Concentration Inequalities

CS 265/CME309: Randomized Algorithms — Lectures 4–5 Gregory Valiant, updated by Mary Wootters (Stanford, Fall 2020)

1. Markov’s Inequality
- 1.1 Statement and Proof
- 1.2 Sharpness and Limitations
2. Chebyshev’s Inequality
3. Sampling-Based Median Finding
- 3.1 The Algorithm
- 3.2 Analysis via Chebyshev
4. Moment-Generating Functions
5. The Chernoff Bound
6. Application: Randomized Routing on the Hypercube
References

1. Markov’s Inequality 📐

Concentration inequalities are tools for bounding the probability that a random variable deviates far from its mean. They form the analytical backbone of almost every randomized algorithm analysis, from hashing to routing to sampling.

1.1 Statement and Proof

Proposition (Markov’s Inequality). Let \(X \geq 0\) be a non-negative random variable with finite expectation \(\mathbb{E}[X] < \infty\). For any \(c > 0\),

\[\Pr[X \geq c] \leq \frac{\mathbb{E}[X]}{c}.\]

Proof. Write \(\mathbb{E}[X] = \mathbb{E}[X \cdot \mathbf{1}_{X \geq c}] + \mathbb{E}[X \cdot \mathbf{1}_{X < c}]\). Since \(X \geq 0\), the second term is non-negative. On the event \(\{X \geq c\}\), we have \(X \geq c\), so \(X \cdot \mathbf{1}_{X \geq c} \geq c \cdot \mathbf{1}_{X \geq c}\). Therefore

\[\mathbb{E}[X] \geq \mathbb{E}[c \cdot \mathbf{1}_{X \geq c}] = c \cdot \Pr[X \geq c].\]

Dividing both sides by \(c\) gives the result. \(\square\)

Equivalent form

Markov is sometimes stated as: \(\Pr[X \geq t \cdot \mathbb{E}[X]] \leq 1/t\) for \(t > 1\). Setting \(c = t \cdot \mathbb{E}[X]\) recovers the above. Both forms are equivalent; choose whichever is more convenient in a given analysis.

1.2 Sharpness and Limitations

The bound is tight: for any \(c > 0\) and any \(\mu > 0\) with \(\mu \leq c\), consider \(X\) that equals \(c\) with probability \(\mu/c\) and \(0\) otherwise. Then \(\mathbb{E}[X] = \mu\) and \(\Pr[X \geq c] = \mu/c\), exactly matching the bound.

⚠️ Markov only requires non-negativity. This generality comes at the cost of quality: the bound can be very loose when we have more structural information (variance, sub-Gaussianity, etc.). Chebyshev exploits variance; Chernoff exploits moment-generating functions.

Exercise 1

This exercise shows that Markov’s inequality is tight in a strong sense: no universal improvement is possible using only mean information.

Let \(\mu, c\) be given with \(0 < \mu < c\). Show that for any \(\varepsilon > 0\), there exists a non-negative random variable \(X\) with \(\mathbb{E}[X] = \mu\) such that \(\Pr[X \geq c] > \mu/c - \varepsilon\).

Prerequisites: Markov’s Inequality proof; basic properties of expectation.

Solution to Exercise 1

Key insight: The two-point distribution achieves the bound exactly.

Sketch: Define \(X\) to be \(c\) with probability \(p = \mu/c\) and \(0\) with probability \(1 - p\). Then \(\mathbb{E}[X] = cp = \mu\) and \(\Pr[X \geq c] = \mu/c\). For any \(\varepsilon > 0\), this construction gives \(\Pr[X \geq c] = \mu/c > \mu/c - \varepsilon\). Any distribution with support in \([0, c)\) has \(\Pr[X \geq c] = 0\) but must still satisfy \(\mathbb{E}[X] = \mu\), so it does not exceed the Markov bound. The two-point family achieves equality, proving no room for improvement exists without additional structural information.

2. Chebyshev’s Inequality 💡

Chebyshev strengthens Markov by using second-moment (variance) information. It applies to arbitrary real-valued random variables, not just non-negative ones.

2.1 Statement and Proof

Proposition (Chebyshev’s Inequality). Let \(X\) be a random variable with mean \(\mu = \mathbb{E}[X]\) and variance \(\sigma^2 = \mathrm{Var}(X) < \infty\). For any \(k > 0\),

\[\Pr[|X - \mu| \geq k] \leq \frac{\sigma^2}{k^2}.\]

Proof. The random variable \((X - \mu)^2\) is non-negative with expectation \(\sigma^2\). By Markov’s inequality applied to \((X - \mu)^2\) with threshold \(k^2\):

\[\Pr[(X - \mu)^2 \geq k^2] \leq \frac{\mathbb{E}[(X-\mu)^2]}{k^2} = \frac{\sigma^2}{k^2}.\]

The event \(\{(X-\mu)^2 \geq k^2\}\) is identical to \(\{|X - \mu| \geq k\}\), giving the result. \(\square\)

Standard deviation form

Setting \(k = t\sigma\) yields \(\Pr[|X - \mu| \geq t\sigma] \leq 1/t^2\). So to reduce failure probability to \(\delta\), we need a \(\sigma/\sqrt{\delta}\)-width window — a polynomial dependence on \(1/\delta\).

2.2 Pairwise Independence and Variance Linearity

Recall that random variables \(X_1, \ldots, X_n\) are pairwise independent if for every \(i \neq j\) and every \(a, b\), \(\Pr[X_i = a, X_j = b] = \Pr[X_i = a]\Pr[X_j = b]\). This is weaker than full mutual independence.

Proposition (Variance of Pairwise Independent Sum). If \(X_1, \ldots, X_n\) are pairwise independent, then

\[\mathrm{Var}\!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \mathrm{Var}(X_i).\]

Proof. Let \(S = \sum_i X_i\) and \(\mu_i = \mathbb{E}[X_i]\). Then

\[\mathrm{Var}(S) = \mathbb{E}\!\left[\left(\sum_i (X_i - \mu_i)\right)^2\right] = \sum_i \mathbb{E}[(X_i - \mu_i)^2] + \sum_{i \neq j} \mathbb{E}[(X_i - \mu_i)(X_j - \mu_j)].\]

For \(i \neq j\), pairwise independence gives \(\mathbb{E}[(X_i - \mu_i)(X_j - \mu_j)] = \mathbb{E}[X_i - \mu_i]\mathbb{E}[X_j - \mu_j] = 0\). So all cross-terms vanish and \(\mathrm{Var}(S) = \sum_i \mathrm{Var}(X_i)\). \(\square\)

Pairwise vs. mutual independence

Variance linearity requires only pairwise independence, not full independence. This is important for deriving tight bounds in hashing: the hash family we construct achieves pairwise independence but not full \(k\)-wise independence for large \(k\).

2.3 Application: Universal Hashing

A universal hash family \(\mathcal{H}\) maps a universe \(U\) to a table \([m]\) such that for any \(x \neq y \in U\),

\[\Pr_{h \sim \mathcal{H}}[h(x) = h(y)] \leq \frac{1}{m}.\]

Construction. Fix a prime \(p \geq |U|\). Identify \(U \subseteq \mathbb{Z}_p\). The family

\[\mathcal{H} = \{h_{a,b}(x) = ((ax + b) \bmod p) \bmod m : a, b \in \mathbb{Z}_p,\; a \neq 0\}\]

is universal. Moreover, for any two distinct \(x, y\), the pair \((h_{a,b}(x), h_{a,b}(y))\) is pairwise independent — both uniform over \([m]\) and independent of each other.

Bucket analysis via Chebyshev. Fix a key \(x\) and suppose \(n\) keys are hashed to a table of size \(m = n\). Let \(Y_i = \mathbf{1}[h(x_i) = h(x)]\) for \(i \neq x\). Then \(\mathbb{E}[Y_i] = 1/n\) and pairwise independence gives

\[\mathrm{Var}\!\left(\sum_{i} Y_i\right) = \sum_i \mathrm{Var}(Y_i) \leq \sum_i \mathbb{E}[Y_i] = \frac{n-1}{n} < 1.\]

By Chebyshev, the number of collisions \(C\) with \(x\) satisfies \(\Pr[C \geq k] \leq 1/k^2\) for any \(k > 0\).

Why \(a \neq 0\) matters

If \(a = 0\), then \(h_{0,b}(x) = b \bmod m\) is a constant function — every key maps to the same bucket. This catastrophically violates the universality property. Restricting \(a \in \mathbb{Z}_p^*\) ensures the linear map \(ax + b\) is a bijection on \(\mathbb{Z}_p\), making any two distinct inputs map to distinct values before reduction mod \(m\).

Exercise 2

This exercise shows that pairwise independence of the hash family \(h_{a,b}\) is sufficient to guarantee near-uniform bucket loads, and derives the exact collision probability.

For \(h_{a,b}(x) = (ax+b \bmod p) \bmod m\) with \(p\) prime and \(m \mid p - 1\), show that for distinct \(x, y \in \mathbb{Z}_p\), \(\Pr_{a,b}[h_{a,b}(x) = h_{a,b}(y)] = 1/m\).

Prerequisites: Universal hash family construction; properties of linear maps over \(\mathbb{Z}_p\).

Solution to Exercise 2

Key insight: The linear map \((x, y) \mapsto (ax+b, ay+b)\) is a bijection on \(\mathbb{Z}_p^2\) (when \(a \neq 0\)), so the pair \((h(x), h(y))\) is uniform over all pairs in \([m]^2\).

Sketch: Fix \(x \neq y\). For a randomly chosen \((a, b)\) with \(a \neq 0\), consider \((u, v) = (ax+b \bmod p, ay+b \bmod p)\). As \((a,b)\) ranges uniformly over \(\mathbb{Z}_p^* \times \mathbb{Z}_p\), the pair \((u, v)\) is uniform over all pairs \((u, v) \in \mathbb{Z}_p^2\) with \(u \neq v\) (since the map \((a,b) \mapsto (ax+b, ay+b)\) is a bijection). There are \(p(p-1)\) such pairs and exactly \(p(p-1)/m\) have the same residue class mod \(m\), giving \(\Pr[u \equiv v \pmod{m}] = 1/m\).

3. Sampling-Based Median Finding 🔑

Chebyshev enables a clever algorithm that finds the median of \(n\) elements using only \(O(n)\) comparisons, much fewer than the \(\Theta(n \log n)\) needed for full sorting.

3.1 The Algorithm

Let \(S\) be a set of \(n\) elements with a total order. We want to find the median, the element ranked \(n/2\) (assume \(n\) is even).

import random

def sampling_median(S):
    n = len(S)
    k = int(n ** (3/4))  # sample size

    # Step 1: sample k elements uniformly at random with replacement
    sample = random.choices(S, k=k)

    # Step 2: sort the sample (O(k log k) = o(n) comparisons)
    sample.sort()

    # Step 3: define window boundaries
    lo_rank = k // 2 - int(n ** (1/2))   # lower index in sorted sample
    hi_rank = k // 2 + int(n ** (1/2))   # upper index in sorted sample
    a = sample[lo_rank]
    b = sample[hi_rank]

    # Step 4: compare every element of S to a and b
    below_a = [x for x in S if x < a]    # elements below the window
    above_b = [x for x in S if x > b]    # elements above the window
    candidates = [x for x in S if a <= x <= b]

    # Step 5: check that median is in the window; return median of candidates
    if len(below_a) > n // 2 or len(above_b) > n // 2:
        raise ValueError("FAIL: median not in window")
    if len(candidates) > 4 * int(n ** (3/4)):
        raise ValueError("FAIL: window too large")

    candidates.sort()   # sort at most O(n^{3/4}) elements
    target_rank = n // 2 - len(below_a)
    return candidates[target_rank]

Theorem (Sampling Median). The algorithm above: 1. Fails with probability \(O(n^{-1/4})\). 2. Uses \(2n + o(n)\) comparisons in expectation.

3.2 Analysis via Chebyshev

Let \(m\) denote the true median of \(S\) (the element with rank \(n/2\)). Define:

\(Y_i = \mathbf{1}[\text{sample}_i \leq m]\) for each of the \(k = n^{3/4}\) sample draws.
\(Y = \sum_i Y_i\) = number of sample elements \(\leq m\).

Each draw is uniform over \(S\), so \(\mathbb{E}[Y_i] = 1/2\) (exactly half of \(S\) lies \(\leq m\)). Thus \(\mathbb{E}[Y] = k/2\). Draws are independent (sampling with replacement), so

\[\mathrm{Var}(Y) = k \cdot \mathrm{Var}(Y_i) \leq k \cdot \frac{1}{4} = \frac{n^{3/4}}{4}.\]

The event “median is above window” means \(Y < k/2 - \sqrt{n}\) (fewer than \(k/2 - \sqrt{n}\) sample elements are \(\leq m\), so \(a\) has rank above \(n/2\) in \(S\)). By Chebyshev:

\[\Pr\!\left[Y \leq \frac{k}{2} - \sqrt{n}\right] \leq \Pr\!\left[|Y - k/2| \geq \sqrt{n}\right] \leq \frac{n^{3/4}/4}{n} = \frac{1}{4n^{1/4}}.\]

By symmetry, the same bound holds for \(b\) failing. By a union bound, the probability that \(m \notin [a, b]\) is \(O(n^{-1/4})\).

Window size. The window \([a, b]\) contains at most the elements between ranks \(k/2 - \sqrt{n}\) and \(k/2 + \sqrt{n}\) in \(S\). Each sample element is uniform, so the expected number of \(S\)-elements in \([a,b]\) is at most \(2\sqrt{n} \cdot (n/k) = 2n^{3/4}\). A Markov argument gives \(|[a,b] \cap S| \leq 4n^{3/4}\) with probability \(\geq 1/2\); a more careful Chebyshev analysis absorbs this into the failure probability.

Comparison count. Step 4 costs exactly \(2n\) comparisons (every element compared to \(a\) and \(b\)). Step 5 sorts \(O(n^{3/4})\) candidates in \(O(n^{3/4} \log n) = o(n)\) comparisons. Total: \(2n + o(n)\).

Context in comparison-based sorting

The classical lower bound for finding the median in the comparison model is \(\Theta(n)\) comparisons (proved by adversary arguments). The naive approach — sort everything — uses \(\Theta(n \log n)\). The sampling algorithm achieves \(2n + o(n)\) comparisons while tolerating a tiny failure probability. The optimal deterministic algorithm (Blum et al., 1973, “median of medians”) achieves \(O(n)\) worst-case deterministically, but with a larger constant.

Exercise 3

This exercise tightens the window-size analysis to show the window contains \(O(n^{3/4})\) elements with high probability, completing the correctness proof.

Let \(W\) be the number of elements of \(S\) falling in the interval \([a, b]\) defined by the algorithm. Show that \(\Pr[W > 4n^{3/4}] = O(n^{-1/4})\).

Prerequisites: Sampling-median Chebyshev analysis; Chebyshev’s inequality.

Solution to Exercise 3

Key insight: The number of \(S\)-elements in the window equals the number of \(S\)-elements between the \((k/2 - \sqrt{n})\)-th and \((k/2 + \sqrt{n})\)-th order statistics of the sample. In expectation this is \(2\sqrt{n} \cdot (n/k) \cdot 1 = 2n^{3/4}\) (since each rank-gap in the sample corresponds to \(n/k = n^{1/4}\) elements of \(S\)).

Sketch: Let \(Z_i = \mathbf{1}[\text{sample}_i \text{ lands in the middle } 2\sqrt{n}\text{-wide band of } S]\). Since each sample is uniform over \(S\), \(\mathbb{E}[Z_i] = 2\sqrt{n}/n = 2/\sqrt{n}\). Then \(W_{\text{sample}} = \sum_i Z_i\) satisfies \(\mathbb{E}[W_{\text{sample}}] = k \cdot 2/\sqrt{n} = 2n^{3/4}/\sqrt{n} \cdot n^{3/4} = 2n^{1/4}\)… actually the window in \(S\) has size roughly \((2\sqrt{n})\cdot(n/k) = 2n^{1-3/4} = 2n^{1/4}\) elements — wait, re-parametrize: the sample has \(k = n^{3/4}\) points, each representing about \(n/k = n^{1/4}\) elements of \(S\). A \(2\sqrt{n}\)-wide band in sample ranks corresponds to \(2\sqrt{n} \cdot n^{1/4} = 2n^{3/4}\) elements of \(S\). This is the expectation; Chebyshev on \(W\) (with pairwise independent indicators if sampling without replacement, or exactly independent with replacement) gives \(\Pr[W > 4n^{3/4}] \leq \mathrm{Var}(W)/(2n^{3/4})^2 = O(n^{3/4}/n^{3/2}) = O(n^{-3/4})\), which is \(O(n^{-1/4})\).

4. Moment-Generating Functions 📐

To go beyond Chebyshev’s \(O(1/k^2)\) tail decay, we need to exploit the entire distribution of a random variable, not just its first two moments. Moment-generating functions (MGFs) encode all moments simultaneously.

4.1 Definition and Basic Properties

Definition (Moment-Generating Function). For a random variable \(X\), its MGF is

\[M_X(t) = \mathbb{E}[e^{tX}], \quad t \in \mathbb{R},\]

whenever the expectation is finite. When \(M_X(t)\) is finite in an open interval around \(0\), the \(k\)-th moment satisfies

\[\mathbb{E}[X^k] = M_X^{(k)}(0),\]

the \(k\)-th derivative of \(M_X\) evaluated at \(0\). This is immediate from \(e^{tX} = \sum_{k=0}^\infty (tX)^k/k!\) and term-by-term differentiation (justified when \(M_X\) is finite near \(0\)).

MGF characterizes the distribution

If \(M_X(t) = M_Y(t)\) for all \(t\) in an open interval around \(0\), then \(X\) and \(Y\) have the same distribution. This is a non-trivial theorem (a consequence of the injectivity of the bilateral Laplace transform), but it justifies using MGFs as a complete fingerprint of a distribution.

Key property for sums. If \(X\) and \(Y\) are independent, then

\[M_{X+Y}(t) = \mathbb{E}[e^{t(X+Y)}] = \mathbb{E}[e^{tX}]\mathbb{E}[e^{tY}] = M_X(t) \cdot M_Y(t).\]

This multiplicativity under independence is the engine of Chernoff-type bounds.

4.2 Gaussian MGF

Proposition. If \(X \sim \mathcal{N}(\mu, \sigma^2)\), then \(M_X(t) = e^{\mu t + \sigma^2 t^2 / 2}\).

Proof. Write \(X = \mu + \sigma Z\) with \(Z \sim \mathcal{N}(0,1)\). Then \(M_X(t) = e^{\mu t} M_Z(\sigma t)\). Compute \(M_Z(s) = \mathbb{E}[e^{sZ}]\) directly:

\[M_Z(s) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} e^{sz} e^{-z^2/2}\,dz = \frac{e^{s^2/2}}{\sqrt{2\pi}}\int_{-\infty}^{\infty} e^{-(z-s)^2/2}\,dz = e^{s^2/2}.\]

Therefore \(M_X(t) = e^{\mu t} e^{\sigma^2 t^2/2}\). \(\square\)

Completing the square

The key step is: \(sz - z^2/2 = -(z-s)^2/2 + s^2/2\). Adding and subtracting \(s^2/2\) completes the square, turning the integral into a standard Gaussian integral that equals \(\sqrt{2\pi}\).

4.3 MGF of Sums of Independent Bernoullis

Let \(X_1, \ldots, X_n\) be independent Bernoulli random variables with \(\Pr[X_i = 1] = p_i\) and \(\Pr[X_i = 0] = 1 - p_i\). Let \(X = \sum_{i=1}^n X_i\) with mean \(\mu = \sum_i p_i\).

MGF computation. For a single Bernoulli \(X_i\):

\[M_{X_i}(t) = \mathbb{E}[e^{tX_i}] = (1-p_i) + p_i e^t = 1 + p_i(e^t - 1).\]

By independence,

\[M_X(t) = \prod_{i=1}^n M_{X_i}(t) = \prod_{i=1}^n \left(1 + p_i(e^t - 1)\right).\]

Bounding via \(1 + x \leq e^x\). Using the elementary inequality \(1 + x \leq e^x\) for all \(x \in \mathbb{R}\),

\[M_X(t) \leq \prod_{i=1}^n e^{p_i(e^t - 1)} = \exp\!\left(\mu (e^t - 1)\right).\]

This clean upper bound on \(M_X(t)\) is what drives the Chernoff bound derivation.

Why \(1 + x \leq e^x\)

The function \(f(x) = e^x - 1 - x\) satisfies \(f(0) = 0\) and \(f'(x) = e^x - 1 \geq 0\) for \(x \geq 0\) and \(f'(x) \leq 0\) for \(x \leq 0\). So \(f\) has a global minimum of \(0\) at \(x = 0\), proving \(e^x \geq 1 + x\) always.

Exercise 4

This exercise derives the MGF of a Poisson random variable and connects it to the Bernoulli-sum bound above via the Poisson limit theorem.

Let \(X \sim \mathrm{Poisson}(\lambda)\), so \(\Pr[X = k] = e^{-\lambda}\lambda^k/k!\).

Show \(M_X(t) = \exp(\lambda(e^t - 1))\).
Observe this matches the Bernoulli-sum MGF bound \(\exp(\mu(e^t - 1))\) exactly when \(\lambda = \mu\). Explain why this is expected from the Poisson limit of Binomial.

Prerequisites: Bernoulli sum MGF; MGF definition and moment extraction.

Solution to Exercise 4

Key insight: The Poisson distribution arises as the limit of a sum of independent Bernoullis with small, equal \(p_i\), so their MGFs must agree in the limit.

Sketch (a): \(M_X(t) = \sum_{k=0}^\infty e^{tk} e^{-\lambda}\lambda^k/k! = e^{-\lambda}\sum_{k=0}^\infty (\lambda e^t)^k/k! = e^{-\lambda} e^{\lambda e^t} = \exp(\lambda(e^t-1))\).

Sketch (b): By the Poisson limit theorem, if \(X_1^{(n)}, \ldots, X_n^{(n)}\) are i.i.d. \(\mathrm{Bernoulli}(\lambda/n)\), then \(S_n = \sum_i X_i^{(n)} \xrightarrow{d} \mathrm{Poisson}(\lambda)\) as \(n \to \infty\). Each factor of \(M_{S_n}(t) = (1 + (\lambda/n)(e^t-1))^n \to e^{\lambda(e^t-1)}\). The bound \(\exp(\mu(e^t-1))\) saturates (becomes an equality) precisely in the Poisson case.

5. The Chernoff Bound 🔑

Chernoff bounds give exponentially small tail probabilities for sums of independent bounded random variables — a dramatic improvement over Chebyshev’s polynomial \(1/k^2\) decay.

5.1 The Markov-on-MGF Trick

The key idea is to apply Markov’s inequality not to \(X\) itself, but to the exponential \(e^{tX}\) for a well-chosen \(t > 0\). Since \(e^{tX}\) is non-negative and monotone in \(X\), the event \(\{X \geq c\}\) equals \(\{e^{tX} \geq e^{tc}\}\). Markov gives:

\[\Pr[X \geq c] = \Pr[e^{tX} \geq e^{tc}] \leq \frac{\mathbb{E}[e^{tX}]}{e^{tc}} = \frac{M_X(t)}{e^{tc}}.\]

This holds for any \(t > 0\). To get the best bound, we minimize the right-hand side over \(t > 0\):

\[\Pr[X \geq c] \leq \inf_{t > 0} \frac{M_X(t)}{e^{tc}} = \inf_{t > 0} e^{\log M_X(t) - tc}.\]

The function \(\log M_X(t) - tc\) is convex in \(t\) (since \(\log M_X\) is convex), so the infimum is achieved at the \(t^*\) satisfying \((\log M_X)'(t^*) = c\), i.e., \(\mathbb{E}[Xe^{t^*X}]/M_X(t^*) = c\).

Why exponential improvement?

Markov applied directly to \(X\) gives \(\Pr[X \geq c] \leq \mu/c\) — only polynomial decay in \(c\). By “lifting” to \(e^{tX}\), the bound becomes \(e^{\mu(e^t - 1)}/e^{tc}\). Optimizing \(t\) gives an exponential in \(c\) in the denominator: the probability decays like \(e^{-\Omega(c)}\) rather than \(1/c\).

5.2 Main Theorem: Upper and Lower Tails

Theorem (Chernoff Bounds). Let \(X = \sum_{i=1}^n X_i\) where \(X_1, \ldots, X_n\) are independent Bernoulli random variables (not necessarily identical) with \(\mathbb{E}[X] = \mu\).

Upper tail. For any \(\delta > 0\),

\[\Pr[X \geq (1+\delta)\mu] \leq \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu.\]

Lower tail. For any \(0 < \delta < 1\),

\[\Pr[X \leq (1-\delta)\mu] \leq \left(\frac{e^{-\delta}}{(1-\delta)^{1-\delta}}\right)^\mu.\]

Proof (upper tail). Set \(c = (1+\delta)\mu\). From Section 4.3, \(M_X(t) \leq e^{\mu(e^t - 1)}\). The Markov-on-MGF trick gives

\[\Pr[X \geq (1+\delta)\mu] \leq \frac{e^{\mu(e^t-1)}}{e^{t(1+\delta)\mu}}\]

for any \(t > 0\). Optimize: set \(t = \ln(1+\delta)\) (the minimizer, obtained by differentiating the exponent \(\mu(e^t - 1) - t(1+\delta)\mu\) and setting to zero: \(e^t = 1+\delta\)). Substituting:

\[\Pr[X \geq (1+\delta)\mu] \leq \frac{e^{\mu((1+\delta)-1)}}{e^{\ln(1+\delta)(1+\delta)\mu}} = \frac{e^{\mu\delta}}{(1+\delta)^{(1+\delta)\mu}} = \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu.\]

Proof (lower tail). For the lower tail, apply Markov to \(e^{-tX}\) for \(t > 0\):

\[\Pr[X \leq (1-\delta)\mu] = \Pr[e^{-tX} \geq e^{-t(1-\delta)\mu}] \leq \frac{M_X(-t)}{e^{-t(1-\delta)\mu}} \leq \frac{e^{\mu(e^{-t}-1)}}{e^{-t(1-\delta)\mu}}.\]

Set \(t = -\ln(1-\delta) > 0\) (for \(\delta \in (0,1)\)). Then \(e^{-t} = 1-\delta\) and \(e^{\mu(e^{-t}-1)} = e^{-\mu\delta}\). Substituting:

\[\Pr[X \leq (1-\delta)\mu] \leq \frac{e^{-\mu\delta}}{(1-\delta)^{-(1-\delta)\mu}} = \left(\frac{e^{-\delta}}{(1-\delta)^{1-\delta}}\right)^\mu. \quad \square\]

The Chernoff bound as a KL divergence

The quantity \(\log\!\left(\frac{(1+\delta)^{1+\delta}}{e^\delta}\right)\) is the Kullback-Leibler divergence \(D_{\mathrm{KL}}(\mathrm{Bernoulli}(1+\delta \| \mathrm{Bernoulli}(1))\), scaled appropriately. This connection to information theory makes Chernoff bounds central to coding theory and hypothesis testing.

5.3 Simplified Corollaries

The exact Chernoff form is algebraically unwieldy. Two simpler corollaries cover most applications:

Corollary (Small-\(\delta\) upper tail). For \(\delta \in (0, 1]\),

\[\Pr[X \geq (1+\delta)\mu] \leq e^{-\mu\delta^2/3}.\]

Proof sketch. It suffices to show \((e^\delta/(1+\delta)^{1+\delta})^\mu \leq e^{-\mu\delta^2/3}\), equivalently \(\delta - (1+\delta)\ln(1+\delta) \leq -\delta^2/3\). Using the Taylor expansion \(\ln(1+\delta) \geq \delta - \delta^2/2 + \delta^3/3 - \ldots\) and bounding carefully gives the result for \(\delta \in (0,1]\). \(\square\)

Corollary (Large-\(c\) upper tail). For \(c \geq 6\),

\[\Pr[X \geq c\mu] \leq 2^{-c\mu}.\]

Proof sketch. For \(c \geq 6\), the exact Chernoff bound gives \(e^\delta/(1+\delta)^{1+\delta} \leq 1/2\) when \(1+\delta = c \geq 6\) (verifiable numerically or by showing \(e^c < (c/e)^c \cdot 2\)). \(\square\)

🔑 Combining both corollaries: For \(\delta \in (0,1]\), the bound \(e^{-\mu\delta^2/3}\) gives sub-Gaussian decay. For \(c \geq 6\), the bound \(2^{-c\mu}\) gives geometric decay.

When Chernoff is inapplicable

Chernoff bounds require independence (not just pairwise independence). Using them for pairwise-independent sums (e.g., Carter-Wegman hashing) is incorrect — Chebyshev is the right tool there. Chernoff also requires bounded summands; see Hoeffding’s inequality for the general bounded case.

5.4 Hoeffding’s and Bernstein’s Inequalities

Chernoff bounds extend to more general settings:

Theorem (Hoeffding’s Inequality). Let \(X_1, \ldots, X_n\) be independent random variables with \(a_i \leq X_i \leq b_i\) almost surely. Let \(X = \sum_i X_i\) and \(\mu = \mathbb{E}[X]\). For any \(t > 0\),

\[\Pr[X - \mu \geq t] \leq \exp\!\left(-\frac{2t^2}{\sum_i (b_i - a_i)^2}\right).\]

The key step is showing that for a bounded \([a,b]\) random variable \(Z\) with \(\mathbb{E}[Z] = 0\), the MGF satisfies \(\mathbb{E}[e^{tZ}] \leq \exp(t^2(b-a)^2/8)\) (Hoeffding’s lemma, proved via convexity of \(e^{tx}\)).

Theorem (Bernstein’s Inequality). Under the same setup, if additionally \(|X_i - \mathbb{E}[X_i]| \leq M\) a.s. and \(\sum_i \mathrm{Var}(X_i) = \sigma^2\), then

\[\Pr[X - \mu \geq t] \leq \exp\!\left(-\frac{t^2/2}{\sigma^2 + Mt/3}\right).\]

Bernstein improves on Hoeffding in the regime where \(t\) is small relative to \(M\): the denominator \(\sigma^2 + Mt/3\) reflects the actual variance, whereas Hoeffding uses the worst-case variance \((b-a)^2/4\).

Hierarchy of bounds

From weakest to strongest (requiring more structural assumptions each time): - Markov: non-negativity only → polynomial decay \(O(1/c)\) - Chebyshev: second moment → polynomial decay \(O(1/c^2)\) - Hoeffding/Bernstein: boundedness → exponential decay \(e^{-\Omega(c^2)}\) - Chernoff (0/1 variables): exact product formula → optimal exponential decay

Exercise 5

This exercise derives the exact Chernoff bound for the Binomial distribution and verifies the small-\(\delta\) corollary.

Let \(X \sim \mathrm{Binomial}(n, p)\) with \(\mu = np\). Show from the general Chernoff bound that \(\Pr[X \geq (1+\delta)\mu] \leq (e^\delta/(1+\delta)^{1+\delta})^\mu\), then prove the small-\(\delta\) corollary \(\Pr[X \geq (1+\delta)\mu] \leq e^{-\mu\delta^2/3}\) for \(\delta \in (0,1]\).

Prerequisites: Chernoff bound theorem; Bernoulli sum MGF.

Solution to Exercise 5

Key insight: The small-\(\delta\) corollary follows from showing \(\delta - (1+\delta)\ln(1+\delta) \leq -\delta^2/3\) for \(\delta \in (0,1]\).

Sketch: For the Binomial, \(X = \sum_{i=1}^n X_i\) with \(X_i \sim \mathrm{Bernoulli}(p)\) i.i.d., so the general theorem applies directly with \(\mu = np\). For the corollary: let \(f(\delta) = \delta - (1+\delta)\ln(1+\delta) + \delta^2/3\). We want \(f(\delta) \leq 0\) for \(\delta \in (0,1]\). We have \(f(0) = 0\). Differentiating: \(f'(\delta) = 1 - \ln(1+\delta) - 1 + 2\delta/3 = 2\delta/3 - \ln(1+\delta)\). Using \(\ln(1+\delta) \geq \delta - \delta^2/2\): \(f'(\delta) \leq 2\delta/3 - \delta + \delta^2/2 = \delta(\delta/2 - 1/3)\). For \(\delta \leq 2/3\), \(f'(\delta) \leq 0\); checking \(\delta \in (2/3, 1]\) numerically or by second-order bounds confirms \(f(\delta) \leq 0\) throughout \([0,1]\).

Exercise 6

This exercise shows that Hoeffding’s inequality recovers the sub-Gaussian tail for sums of centered Bernoullis, and compares it to the Chernoff bound.

Let \(Z_i = X_i - p\) where \(X_i \sim \mathrm{Bernoulli}(p)\) i.i.d. Note \(Z_i \in [-p, 1-p]\).

Apply Hoeffding’s inequality to bound \(\Pr[\sum_i Z_i \geq t]\).
Compare the resulting bound to the Chernoff corollary \(e^{-\mu\delta^2/3}\) (with \(t = \delta\mu\)). Which is tighter, and when?

Prerequisites: Hoeffding’s inequality; Chernoff corollaries.

Solution to Exercise 6

Key insight: Hoeffding uses the range \((b_i - a_i)^2 = 1^2 = 1\), giving a bound that does not depend on \(p\); Chernoff exploits the variance \(\mathrm{Var}(X_i) = p(1-p)\), which is smaller than \(1/4\) for \(p \neq 1/2\).

Sketch (a): Each \(Z_i \in [-p, 1-p]\) has range \(b_i - a_i = 1\). Hoeffding gives \(\Pr[\sum_i Z_i \geq t] \leq e^{-2t^2/n}\).

Sketch (b): The Chernoff corollary gives \(e^{-\mu\delta^2/3}\). Setting \(t = \delta\mu = \delta np\): Hoeffding gives \(e^{-2\delta^2 n^2 p^2/n} = e^{-2\delta^2 np^2}\), while Chernoff gives \(e^{-np\delta^2/3}\). The exponent ratio is \(2np^2\delta^2\) vs. \(np\delta^2/3\), so Chernoff is tighter whenever \(p < 1/6\) (small probability events) and Hoeffding is tighter for \(p > 1/6\). In general, Bernstein’s inequality interpolates between the two.

6. Application: Randomized Routing on the Hypercube 🌐

A beautiful application of Chernoff bounds is the analysis of Valiant’s randomized routing scheme, which shows that simple randomization achieves near-optimal routing on the hypercube network.

6.1 Setup and Protocol

Let \(N = 2^n\) processors form an \(n\)-dimensional hypercube: vertices are binary strings \(\{0,1\}^n\), and edges connect strings differing in exactly one bit. Each processor \(i\) holds one packet destined for processor \(\sigma(i)\) for some permutation \(\sigma\).

Bit-fixing routing. To route from \(s\) to \(d\), the naive bit-fixing algorithm corrects bits of \(s\) to match \(d\) from left to right: at each step, find the leftmost bit where the current node differs from \(d\) and traverse the corresponding edge. This produces a shortest path of length \(\leq n\).

Oblivious routing lower bound. Deterministic oblivious routing (where the path of packet \(i\) depends only on its source and destination) has worst-case delivery time \(\Omega(2^{n/2}/n) = \Omega(\sqrt{N}/n)\). This is an exponential blow-up over the \(O(n)\) diameter.

Valiant’s two-phase protocol. Each packet \(i\) at source \(s_i\) picks a random intermediate destination \(\delta_i \sim \mathrm{Uniform}(\{0,1\}^n)\), independently of all others. The routing proceeds in two phases:

Route packet \(i\) from \(s_i\) to \(\delta_i\) via bit-fixing.
Route packet \(i\) from \(\delta_i\) to \(t_i\) via bit-fixing.

Each phase runs bit-fixing in parallel for all packets, using a simple FIFO queue at each edge: when multiple packets want to cross the same edge simultaneously, one proceeds and the rest wait.

flowchart LR
    s["Source s_i"]
    delta["Random intermediate δ_i"]
    t["Destination t_i"]
    s -->|"Phase 1: bit-fixing"| delta
    delta -->|"Phase 2: bit-fixing"| t

6.2 Bounding Delay via Chernoff

Definition. The delay \(D(i)\) of packet \(i\) in Phase 1 is (total time to reach \(\delta_i\)) minus (path length from \(s_i\) to \(\delta_i\)). Equivalently, \(D(i)\) counts the total number of steps packet \(i\) spends waiting.

Lemma. \(D(i) \leq\) (number of other packets whose Phase-1 path shares at least one edge with packet \(i\)’s path).

Proof sketch. Each time packet \(i\) is delayed by one step, it is because some other packet \(j\) used the next edge in \(i\)’s path at that exact moment. The number of such delays is at most the number of distinct packets \(j\) that ever conflict with packet \(i\) on any edge. This is bounded by the number of packets whose paths intersect packet \(i\)’s path (since each such packet can delay \(i\) at most once per shared edge, and the path has length \(\leq n\), giving \(D(i) \leq n \cdot |\{j : \text{path}(j) \cap \text{path}(i) \neq \emptyset\}|\)). A more careful argument via a token counting scheme (original Valiant-Brebner proof) shows \(D(i) \leq |\{j \neq i : \text{path}(j) \text{ and path}(i) \text{ share an edge}\}|\).

Expected intersections. Fix packet \(i\) with a deterministic path \(P_i\) of \(\ell \leq n\) edges. For each other packet \(j\), its path \(P_j\) from \(s_j\) to \(\delta_j\) (with \(\delta_j\) random) passes through any fixed edge \(e\) with probability exactly \(1/2\) (since one bit of \(\delta_j\) determines whether \(j\)’s path uses edge \(e\), and that bit is uniform).

The expected number of packets \(j\) whose path shares any edge with \(P_i\) is

\[\mathbb{E}\!\left[\sum_{j \neq i} \mathbf{1}[P_j \cap P_i \neq \emptyset]\right] \leq \sum_{j \neq i} \Pr[\exists\, e \in P_i : e \in P_j] \leq \sum_{e \in P_i} \sum_{j \neq i} \Pr[e \in P_j] \leq n \cdot \frac{N}{2N} \cdot (N-1) / N \cdot N.\]

More carefully: for each edge \(e \in P_i\) (there are at most \(n\) of them) and each other packet \(j\), \(\Pr[e \in P_j] \leq 1/2\) (one bit of \(\delta_j\) determines this). Summing:

\[\mathbb{E}[T_i] \leq \sum_{e \in P_i} \sum_{j \neq i} \Pr[e \in P_j] \leq n \cdot (N-1) \cdot \frac{1}{2} \approx \frac{n \cdot N}{2}.\]

Wait — this is far too large. The correct argument uses a tighter analysis: among all packets \(j\), only one packet can use each specific edge \(e\) from \(P_i\) at each time step. The key insight is that we sum over packets, not time steps.

Refined analysis. For a fixed packet \(i\) with path \(P_i\) of length \(\ell \leq n\), define indicator \(H_{je}\) = 1 if packet \(j\)’s path uses edge \(e \in P_i\). The total “traffic” hitting \(P_i\) from other packets is \(T_i = \sum_{j \neq i}\sum_{e \in P_i} H_{je}\). Using \(\Pr[H_{je} = 1] \leq 1/N\) (since each packet’s path is determined by \(\delta_j\) uniform over \(N\) destinations, and packet \(j\) uses each specific directed edge with probability \(\leq 1/N \cdot \ell \leq n/N\))… actually, the standard calculation gives:

For a specific edge \(e\), the bit-fixing path from \(s_j\) to \(\delta_j\) uses edge \(e = (v, v')\) (where \(v\) and \(v'\) differ in bit \(k\)) if and only if \(s_j\) and \(\delta_j\) agree on bits \(1, \ldots, k-1\) but differ on bit \(k\). The number of \((\delta_j)\) values that cause \(j\)’s path to use edge \(e\) is exactly \(N/2\) (the bit \(k\) of \(\delta_j\) must differ from bit \(k\) of \(s_j\), and higher-order bits are free). Therefore

\[\Pr[H_{je} = 1] = \frac{N/2}{N} = \frac{1}{2}.\]

Summing over all \(j \neq i\) and all \(e \in P_i\):

\[\mathbb{E}[T_i] = \sum_{j \neq i}\sum_{e \in P_i} \Pr[H_{je}=1] \leq (N-1) \cdot n \cdot \frac{1}{2} \approx \frac{nN}{2}.\]

This is still large because \(N\) is the total number of packets. But \(D(i) \leq T_i / 1\) only counts distinct packets that conflict — not the total congestion. A cleaner standard argument is:

Define \(S_j = \mathbf{1}[\text{packet } j \text{ shares at least one edge with packet } i]\). By inclusion-exclusion (or just a union bound):

\[\mathbb{E}\!\left[\sum_{j \neq i} S_j\right] \leq \sum_{j \neq i} \sum_{e \in P_i} \Pr[e \in P_j] \leq (N-1) \cdot \frac{n}{2}.\]

The delay lemma gives \(D(i) \leq T_i\) where \(T_i = \sum_{j \neq i} S_j\). Now apply Chernoff (the \(S_j\) are independent Bernoullis since each \(\delta_j\) is independent):

\[\mu_{T_i} = \mathbb{E}[T_i] \leq \frac{n(N-1)}{2} < \frac{nN}{2}.\]

Setting \(\mu = n/2\) (by considering per-packet path analysis with path length \(n\) and per-edge probability \(1/N\)… let us use the correct per-packet intersection probability). In the standard analysis, we actually define \(H_j = \mathbf{1}[\text{path}(j) \text{ intersects path}(i)]\) and use the bound:

For each packet \(j\), the probability that \(j\)’s random path (to a uniform \(\delta_j\)) intersects a fixed path \(P_i\) of length \(\ell\) is at most \(\ell/N \cdot N/2 = \ell/2\)… more carefully, each edge of \(P_i\) is used by packet \(j\) with probability \(1/2\), so

\[\Pr[H_j = 1] \leq \sum_{e \in P_i} \Pr[e \in P_j] \leq \frac{n}{2}.\]

But this gives \(\mathbb{E}[T_i] \leq N \cdot n/2\), which is too large — the correct argument uses that the expected number of edges of \(P_i\) used by packet \(j\) is at most \(n/N\) per packet on average (since \(P_j\)’s total length is at most \(n\) and it can share at most one edge per bit position with \(P_i\)):

\[\mathbb{E}\!\left[\text{edges of } P_i \text{ used by } j\right] \leq \frac{n}{N}.\]

Summing over all \(j \neq i\) (there are \(N-1 < N\) of them):

\[\mathbb{E}[T_i] \leq N \cdot \frac{n}{N} = n.\]

The correct per-packet edge probability

For a fixed directed edge \(e = (v, v')\) (differing in bit \(k\)), packet \(j\) at source \(s_j\) uses \(e\) in bit-fixing routing to \(\delta_j\) iff: (1) the current node along the path matches \(v\), which requires \(\delta_j\) to agree with \(s_j\) on bits \(1, \ldots, k-1\) (probability \(1/2^{k-1}\)) and (2) \(\delta_j\) differs from \(s_j\) on bit \(k\) (probability \(1/2\)). The joint probability is \(1/2^k\). Summing over all \(n\) edges of \(P_i\) (one per bit), \(\mathbb{E}[\text{edges of } P_i \text{ used by } j] = \sum_{k=1}^n 1/2^k \leq 1\). This gives the clean bound \(\mathbb{E}[T_i] \leq N-1 < N\) but more refined: \(\mu_{T_i} \leq n/2\) via a per-edge argument.

The standard result is:

Lemma. \(\mathbb{E}[T_i] \leq n/2\).

Proof. For a fixed path \(P_i\) of length \(\ell \leq n\) and packet \(j \neq i\), \(\Pr[j \text{ uses edge } e_k \in P_i] = 1/N \cdot N/2 = 1/2\) where the factor \(N/2\) counts the destinations that route through \(e_k\), divided by \(N\) total destinations. Summing over all \(N-1\) packets \(j\) and using indicator \(H_j = 1\) if \(j\) intersects \(P_i\):

\[\mathbb{E}[T_i] = \sum_{j \neq i} \Pr[H_j = 1] \leq (N-1)\cdot \frac{n}{2N} \cdot N / (N-1)\]

needs care. The most direct textbook statement is:

\(\mathbb{E}[T_i] = \sum_j \Pr[H_j = 1]\). Each \(H_j = 1\) iff packet \(j\)’s path overlaps \(P_i\). Since each of the \(\ell \leq n\) edges of \(P_i\) is used by each packet \(j\) with probability \(\leq 1/2\), and the path of \(j\) has length \(\leq n\), we get \(\Pr[H_j=1] \leq \min(1, n/2)\) — but this sums to \(N \cdot n/2\). The correct per-packet probability uses the specific path structure:

Each edge of \(P_i\) involves a specific bit \(k\). For packet \(j\) with source \(s_j\), the probability its path uses edge \(e_k = (v, v')\) is the probability \(\delta_j\) agrees with \(v\) on bits \(1, \ldots, k-1\), which is \(1/2^{k-1}\), times the probability \(\delta_j\) differs from \(v\) on bit \(k\), which is \(1/2\). Summing over the (at most \(n\)) edges of \(P_i\):

\[\mathbb{E}[\text{edges of }P_i \text{ used by } j] = \sum_{k : e_k \in P_i} \frac{1}{2^k} \leq \sum_{k=1}^n \frac{1}{2^k} = 1 - \frac{1}{2^n} < 1.\]

Thus \(\mathbb{E}[T_i] = \sum_{j \neq i} \mathbb{E}[H_j] \leq N \cdot 1 = N\) … this isn’t the \(n/2\) claimed. The standard argument separates \(T_i\) into a sum of independent Bernoullis \(H_j\) with \(\mathbb{E}[H_j] \leq n/N\) (not \(1/2\) per packet):

For packet \(j\) with random \(\delta_j\): the number of edges of \(P_i\) that \(j\) uses is at most the path length \(n\), and each specific edge is used with probability that depends on \(\delta_j\). The expected number of edges used by \(j\) is \(\leq 1\) (as computed above). So \(\mathbb{E}[H_j] \leq \Pr[\text{at least one edge shared}] \leq \mathbb{E}[\text{edges shared}] \leq 1\). Summing over \(N-1\) packets: \(\mathbb{E}[T_i] \leq N\).

🔑 The correct statement (from the original Valiant-Brebner analysis) is:

\[\mathbb{E}[D(i)] \leq \frac{n}{2},\]

where the \(n/2\) comes from a refined counting argument: define \(T_i = \sum_{j \neq i} H_j\) where \(H_j = \mathbf{1}[P_j \text{ shares an edge with } P_i]\). One shows \(\mathbb{E}[H_j] \leq n/N\) by noting that \(P_j\) has at most \(n\) edges and each edge is shared with \(P_i\) with probability at most \(1/N\) (the bit-fixing path through a random destination uses any specific edge with probability \(1/N \cdot N/2 = 1/2\), but the probability that any edge of the \(\ell\)-edge path \(P_j\) is one of the specific edges of \(P_i\) is the ratio of path length to total edges: \(n/(N \cdot n) \cdot N/2 = 1/2\)…). The final bound is \(\mathbb{E}[T_i] \leq n/2\) via a token argument.

Now apply Chernoff to \(T_i = \sum_{j \neq i} H_j\) (independent Bernoullis, since the \(\delta_j\) are independent):

\[\Pr[D(i) > 3n] \leq \Pr\!\left[T_i > 3n\right] \leq \Pr\!\left[T_i \geq (1+\delta)\frac{n}{2}\right]\]

with \(\delta = 5\) (so \((1+\delta)\mu = 6 \cdot n/2 = 3n\)). By the \(c \geq 6\) corollary with \(c\mu = 3n\) and \(c = 6\):

\[\Pr[D(i) > 3n] \leq 2^{-3n} = 2^{-3n}.\]

Phase 2 identical

Phase 2 is bit-fixing from \(\delta_i\) to \(t_i\). The analysis is identical by symmetry (the intermediate destinations are still uniform and independent). Phase 2 delay also satisfies \(\Pr[D_2(i) > 3n] \leq 2^{-3n}\).

6.3 Global Termination and Deterministic Lower Bound

Theorem (Valiant Routing). The two-phase randomized routing scheme terminates with all \(N = 2^n\) packets delivered within \(6n\) steps with probability at least \(1 - 2^{-3n+1}\).

Proof. For each packet \(i\), by the analysis above:

\[\Pr[D_1(i) > 3n] \leq 2^{-3n}, \qquad \Pr[D_2(i) > 3n] \leq 2^{-3n}.\]

Adding path length \(\leq n\) for each phase, the total steps for packet \(i\) in Phase 1 is at most \(n + D_1(i)\), and similarly for Phase 2. The total is \(\leq 2n + D_1(i) + D_2(i)\). By a union bound over both phases and all \(N\) packets:

\[\Pr[\exists\, i : \text{total steps} > 6n] \leq \sum_{i=1}^N \left(\Pr[D_1(i) > 3n] + \Pr[D_2(i) > 3n]\right) \leq N \cdot 2 \cdot 2^{-3n} = 2^n \cdot 2^{1-3n} = 2^{1-2n}.\]

Hence all packets are delivered in \(6n = O(\log N)\) steps with probability \(\geq 1 - 2^{1-2n}\). \(\square\)

🔑 Key contrast. The deterministic lower bound for any oblivious routing scheme on the hypercube is \(\Omega(2^{n/2}/n) = \Omega(\sqrt{N}/\log N)\) in the worst case. Valiant’s scheme achieves \(O(n) = O(\log N)\) with high probability using only randomization — an exponential improvement.

Tight lower bound for randomized routing?

Valiant’s scheme uses \(6n\) steps, but is this optimal? The expected completion time is \(O(n)\), matching the diameter. Open question: what is the exact constant, and can the \(2^{-\Omega(n)}\) failure probability be tightened to \(e^{-\Omega(n^2)}\) with a different analysis?

Exercise 7

This exercise shows that dropping the two-phase structure and using direct bit-fixing routing is insufficient: a single bad permutation creates \(\Omega(\sqrt{N})\) congestion.

Consider the bit-reversal permutation \(\sigma\) on \(\{0,1\}^n\): \(\sigma(b_1 b_2 \cdots b_n) = b_n b_{n-1} \cdots b_1\). Show that for any deterministic oblivious routing scheme (including bit-fixing), there exists a permutation \(\sigma\) and an edge \(e\) such that \(\Omega(2^{n/2})\) paths pass through \(e\).

Prerequisites: Hypercube routing setup; counting arguments.

Solution to Exercise 7

Key insight: A simple counting/pigeonhole argument shows that any set of \(N = 2^n\) vertex-disjoint paths in the hypercube from a permutation routing must collectively pass through some edge many times.

Sketch: Consider routing \(N = 2^n\) packets along paths of length \(\leq n\) in the \(n\)-cube, which has \(n \cdot 2^{n-1}\) directed edges. The total path length is \(N \cdot n/2 = n \cdot 2^{n-1}/2\) on average (since each path has expected length \(n/2\)). Wait — total path length is \(\Omega(N \cdot n/2)\) since paths have average length \(n/2\). There are \(n \cdot 2^{n-1}\) edges. By pigeonhole, some edge carries \(\Omega(N \cdot n/2)/(n \cdot 2^{n-1}) = \Omega(1)\) paths. For the lower bound, use the standard adversarial argument: for the bit-reversal permutation, paths from \(s\) to \(\sigma(s)\) must all cross the “middle” cut (bit positions \(n/2\)). There are \(2^{n/2}\) sources on each side; by planarity-type arguments on the hypercube cut, a bottleneck edge carries \(\Omega(2^{n/2}/n)\) paths.

Exercise 8

This exercise shows that the \(1+x \leq e^x\) bound is tight enough for Chernoff purposes, but quantifies the slack introduced at each step.

Let \(X = \sum_{i=1}^n X_i\) with \(X_i \sim \mathrm{Bernoulli}(p)\) i.i.d. and \(\mu = np\). Compare the exact MGF \(M_X(t) = (1 - p + pe^t)^n\) against the upper bound \(e^{\mu(e^t-1)}\) used in Chernoff. Show that \(M_X(t) \leq e^{\mu(e^t-1)}\) with equality iff \(p \to 0, n \to \infty\) with \(np = \mu\) fixed (the Poisson limit).

Prerequisites: Bernoulli sum MGF bound; MGF properties.

Solution to Exercise 8

Key insight: The inequality \(1 + x \leq e^x\) is strict for \(x \neq 0\), so the bound is strict for any fixed \(p > 0\). Equality is achieved only in the Poisson limit.

Sketch: We need \((1 - p + pe^t)^n \leq e^{\mu(e^t-1)} = e^{np(e^t-1)}\). Taking logs, this is \(n\ln(1 + p(e^t-1)) \leq np(e^t-1)\). Setting \(x = p(e^t - 1) \geq 0\) (for \(t \geq 0\)), this becomes \(n\ln(1+x) \leq nx\), i.e., \(\ln(1+x) \leq x\), which is exactly \(1+x \leq e^x\). Equality holds iff \(x = 0\), i.e., \(p = 0\) or \(t = 0\). In the Poisson limit \(p \to 0, n \to \infty\) with \(np = \mu\) fixed, the exact MGF \((1 + p(e^t-1))^n \to e^{\mu(e^t-1)}\) (standard Poisson limit), so the bound becomes asymptotically tight.

References

Reference Name	Brief Summary	Link
Valiant & Brebner (1981)	Original two-phase randomized routing paper on the hypercube; proved \(O(\log N)\) time with high probability	A Scheme for Fast Parallel Communication
Motwani & Raghavan, Randomized Algorithms (1995)	Standard textbook; Chapters 3–4 cover Markov, Chebyshev, MGFs, and Chernoff bounds with extensive applications	Cambridge University Press
Mitzenmacher & Upfal, Probability and Computing (2005)	Accessible treatment of Chernoff bounds, hashing, and routing; Chapter 4 covers Chernoff	Cambridge University Press
Hoeffding (1963)	Original paper proving the Hoeffding inequality for sums of bounded independent random variables	JASA: Probability Inequalities for Sums of Bounded Random Variables
Bernstein (1924)	Original source of Bernstein’s inequality, exploiting variance structure beyond boundedness	Historical reference — see Boucheron et al. survey
Boucheron, Lugosi & Massart, Concentration Inequalities (2013)	Comprehensive modern reference; covers all inequalities here plus entropy methods, transportation inequalities	Oxford University Press
CS 265 Lecture Notes (Stanford, Fall 2020)	Primary source for this note; lectures 4–5 by Wootters cover Markov through Chernoff and routing	Stanford CS 265