Getting The Public Invested In Science

FundScience Blog

Welcome to the FundScience Blog. This page was created to bring you the news of our venture by the FundScience team (Category: FundScience News) as well as interesting subjects that are related to education and science. We welcome and encourage comments and discussions on the posted topics. If you are a writer and are interested in posting please contact us. If you are a reader we hope that you sign-up for a feed of our blog and/or a quarterly collection of the published articles in an easy to read and pass to friends PDF format.

We hope you stay with us as we develop this exciting project!
The FundScience Team

Blackwell’s proof of Wald’s identity

04.29.11 by mraginsky

Every once in a while you come across a mathematical argument of such incredible beauty that you feel compelled to tell the whole world about it. This post is about one such gem: David Blackwell’s 1946 proof of Wald’s identity on the expected value of a randomly stopped random walk. In fact, even forty years after the publication of that paper, in a conversation with Morris DeGroot, Blackwell said: “That’s a paper I’m still very proud of. It just gives me pleasant feelings every time I think about it.”

Let {X^\infty = (X_1,X_2,\ldots)} be a sequence of independent real-valued random variables with common mean {\mu}. Let {\tau} be a stopping time, i.e., a random variable taking values in {{\mathbb N}}, such that the occurrence or nonoccurrence of the event {\{\tau=n\}} is determined only by {X^n := (X_1,\ldots,X_n)}. We say that Wald’s identity holds for the pair {(X^\infty,\tau)} if

\displaystyle  \mathop{\mathbb E}\left[ X_1 + \ldots + X_\tau\right] = \mu \cdot \mathop{\mathbb E} \tau.

The first result of this type has appeared in the seminal paper of Abraham Wald on sequential hypothesis testing (hence the name), and it can be stated as follows:

Theorem 1 Assume that {X_1,X_2,\ldots} are i.i.d., {\mathop{\mathbb E} |X_i| = \mathop{\mathbb E} |X_1| < \infty}, and {\mathop{\mathbb E} \tau < \infty}. Then Wald’s identity holds for the pair {(X^\infty,\tau)}.

The standard proof of this theorem uses martingale arguments. In 1946, David Blackwell found a beautiful proof that uses only Kolmogorov‘s Strong Law of Large Numbers (SLLN) and its converse. Since the latter is, perhaps, less well known than the former, I will just give both for convenience (see, e.g., Chapter 12 in Probability With Martingales by David Williams):

  1. SLLN: Let {U_1,U_2,\ldots} be i.i.d. with {\mathop{\mathbb E} |U_i| = \mathop{\mathbb E} |U_1| < \infty} for all {i}. Then

    \displaystyle  	\lim_{n \rightarrow \infty} \frac{U_1 + \ldots + U_n}{n} = \mathop{\mathbb E} U_1 \qquad \text{a.s.}

  2. Converse to SLLN: If {U_1,U_2,\ldots} are i.i.d. with {\mathop{\mathbb E} |U_i| = \mathop{\mathbb E} |U_1| = \infty}, then

    \displaystyle  	\lim_{n \rightarrow \infty} \frac{U_1 + \ldots + U_n}{n} \text{ does not exist a.s.}

Blackwell’s proof

Let us define recursively a sequence of random variables {\tau_1,\tau_2,\ldots} as follows. Let {\tau_1 = \tau}. Then, assuming {\tau_1,\ldots,\tau_k} have been defined, let

\displaystyle  \tau_{k+1} = \tau(X_{\tau_1+\ldots+\tau_k+1}, X_{\tau_1 + \ldots + \tau_k + 2},\ldots).

In other words, {\tau_1} is the stopping time determined by the original full sequence {X^\infty}, {\tau_2} is the stopping time determined by the remaining data after the first stopping time, i.e., by {X^\infty_{\tau_1+1}}, {\tau_3} is the stopping time determined by the data remaining after the second stopping time, i.e., by {X^\infty_{\tau_1+\tau_2 + 1}}, etc. We now claim that {\tau^\infty = (\tau_1,\tau_2,\ldots)} is an i.i.d. sequence. To see this, let us compute the conditional probability that {\tau_{k+1} = n} given {\tau^k = (\tau_1,\ldots,\tau_k)}. Given positive integers {j_1,\ldots,j_k}, define the event

\displaystyle  S = \left\{ \tau_1 = j_1,\ldots,\tau_k = j_k \right\}

as well as the event

\displaystyle  R = \left\{ \tau(X^\infty_{j_1+\ldots +j_k +1}) = n \right\}.

By construction, the occurrence or nonoccurrence of {S} is determined only by {X^{j_1 + \ldots + j_k}}, so {S} is independent of {R}. Therefore,

\displaystyle  \begin{array}{rcl}  	\Pr\Big(\tau_{k+1}= n |\tau_1 = j_1,\ldots,\tau_k = j_k\Big) &=& \frac{\Pr\left(S \cap \left\{ \tau_{k+1} = n\right\}\right)}{\Pr(S)} \\ 	&=& \frac{\Pr(S \cap R)}{\Pr(S)} \\ 	&=& \Pr(R). \end{array}

Moreover, since the {X_i}‘s are i.i.d., {\Pr(R) = \Pr(\tau_1 = n) =: \pi_n}. Thus, we have shown that, for every {k = 1,2,\ldots} and every {n},

\displaystyle  \Pr(\tau_{k+1} = n |\tau_1,\ldots,\tau_k) = \pi_n,

from which we conclude that {\tau_1,\tau_2,\ldots} is an i.i.d. sequence.

Now let us define, for each {k}, the random variable

\displaystyle  S_k = X_{\tau_1 + \ldots + \tau_{k-1} + 1} + \ldots + X_{\tau_1 + \ldots + \tau_k}

(where {S_1 = X_1 + \ldots + X_{\tau_1} \equiv X_1 + \ldots + X_\tau}). Exactly the same argument as above can be used to show that {S_1,S_2,\ldots} are i.i.d.. Now, noting that

\displaystyle  S_1 + \ldots + S_k = X_1 + X_2 + \ldots + X_{\tau_1 + \ldots + \tau_k},

we can write

\displaystyle  	\frac{S_1 + \ldots + S_k}{k} = \frac{X_1 + \ldots + X_{\tau_1 + \ldots + \tau_k}}{k} = \frac{X_1 + \ldots + X_{\tau_1 + \ldots + \tau_k}}{\tau_1 + \ldots + \tau_k} \cdot \frac{\tau_1 + \ldots + \tau_k}{k}.

Let’s see what happens as {k \rightarrow \infty}. First of all, if we let {Z_n := n^{-1}(X_1 + \ldots + X_n)}, then, by the SLLN, {Z_n \rightarrow \mathop{\mathbb E} X_1 =\mu} a.s. as {n \rightarrow \infty}. Since any subsequence of a convergent sequence converges to the same limit, we have

\displaystyle  \lim_{k \rightarrow \infty} \frac{X_1 + \ldots + X_{\tau_1 + \ldots + \tau_k}}{\tau_1 + \ldots + \tau_k} = \lim_{k \rightarrow \infty} Z_{\tau_1 + \ldots + \tau_k} = \mu, \qquad \text{a.s.}

Moreover, since {\tau_1,\tau_2,\ldots} are i.i.d. with mean {\mathop{\mathbb E} \tau}, {k^{-1}(\tau_1 + \ldots + \tau_k) \rightarrow \mathop{\mathbb E}\tau} almost surely as {k \rightarrow \infty}. Therefore,

\displaystyle  \lim_{k \rightarrow \infty} \frac{S_1 + \ldots + S_k}{k} = \mu \cdot \mathop{\mathbb E}\tau \qquad \text{a.s.}

Now we invoke the converse to the SLLN (more precisely, the contrapositive to the converse): Since {S_1,S_2,\ldots} are i.i.d. and the limit of {k^{-1}(S_1+\ldots+S_k)} exists a.s., then {\mathop{\mathbb E} S_1} is finite and equal to this limit. Hence,

\displaystyle  \mathop{\mathbb E} (X_1 + \ldots + X_\tau) = \mathop{\mathbb E} S_1 = \mu \cdot \mathop{\mathbb E} \tau.

The theorem is proved.


| Posted in Mathematics, Probability | Comments Off

Divergence in everything: mixing rates for finite Markov chains

10.14.10 by mraginsky

This is the first post in a series aobut the versatility and the power of the good old Kullback-Leibler divergence.

(image yoinked from Sergio Verdú‘s 2007 Shannon Lecture slides)

Today I will describe a beautiful application of the divergence to the problem of determining mixing rates of finite Markov chains. This argument is quite recent, and comes from a nice paper by Sergey Bobkov and Prasad Tetali (“Modified logarithmic Sobolev inequalities in discrete settings,” Journal of Theoretical Probability, vol. 19, no. 2, pp. 209-336, 2006; preprint). Since my interest here is information-theoretic, I will take for granted certain facts from the theory of finite Markov chains.

Let {{\mathsf X}} be a finite state space, and consider a stochastic matrix {P} on {{\mathsf X}}, i.e., {P(x,y) \ge 0} for all {x,y \in {\mathsf X}} and

\displaystyle  \sum_{y \in {\mathsf X}}P(x,y) = 1.

Following standard usage, I will represent probability distributions on {{\mathsf X}} by row vectors and real-valued functions on {{\mathsf X}} by column vectors. With these conventions, {P} acts on probability distributions from the right, {\pi \mapsto \pi P}, and on functions from the left, {f \mapsto P f}.

It is easiest to work with continuous-time Markov chains. Any such chain is an {{\mathsf X}}-valued random process {\{ X_t \}_{t \ge 0}} with some specified initial distribution of {X_0}, say, {\mu_0}. Then the distribution of {X_t} is given by

\displaystyle  \mu_t = \mu_0 P_t, \qquad \text{where } P_t = e^{-t (I - P)} = e^{-t} \sum^\infty_{n=0} \frac{t^n P^n}{n!}.

Here {I} is the identity matrix, {L = -(I-P)} is the infinitesimal generator of the Markov semigroup {\{P_t\}_{t \ge 0}}, and {P^n} denotes the {n}th matrix power of {P}.

We are interested in the asymptotic behavior of the random variables {X_t} for large values of {t}. In order to be able to make sensible statements about it, we will start with the following basic assumption:

Irreducibility — For any pair {x,y \in {\mathsf X}}, there exists some positive integer {n \ge 1}, such that {P^n (x,y) > 0}.

From this it can be shown that:

  1. There exists a stationary distribution {\pi}, such that {\pi = \pi P}. In other words, if {X_0 \sim \pi}, then {X_t \sim \pi} for all {t}.
  2. This {\pi} is unique, and {\pi(x) > 0} for all {x \in {\mathsf X}}.

Let {\mu_0} be the initial distribution, and let {\mu_t} denote the distribution of {X_t}. We will show that, no matter what {\mu_0} is, {\mu_t} will converge to {\pi} in total variation, and moreover the convergence is exponentially fast. This property, often referred to as geometric ergodicity, is a fundamental statement about stability, and so finds applications in many areas, from mathematical statistical mechanics (in connection with the second law of thermodynamics) to stochastic control. Therefore, it is not surprising that there are numerous papers and even books devoted to geometric ergodicity. A standard way of deriving geometric ergodicity is through spectral methods, pertaining to the eigenvalues of {L} viewed as a linear operator on the Hilbert space {L^2(\pi)}. There is quite a bit of work along this direction; see, for example, papers by Persi Diaconis and Daniel Stroock (for the reversible case) and by James Allen Fill (for the nonreversible case).

The argument due to Bobkov and Tetali, which I am about to describe, is information-theoretic in nature. Instead of studying the spectral properties of {L}, they look at the optimal rate at which the divergence {D(\mu_t \| \pi)} converges to zero.

Since {\pi} is strictly positive, any probability distribution {\mu} on {{\mathsf X}} has a density relative to it, i.e., the function {f(x) = \mu(x)/\pi(x)} is nonnegative and finite. To study the asymptotic behavior of {\mu_t}, we will examine the evolution of the density {f_t(x) = \mu_t(x)/\pi(x)}. To that end, let us define the time reversal of {P} (what Fill calls the reversibilization of {P}), which we denote by {P^*}, by

\displaystyle  \pi(x) P^*(x,y) = \pi(y) P(y,x), \qquad \forall x,y \in {\mathsf X}.

Note that we did not assume that the Markov chain is reversible, i.e., {P^* = P}. We have the following basic fact:

Lemma 1 For every positive integer {n \ge 1} and any two {x,y \in {\mathsf X}},

\displaystyle  	\pi(x) (P^*)^n(x,y) = \pi(y) P^n(y,x).

Proof: The proof is by induction. The base case, {n = 1}, holds by the definition of {P^*}. Assume then that the statement is true for some {n \ge 1}. Then

\displaystyle  \begin{array}{rcl}  		\pi(x)(P^*)^{n+1}(x,y) &=& \sum_{u \in {\mathsf X}} \pi(x) (P^*)^n(x,u) P^*(u,y) \\ 		&=& \sum_{u \in {\mathsf X}} \pi(u) P^n(u,x) P^*(u,y) \\ 		&=& \sum_{u \in {\mathsf X}} P^n(u,x) \pi(u) P^*(u,y) \\ 		&=& \sum_{u \in {\mathsf X}} P^n(u,x) \pi(y) P(y,u) \\ 		&=& \pi(y) \sum_{u \in {\mathsf X}} P(y,u)P^n(u,x) \\ 		&\equiv& \pi(y) P^{n+1}(y,x), 	\end{array}

where we have used the rule for matrix multiplication, the induction hypothesis, and the definition of {P^*}. This finishes the proof. \Box

Let us also define the dual semigroup {\{P^*_t\}_{t \ge 0}} by

\displaystyle  P^*_t = e^{-t(I - P^*)} = e^{-t}\sum^\infty_{n=0}\frac{t^n (P^*)^n}{n!}.

Armed with this definition, we can proceed to establish the following key result:

Lemma 2 For any {\mu_0} and any {t \ge 0}, {f_t = P^*_t f_0}. Moreover, for any {x \in {\mathsf X}}

\displaystyle  	\frac{df_t(x)}{dt} = L^* f_t(x),

where {L^* = -(I - P^*)} is the generator of the dual Markov semigroup.

Proof: Starting with {\mu_t = \mu_0 P_t}, we can write

\displaystyle  \begin{array}{rcl}  		\mu_t(x) &=& \sum_{y \in {\mathsf X}} \mu_0(y) P_t(y,x) \\ 		&=& \sum_{y \in {\mathsf X}} f_0(y) \pi(y) P_t(y,x) \\ 		&=& \sum_{y \in {\mathsf X}} f_0(y) e^{-t}\sum^\infty_{n=0}\frac{t^n \pi(y) P^n(y,x)}{n!} \\ 		&=& \sum_{y \in {\mathsf X}} f_0(y)e^{-t}\sum^\infty_{n=0}\frac{t^n \pi(x) (P^*)^n(x,y)}{n!} \qquad \text{(by Lemma 1)}\\ 		&=& \pi(x) \sum_{y \in {\mathsf X}} e^{-t} \left( \sum^\infty_{n=0} \frac{t^n (P^*)^n(x,y)}{n!}\right) f_0(y) \\ 		&=& \pi(x) \sum_{y \in {\mathsf X}} P^*_t(x,y) f_0(y) \\ 		&=& \pi(x) \cdot P^*_t f_0 (x). 	\end{array}

Dividing both sides by {\pi(x)} (which is strictly positive by irreducibility), we obtain the first statement of the lemma.

Next, by linearity we have

\displaystyle  	\frac{df_t(x)}{dt} = \left( \frac{d}{dt}P^*_t\right) f_0 (x) = L^* P^*_t f_0(x) = L^* f_t(x).

Note: the time derivative of {P^*_t} is rigorously defined via

\displaystyle  \frac{dP^*_t}{dt} = \lim_{\varepsilon \rightarrow 0}\frac{P^*_{t+\varepsilon} - P^*_t}{\varepsilon},

and it is not hard to show, using the infinite series representation of {P^*_t}, that {(d/dt)P^*_t = L^* P_t}. \Box

Before we can finally get to business, we need to introduce one more definition. For any two functions {f, g : {\mathsf X} \rightarrow {\mathbb R}}, define the so-called Dirichlet form

\displaystyle  {\mathcal E}(f,g) = - \sum_{x \in {\mathsf X}} f(x) Lg(x) \pi(x).

Here is why this definition comes in handy:

Theorem 3 For any {t > 0},

\displaystyle  	\frac{d}{dt} D(\mu_t \| \pi) = -{\mathcal E} (f_t, \log f_t).

Proof: Using the fact that {P} is irreducible and Lemma~1, it is easy to show that {f_t > 0} for all {t}. Hence, the function {t \mapsto D(\mu_t \| \pi)} is differentiable for all {0 < t < +\infty}. Now, by definition,

\displaystyle  \begin{array}{rcl}  	D(\mu_t \| \pi) &=& \sum_{x \in {\mathsf X}} \mu_t(x) \log \frac{\mu_t(x)}{\pi(x)} \\ 	&=& \sum_{x \in {\mathsf X}} \pi(x) f_t(x) \log f_t(x) \\ 	&=& \mathop{\mathbb E}{}_\pi [f_t \log f_t]. \end{array}

Therefore, by linearity of expectation

\displaystyle  \begin{array}{rcl}  	\frac{d}{dt} D(\mu_t \| \pi) &=& \mathop{\mathbb E}{}_\pi \left[ \frac{d}{dt}(f_t \log f_t) \right] \\ 	&=& \mathop{\mathbb E}{}_\pi\left[ (\log f_t + 1) \frac{df_t}{dt} \right] \\ 	&=& \mathop{\mathbb E}{}_\pi \left[ (\log f_t + 1) L^* f_t\right] \qquad \qquad \qquad \qquad \text{(by Lemma 2)} \\ 	&=& \mathop{\mathbb E}{}_\pi \left[ (\log f_t) (L^* f_t) \right] + L^* \mathop{\mathbb E}{}_\pi [f_t] \\ 	&=& \mathop{\mathbb E}{}_\pi \left[ (\log f_t) (L^* f_t)\right] - (I - P^*) 1 \\ 	&=& \mathop{\mathbb E}{}_\pi \left[ (\log f_t)(L^* f_t)\right]. \end{array}

Next, we use the fact that {L^*} is the adjoint of {L} w.r.t. the {L^2(\pi)} inner product {\langle f, g \rangle_\pi = \mathop{\mathbb E}{}_\pi [fg]}:

\displaystyle  \begin{array}{rcl}  	\langle f, Lg \rangle_\pi &=& \sum_{x \in {\mathsf X}} f(x) Lg(x) \pi(x) \\ 	&=& \sum_{x \in {\mathsf X}} f(x) Pg(x) \pi(x) - \sum_{x \in {\mathsf X}} f(x) g(x) \pi(x) \\ 	&=& \sum_{x \in {\mathsf X}} \sum_{y \in {\mathsf X}} f(x) P(x,y) g(y) \pi(x) - \sum_{x \in {\mathsf X}} f(x) g(x) \pi(x) \\ 	&=& \sum_{x \in {\mathsf X}} \sum_{y \in {\mathsf X}} f(x) P^*(y,x) g(y) \pi(y) - \sum_{x \in {\mathsf X}} f(x) g(x) \pi(x) \\ 	&=& \sum_{y \in {\mathsf X}} P^*f (y) g(y) \pi(y) - \sum_{y \in {\mathsf X}} f(y) g(y) \pi(y) \\ 	&=& \sum_{y \in {\mathsf X}} L^*f(y) g(y) \pi(y) \\ 	&=& \langle L^* f, g \rangle_\pi \end{array}

(note that when {P} is reversible, {L} is self-adjoint, {L = L^*}). Hence,

\displaystyle  	\mathop{\mathbb E}{}_\pi \left[ (L^* f_t)(\log f_t)\right] = \langle L^* f_t, \log f_t \rangle_\pi = \langle f_t, L \log f_t \rangle_\pi \equiv - {\mathcal E} (f_t,\log f_t),

and the theorem is proved. \Box

And here is the rub. Let us define, for any strictly positive {f : {\mathsf X} \rightarrow (0,+\infty)}, the entropy functional

\displaystyle  H_\pi(f) = \mathop{\mathbb E}{}_\pi \left[f \log \frac{f}{\mathop{\mathbb E}{}_\pi f}\right].

Note that if {f} is of the form {f(x) = \mu(x)/\pi(x)} for some probability distribution {\mu}, then {\mathop{\mathbb E}{}_\pi f = 1}, and so

\displaystyle  H_\pi(f) = \mathop{\mathbb E}{}_\pi [f \log f] = \sum_{x \in {\mathsf X}} \mu(x) \log f(x) \equiv D(\mu \| \pi).

Now let

\displaystyle  \rho_0 = \sup_{f > 0} \frac{{\mathcal E}(f, \log f)}{2 H_\pi(f)}. \ \ \ \ \ (1)

With this definition, we finally obtain the desired exponential convergence result:

Theorem 4 For any initial distribution {\mu_0},

\displaystyle  	\| \mu_t - \pi \|^2_{TV} \le 2 \log \left(\frac{1}{\pi_*}\right) e^{-2\rho_0 t}, \qquad \forall t \ge 0,

where {\pi_* = \min_{x \in {\mathsf X}}\pi(x)}.

Proof: Using Theorem 3 and the definition of {\rho_0}, we can write

\displaystyle  	\frac{d}{dt}D(\mu_t \| \pi) = - {\mathcal E} (f_t, \log f_t) \le - 2\rho_0 H_\pi(f_t) = - 2\rho_0 D(\mu_t \| \pi).

Integrating the resulting inequality, we obtain

\displaystyle  D(\mu_t \| \pi) \le D(\mu_0 \| \pi) e^{-2\rho_0 t}.

Using Pinsker’s inequality, we have

\displaystyle  \| \mu_t - \pi \|^2_{TV} \le 2 D(\mu_t \| \pi) \le 2 D(\mu_0 \| \pi) e^{-2\rho_0 t}.

We can upper-bound {D(\mu_0 \| \pi)} as follows:

\displaystyle  	D(\mu_0 \| \pi) = \sum_{x \in {\mathsf X}} \mu_0(x) \log \frac{\mu_0(x)}{\pi(x)} \le \sum_{x \in {\mathsf X}} \mu_0(x) \log \frac{1}{\pi_*} = \log \frac{1}{\pi_*}.

This finishes the proof. \Box

Among other things, Bobkov and Tetali show that the constant {\rho_0} defined in (1) often gives a much tighter bound on the rate of convergence to stationarity than the usual Poincaré constant, derived using the {L^2(\pi)} spectral methods (cf. Diaconis–Stroock). So the whole affair hinges on being able to derive good lower bounds on {\rho_0}.


Copyright 2010 FundScience LLC