04.29.11 by mraginsky
Every once in a while you come across a mathematical argument of such incredible beauty that you feel compelled to tell the whole world about it. This post is about one such gem: David Blackwell’s 1946 proof of Wald’s identity on the expected value of a randomly stopped random walk. In fact, even forty years after the publication of that paper, in a conversation with Morris DeGroot, Blackwell said: “That’s a paper I’m still very proud of. It just gives me pleasant feelings every time I think about it.”
Let be a sequence of independent real-valued random variables with common mean . Let be a stopping time, i.e., a random variable taking values in , such that the occurrence or nonoccurrence of the event is determined only by . We say that Wald’s identity holds for the pair if
Theorem 1 Assume that are i.i.d., , and . Then Wald’s identity holds for the pair .
The standard proof of this theorem uses martingale arguments. In 1946, David Blackwell found a beautiful proof that uses only Kolmogorov‘s Strong Law of Large Numbers (SLLN) and its converse. Since the latter is, perhaps, less well known than the former, I will just give both for convenience (see, e.g., Chapter 12 in Probability With Martingales by David Williams):
- SLLN: Let be i.i.d. with for all . Then
- Converse to SLLN: If are i.i.d. with , then
Let us define recursively a sequence of random variables as follows. Let . Then, assuming have been defined, let
In other words, is the stopping time determined by the original full sequence , is the stopping time determined by the remaining data after the first stopping time, i.e., by , is the stopping time determined by the data remaining after the second stopping time, i.e., by , etc. We now claim that is an i.i.d. sequence. To see this, let us compute the conditional probability that given . Given positive integers , define the event
as well as the event
By construction, the occurrence or nonoccurrence of is determined only by , so is independent of . Therefore,
Moreover, since the ‘s are i.i.d., . Thus, we have shown that, for every and every ,
from which we conclude that is an i.i.d. sequence.
Now let us define, for each , the random variable
(where ). Exactly the same argument as above can be used to show that are i.i.d.. Now, noting that
we can write
Let’s see what happens as . First of all, if we let , then, by the SLLN, a.s. as . Since any subsequence of a convergent sequence converges to the same limit, we have
Moreover, since are i.i.d. with mean , almost surely as . Therefore,
Now we invoke the converse to the SLLN (more precisely, the contrapositive to the converse): Since are i.i.d. and the limit of exists a.s., then is finite and equal to this limit. Hence,
The theorem is proved.
10.14.10 by mraginsky
This is the first post in a series aobut the versatility and the power of the good old Kullback-Leibler divergence.
Today I will describe a beautiful application of the divergence to the problem of determining mixing rates of finite Markov chains. This argument is quite recent, and comes from a nice paper by Sergey Bobkov and Prasad Tetali (“Modified logarithmic Sobolev inequalities in discrete settings,” Journal of Theoretical Probability, vol. 19, no. 2, pp. 209-336, 2006; preprint). Since my interest here is information-theoretic, I will take for granted certain facts from the theory of finite Markov chains.
Let be a finite state space, and consider a stochastic matrix on , i.e., for all and
Following standard usage, I will represent probability distributions on by row vectors and real-valued functions on by column vectors. With these conventions, acts on probability distributions from the right, , and on functions from the left, .
It is easiest to work with continuous-time Markov chains. Any such chain is an -valued random process with some specified initial distribution of , say, . Then the distribution of is given by
Here is the identity matrix, is the infinitesimal generator of the Markov semigroup , and denotes the th matrix power of .
We are interested in the asymptotic behavior of the random variables for large values of . In order to be able to make sensible statements about it, we will start with the following basic assumption:
Irreducibility — For any pair , there exists some positive integer , such that .
From this it can be shown that:
- There exists a stationary distribution , such that . In other words, if , then for all .
- This is unique, and for all .
Let be the initial distribution, and let denote the distribution of . We will show that, no matter what is, will converge to in total variation, and moreover the convergence is exponentially fast. This property, often referred to as geometric ergodicity, is a fundamental statement about stability, and so finds applications in many areas, from mathematical statistical mechanics (in connection with the second law of thermodynamics) to stochastic control. Therefore, it is not surprising that there are numerous papers and even books devoted to geometric ergodicity. A standard way of deriving geometric ergodicity is through spectral methods, pertaining to the eigenvalues of viewed as a linear operator on the Hilbert space . There is quite a bit of work along this direction; see, for example, papers by Persi Diaconis and Daniel Stroock (for the reversible case) and by James Allen Fill (for the nonreversible case).
The argument due to Bobkov and Tetali, which I am about to describe, is information-theoretic in nature. Instead of studying the spectral properties of , they look at the optimal rate at which the divergence converges to zero.
Since is strictly positive, any probability distribution on has a density relative to it, i.e., the function is nonnegative and finite. To study the asymptotic behavior of , we will examine the evolution of the density . To that end, let us define the time reversal of (what Fill calls the reversibilization of ), which we denote by , by
Note that we did not assume that the Markov chain is reversible, i.e., . We have the following basic fact:
Proof: The proof is by induction. The base case, , holds by the definition of . Assume then that the statement is true for some . Then
where we have used the rule for matrix multiplication, the induction hypothesis, and the definition of . This finishes the proof.
Let us also define the dual semigroup by
Armed with this definition, we can proceed to establish the following key result:
Proof: Starting with , we can write
Dividing both sides by (which is strictly positive by irreducibility), we obtain the first statement of the lemma.
Next, by linearity we have
Note: the time derivative of is rigorously defined via
and it is not hard to show, using the infinite series representation of , that .
Before we can finally get to business, we need to introduce one more definition. For any two functions , define the so-called Dirichlet form
Here is why this definition comes in handy:
Proof: Using the fact that is irreducible and Lemma~1, it is easy to show that for all . Hence, the function is differentiable for all . Now, by definition,
Therefore, by linearity of expectation
Next, we use the fact that is the adjoint of w.r.t. the inner product :
(note that when is reversible, is self-adjoint, ). Hence,
and the theorem is proved.
And here is the rub. Let us define, for any strictly positive , the entropy functional
Note that if is of the form for some probability distribution , then , and so
Theorem 4 For any initial distribution ,
Proof: Using Theorem 3 and the definition of , we can write
Integrating the resulting inequality, we obtain
Using Pinsker’s inequality, we have
We can upper-bound as follows:
This finishes the proof.
Among other things, Bobkov and Tetali show that the constant defined in (1) often gives a much tighter bound on the rate of convergence to stationarity than the usual Poincaré constant, derived using the spectral methods (cf. Diaconis–Stroock). So the whole affair hinges on being able to derive good lower bounds on .