Noise Sensitivity and Influence 1

Last week, Simon gave a junior probability seminar about noise sensitivity, and its applications in particular to dynamic critical percolation, so I thought I would write something about this topic based on his talk, and some other sources.

The motivation is critical site percolation on the triangular lattice. This has critical probability p_c=\frac12, and, at this critical probability, almost surely there is no infinite component. But it turns out that there is ‘almost’ an infinite component, in the following technical sense. Dynamical site percolation is given by switching the state of each site according to independent Poisson processes with the same parameter. Then, by construction, if we start at time 0 with the critical probability, then at any fixed time T>0, the configuration is distributed as critical site percolation, and hence almost surely has no infinite component. If, however, we consider the interval of times [0,T], then almost surely there *is* a time when there is an infinite component, and indeed the set of such times has positive Hausdorff dimension.

Definitions

Noise sensitivity addresses the question of how likely some function of a product measure is to change under small perturbations to the state. To give a precise definition, we take a function f_n:\{0,1\}^n \rightarrow \{0,1\} on a product space, with product measure \mu. Then given \omega\in\Omega_n=\{0,1\}^n, define w^\epsilon to be the outcome when each component of \omega is resampled with probability \epsilon. Note that it matters quantitatively whether we consider ‘resampled’ or ‘swapped’, but it doesn’t affect any of the results qualitatively.

Then the noise stability of f at level \epsilon is defined to be

S_\epsilon(f):=\text{Cov}(f(\omega)f(\omega^\epsilon),

where \omega \sim \mu. Note that \omega^\epsilon\sim \mu also, but obviously \omega,\omega^\epsilon are not independent. Noise stability measures the magnitude of this dependence. The sequence of functions (f_n) is said to be noise-sensitive if S_\epsilon(f_n)\rightarrow 0 as n\rightarrow\infty.

A related concept is influence. This should be viewed as a local version of noise sensitivity, where we consider the likelihood that the function will change as a result of swapping a single fixed component of the state. Perhaps abusing standard notation, define \omega^{(i)} to be the result of swapping the ith component of \omega. Then the influence of the ith coordinate is

I_i(f) := \mu(\{\omega\in\Omega: f(\omega)\ne f(\omega^{(i)})\}).

Examples and Results

As in so many models, we are interested in the relationship between the local setting and the global setting. Benjamini, Kalai and Schramm (99) showed that, in the case of the uniform product measure,

\sum_{i\in[n]} I_i(f_n)^2 \stackrel{n\rightarrow\infty}\rightarrow 0,

is a sufficient condition for noise-sensitivity of (f_n). Note the converse is trivially false, by considering the function f(w) that records the parity of the number of 1s in w. Here, the influence is 1, since changing one component by definition changes the value of the function. But the function is noise sensitive, as changing a proportion \epsilon of the components gives a roughly 1/2 probability for large n that the value of the function changes.

However, if we restrict to the class of monotone increasing functions, where changing a 0 to a 1 can never cause f to decrease, the converse is true. Outside the setting of uniform product measure, lots of versions of this theorem are open.

This perhaps seems counter-intuitive. After all, we would expect high global sensitivity to follow from high local sensitivity. I don’t have a good counter-intuition for why this should be true. The original authors themselves describe this as ‘paradoxical’ in [1].

It is natural to ask which component has the largest influence, and in particular whether we can guarantee the existence of a component with influence at least some particular value. Obviously, if the Boolean function is constant, all the influences are zero, and other examples, such as the indicator function of a specific state, will also have very small influences. But such functions also have small variance (under the uniform product measure), so we need to make comparison to this. We can show that

\text{Var}(f)\le \frac14 \sum_{i\in[n]}I_i(f),

which is a (discrete) example of a Poincare inequality, where deviations of a function from its mean on a particular space in L^p are bounded in terms of the derivative of the function. Here, influence is a discrete analogue of derivative.

Proof: We observe first that if \mathbb{P}(f(w)=1)=p, then \mathbb{E}[f(w)]=p, and

\text{Var}(f)=p(1-p)^2+(1-p)p^2=p(1-p)=\frac12 \mathbb{P}(f(w)\ne f(\bar w)),

where w,\bar w are independent realisations of the underlying measure. We move from w\mapsto \bar w one component at a time, examining the influence at each step. Let v^{(i)}\in \{0,1\}^n be given by the first i components of \bar w and the final (n-i) components of w. Then v^{(0)}=w, v^{(n)}=\bar w. Finally, define

J:=\{i:w_i\ne \bar w_i\}.

We also say that a state is i-pivotal, if changing the ith component changes the value of the function. Note that if f(w)\ne f(\bar w), then there exists some i<n such that f(v^{(i)})\ne f(v^{(i+1)}). In particular:

\{f(w)\ne f(\bar w)\}\subset \{1\in J, v^{(0)}\text{ 1-pivotal}\}\cup

\{2\in J, v^{(1)}\text{ 2-pivotal}\}\cup\ldots\cup\{n\in J, v^{(n-1)}\text{ n-pivotal}\}.

Note now that for any i<n, v^{(i-1)} has uniform product measure on \Omega, since it is an interpolation of sorts between independent realisations of the uniform product measure. Furthermore, since we haven’t interpolated at component i yet, it is independent of the event \{i \in J\}. Thus

\mathbb{P}(i\in J, v^{(i-1)}\text{ i-pivotal})=\frac12 I_i(f).

Putting everything together with a union bound gives

\text{Var}(f)=\frac12 \mathbb{P}(f(w)\ne f(\bar w))\le \frac14 \sum_{i\in [n]}I_i(f),

as desired.

We might ask when this result is tight. It is certainly not tight if we during the interpolation procedure, the value of f changes lots of times. Since a consequence of this Poincare inequality is that there is some component i such that I_i(f)\ge \frac{1}{4n}, we might conjecture that there is a better lower bound.

It turns out, as shown in [2], that we can add an extra logarithmic factor. Precisely, for some absolute constant c>0, there exists an i such that:

I_i(f)\ge c \text{Var}(f)\frac{\log n}{n}.

Indeed, there’s an example which shows that this is optimal. Consider partitioning the n components into groups of size roughly log n – log log n, all in base 2. Then the function is 1, if there exists at least one group, which we call tribes, where all the components are 1. So a particular component is pivotal precisely when all the other components in its tribe are already 1. Thus

I_i(f)\approx \left(\frac{1}{2}\right)^{\log n - \log\log n}= \frac{\log n}{n}.

One consequence of this is that by BKS, we learn that the function is noise-sensitive. Is this clear directly? Note that if w contains a tribe of 1s, then with high probability w^{(\epsilon)} for large n, roughly \epsilon(\log n-\log \log n) of these get changed. In particular, it is very unlikely that the same tribe will be a tribe of 1s in w^\epsilon. So heuristically, the presence of a tribe in w^\epsilon feels almost independent of the presence of a tribe in w. But because of the (approximate) symmetry of the setup, it is much easier to work with influences, and use BKS.

I will continue the discussion of the Fourier techniques used to derive some of these results in a new post soon.

References

[1] – Benjamini, Kalai, Schramm (1999) – Noise Sensitivity of Boolean Functions and Applications to Percolation. (arXiv)

[2] – Kahn, Kalai, Linial (1988) – The influence of variables on Boolean Functions. (link)

Garban, Steif – Noise sensitivity of Boolean functions and Percolation. (arXiv) I have borrowed heavily from these excellent notes. The discrete Poincare inequality appears in Section 1.2, with an outline of the solution in Exercise 1.5.

Coupling from the Past

In a long series of previous posts I have talked about mixing times for Markov chains. We consider how long it takes for the distribution of a particular Markov chain to approach equilibrium. We are particularly interested in the asymptotics when some parameter of the model grows, such as the size of the state space, grows to infinity.

But why are we interested in the underlying problem? The idea of Markov Chain Monte Carlo methods is to sample from an intractable distribution by instead sampling from a Markov chain which approximates the distribution well at large times. A distribution might be intractable because it is computationally demanding to work out the normalising constant, or it might be distributed uniformly on a complicated combinatorial set. If, however, the distribution is the equilibrium distribution of some Markov chain, then we know how to at least sample from a distribution which is close to the one we want. But we need to know how long to run the process. We will typically tolerate some small error in approximating the distribution (whether we measure this in terms of total variation distance or some other metric doesn’t really matter at this heuristic level), but we need to know how it scale. If we double the size of the system, do we need to double the number of iterations of the chain, or square it. This is really important if we are going to use this for large real-world models with finite computing power!

Sometimes though, an approximation is not enough. If we want an exact sample from the equilibrium distribution, Markov chains typically will not help us as it is only in very artificial examples that the distribution after some finite time is actually the equilibrium distribution. One thing that we might use is a stationary time, which is a stopping time T, for which X_T\stackrel{d}{=}\pi. Note that there is one trivial way to do this. We can sample Y from distribution \pi before starting the process, then stop X at the first time T for which X_T=Y. But this is no help really, as we need to have Y in the first place!

So we are really interested in less trivial stationary times. Perhaps the best example is the top-to-random shuffle. Here we are given a pack of labelled cards, WLOG initially in descending order at each step we move the top card in the pile to a randomly-chosen location in the pile (which includes back onto the top). Then it turns out that the first time we move the card originally at the bottom from the top to somewhere is a strong stationary time. This is fairly natural, as by this time, every card has been involved in at least one randomising event.

Anyway, so this gives a somewhat artificial way to sample from the uniform distribution on a pack of cards. This strong stationary time is almost surely finite, with distribution given by the coupon collector problem, for which the expectation grows as n\log n, where n is the number of cards.

The problem with this method is that it is not easy in general to come up with a non-contrived stationary time such as this one. The idea of coupling from the past, discussed by some previous authors but introduced in this context by Propp and Wilson in the mid ’90s, is another method to achieve perfect sampling from the equilibrium distribution of a Markov chain. The idea here is to work backwards rather than forwards. The rest of this post, which discusses this idea, is based on the talk given at the Junior Probability Seminar by Irene, and on the chapter in the Levin, Peres, Wilmer book.

The key to the construction is a coupling of the transitions of a Markov chain. In the setting of a simple random walk, we have by construction a coupling of the transitions. It doesn’t matter which state we are at: we toss a coin to decide whether to move up or down, and we can do this without reference to our current position. Levin, Peres and WIlmer call this a random mapping representation in general, and it is yet another concept that is less scary than its definition might suggest.

Given a transition matrix P on state space S, such a representation is a function

\phi: S\times[0,1]\rightarrow S,\text{ s.t. }\mathbb{P}(\phi(i,U)=j)=p_{ij},

where U is a U(0,1) random variable independent of choice of i. In particular, once we have the random value of u, we can consider \phi(i,u) as i varies, to obtain a random map S\rightarrow S. Crucially, this map is not necessarily a bijection.

Note first that there are many possibilities for constructing the representation \phi. For some chains, and some representations, in particular random walks on vertex-transitive graphs (such as SRW – only for now we are restricting attention to finite state spaces) it is possible to choose \phi so that it always gives a bijection, but it is also always possible to choose it so that there is some probability it doesn’t give a bijection.

Let U_1,U_2,\ldots be an IID sequence of U[0,1] random variables, and write \phi_i for the random map induced by U_i. Then consider the sequence of iterated maps:

\phi_1, \phi_1\circ \phi_2, \ldots, \phi_1\circ\ldots\circ\phi_n,

and let T be the (random) smallest time such that the image of \phi_1\circ\ldots\circ \phi_T is a single state. Ie, as we go backwards in time through the maps \phi_i, we are gradually losing various states, corresponding to the maps not being bijections. Since the state space is finite, and the probability of not being a bijection is positive, it can be shown that T is almost surely finite. The claim then is that

Y=\text{Im}(\phi_1\circ\ldots\circ \phi_T)

is distributed as the equilibrium distribution of the chain. We finish by proving this.

Proof: Since the algorithm terminates after finite time almost surely, given any \epsilon>0, we can choose N such that the probability the algorithm stops in at most N steps is greater than 1-\epsilon.

Now run the Markov chain from time -N, started in the equilibrium distribution, with the transition from time -t to -(t-1) given by the random mapping driven by U_t. Thus at time 0, the distribution of the chain is still the equilibrium distribution. But if we condition on the event that T\le N, then X_0=\phi_1\circ \ldots \circ\phi_n(X_{-N})=Y regardless of the initial value. So \mathbb{P}(X_0\ne Y)<\epsilon, and hence the result follows, since \epsilon>0 was arbitrary.

What makes this easier than strong stationary times is that we don’t have to be clever to come up with the stopping time. It is however still important to know how long on average it takes to run the algorithm. At the end of her talk, Irene showed how to adapt this algorithm to deal with Probabilistic Cellular Automata. Roughly speaking, these are a sequence of infinite strings of 0s and 1s. The value of some element is determined randomly as a function of the values in the row underneath, say the element directly underneath and the two either side. In that setting, if you start with a finite subsequence and couple from the past by looking down to lower rows, each time you drop down a row you consider one further element, so in fact the coupling from the past algorithm has to eliminate possibilities fast enough to make up for this, if we want to terminate almost surely in finite time.

Here’s a link to the paper which discusses this in fuller detail.

Enhanced by Zemanta

Duality for Stochastic Processes

In the past couple of weeks, we’ve launched a new junior probability seminar in Oxford. (If you’re reading this and would like to come, please drop me an email!) The first session featured Daniel Straulino talking about his work on the Spatial Lambda Fleming-Viot process, a model for the evolution of populations allowing for selection and geometry. A lively discussion of duality broke out halfway through, following which it seemed reasonable to devote another session to this topic. Luke Miller presented some classical and less classical examples of the theory this afternoon. I found all of this interesting and relevant, and thought it would be worth writing some things down, and tying it in with one of the courses on this subject that we attended at ALEA in Luminy last October.

The majority of this article is based on Luke’s talk. Errors, omissions and over-simplifications are of course my own.

The setup is that we have two stochastic processes X_t\subset R, Y_t\subset S. For now we make no assertion about whether the two state spaces R and S are the same or related, and we make no comment on the dependence relationship between X and Y. Let P_x,Q_y be the respective probability measures, representing starting from x and y respectively. Then given a bivariate, measurable function h(.,.) on R x S, such that:

E^P_x h(X_t,y)=E^Q_y h(x,Y_t),\quad \forall x,y\quad\forall t,

then we say X and Y are dual with respect to h.

The interpretation should be that X is a process forwards in time, and Y is a process backwards in time. So X_t, Y_0 represent the present, while X_0, Y_t represent the past, which is the initial time for original process X. The fact that the result holds for all times t allows us to carry the equality through a derivative, to obtain an equality of generators:

\mathcal{G}^X h(x,y)=\mathcal{G}^Y h(x,y),\quad \forall x,y.

On the LHS, the generator acts on x, while on the RHS it acts on y. Although it still isn’t obvious (at least to me) when a pair of processes might have this property, especially for an arbitrary function, this seems the more natural definition to think about.

Note that this does indeed require a specific function h. There were murmurings in our meeting about the possibility of a two processes having a strong duality property, where this held for all h in some broad class of test functions. On more reflection, which may nonetheless be completely wrong, this seems unlikely to happen very often, except in some obviously degenerate cases, such as h constant. If this holds, then as the set of expectations of a class of functions of a random variable determines the distribution, we find that the instantaneous behaviour of Y is equal in distribution to the instantaneous behaviour of X when started from fixed (x,y). It seems unlikely that you might get many examples of this that are not deterministic or independent (eg two Brownian motions, or other space-, time-homogeneous Markov process).

Anyway, a canonical example of this is the Wright-Fisher diffusion, which provides a simple model for a population which evolves in discrete-time generations. We assume that there are two types in the population: {A,a} seems to be the standard notation. Children choose their type randomly from the distribution of types in the previous generation. In other words, if there are N individuals at all times, and X_k is the number of type A individuals, then:

X_{k+1} | X_k \stackrel{d}{=} \mathrm{Bin}(N, \frac{X_k}{N}).

It is not hard to see that in a diffusion limit as the number of individuals tends to infinity, the proportion of type A individuals is a martingale, and so the generator for this process will not depend on f’. In fact by checking a Taylor series, we can show that:

\mathcal{G}_{WF}f(x)=\frac{1}{2} x(1-x)f''(x),

for all f subject to mild regularity conditions. In particular, we can show that for f_n(x)=x^n, we have:

\mathcal{G}_{WF} f_n(x)=\binom{n}{2}(f_{n-1}(x)-f_n(x))

after some rearranging. This looks like the generator of a jump process, indeed a jump process where all the increments are -1. This suggests there might be a coalescent as the dual process, and indeed it turns out that Kingman’s coalescent, where any pair of blocks coalesce at uniform rate, is the dual. We have the relation in expectation:

\mathbb{E}_x[X_t^n]= \mathbb{E}_n[x^{N_t}],

where the latter term is the moment generating function of the number of blocks at time t of Kingman’s coalescent started from n blocks.

In particular, we can control the moments of the Wright-Fisher diffusion using the mgf of the Kingman’s coalescent, which might well be easier to work with.

That’s all very elegant, but we were talking about why this might be useful in a broader setting. In the context of this question, there seems to be an obstacle towards applying this idea above more generally. This is an underlying idea in population genetics models that as well as the forward process, there is also a so-called ancestral process looking backwards in time, detailing how time t individuals are related historically. It would be convenient if this process, which we might expect to be some nice coalescent, was the dual of the forward process.

But this seems to be a problem, as duals are a function of the duality function, so do we have uniqueness? It would not be very satisfying if there were several coalescents processes that could all be the dual of the forward process. Though some worked examples suggest this might not happen, because a dual and its duality function has to satisfy too many constaints, there seems no a priori reason why not. It seems that the strength of the results you can derive from the duality relation is only as strong as the duality relation itself. This is not necessarily a problem from the point of view of applications, so long as the duality function is something it might actually be useful to work with.

It’s getting late and this text is getting long, so I shall postpone a discussion of duality for interacting particle systems until tomorrow. In summary, by restricting to a finite state space, we can allow ourselves the option of a more algebraic approach, from which some direct uses of duality can be read off. I will also mention a non-technical but I feel helpful way to view many examples of duality in interacting particle systems as deterministic forward and backwards walks through a random environment, in what might be considering an extreme example of coupling.

Enhanced by Zemanta