# Popoviciu’s Inequality

I’ve just returned to the UK after an excellent stay at the University of British Columbia. More about that will follow in some posts which are being queued. Anyway, I flew back in time to attend the last day of the camp held at Oundle School to select the UK team for this year’s International Mathematical Olympiad, to be held in Cape Town in early July. I chose to give a short session on inequalities, which is a topic I did not enjoy as a student and do not enjoy now, but perhaps that makes it a particularly suitable choice?

We began with a discussion of convexity. Extremely occasionally in olympiads, and merely slightly occasionally in real life, an inequality arises which can be proved by showing that a given function is convex in all its arguments, hence its maximum must be attained at a boundary value in each variable.

In general though, our main experience of convexity will be through the medium of Jensen’s inequality. A worthwhile check is to consider one form of the statement of Jensen’s inequality, with two arguments. We are always given a convex function f defined on an interval I=[a,b], and $x,y\in I$, and weights $\alpha,\beta$ which sum to 1. Then

$\alpha f(x)+\beta f(y)\ge f(\alpha x+\beta y).$

How do we prove this? Well, in fact this is the natural definition of convexity for a function. There had initially been vague murmurings that convexity should be defined as a property of the second derivative of the function. But this is somewhat unsatisfactory, as the function $f(x)=|x|$ is certainly convex, but the second derivative does not exist at x=0. One could argue that the second derivative may not be finite at x=0, but is nonetheless positive by defining it as a limit which happens to be infinite in this case. However, I feel it is uncontroversial to take the case of Jensen given above as the definition of convexity. It is after all a geometric property, so why raise objections to a geometric definition?

The general statement of Jensen’s inequality, with the natural definitions, is

$\sum_{i} \alpha_i f(x_i)\ge f(\sum_{i}\alpha_ix_i).$

This is sometimes called Weighted Jensen in the olympiad community, with ‘ordinary’ Jensen following when the weights are all 1/n. In a probabilistic context, we write

$\mathbb{E}[f(X)]\ge f(\mathbb{E}X),$

for X any random variable supported on the domain of f. Naturally, X can be continuous as well as discrete, giving an integral version of the discretely weighted statement.

Comparing ‘ordinary’ Jensen and ‘weighted’ Jensen, we see an example of the situation where the more general result is easier to prove. As is often the case in these situations, this arises because the more general conditions allow more ‘elbow room’ to perform an inductive argument. A stronger statement means that assuming the induction hypothesis is more useful! Anyway, I won’t digress too far onto the proof of discrete ‘weighted’ Jensen as it is a worthwhile exercise for olympiad students.

What I wanted to discuss principally was an inequality due to Tiberiu Popoviciu:

$\frac13[f(x)+f(y)+f(z)]+f(\frac{x+y+z}{3})\ge \frac23[f(\frac{x+y}{2})+f(\frac{y+z}{2})+f(\frac{z+x}{2})].$

We might offer the following highly vague intuition. Jensen asserts that for sums of the form $\sum f(x_i)$, you get larger sums if the points are more spread out. The effect of taking the mean is immediately to bring all the points as close together as possible. But Popoviciu says that this effect is so pronounced that even with only half the weight on the outer points (and the rest as close together as possible), it still dominates a system with the points twice as close together.

So how to prove it? I mentioned that there is, unsurprisingly, a weighted version of this result, which was supposed to act as a hint to avoid getting too hung up about midpoints. One can draw nice diagrams with a triangle of points (x,f(x)), (y,f(y)), (z,f(z)) and draw midpoints, medians and centroids, but the consensus seemed to be that this didn’t help much.

I had tried breaking up the LHS into three symmetric portions, and using weighted Jensen to obtain terms on the RHS, but this also didn’t yield much, so I warned the students against this approach unless they had a specific reason to suppose it might succeed.

Fortunately, several of the students decided to ignore this advice, and though most fell into a similar problem I had experienced, Joe found that by actively avoiding symmetry, a decomposition into two cases of Jensen could be made. First we assume WLOG that $x\le y \le z$, and so by standard Jensen, we have

$\frac13[f(x)+f(y)]\ge \frac23 f(\frac{x+y}{2}).$

It remains to show

$\frac13 f(z)+f(\frac{x+y+z}{3})\ge \frac23[f(\frac{x+z}{2})+f(\frac{y+z}{2})].$

If we multiply by ¾, then we have an expression on each side that looks like the LHS of Weighted Jensen. At this point, it is worth getting geometric again. One way to visualise Jensen is that for a convex function, a chord between two points on the function lies above the function. (For standard Jensen with two variables, in particular the midpoint lies above the function.) But indeed, suppose we have values $x_1, then the chord between $f(x_1),f(y_1)$ lies strictly above the chord between $f(x_2),f(y_2)$. Making precisely such a comparison gives the result required above. If you want to be more formal about it, you could consider replacing the values of f between $x_2,y_2$ with a straight line, then applying Jensen to this function. Linearity allows us to move the weighting in and out of the brackets on the right hand side, whenever the mean lies in this straight line interval.

Hopefully the diagram above helps. Note that we can compare the heights of the blue points (with the same abscissa), but obviously not the red points!

In any case, I was sceptical about whether this method would work for the weighted version of Popoviciu’s inequality

$\alpha f(x)+\beta f(y) + \gamma f(z)+f(\alpha x+\beta y+\gamma z)\ge$

$(\alpha+\beta)f(\frac{\alpha x+\beta y}{\alpha+\beta})+(\beta+\gamma)f(\frac{\beta y + \gamma z}{\beta+\gamma})+(\gamma+\alpha)f(\frac{\gamma z+\alpha x}{\gamma+\alpha}).$

It turns out though, that it works absolutely fine. I would be interested to see a solution to the original statement making use of the medians and centroid, as then by considering general Cevians the more general inequality might follow.

That’s all great, but my main aim had been to introduce one trick which somewhat trivialises the problem. Note that in the original statement of Popoviciu, we have a convex function, but we only evaluate it at seven points. So for given x,y,z, it makes no difference if we replace the function f with a piece-wise linear function going through the correct seven points. This means that if we can prove that the inequality for any convex piece-wise linear function with at most eight linear parts then we are done.

(There’s a subtlety here. Note that we will prove the inequality for all such functions and all x,y,z, but we will only use this result when x,y,z and their means are the points where the function changes gradient.)

So we consider the function

$g_a(x)=\begin{cases}0& x\le 0\\ ax & x\ge 0\end{cases}$

for some positive value of a. It is not much effort to check that this satisfies Popoviciu. It is also easy to check that the constant function, and the linear function g(x)=bx also satisfy the inequality. We now prove that we can write the piece-wise linear function as a sum of functions which satisfy the inequality, and hence the piece-wise linear function satisfies the inequality.

Suppose we have a convex piecewise linear function h(x) where $x_1 are the points where the derivative changes. We write

$a_i=h'(x_i+)-h'(x_i'),\quad a_0=h'(x_1-),$

for the change in gradient of h around point $x_i$. Crucially, because h is convex, we have $a_i\ge 0$. Then we can write h as

$h(x)=C+ a_0x+g_{a_1}(x-x_1)+\ldots+g_{a_{n}}(x-x_n),$

for a suitable choice of the constant C. This result comes according to [1] as an intermediate step in a short paper of Hardy, Littlewood and Polya, which I can’t currently find online. Note that inequalities are preserved under addition (but not under subtraction) so it follows that h satisfies Popoviciu, and so the original function f satisfies it too for the values of x,y,z chosen. These were arbitrary (but were used to construct h), and hence f satisfies the inequality for all x,y,z.

Some further generalisations can be found in [1]. With more variables, there are more interesting combinatorial aspects that must be checked, regarding the order of the various weighted means.

[1] – D. Grinberg – Generalizations of Popoviciu’s Inequality. arXiv

# Large Deviations 5 – Stochastic Processes and Mogulskii’s Theorem

Motivation

In the previous posts about Large Deviations, most of the emphasis has been on the theory. To summarise briefly, we have a natural idea that for a family of measures supported on the same metric space, increasingly concentrated as some index grows, we might expect the probability of seeing values in a set not containing the limit in distribution to grow exponentially. The canonical example is the sample mean of a family of IID random variables, as treated by Cramer’s theorem.

It becomes apparent that it will not be enough to specify the exponent for a given large deviation event just by taking the infimum of the rate function, so we have to define an LDP topologically, with different behaviour on open and closed sets. Now we want to find some LDPs for more complicated measures, but which will have genuinely non-trivial applications. The key idea in all of this is that the infimum present in the definition of an LDP doesn’t just specify the rate function, it also might well give us some information about the configurations or events that lead to the LDP.

The slogan for the LDP as in Frank den Hollander’s excellent book is: “A large deviation event will happen in the least unlikely of all the unlikely ways.” This will be useful when our underlying space is a bit more complicated.

Setup

As a starting point, consider the set-up for Cramer’s theorem, with IID $X_1,\ldots,X_n$. But instead of investigating LD behaviour for the sample mean, we investigate LD behaviour for the whole set of RVs. There is a bijection between sequences and the partial sums process, so we investigate the partial sums process, rescaled appropriately. For the moment this is a sequence not a function or path (continuous or otherwise), but in the limit it will be, and furthermore it won’t make too much difference whether we interpolate linearly or step-wise.

Concretely, we consider the rescaled random walk:

$Z_n(t):=\tfrac{1}{n}\sum_{i=1}^{[nt]}X_i,\quad t\in[0,1],$

with laws $\mu_n$ supported on $L_\infty([0,1])$. Note that the expected behaviour is a straight line from (0,0) to (1,$\mathbb{E}X_1$). In fact we can say more than that. By Donsker’s theorem we have a functional version of a central limit theorem, which says that deviations from this expected behaviour are given by suitably scaled Brownian motion:

$\sqrt{n}\left(\frac{Z_n(t)-t\mathbb{E}X}{\sqrt{\text{Var}(X_1)}}\right)\quad\stackrel{d}{\rightarrow}\quad B(t),\quad t\in[0,1].$

This is what we expect ‘standard’ behaviour to look like:

The deviations from a straight line are on a scale of $\sqrt{n}$. Here are two examples of potential large deviation behaviour:

Or this:

Note that these are qualitatively different. In the first case, the first half of the random variables are in general much larger than the second half, which appear to have empirical mean roughly 0. In the second case, a large deviation in overall mean is driven by a single very large value. It is obviously of interest to find out what the probabilities of each of these possibilities are.

We can do this via an LDP for $(\mu_n)$. Now it is really useful to be working in a topological context with open and closed sets. It will turn out that the rate function is supported on absolutely continuous functions, whereas obviously for finite n, none of the sample paths are continuous!

We assume that $\Lambda(\lambda)$ is the logarithmic moment generating function of X_1 as before, with $\Lambda^*(x)$ the Fenchel-Legendre transform. Then the key result is:

Theorem (Mogulskii): The measures $(\mu_n)$ satisfy an LDP on $L_\infty([0,1])$ with good rate function:

$I(\phi)=\begin{cases}\int_0^1 \Lambda^*(\phi'(t))dt,&\quad \text{if }\phi\in\mathcal{AC}, \phi(0)=0,\\ \infty&\quad\text{otherwise,}\end{cases}$

where AC is the space of absolutely continuous functions on [0,1]. Note that AC is dense in $L_\infty([0,1])$, so any open set contains a $\phi$ for which $I(\phi)$ is at least in principle finite. (Obviously, if $\Lambda^*$ is not finite everywhere, then extra restrictions of $\phi'$ are required.)

The following picture may be helpful at providing some motivation:

So what is going on is that if we take a path and zoom in on some small interval around a point, note first that behaviour on this interval is independent of behaviour everywhere else. Then the gradient at the point is the local empirical mean of the random variables around this point in time. The probability that this differs from the actual mean is given by Cramer’s rate function applied to the empirical mean, so we obtain the rate function for the whole path by integrating.

More concretely, but still very informally, suppose there is some $\phi'(t)\neq \mathbb{E}X$, then this says that:

$Z_n(t+\delta t)-Z_n(t)=\phi'(t)\delta t+o(\delta t),$

$\Rightarrow\quad \mu_n\Big(\phi'(t)\delta t+o(\delta t)=\frac{1}{n}\sum_{i=nt+1}^{n(t+\delta t)}X_i\Big),$

$= \mu_n\Big( \phi'(t)+o(1)=\frac{1}{n\delta t}\sum_{i=1}^{n\delta t}X_i\Big)\sim e^{-n\delta t\Lambda^*(\phi'(t))},$

by Cramer. Now we can use independence:

$\mu_n(Z_n\approx \phi)=\prod_{\delta t}e^{-n\delta t \Lambda^*(\phi'(t))}=e^{-\sum_{\delta t}n\delta t \Lambda^*(\phi'(t))}\approx e^{-n\int_0^1 \Lambda^*(\phi'(t))dt},$

as in fact is given by Mogulskii.

Remarks

1) The absolutely continuous requirement is useful. We really wouldn’t want to be examining carefully the tail of the underlying distribution to see whether it is possible on an exponential scale that o(n) consecutive RVs would have sum O(n).

2) In general $\Lambda^*(x)$ will be convex, which has applications as well as playing a useful role in the proof. Recalling den Hollander’s mantra, we are interested to see where infima hold for LD sets in the host space. So for the event that the empirical mean is greater than some threshold larger than the expectation, Cramer’s theorem told us that this is exponentially the same as same the empirical mean is roughly equal to the threshold. Now Mogulskii’s theorem says more. By convexity, we know that the integral functional for the rate function is minimised by straight lines. So we learn that the contributions to the large deviation are spread roughly equally through the sample. Note that this is NOT saying that all the random variables will have the same higher than expected value. The LDP takes no account of fluctuations in the path on a scale smaller than n. It does however rule out both of the situations pictured a long way up the page. We should expect to see roughly a straight line, with unexpectedly steep gradient.

3) The proof as given in Dembo and Zeitouni is quite involved. There are a few stages, the first and simplest of which is to show that it doesn’t matter on an exponential scale whether we interpolate linearly or step-wise. Later in the proof we will switch back and forth at will. The next step is to show the LDP for the finite-dimensional problem given by evaluating the path at finitely many points in [0,1]. A careful argument via the Dawson-Gartner theorem allows lifting of the finite-dimensional projections back to the space of general functions with the topology of pointwise convergence. It remains to prove that the rate function is indeed the supremum of the rate functions achieved on projections. Convexity of $\Lambda^*(x)$ is very useful here for the upper bound, and this is where it comes through that the rate function is infinite when the comparison path is not absolutely continuous. To lift to the finer topology of $L_\infty([0,1])$ requires only a check of exponential tightness in the finer space, which follows from Arzela-Ascoli after some work.

In conclusion, it is fairly tricky to prove even this most straightforward case, so unsurprisingly it is hard to extend to the natural case where the distributions of the underlying RVs (X) change continuously in time, as we will want for the analysis of more combinatorial objects. Next time I will consider why it is hard but potentially interesting to consider with adaptations of these techniques an LDP for the size of the largest component in a sparse random graph near criticality.