Lagrange multipliers Part Two

My own question on last week’s BMO2 notwithstanding, inequalities seem out of fashion at the moment among mainstream international olympiads. Such problems often involve minimising some function subject to a constraint, and word has, over the years, filtered down to students interested in such things, that there’s a general method for achieving this via Lagrange multipliers. The motivation for my talk in Hungary, summarised by the previous post and this one, is to dispute the claim made by some of the UK students that these are hard to justify rigorously. I dispute this because I don’t think it’s qualitatively much harder to justify Lagrange multipliers rigorously than an unconstrained optimisation problem, whereas I would claim instead that Lagrange multipliers are merely unlikely to work at a computational level on the majority of olympiad problems.

Unconstrained optimisation in two variables

Before we can possibly discuss constrained optimisation, we should discuss unconstrained optimisation. That is, finding minima of a function of several variables. We don’t lose too much by assuming that our function f(x,y) depends on two variables.

Recall that our method in the previous post for justifying the A-level approach to minima was to find a necessary condition to be a local minimum, and also a general reason why there should be a global minimum. That way, if there are finitely many points satisfying the condition, we just check all of them, and the one with the smallest value of f is the global minimum. We’ll discuss the existence of the global minimum later.

If we hold one coordinate fixed, the local variation of a function is equivalent to the one-dimensional case.

f(x+h,y)-f(x,y)= h\frac{\partial f}{\partial x}(x,y)+O(h^2).

In general we want to vary both variables, which is fine since

f(x+h,y+\ell)-f(x+h,y)=\ell \frac{\partial f}{\partial y}(x+h,y) + O(h^2).

But since we really want everything to be determined by the function at (x,y), we really want

\frac{\partial f}{\partial y}(x+h,y) \approx \frac{\partial f}{\partial y}(x,y),

and so we be mindful that we may have to assume that both partial derivatives are continuous everywhere they exist. Once we have this though, we can rewrite as

f(x+h,y+\ell) - f(x,y)= h\frac{\partial f}{\partial x}(x,y) + \ell \frac{\partial f}{\partial y}(x,y) + O(h^2)

= (\frac{\partial f}{\partial x}(x,y) ,\frac{\partial f}{\partial y}(x,y) )\cdot (h,\ell) + O(h\vee \ell^2).

In particular, if we define grad of f to be \nabla f(x,y)=(\frac{\partial f}{\partial x}(x,y), \frac{\partial f}{\partial y}(x,y) ) and apply a similar argument to that which we used in the original setting. If \nabla f(x,y)\ne 0, then we can choose some small (h,\ell), such that f(x+h,y+\ell)<f(x,y). Thus a necessary condition for (x,y) to be a local minimum for f is that \nabla f=0.

Lagrange multipliers

This is natural time to discuss where Lagrange multipliers emerge. The setting now is that we still want to minimise some function f(\mathbf{x}), but only across those values of \mathbf{x} which satisfy the constraint g(\mathbf{x})=0.

But our approach is exactly the same, namely we find a necessary condition to be a local minimum subject to the condition. As before, we have

f(\mathbf{x}+\mathbf{h}) - f(\mathbf{x}) = \mathbf{h}\cdot \nabla f + O(|\mathbf{h}|^2),

but we are only interested in those small vectors \mathbf{h} for which \mathbf{x}+\mathbf{h} actually satisfies the constraint, namely g(\mathbf{x}+\mathbf{h})=0. But then

0 = g(\mathbf{x}+\mathbf{h})- g(\mathbf{x})=\mathbf{h}\cdot \nabla g + O(|\mathbf{h}|^2).

From this, we conclude that the set of small relevant \mathbf{h} is described by \mathbf{h}\nabla g=O(|\mathbf{h}|^2). And now we really can revert to the original argument. If there’s a small \mathbf{h} such that \mathbf{h}\cdot \nabla g=0 but \mathbf{h}\cdot \nabla f \ne 0, then we can find some \mathbf{h'_+},\mathbf{h'_-}=\pm \mathbf{h}+O(|\mathbf{h}|^2) such that at least one of f(\mathbf{h'_+}),f(\mathbf{h'_-})<f(\mathbf{h}).

So a necessary condition to be a constrained local minimum is that every vector which is perpendicular to \nabla g must also be perpendicular to \nabla f. From which it follows that these two vectors must be parallel, that is \nabla f(\mathbf{x})=\lambda \nabla g(\mathbf{x}), where \lambda is the so-called Lagrange multiplier. Of course we must also have that g(\mathbf{x})=0, and so we have a complete characterisation for a necessary condition that the constrained optimisation has a local minimum at \mathbf{x}, assuming that all the derivatives of both f and g exist with suitable regularity near \mathbf{x}.


The point of setting up the one-variable case in the unusual way in the previous post was to allow me to say at this stage: “it’s exactly the same”. Well, we’ve already seen an extra differentiability condition we might require, but apart from that, the same approach holds. Multi-variate continuous functions also attain their bounds when the domain is finite and includes its boundary.

Checking the boundary might be more complicated in this setting. If the underlying domain is \{x,y,z\ge 0\}, then one will have to produce a separate argument for why the behaviour when at least one of the variables is zero fits what you are looking for. Especially in the constrained case, it’s possible that the surface corresponding to the constraint doesn’t actually have a boundary, for example if it is the surface of a sphere. Similarly, checking that the objective function gets large as the variables diverge to infinity can be annoying, as there are many ‘directions’ down which to diverge to infinity.

Motivating Cauchy-Schwarz

In practise, you probably want to have such methods in hand as a last resort on olympiad problems. It’s always possible that something will slip through the net, but typically problem-setters are going to trying to ensure that their problems are not amenable to mindless application of non-elementary methods. And even then, one runs the risk of accusations of non-rigour if you don’t state exact, precise results which justify everything which I’ve presented above.

One thing that can be useful, on the other hand, is to observe that the Lagrange multiplier condition looks a lot like the equality condition for Cauchy-Schwarz. So, even if you can’t solve the family of Lagrange multiplier ‘equations’, this does suggest that applying Cauchy-Schwarz to the vectors involved might give you some insight into the problem.

The following inequality from the IMO 2007 shortlist is a good example.

Suppose a_1,\ldots,a_{100}\ge 0 satisfy a_1^2+\ldots+a_{100}^2=1. Prove that

a_1^2a_2+a_2^2a_3+\ldots + a_{100}^2 a_1<\frac{12}{25}.

We shouldn’t be perturbed by the strictness. Maybe we’ll end up showing a true bound in terms of surds that is less neat to write down…

Anyway, applying Lagrange multipliers would require us to solve

a_{k-1}^2 + 2a_ka_{k+1}=2\lambda a_k,\quad k=1,\ldots,100,

with indices taken modulo 100. As so often with these cyclic but non-symmetric expressions, this looks quite hard to solve. However, it turns out that by applying Cauchy-Schwarz to the vectors (a_k),(a_{k-1}^2+2a_ka_{k+1}) gets us a long way into the problem by classical means. Working all the way through is probably best left as an exercise.


The Combinatorial Nullstellensatz

I’ve been taking a TCC course this term on Additive Combinatorics, delivered via video link from Bristol by Julia Wolf. At some point once the dust of this term has settled, I might write some things about the ideas of the course I’ve found most interesting, in particular the tools of discrete Fourier analysis to get a hold on some useful combinatorial properties of subsets of \mathbb{Z}/n\mathbb{Z} for example.

For this post, I want to talk instead about a topic that was merely mentioned in passing, the Combinatorial Nullstellensatz. The majority of this post is based on Alon’s original paper, which can be found here, and Chapter 9 of Tao and Vu’s book Additive Combinatorics. My aim is to motivate the theorem, give a proof, introduce one useful application from additive combinatorics, and solve Q6 from IMO 2007 as a direct corollary.

What does Nullstellensatz mean? Roughly speaking, it seems to mean ‘a theorem specifying the zeros’. We will be specifying the zeros of a polynomial. We are comfortable with how the zeros of a complex-valued polynomial of one variable behave. The number of zeros is given precisely by the degree of the polynomial (allowing appropriately for multiplicity). It is generally less clear how we might treat the zeros of a polynomial of many variables. The zero set is likely to be some surface, perhaps of dimension one less than the number of variables. In particular, it no longer really makes sense to talk about whether this set is finite or not. The Combinatorial Nullstellensatz gives us some control over the structure of this set of zeros.

The idea behind the generalisation is to view the Fundamental Theorem of Algebra as a statement not about existence of roots, but rather about (combinatorial) existence of non-roots. That is, given a polynomial P(x) of degree n, for any choice of (n+1) complex numbers, at least one of them is not a root of P. This may look like a very weak statement in this context, where we only expect finitely many roots anyway, but in a multivariate setting it is much more intuitively powerful.

Recall that the degree of a monomial is given by the sum of the exponents of the variables present. So the degree of 4x^2 y^3 z is 6. The degree of a polynomial is then given by the largest degree of a monomial in that polynomial. A polynomial P(x_1,\ldots,x_n) over a field F with degree d might have lots of monomial terms of degree d. Suppose one of these monomials is x_1^{d_1}\ldots x_n^{d_n}, where \sum d_i=d. Then one version of the Combinatorial Nullstellensatz asserts that whenever you take subsets of the base field S_i\subset F with |S_i|\ge d_i+1, then there is a point with x_i\in S_i such that P(x_1,\ldots,x_n)=0.

In other words, you can’t have a box (ie product of sets) of dimension d_1+1 \times d_2+1 \times\ldots\times d_n+1 on which the polynomial is zero.

Unsurprisingly, the proof proceeds by induction on the number of variables. Alon’s result proceeds via a more general theorem giving information about the possibility of writing multinomial polynomials as linear combinations of polynomials in one variable.

We would like to start this induction by fixing the x_n co-ordinate, then viewing P as a polynomial in x_1,\ldots,x_{n-1} only. One problem with this approach is that the largest degree monomials in P are not necessarily still the largest degree monomials in P with x_n fixed. So we need to apply a division algorithm argument.

I’m going to miss some steps so as to keep this of suitable blog post length. The key idea is to apply the division algorithm to P with respect to the simplest polynomial that is zero on all of S_n, which we define as:

g(x_n)=\prod_{s_n\in S_n}(x_n-s_n).

We can decompose as


So now we ask where the term x_1^{d_1}\ldots x_n^{d_n} is coming from, bearing in mind that d_n<|S_n|. The lower order terms in g cannot contribute to this, as  they cannot be of maximal degree. Also, the first term in q_n(\mathbf{x})g(x_n) cannot contribute as the exponent of x_n is too large. So the term in question must be coming from r_{n,d_n}(x_1,\ldots,x_{n-1})x_n^{d_n}. So now we can apply the induction hypothesis to the polynomial r_{n,d_n} to find $x_1\in S_1,\ldots, x_{n-1}\in S_{n-1}$ such that r_{n,d_n}(x_1,\ldots,x_{n-1} is non-zero. With these values, we can view the remainder as a polynomial in x_n of degree |S_n|>d_n, and so there is an x_n\in S_n such that

\sum_{j=0}^{|S_n|}r_{n,j}(x_1,\ldots,x_{n-1})x_n^j)\neq 0.

This concludes the proof by induction.

I want to discuss two relatively simple applications. The first is the Cauchy-Davenport Theorem, which one might view as the first non-trivial theorem in additive combinatorics, giving a bound on the size of a sumset.

Theorem (Cauchy-Davenport): Given A, B non-empty subsets of Z_p for p a prime, then

|A+B|\geq \min\{p,|A|+|B|-1\}.

( A+B:=\{c: c=a+b,a\in A,b\in B\} )

Note that the result isn’t especially surprising. Providing some sort of ordering to the elements of A and B might be a sensible way to proceed. Certainly if they were sets in \mathbb{Z}, this would give a proof immediately.

Proof: Only the case |A|+|B| <= p is interesting. Following Alon’s argument, suppose that |A+B| <= |A|+|B|-2, and let C=A+B. Set f(x,y)=\prod_{c\in C}(x+y-c), so f(a,b)=0 for all a\in A,b\in B.

Then the coefficient of x^{|A|-1}y^{|B|-1} in f is \binom{|A|+|B|-2}{|A|-1} as we have to choose which of the terms in the product supply an x and which supply a y. This is non-zero (in Z_p recall) since the upper integer is less than p. The Combinatorial Nullstellensatz then gives a contradiction.

My second example is from the IMO in Vietnam which I attended. I spent a lot of time thinking about this problem, but made no progress.

IMO 2007 Question 6: Let n be a positive integer. Consider

S=\{(x,y,z) | x,y,z\in \{0,1,\ldots,n\}, x+y+z>0\}

as a set of (n+1)^3-1 points in 3D space. Determine the smallest number of planes, the union of which contains S but does not include (0,0,0).

Answer: 3n. Consider the planes x+y+z = k for k varying between 1 and 3n. The aim is to prove that you cannot do it with fewer.

To prove this, suppose we can do with fewer planes, say k. We write the equation of a plane as


Note that the d’s are non-zero as (0,0,0) must not be a solution. Then take the product of all these degree one polynomials together and subtract a multiple of

\prod_{i=1}^n (x-i)(y-i)(z-i),

with the multiple chosen so the resulting polynomial has a root at (0,0,0). (This constant must be non-zero to cancel the non-zero product of the d’s.) This resulting polynomial is degree 3n by construction, and x^ny^nz^n has a non-zero coefficient, but it is zero on the box [0,n]^3, which contradicts Combinatorial Nullstellensatz.