Lagrange multipliers Part Two

My own question on last week’s BMO2 notwithstanding, inequalities seem out of fashion at the moment among mainstream international olympiads. Such problems often involve minimising some function subject to a constraint, and word has, over the years, filtered down to students interested in such things, that there’s a general method for achieving this via Lagrange multipliers. The motivation for my talk in Hungary, summarised by the previous post and this one, is to dispute the claim made by some of the UK students that these are hard to justify rigorously. I dispute this because I don’t think it’s qualitatively much harder to justify Lagrange multipliers rigorously than an unconstrained optimisation problem, whereas I would claim instead that Lagrange multipliers are merely unlikely to work at a computational level on the majority of olympiad problems.

Unconstrained optimisation in two variables

Before we can possibly discuss constrained optimisation, we should discuss unconstrained optimisation. That is, finding minima of a function of several variables. We don’t lose too much by assuming that our function f(x,y) depends on two variables.

Recall that our method in the previous post for justifying the A-level approach to minima was to find a necessary condition to be a local minimum, and also a general reason why there should be a global minimum. That way, if there are finitely many points satisfying the condition, we just check all of them, and the one with the smallest value of f is the global minimum. We’ll discuss the existence of the global minimum later.

If we hold one coordinate fixed, the local variation of a function is equivalent to the one-dimensional case.

f(x+h,y)-f(x,y)= h\frac{\partial f}{\partial x}(x,y)+O(h^2).

In general we want to vary both variables, which is fine since

f(x+h,y+\ell)-f(x+h,y)=\ell \frac{\partial f}{\partial y}(x+h,y) + O(h^2).

But since we really want everything to be determined by the function at (x,y), we really want

\frac{\partial f}{\partial y}(x+h,y) \approx \frac{\partial f}{\partial y}(x,y),

and so we be mindful that we may have to assume that both partial derivatives are continuous everywhere they exist. Once we have this though, we can rewrite as

f(x+h,y+\ell) - f(x,y)= h\frac{\partial f}{\partial x}(x,y) + \ell \frac{\partial f}{\partial y}(x,y) + O(h^2)

= (\frac{\partial f}{\partial x}(x,y) ,\frac{\partial f}{\partial y}(x,y) )\cdot (h,\ell) + O(h\vee \ell^2).

In particular, if we define grad of f to be \nabla f(x,y)=(\frac{\partial f}{\partial x}(x,y), \frac{\partial f}{\partial y}(x,y) ) and apply a similar argument to that which we used in the original setting. If \nabla f(x,y)\ne 0, then we can choose some small (h,\ell), such that f(x+h,y+\ell)<f(x,y). Thus a necessary condition for (x,y) to be a local minimum for f is that \nabla f=0.

Lagrange multipliers

This is natural time to discuss where Lagrange multipliers emerge. The setting now is that we still want to minimise some function f(\mathbf{x}), but only across those values of \mathbf{x} which satisfy the constraint g(\mathbf{x})=0.

But our approach is exactly the same, namely we find a necessary condition to be a local minimum subject to the condition. As before, we have

f(\mathbf{x}+\mathbf{h}) - f(\mathbf{x}) = \mathbf{h}\cdot \nabla f + O(|\mathbf{h}|^2),

but we are only interested in those small vectors \mathbf{h} for which \mathbf{x}+\mathbf{h} actually satisfies the constraint, namely g(\mathbf{x}+\mathbf{h})=0. But then

0 = g(\mathbf{x}+\mathbf{h})- g(\mathbf{x})=\mathbf{h}\cdot \nabla g + O(|\mathbf{h}|^2).

From this, we conclude that the set of small relevant \mathbf{h} is described by \mathbf{h}\nabla g=O(|\mathbf{h}|^2). And now we really can revert to the original argument. If there’s a small \mathbf{h} such that \mathbf{h}\cdot \nabla g=0 but \mathbf{h}\cdot \nabla f \ne 0, then we can find some \mathbf{h'_+},\mathbf{h'_-}=\pm \mathbf{h}+O(|\mathbf{h}|^2) such that at least one of f(\mathbf{h'_+}),f(\mathbf{h'_-})<f(\mathbf{h}).

So a necessary condition to be a constrained local minimum is that every vector which is perpendicular to \nabla g must also be perpendicular to \nabla f. From which it follows that these two vectors must be parallel, that is \nabla f(\mathbf{x})=\lambda \nabla g(\mathbf{x}), where \lambda is the so-called Lagrange multiplier. Of course we must also have that g(\mathbf{x})=0, and so we have a complete characterisation for a necessary condition that the constrained optimisation has a local minimum at \mathbf{x}, assuming that all the derivatives of both f and g exist with suitable regularity near \mathbf{x}.


The point of setting up the one-variable case in the unusual way in the previous post was to allow me to say at this stage: “it’s exactly the same”. Well, we’ve already seen an extra differentiability condition we might require, but apart from that, the same approach holds. Multi-variate continuous functions also attain their bounds when the domain is finite and includes its boundary.

Checking the boundary might be more complicated in this setting. If the underlying domain is \{x,y,z\ge 0\}, then one will have to produce a separate argument for why the behaviour when at least one of the variables is zero fits what you are looking for. Especially in the constrained case, it’s possible that the surface corresponding to the constraint doesn’t actually have a boundary, for example if it is the surface of a sphere. Similarly, checking that the objective function gets large as the variables diverge to infinity can be annoying, as there are many ‘directions’ down which to diverge to infinity.

Motivating Cauchy-Schwarz

In practise, you probably want to have such methods in hand as a last resort on olympiad problems. It’s always possible that something will slip through the net, but typically problem-setters are going to trying to ensure that their problems are not amenable to mindless application of non-elementary methods. And even then, one runs the risk of accusations of non-rigour if you don’t state exact, precise results which justify everything which I’ve presented above.

One thing that can be useful, on the other hand, is to observe that the Lagrange multiplier condition looks a lot like the equality condition for Cauchy-Schwarz. So, even if you can’t solve the family of Lagrange multiplier ‘equations’, this does suggest that applying Cauchy-Schwarz to the vectors involved might give you some insight into the problem.

The following inequality from the IMO 2007 shortlist is a good example.

Suppose a_1,\ldots,a_{100}\ge 0 satisfy a_1^2+\ldots+a_{100}^2=1. Prove that

a_1^2a_2+a_2^2a_3+\ldots + a_{100}^2 a_1<\frac{12}{25}.

We shouldn’t be perturbed by the strictness. Maybe we’ll end up showing a true bound in terms of surds that is less neat to write down…

Anyway, applying Lagrange multipliers would require us to solve

a_{k-1}^2 + 2a_ka_{k+1}=2\lambda a_k,\quad k=1,\ldots,100,

with indices taken modulo 100. As so often with these cyclic but non-symmetric expressions, this looks quite hard to solve. However, it turns out that by applying Cauchy-Schwarz to the vectors (a_k),(a_{k-1}^2+2a_ka_{k+1}) gets us a long way into the problem by classical means. Working all the way through is probably best left as an exercise.


Lagrange multipliers Part One: A much simpler setting

I am currently in northern Hungary for our annual winter school for some of the strongest young school-aged mathematicians in the UK and Hungary. We’ve had a mixture of lectures, problem-solving sessions and the chance to enjoy a more authentic version of winter than is currently on offer in balmy Oxford.

One of my favourite aspects of this event is the chance it affords for the students and the staff to see a slightly different mathematical culture. It goes without saying that Hungary has a deep tradition in mathematics, and the roots start at school. The British students observe fairly rapidly that their counterparts have a much richer diet of geometry, and methods in combinatorics at school, which is certainly an excellent grounding for use in maths competitions. By contrast, our familiarity with calculus is substantially more developed – by the time students who study further maths leave school, they can differentiate almost anything.

But the prevailing attitude in olympiad circles is that calculus is unrigorous and hence illegal method. The more developed summary is that calculus methods are hard, or at least technical. This is true, and no-one wants to spoil a measured development of analysis from first principles, but since some of the British students asked, it seemed worth giving a short exposition of why calculus can be made rigorous. They are mainly interested in the multivariate case, and the underlying problem is that the approach suggested by the curriculum doesn’t generalise well at all to the multivariate setting. Because it’s much easier to imagine functions of one variable, we’ll develop the machinery of the ideas in this setting in this post first.

Finding minima – the A-level approach

Whether in an applied or an abstract setting, the main use of calculus at school is to find where functions attain their maximum or minimum. The method can be summarised quickly: differentiate, find where the derivative is zero, and check the second-derivative at that value to determine that the stationary point has the form we want.

Finding maxima and finding minima are a symmetric problem, so throughout, we talk about finding minima. It’s instructive to think of some functions where the approach outlined above fails.

20160131_124059In the top left, there clearly is a minimum, but the function is not differentiable at the relevant point. We can probably assert this without defining differentiability formally: there isn’t a well-defined local tangent at the minimum, so we can’t specify the gradient of the tangent. In the top right, there’s a jump, so depending on the value the function takes at the jump point, maybe there is a minimum. But in either case, the derivative doesn’t exist at the jump point, so our calculus approach will fail.

In the middle left, calculus will tell us that the stationary point in the middle is a ‘minimum’, but it isn’t the minimal value taken by the function. Indeed the function doesn’t have a minimum, because it seems to go off to -\infty in both directions. In the middle right, the asymptote provides a lower bound on the values taken by the function, but this bound is never actually achieved. Indeed, we wouldn’t make any progress by calculus, since there are no stationary points.

At the bottom, the functions are only defined on some interval. In both cases, the minimal value is attained at one of the endpoints of the interval, even though the second function has a point which calculus would identify as a minimum.

The underlying problem in any calculus argument is that the derivative, if it exists, only tells us about the local behaviour of the function. At best, it tells us that a point is a local minimum. This is at least a necessary condition to be a global minimum, which is what we actually care about. But this is a change of emphasis from the A-level approach, for which having zero derivative and appropriately-signed second-derivative is treated as a sufficient condition to be a global minimum.

Fortunately, the A-level approach is actually valid. It can be shown that if a function is differentiable everywhere, and it only has one stationary point, where the second-derivative exists and is positive, then this is in fact the global minimum. The first problem is that this is really quite challenging to show – since in general the derivative might not be continuous, although it might have many of the useful properties of a continuous function. Showing all of this really does require setting everything up carefully with proper definitions. The second problem is that this approach does not generalise well to multivariate settings.

Finding minima – an alternative recipe

What we do is narrow down the properties which the global minimum must satisfy. Here are some options:

0) There is no global minimum. For example, the functions pictured in the middle row satisfy this.

Otherwise, say the global minimum is attained at x. It doesn’t matter if it is attained at several points. At least one of the following options must apply to each such x.

1) f'(x)=0,

2) f'(x) is not defined,

3) x lies on the boundary of the domain where f is defined.

We’ll come back to why this is true. But with this decomposition, the key to identifying a global minimum via calculus is to eliminate options 0), 2) and 3). Hopefully we can eliminate 2) immediately. If we know we can differentiate our function everywhere, then 2) couldn’t possibly hold for any value of x. Sometimes we will be thinking about functions defined everywhere, in which case 3) won’t matter. Even if our function is defined on some interval, this only means we have to check two extra values, and this isn’t such hard work.

20160131_124716It’s worth emphasising why if x is a local minimum not on the boundary and f'(x) exists, then f'(x)=0. We show that if f'(x)\ne 0, then x can’t be a local minimum. Suppose f'(x)>0. Then both the formal definition of derivative, and the geometric interpretation in terms of the gradient of a tangent which locally approximates the function, give that, when h is small,

f(x-h) = f(x)-h f'(x) +o(h),

where this ‘little o’ notation indicates that for small enough h, the final term is much smaller than the second term. So for small enough h, f(x-h)<f(x), and so we don’t have a local minimum.

The key is eliminating option 0). Once we know that there definitely is a global minimum, we are in a good position to identify it using calculus and a bit of quick checking. But how would we eliminate option 0)?

Existence of global minima

This is the point where I’m in greatest danger of spoiling first-year undergraduate course content, so I’ll be careful.

As we saw in the middle row, when functions are defined on the whole real line, there’s the danger that they can diverge to \pm \infty, or approach some bounding value while never actually attaining it. So life gets much easier if you work with functions defined on a closed interval. We also saw what can go wrong if there are jumps, so we will assume the function is continuous, meaning that it has no jumps, or that as y gets close to x, f(y) gets close to f(x). If you think a function can be differentiated everywhere, then it is continuous, because we’ve seen that once a function has a jump (see caveat 2) then it certainly isn’t possible to define the derivative at the jump point.

It’s a true result that a continuous function defined on a closed interval is bounded and attains its bounds. Suppose such a function takes arbitrarily large values. The main idea is that if the function takes arbitrarily large values throughout the interval, then because the interval is finite it also takes arbitrarily large values near some point, which will make it hard to be continuous at that point. You can apply a similar argument to show that the function can’t approach a threshold without attaining it somewhere. So how do you prove that this point exists? Well, you probably need to set up some formal definitions of all the properties under discussion, and manipulate them carefully. Which is fine. If you’re still at school, then you can either enjoy thinking about this yourself, or wait until analysis courses at university.

My personal opinion is that this is almost as intuitive as the assertion that if a continuous function takes both positive and negative values, then it has a zero somewhere in between. I feel if you’re happy citing the latter, then you can also cite the behaviour of continuous functions on closed intervals.

Caveat 2) It’s not true to say that if a function doesn’t have jumps then it is continuous. There are other kinds of discontinuity, but in most contexts these are worse than having a jump, so it’s not disastrous in most circumstances to have this as your prime model of non-continuity.

Worked example

Question 1 of this year’s BMO2 was a geometric inequality. I’ve chosen to look at this partly because it’s the first question I’ve set to make it onto BMO, but mainly because it’s quite hard to find olympiad problems which come down to inequalities in a single variable.

Anyway, there are many ways to parameterise and reparameterise the problem, but one method reduces, after some sensible application of Pythagoras, to showing

f(x)=x+ \frac{1}{4x} + \frac{1}{4x+\frac{1}{x}+4}\ge \frac{9}{8}, (*)

for all positive x.

There are simpler ways to address this than calculus, especially if you establish or guess that the equality case is x=1/2. Adding one to both sides is probably a useful start.

But if you did want to use calculus, you should argue as follows. (*) is certainly true when x\ge \frac{9}{8} and also when $x\le \frac{2}{9}$. The function f(x) is continuous, and so on the interval [\frac{2}{9},\frac{9}{8}] it has a minimum somewhere. We can differentiate, and fortunately the derivative factorises (this might be a clue that there’s probably a better method…) as

(1-\frac{1}{4x^2}) \left[ 1 - \frac{4}{(4x+\frac{1}{x}+4)^2} \right].

If x is positive, the second bracket can’t be zero, so the only stationary point is found at x=1/2. We can easily check that f(\frac12)=\frac98, and we have already seen that f(\frac29),f(\frac98)>\frac98. We know f attains its minimum on [\frac29,\frac98], and so this minimal value must be $\frac98$, as we want.

Overall, the moral of this approach is that even if we know how to turn the handle both for the calculation, and for the justification, it probably would be easier to use a softer approach if possible.

Next stage

For the next stage, we assess how much of this carries across to the multivariate setting, including Lagrange multipliers to find minima of a function subject to a constraint.

Nested Closed Intervals

Th UK team for this year’s International Mathematical Olympiad in Santa Marta, Colombia, has just been selected. For details, see the BMOC website.

During the selection camp, which was hosted at Oundle School near Peterborough, we spent a while discussing analytic questions that typically lie outside the scope of the olympiad syllabus. Furthermore, incorrect consideration of, for example, the exact conditions for a stationary point to be a global maximum, are likely to incur very heavy penalties if a candidate has attempted a solution using, for example, Lagrange multipliers. As a result we have a dilemma about how much analysis to teach during the training process: we want the students to be able to use sophisticated methods if necessary; but we don’t want to spoil the experience of learning this theory in a serious step-by-step manner as first year undergraduates.

This post does not present a solution to this dilemma. Rather, I want to discuss one question that arose on the last day of exams. Because the statement of the question is currently classified, I will have to be oblique in discussion of the solution, but this shouldn’t distract from the maths I actually want to talk about.

The setup is as follows. We have a sequence of nested closed intervals in the reals, that is:

[a_1,b_1]\supset [a_2,b_2]\supset [a_3,b_3]\supset\ldots

We want to demonstrate that there is some real number that lies in all of the intervals, that is \cap_{n\geq 1}[a_n,b_n]\neq \varnothing. This feels intuitively obvious, but some form of proof is required.

First, what does closed mean? Well a closed interval is a closed set, but what about, for example, [a,\infty)\subset\mathbb{R}? It turns out that it is most convenient to define an open set, and then take a closed set to be the complement of an open set.

The best way of thinking about an open set is to say that it does not contain its boundary. This is certainly the case for an open interval or an open ball. It is not immediately clear how to extend this to a general definition. But note that if no point in the set X can be on the boundary, then in all the natural examples to consider there must some finite distance between any point x\in X and the boundary. So in particular there is a small open ball centred on x that is entirely contained within X. This is then the definition of an open set in a metric space (in particular for some \mathbb{R}^d).

Note that it is not a sensible definition to say that a closed set has the property that there is a closed ball containing each point. Any open set has this property also! For if there is an open ball of radius R around a point x, then there is a closed ball of radius R/2 around that same point. So we really do have to say that a set is closed if the complement is open. Note that in \mathbb{R}, a closed interval is closed, and a finite union of closed intervals is closed, though not a countable union as:

(0,1]=\cup_{n\geq 1}[\frac{1}{n},1].

Now we know what a closed set is, we can start thinking about the question.

First we remark that it is not true if we allow the interval to be unbounded. Obviously \cap_n [n,\infty)=\varnothing. Note that even though it appears that these sets have an open upper boundary, they are closed because the complement (-\infty,n) is open. This will not be a problem in our question because everything is contained within the first interval [a_1,b_1] and so is bounded.

Secondly we remark that the result is not true if we move to a general host set. For example, it makes sense to consider open and closed sets in the rationals. For example, the open ball radius 1 either side of 1/2 is just all the rationals strictly between -1/2 and 3/2. We could write this as (-\frac12,\frac32)\cap\mathbb{Q}. But then note that

\cap_{n\geq 1}[\pi-\frac{1}{n},\pi+\frac{1}{n}]\cap\mathbb{Q}

cannot contain any rational elements, so must be empty.

There was various talk that this result might in general be a consequence of the Baire Category Theorem. I think this is overkill. Firstly, there is a straightforward proof for the reals which I will give at the end. Secondly, if anything it is a lemma required for the proof of BCT. This result is often called Cantor’s Lemma.

Lemma (Cantor): Let X be a complete metric space, and let F_1\supset F_2\supset\ldots be a nested sequence of non-empty closed subsets of X with \text{diam}(F_n)\rightarrow 0. There there exists x\in X such that

\cap F_n=\{x\}.

Translation: for ‘complete metric space’ think ‘the reals’ or \mathbb{R}^d. The diameter is, unsurprisingly, the largest distance between two points in the set. For reasons that I won’t go into, the argument for the olympiad question gave \text{diam}(F_{n+1})\leq \frac12\text{diam}(F_n) so this certainly holds here.

Proof: With reference to AC if necessary, pick an element x_n\in F_n for all n. Note that by nesting, x_m\in F_n\;\forall n\leq m. As a result, for m>n the distance d(x_n,x_m)\leq \text{diam}(F_n). This tends to 0 as n grows. The definition of complete is that such a sequence then has a limit point x.

Informally, completeness says that if a sequence of points get increasingly close, they must tend towards a point in the set. This is why it doesn’t work for the rationals. You can have a sequence of rationals that get very close together, but approach a point not in the set, eg an irrational. We use the definition of closed sets in terms of sequences: if the sequence is within a closed set, then the limit is too. This could only go wrong if we ‘leaked onto the boundary’ in the limit, but for a closed set, the boundary is in the set. This shows that x\in F_n for each n, and so $x\in\cap_n F_n$. But if there is another point in \cap_n F_n, then the distance between them is strictly positive, contradicting the claim that diameter tends to 0. This ends the proof.

Olympiad-friendly version: I think the following works fine as a fairly topology definition-free proof. Consider the sequence of left-boundaries

a_1\leq a_2\leq a_3\leq \ldots <b_1.

This sequence is non-decreasing and bounded, so it has a well-defined limit. Why? Consider the supremum. We can’t exceed the sup, but we must eventually get arbitrarily close, by definition of supremum and because the sequence is non-decreasing. Call this limit a. Then do the same for the upper boundaries to get limit b.

If a>b, then there must be some a_n>b_n, which is absurd. So we must have some non-empty interval as the intersection. Consideration of the diameter again gives that this must be a single point.