# Tightness in Skorohod Space

This post continues the theme of revising topics in the analytic toolkit relevant to proving convergence of stochastic processes. Of particular interest is the question of how to prove that families of Markov chains might have a process scaling limit converging to a solution of some stochastic differential equation, in a generalisation of Donsker’s theorem for Brownian motion. In this post, however, we address more general aspects of convergence of stochastic processes, with particular reference to Skorohod space.

Topological Background

I’ve discussed Skorohod space in a previous post. For now, we focus attention on compactly supported functions, D[0,T]. Some of what follows can be extended to the infinite-time setting easily, and some requires more work. Although we can define a metric on the space of cadlag functions in lots of ways, it is more useful to think topologically, or at least with a more vague sense of metric. We say two cadlag functions are close to one another if there is a reparameterisation of the time-axis, (a function [0,T] to itself) that is uniformly close to the identity function, and when applied to one of the cadlag functions, brings it close to the other cadlag function. Heuristically, two cadlag functions are close if their large jumps are close to one another and of similar size, and if they are uniformly close elsewhere. It is worth remembering that a cadlag function on even an unbounded interval can have only countably many jumps, and only finitely many with magnitude greater than some threshold on any compact interval.

For much of the theory one would like to use, it is useful for the spaces under investigation to be separable. Recall a topological space is separable if there exists a countable dense subset. Note in particular that D[0,T] is not separable under the uniform metric, since we can define $f_x(\cdot)=\mathbf{1}_{(\cdot \ge x)}$ for each $x\in[0,T]$, then $||f_x-f_y||_\infty=1$ whenever $x\ne y$. In particular, we have an uncountable collection of disjoint open sets given by the balls $\mathcal{B}(f_x,\frac12)$, and so the space is not countable. Similarly, $C[0,\infty)$ is not separable. A counterexample might be given by considering functions which take the values {0,1} on the integers. Thus we have a map from $\{0,1\}^{\mathbb{N}}\rightarrow C[0,\infty)$, where the uniform distance between any two distinct image points is at least one, hence the open balls of radius 1/2 around each image point give the same contradiction as before. However, the Stone-Weierstrass theorem shows that C[0,T] is separable, as we can approximate any such function uniformly well by a polynomial, and thus uniformly well by a polynomial with rational coefficients.

In any case, it can be shown that D[0,T] is separable with respect to the natural choice of metric. It can also be shown that there is a metric which gives the same open sets (hence is a topologically equivalent metric) under which D[0,T] is complete, and hence a Polish space.

Compactness in C[0,T] and D[0,T]

We are interested in tightness of measures on D[0,T], so first we need to address compactness for sets of deterministic functions in D[0,T]. First, we consider C[0,T]. Here, the conditions for a set of functions to be compact is given by the celebrated Arzela-Ascoli theorem. We are really interested in compactness as a property of size, so we consider instead relative compactness. A set is relatively compact (sometimes pre-compact) if its closure is compact. For the existence of subsequential limits, this is identical to compactness, only now we allow the possibility of the limit point lying outside the set.

We note that the function $C[0,T]\rightarrow \mathbb{R}$ given by $||f||_\infty$ is continuous, and hence uniform boundedness is certainly a required condition for compactness in C[0,T]. Arzela-Ascoli states that uniform boundedness plus equicontinuity is sufficient for a set of such functions to be compact. Equicontinuity should be thought of as uniform continuity that is uniform among all the functions in the set, rather than just within the argument of an individual particular function.

For identical reasons, we need uniform boundedness for relative compactness in D[0,T], but obviously uniform continuity won’t work as a criterion for discontinuous functions! We seek some analogue of the modulus of continuity that ignores jumps. We define

$\omega'_\delta(f):=\inf_{\{t_i\}} \max_i \sup_{s,t\in[t_{i-1},t_i)} |f(s)-f(t)|,$

where the infimum is taken over all meshes $0=t_0 with $t_i-t_{i-1}>\delta$. Note that as $\delta\downarrow 0$, we can, if we want, place the $t_i$ so that large jumps of the function f take place over the boundaries between adjacent parts of the mesh. In particular, for a given cadlag function, it can be shown fairly easily that $\omega'_\delta(f)\downarrow 0$ as $\delta\rightarrow 0$. Then, unsurprisingly, in a similar fashion to the Arzela-Ascoli theorem, it follows that a set of functions $A\subset D[0,T]$ is relatively compact if it is uniformly bounded, and

$\lim_{\delta\rightarrow 0} \sup_{f\in A}\omega'_\delta(f)=0.$

Note that this ‘modulus of continuity’ needs to decay uniformly across the set of functions, but that we do not need to choose the mesh at level $\delta$ uniformly across all functions. This would obviously not work, as then the functions $\mathbf{1}_{(\cdot\ge x_n)}$ for any sequence $x_n\rightarrow x$ would not be compact, but they clearly converge in Skorohod space!

Tightness in C[0,T] and D[0,T]

Naturally, we are mainly interested in (probability) measures on D[0,T], and in particular conditions for tightness on this space. Recall a family of measures is tight if for any $\epsilon>0$, there exists some compact set A such that

$\pi(A)>1-\epsilon,\quad \forall \pi\in\Pi.$

So, for measures $(\mu_n)$ on D[0,T], the sequence is tight precisely if for any $\epsilon>0$, there exists $M,\delta$ and some N such that for any n>N, both

$\mu_n(||f||_\infty >M)\le \epsilon,\quad \mu_n(\omega'_\delta(f)>\epsilon)\le \epsilon$

hold. In fact, the second condition controls variation sufficiently strongly, that we can replace the first condition with

$\mu_n(|f(0)|>M)\le \epsilon.$

Often we might be taking some sort of scaling limit of these processes in D[0,T], where the jumps become so small in the limit that we expect the limit process to be continuous, perhaps an SDE or diffusion. If we can replace $\omega'_\delta$ by $\omega_\delta$, the standard modulus of continuity, then we have the additional that any weak limit lies in C[0,T].

In general, to prove convergence of some stochastic processes, we will want to show that the processes are tight, by demonstrating the properties above, or something equivalent. Then Prohorov’s theorem (which I tend to think of as a probabilistic functional version of Bolzano-Weierstrass) asserts that the family of processes has a weak subsequential limit. Typically, one then shows that any weak subsequential limit must have the law of some particular random process. Normally this is achieved by showing some martingale property (eg for an SDE) in the limit, often by using the Skorohod representation theorem to use almost sure subsequential convergence rather than merely weak convergence. Then one argues that there is a unique process with this property and a given initial distribution. So since all weak subsequential limits are this given process, in fact the whole family has a weak limit.

# Non-separable Skorohod Representations

In the previous post, I discussed the statement and proof of the Skorohod representation theorem. This concerns the conditions under which it is possible to couple distributions which converge in law, to obtain a family of random variable on a possibly very large probability space, which converge almost surely. The condition for the theorem to hold is that the base space, or at least the support of the limiting distribution should be a separable metric space. Skorohod’s original proof concerned the case where all the distributions were supported on a complete, separable metric space (Polish space), but this extension is not particularly involved, and was proven not long after the original result.

It is natural to ask exactly what goes wrong in non-separable or non-metrizable spaces. Recall a space is separable if it contains a countable dense subset. Obviously, finite or countable sets are by definition separable with any metric. Considering the points with rational coordinates shows that $\mathbb{R}^d$ is separable for each d, and the Stone-Weierstrass theorem shows that continuous functions with on a bounded interval are also separable with the uniform topology, as they can be approximated uniformly well by polynomials with rational coefficients. One heuristic is that a separable space does not have ‘too many’ open sets.

There are references (for example, see [2]) to examples of Skorohod non-representation in non-metrizable topological spaces, which are ‘big’ enough to allow convergence in distribution with respect to a particular class of test functions, but where the distributions are not uniformly tight, so cannot converge almost surely. However, I don’t really understand this well at all, and have struggled to chase the references, some of which are unavailable, and some in French.

Instead, I want to talk about an example given in [1] of a family of distributions on a non-separable space, which cannot be coupled to converge almost surely. The space is (0,1) equipped with the discrete metric, which says that $d(x,y)=1$ whenever $x\ne y$. Note that it is very hard to have even deterministic convergence in this space, since the only way to be close to a element of the space is indeed to be equal to that element. We will construct random variables and it will unsurprising that they cannot possibly converge almost surely in any coupling, but the exact nature of the construction will lead to convergence in distribution.

Based on what we proved last time, the support of the limiting distribution will be non-separable. It turns out that the existence of such a distribution is equiconsistent in the sense of formal logic with the existence of an extension of Lebesgue measure to the whole power set of (0,1). This is not allowed under the Axiom of Choice, but is consistent under the slightly weaker Axiom of Dependent Choice (AC). This weaker condition says, translated into language more familiar to me, that every directed graph with arbitrary (and in particular, potentially uncountable) vertex set, and with all out-degrees at least 1 contains an infinite directed path. This seems obvious when viewed through the typically countable context of graph theory. But the natural construction is to start somewhere and ‘just keep going’ wherever possible, which involves making a choice from the out-neighbourhood at lots of vertices. Thus it is clear why this is weaker than AC. Anyway, in the sequel, we assume that this extension of Lebesgue measure exists.

Example (from [1]): We take $(X_n)_{n\ge 1}$ to be an IID sequence of non-negative RVs defined on the probability space $((0,1),\mathcal{B}(0,1),\mathrm{Leb})$, with expectation under Lebesgue measure equal to 1. It is not obvious how to do this, with the restriction on the probability space. One example might be to write $\omega\in(0,1)$ as $\overline{\omega_1\omega_2\ldots}$, the binary expansion, and then set $X_n=2\omega_n$. We will later require that $X_n$ is not identically 1, which certainly holds in this example just given.

Let $\mu$ be the extension of Lebesgue measure to the power set $\mathcal{P}=\mathcal{P}(0,1)$. Now define the measures:

$\mu_n(B)=\mathbb{E}_\mu(X_n \mathbf{1}_B),\quad \forall B\in\mathcal{P}.$

To clarify, we are defining a family of measures which also are defined for all elements of the power set. We have defined them in a way that is by definition a coupling. This will make it possible to show convergence in distribution, but they will not converge almost surely in this coupling, or, in fact, under any coupling. Now consider a restricted class of sets, namely $B\in \sigma(X_1,\ldots,X_k)$, the class of sets distinguishable by the outcomes of the first k RVs.

[Caution: the interpretation of this increasing filtration is a bit different to the standard setting with for example Markov processes, as the sets under consideration are actually subsets of the probability space on which everything is defined. In particular, there is no notion that a ‘fixed deterministic set’ lies in all the layers of the filtration.]

Anyway, by independence, when n>k, by independence, we have

$\mu_n(B)=\mathbb{E}_\mu(X_n\mathbf{1}_B)=\mathbb{E}_\mu(X_n)\mathbb{E}_\mu(\mathbf{1}_B)=\mu(B).$

So whenever $B\in\mathcal{F}\bigcup_k \sigma(X_1,\ldots,X_k)$, $\lim_n \mu_n(B)=\mu(B)$. By MCT, we can extend this convergence to any bounded $\mathcal F$-measurable function.

This is the clever bit. We want to show that $\mu_n(B)\rightarrow\mu(B)$ for all $B\in\mathcal P$, but we only have it so far for $B\in\mathcal F$. But since $\mathcal{F}\subset \mathcal P$, which is the base field of the probability space under the (non-AC) assumption, we can take conditional expectations. In particular for any $B\in\mathcal P$, $\mathbb{E}_\mu[\mathbf{1}_B | \mathcal{F}]$ is a bounded, $\mathcal F$-measurable function. Hence, by definition of $\mu_n$ and the extended MCT result:

$\mu_n(B)=\mathbb{E}_\mu[X_n\mathbb{E}_\mu[\mathbf{1}_B|\mathcal F]]=\mathbf{E}_{\mu_n}[\mathbb{E}_\mu[\mathbf{1}_B|\mathcal F]] \rightarrow \mathbb{E}_\mu [\mathbb{E}_\mu[\mathbf{1}_B |\mathcal{F}]].$

Now, since by definition $\mathbf{1}_B$ is $\mathcal{P}$-measurable, applying the tower law gives that this is equal to $\mu(B)$. So we have

$\mu_n(B)\rightarrow \mu(B),\quad \forall B\in\mathcal{P}.$ (*)

This gives weak convergence $\mu_n\Rightarrow \mu$. At first glance it might look like we have proved a much stronger condition than we need. But recall that in any set equipped with the discrete topology, any set is both open and closed, and so to use the portmanteau lemma, (*) really is required.

Now we have to check that we can’t have almost sure convergence in any coupling of these measures. Suppose that we have a probability space with random variables $Y,(Y_n)$ satisfying $\mathcal L(Y)=\mu, \mathcal L(Y_n)=\mu_n$. But citing the example I gave of $X_n$ satisfying the conditions, the only values taken by $Y_n$ are 0 and 2, and irrespective of the coupling,

$\mathbb{P}(Y_n=2\text{ infinitely often})>0.$

So it is impossible that $Y_n$ can converge almost surely to any supported on [0,1].

References

[1] Berti, Pratelli, Rigo – Skorohod Representation and Disintegrability (here – possibly not open access)

[2] Jakubowski – The almost sure Skorokhod representation for subsequences in non-metric spaces.

# Mixing Times 3 – Convex Functions on the Space of Measures

The meat of this course covers rate of convergence of the distribution of Markov chains. In particular, we want to be thinking about lots of distributions simultaneously, so we really to be comfortable working with the space of measures on a (for now) finite state space. This is not really too bad actually, since we can embed it in a finite-dimensional real vector space.

$\mathcal{M}_1(E)=\{(x_v:v\in\Omega),x_v\geq 0, \sum x_v=1\}\subset \mathbb{R}^\Omega.$

Since most operations we might want to apply to distributions are linear, it doesn’t make much sense to inherit the usual Euclidean metric. In the end, the metric we use is the same as the $L_1$ metric, but the motivation is worth exploring. Typically, the size of $|\Omega|$ will be function of n, a parameter which will tend to infinity. So we do not want to be too rooted in the actual set $\Omega$ for what will follow.

Perhaps the best justification for total variation distance is from a gambling viewpoint. Suppose your opinion for the distribution of some outcome is $\mu$, and a bookmaker has priced their odds according to their evaluation of the outcome as $\nu$. You want to make the most money, assuming that your opinion of the distribution is correct (which in your opinion, of course it is!). So assuming the bookmaker will accept an arbitrarily complicated (but finite obviously, since there are only $|\Omega|$ possible outcomes) bet, you want to place money on whichever event evinces the greatest disparity between your measure of likeliness and the bookmaker’s. If you can find an event which you think is very likely, and which the bookmaker thinks is unlikely, you are (again, according to your own opinion of the measure) on for a big profit. This difference is the total variation distance $||\mu-\nu||_{TV}$.

Formally, we define:

$||\mu-\nu||_{TV}:=\max_{A\subset\Omega}|\mu(A)-\nu(A)|.$

Note that if this maximum is achieved at A, it is also achieved at $A^c$, and so we might as well go with the original intuition of

$||\mu-\nu||_{TV}=\max_{A\subset\Omega} \left[\mu(A)-\nu(A)\right].$

If we decompose $\mu(A)=\sum_{x\in A}\mu(x)$, and similarly for $A^c$, then add the results, we obtain:

$||\mu-\nu||_{TV}=\frac12\sum_{x\in\Omega}|\mu(A)-\nu(A)|.$

There are plenty of other interesting interpretations of total variation distance, but I don’t want to get bogged down right now. We are interested in the rate of convergence of distributions of Markov chains. Given some initial distribution $\lambda$ of $X_0$, we are interested in $||\lambda P^t-\pi||_{TV}$. The problem is that doing everything in terms of some general $\lambda$ is really annoying, at the very least for notational reasons. So really we want to investigate

$d(t)=\max_{\lambda\in\mathcal{M}_1(E)}||\lambda P^t-\pi||_{TV},$

the worst-case scenario, where we choose the initial distribution that mixes the slowest, at least judging at time t. Now, here’s where the space of measures starts to come in useful. For now, we relax the requirement that measures must be probability distributions. In fact, we allow them to be negative as well. Then $\lambda P^t-\pi$ is some signed measure on $\Omega$ with zero total mass.

But although I haven’t yet been explicit about this, it is easy to see that $||\cdot||_{TV}$ is a norm on this space. In fact, it is (equivalent to – dividing by 1/2 makes no difference!) the product norm of the $L_1$ norm as defined before. Recall the norms are convex functions. This is an immediate consequence of the triangle inequality. The set of suitable distributions $\lambda$ is affine, because an affine combination of probability distributions is another probability distribution.

Then, we know from linear optimisation theory, that convex functions on an affine space achieve their maxima at boundary points. And the boundary points for this definition of $\lambda\in\mathcal{M}_1(E)$, are precisely the delta-measures at some point of the state space $\delta_v$. So in fact, we can replace our definition of d(t) by:

$d(t)=\max_{x\in\Omega}||P^t(x,\cdot)-\pi||_{TV},$

where $P^t(x,\cdot)$ is the same as $(\delta_x P^t)(\cdot)$. Furthermore, we can immediately apply this idea to get a second result for free. In some problems, particularly those with neat couplings across all initial distributions, it is easier to work with a larger class of transition probabilities, rather than the actual equilibrium distribution, so we define:

$\bar{d}(t):=\max_{x,y\in\Omega}||P^t(x,\cdot)-P^t(y,\cdot)||_{TV}.$

The triangle inequality gives $\bar{d}(t)\leq 2d(t)$ immediately. But we want to show $d(t)\leq \bar{d}(t)$, and we can do that as before, by considering

$\max_{\lambda,\mu\in\mathcal{M}_1(E)}||\lambda P^t-\mu P^t||_{TV}.$

The function we are maximising is a convex function on $\mathcal{M}_1(E)^2$, and so it attains its maximum at a boundary point, which must be $\lambda=\delta_x,\mu=\delta_y$. Hence $\bar{d}(t)$ is equal to the displayed expression above, which is certainly greater than or equal to the original formulation of d(t), as this is the maximum of the same expression over a strict subset.

I’m not suggesting this method is qualitatively different to that proposed by the authors of the book. However, I think this is very much the right way to be thinking about these matters of maximising norms over a space of measures. Partly this is good because it gives an easy ‘sanity check’ for any idea. But also because it gives some idea of whether it will or won’t be possible to extend the ideas to the case where the state space is infinite, which will be of interest much later.

# Weak Convergence and the Portmanteau Lemma

Much of the theory of Large Deviations splits into separate treatment of open and closed sets in the rescaled domains. Typically we seek upper bounds for the rate function on closed sets, and lower bounds for the rate function on open sets. When things are going well, these turn out to be same, and so we can get on with some applications and pretty much forget about the topology underlying the construction. Many sources made a comment along the lines of “this is natural, by analogy with weak convergence”.

Weak convergence is a topic I learned about in Part III Advanced Probability. I fear it may have been one of those things that leaked out of my brain shortly after the end of the exam season… Anyway, this feels like a good time to write down what it is all about a bit more clearly. (I’ve slightly cheated, and chosen definitions and bits of the portmanteau lemma which look maximally similar to the Large Deviation material, which I’m planning on writing a few posts about over the next week.)

The motivation is that we want to extend the notion of convergence in distribution of random variables to general measures. There are several ways to define convergence in distribution, so accordingly there are several ways to generalise it. Much of what follows will be showing that these are equivalent.

We work in a metric space (X,d) and have a sequence $(\mu_n)$ and $\mu$ of (Borel) probability measures. We say that $(\mu_n)$ converges weakly to $\mu$, or $\mu_n\Rightarrow\mu$ if:

$\mu_n(f)\rightarrow\mu(f), \quad\forall f\in\mathcal{C}_b(X).$

So the test functions required for result are the class of bounded, continuous functions on X. We shall see presently that it suffices to check a smaller class, eg bounded Lipschitz functions. Indeed the key result, which is often called the portmanteau lemma, gives a set of alternative conditions for weak convergence. We will prove the equivalence cyclically.

Portmanteau Lemma

The following are equivalent.

a) $\mu_n\Rightarrow \mu$.

b) $\mu_n(f)\rightarrow\mu(f)$ for all bounded Lipschitz functions f.

c) $\limsup_n \mu_n(F)\leq \mu(F)$ for all closed sets F. Note that we demanded that all the measures be Borel, so there is no danger of $\mu(F)$ not being defined.

d) $\liminf_n \mu_n(F)\geq \mu(G)$ for all open sets G.

e) $\lim_n \mu_n(A)=\mu(A)$ whenever $\mu(\partial A)=0$. Such an A is called a continuity set.

Remarks

a) All of these statements are well-defined if X is a general topological space. I can’t think of any particular examples where we want to use measures on a non-metrizable space (eg C[0,1] with topology induced by pointwise convergence), but there seem to be a few references (such as the one cited here) implying that the results continue to hold in this case provided X is locally compact Hausdorff. This seems like an interesting thing to think about, but perhaps not right now.

b1) This doesn’t strike me as hugely surprising. I want to say that any bounded continuous function can be uniformly approximated almost everywhere by bounded Lipschitz functions. Even if that isn’t true, I am still not surprised.

b2) In fact this condition could be replaced by several alternatives. In the proof that follows, we only use one type of function, so any subset of $\mathcal{C}_b(X)$ that contains the ones we use will be sufficient to determine weak convergence.

c) and d) Why should the sign be this way round? The canonical example to have in mind is some sequence of point masses $\delta_{x_n}$ where $x_n\rightarrow x$ in some non-trivial way. Then there is some open set eg X\{x} such that $\mu_n(X\backslash x)=1$ but $\mu(X\backslash x)=0$. Informally, we might say that in the limit, some positive mass could ‘leak out’ into the boundary of an open set.

e) is then not surprising, as the condition of being a continuity set precisely prohibits the above situation from happening.

Proof

a) to b) is genuinely trivial. For b) to c), find some set F’ containing F such that $\mu(F')-\mu(F)=\epsilon$. Then find a Lipschitz function f which is 0 outside F’ and 1 on F. We obtain

$\limsup_n \mu_n(F)\leq \limsup \mu_n(f)=\mu(f)\leq \mu(F').$

But $\epsilon$ was arbitrary, so the result follows as it tends to zero. c) and d) are equivalent after taking $F^c=G$. If we assume c) and d) and apply them to $A^\circ, \bar{A}$, then e) follows.

e) to a) is a little trickier. Given a bounded continuous function f, assume WLOG that it has domain [0,1]. At most countably many events $\{f=a\}$ have positive mass under each of $\mu, (\mu_n)$. So given $M>0$, we can choose a sequence

$-1=a_0 such that $|a_{k+1}-a_k|<\frac{1}{M}$,

and $\mu(f=a_k)=\mu_n(f=a_k)=0$ for all k,n. Now it is clear what to do. $\{f\in[a_k,a_{k+1}]\}$ is a continuity set, so we can apply e), then patch everything together. There are slightly too many Ms and $\epsilon$s to do this sensibly in WordPress, so I will leave it at that.

I will conclude by writing down a combination of c) and d) that will look very familiar soon.

$\mu(A^\circ)\leq \liminf_n \mu_n(A)\leq \limsup_n\mu_n(A)\leq \mu(\bar{B}).$

References

Apart from the Part III Advanced Probability course, this article was prompted by various books on Large Deviations, including those by Frank den Hollander and Ellis / Dupuis. I’ve developed the proof above from the hints given in the appendix of these very comprehensible notes by Rassoul-Agha and Seppalainen.

# The Sample Space for a Die

Also featuring: a non-Lebesgue measurable set and Dynkin’s Lemma.

At the National Maths Summer School last week, the senior students and I spent a while talking about probability space, and in particular, when it was reasonable to assign a probability to a potential event. We considered rolling a standard die, and the probabilities $\mathbb{P}(D\in\{\})$, the empty event, and $\mathbb{P}(D=7)$. Though it is tempting to conclude that the latter must be zero, in the end we decided that it should not actually be defined at all.

Why? Well, if we accept $\mathbb{P}(D=7)=0$, then by extension we must accept $\mathbb{P}(D=137)$ and $\mathbb{P}(D=\1.50)$ also both exist and are zero. What have we gained? In reality very little. But the cost is this: we might define an event to be any subset of the sample space. Before, our sample space was $\Omega=\{1,2,3,4,5,6\}$, and so there are exactly 64 events, including the possibly counter-intuitive empty event $\{\}$. This is finite, which is always nice. With the extra events, however, we must extend the sample space to $\Omega=\{1,2,3,4,5,6,7,\ldots$, where “…” means “the rest of the universe”. This is a fairly exotic mathematical object, and really has no place in any sensible discussion.

This reminded me of one of my favourite results from Part II Probability and Measure. Of course, for uncountable sample spaces, we cannot necessarily assume all subsets of $\Omega$ are measurable. Instead we build up a sigma-algebra of measurable sets, most importantly for Lebesgue measure on $\mathbb{R}$. An immediate question to ask is: are all subsets of $\mathbb{R}$ Lebesgue-measurable?

And the answer is ‘no’. Why? The standard counterexample is as follows. Consider Lebesgue measure on the unit interval $U=\mathbb{R}/\mathbb{Z}$, with endpoints identified. Now consider the rationals in U, which are actually a subgroup $\mathbb{Q}\leq U$, with uncountably many cosets. Pick an element from each coset (*). Call this sets A. Then, working modulo 1, $U=\cup_{q\in\mathbb{Q}\cap U}A+q$. If A is Lebesgue-measurable, then so is A+q, and $\mu(A+q)=\mu(A)$ (**). Combining these two results, using countable additivity:

$\mu(U)=$ (0 if $\mu(A)=0, \infty$ otherwise).

This is a contradiction, and hence we conclude that A is not Lebesgue-measurable.

Remark on (*): This relies on the Axiom of Choice. In fact, the existence of non-Lebesgue measurable sets MAY be equivalent to AC.

Remark on (**): I was suddenly unsure that this was obvious. I mean, this is such a weird set that it is in fact not measurable: why should its hypothetical measure be translation invariant? It is tempting to argue vaguely, by saying that the construction of Lebesgue measure is invariant under translation at all steps. As so often with elementary measure theory, recourse to Dynkin’s Lemma is more reliable.

Let D be the collection of measurable sets whose measure is invariant under translation. By definition, D is invariant under translation (of its elements). D certainly contains all intervals in U, which is a pi-system generating $\mathcal{B}([0,1])$. But, in a classic proof by suggestive notation, we can check that D is a d-system. The presence of the empty sets is clear. If we have $B\subset A$, both in D, then also $A\backslash B\in D$, as the translates of $x\in A\backslash B$ must be in A, but cannot be in B, as B is translation invariant. Finally, given $A_1\subset A_2\subset\ldots\subset D$, then $x\in\cup A_i\Rightarrow x\in A_n$ for some n, so all of x’s translates are in $A_n$, and hence in $\cup A_i$.

Now we can deploy Dynkin’s Lemma. D must be the sigma-algebra of all measurable sets, as we wanted.