The Fisher Information and Cramer-Rao Bound

Given that I’d done this twice before, and was giving the same tutorial five times this week, I was surprised at the extent to which the definition of the Fisher Information caused me to confuse both myself and the students. I thought it would be worth summarising some of the main ways to get confused, and talking about one genuine, quantitative use of the Fisher Information.

Recall we are in a frequentist model, where there is an unknown parameter \theta controlling the distribution of some observable random variable X. Our aim is to make inference about \theta using the value of X we observe. We use a lower case x to indicate a value we actually observe (ie a variable as opposed to a random variable). For each value of \theta, there is a density function f_\theta(x) controlling X. We summarise this in the likelihood function L(x,\theta).

The important thing to remember is that there are two unknowns here. X is unknown because it is genuinely a random variable. Whereas \theta is unknown because that is how the situation has been established. \theta is fixed, but we are ignorant of its value. If we knew \theta we wouldn’t need to be doing any statistical inference to find out about it! A useful thing to keep in mind in everything that follows is: “Is this quantity a RV or not?” This is equivalent to “Is this a function of X or not?”, but the original form is perhaps clearer.

For some value of X=x, we define the maximum likelihood estimator to be

\hat\theta(X):=\text{argmax}_\theta L(X,\theta).

In words, given some data, \hat\theta is the parameter under which this data is most likely. Note that L(x,\theta) is a probability density function for fixed \theta, but NOT for fixed x. (This changes in a Bayesian framework.) For example, there might well be values of x for which L(x,\theta)=0\,\forall \theta\in\Theta.

Note also that we are only interested in the relative values of L(x,\theta_1), L(x,\theta_2). So it doesn’t matter if we rescale L by a constant factor (although this means the marginal in x is no longer a pdf). We typically consider the log-likelihood l(x,\theta)=\log L(x,\theta) instead, as this has a more manageable form when the underlying RV is an IID sequence. Anyway, since we are interested in the ratio of different values of L, we are interested in the difference between values of the log-likelihood l.

Now we come to various versions of the information. Roughly speaking, this is a measure of how good the estimator is. We define the observed information:

J(\theta):=-\frac{d^2 l(\theta)}{d\theta^2}.

This is an excellent example of the merits of considering the question I suggested earlier. Here, J is indeed a random variable. The abbreviated notation being used can lead one astray. Of course, l(\theta)=l(X,\theta), and so it must be random. The second question is: “where are we evaluating this second derivative?”

For this, we should be considering what our aim is. We know we are defining the MLE by maximising the likelihood function for fixed x. We have said that the difference between values of l gives a measure of relative likelihood. So if the likelihood function has a sharp peak at \hat\theta, then this gives us more confidence than if the peak is very shallow. (I am using ‘confidence’ in a non-technical. Confidence intervals are related, but I am not considering that now.) The absolute value second derivative is precisely a measure of this property.

Ok, but the information does not evaluate this second derivative at \hat\theta, it evaluates it at \theta. The key point is that it is still a good measure if it evaluates the second derivative at a point close to \hat\theta. And if \hat\theta is a good estimator, which it typically will be, especially when we have an IID sequence and the number of terms grows large, then \theta and \hat\theta will be close together, and so it remains a plausible measure.

This idea is particularly important when we come to consider the Fisher InformationThis is defined as

I(\theta):= \mathbb{E}J(\theta)=\mathbb{E} -\frac{d^2 l(\theta)}{d\theta^2}.

The cause for confusion is exactly what is mean by this expectation. It is not implausible that this is present, since we have already explained why J(\theta) is a random variable. But we need to decide what distribution we are to integrate with respect to. After all, we don’t actually know the distribution of X. If we did, we wouldn’t be doing statistical inference on it!

So the key thing to remember is that in I(\theta), the value \theta plays two roles. First, it gives the distribution of X with respect to which we integrate. Also, it tells us where to evaluate this second derivative. This makes sense overall. If the distribution we are considering is l(\cdot,\theta), then we expect \hat\theta to be close to the true value \theta, and so it makes sense to evaluate it there.

Now we deduce the Cramer-Rao bound, which says that for any unbiased estimator \hat\theta of \theta, we have

\text{Var}(\hat\theta)\ge \frac{1}{I(\theta)}.

First we explain that unbiased means that \mathbb{E}\hat\theta=\theta. This is a property that we would like any estimator to have, though often we have to settle for this property asymptotically. Again, we should be careful about the role of \theta. Here we mean that given some parameter \theta, \hat\theta is then a RV depending onto the actual data, and so has a variance, which happens to be bounded below by a function of the Fisher Information.

So let’s prove this. First we need a quick result about the score, which is defined as:

U(\theta)=\frac{dl(\theta)}{d\theta}.

Again, this is a random variable. We want to show that \mathbb{E}U(\theta)=0. This is not difficult. Writing f(x)=L(x,\theta), we have

\mathbb{E}U(\theta)=\int f(x)\frac{\partial}{\partial\theta}\log f(x)dx

= \int \frac{\partial}{\partial\theta} L(x,\theta)dx=\frac{d}{d\theta}\int f(x)dx=\frac{d}{d\theta}1=0,

as required. Next we consider the covariance of U and \hat\theta. Since we have established that \mathbb{E}U=0, this is simply \mathbb{E}[U\hat\theta].

\text{Cov}(U,\hat\theta)=\int \hat\theta(x)f(x) \frac{d \log f(x)}{d\theta}dx=\int \hat\theta(x)f(x)\cdot \frac{\frac{\partial f(x)}{\partial \theta}}{f(x)} dx

= \int \hat\theta(x)\frac{\partial f(x)}{\partial \theta}=\frac{\partial}{\partial \theta}\int \hat\theta(x)f(x)

= \frac{\partial}{\partial\theta} \mathbb{E}\hat\theta=\frac{d\theta}{d\theta}=1,

as we assumed at the beginning that \hat\theta was unbiased. Then, from Cauchy-Schwarz, we obtain

\text{Var}(U)\text{Var}(\hat\theta)\ge \text{Cov}(U,\hat\theta)=1.

So it suffices to prove that \text{Var}(U)=I(\theta). This is a very similar integral rearrangement to what has happened earlier, so I will leave it as an exercise (possibly an exercise in Googling).

Note a good example of this is question 4 on the sheet. At any rate, this is where we see the equality case. We are finding the MLE for \theta given an observation from \text{Bin}(n,\theta). Unsurprisingly, \hat\theta=\frac{X}{m}. We know from our knowledge of the binomial distribution that the variance of this is \frac{\theta(1-\theta)}{n}, and indeed it turns out that the Fisher Information is precisely the reciprocal of this.

The equality case must happen when the score is proportional to the observed value. I don’t have a particularly strong intuition for when and why this should happen.

In any case, I hope this was helpful and interesting in some way!

Enhanced by Zemanta
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s