Given that I’d done this twice before, and was giving the same tutorial five times this week, I was surprised at the extent to which the definition of the Fisher Information caused me to confuse both myself and the students. I thought it would be worth summarising some of the main ways to get confused, and talking about one genuine, quantitative use of the Fisher Information.
Recall we are in a frequentist model, where there is an unknown parameter controlling the distribution of some observable random variable X. Our aim is to make inference about
using the value of X we observe. We use a lower case x to indicate a value we actually observe (ie a variable as opposed to a random variable). For each value of
, there is a density function
controlling X. We summarise this in the likelihood function
.
The important thing to remember is that there are two unknowns here. X is unknown because it is genuinely a random variable. Whereas is unknown because that is how the situation has been established.
is fixed, but we are ignorant of its value. If we knew
we wouldn’t need to be doing any statistical inference to find out about it! A useful thing to keep in mind in everything that follows is: “Is this quantity a RV or not?” This is equivalent to “Is this a function of X or not?”, but the original form is perhaps clearer.
For some value of X=x, we define the maximum likelihood estimator to be
In words, given some data, is the parameter under which this data is most likely. Note that
is a probability density function for fixed
, but NOT for fixed x. (This changes in a Bayesian framework.) For example, there might well be values of x for which
.
Note also that we are only interested in the relative values of . So it doesn’t matter if we rescale L by a constant factor (although this means the marginal in x is no longer a pdf). We typically consider the log-likelihood
instead, as this has a more manageable form when the underlying RV is an IID sequence. Anyway, since we are interested in the ratio of different values of L, we are interested in the difference between values of the log-likelihood l.
Now we come to various versions of the information. Roughly speaking, this is a measure of how good the estimator is. We define the observed information:
This is an excellent example of the merits of considering the question I suggested earlier. Here, J is indeed a random variable. The abbreviated notation being used can lead one astray. Of course, , and so it must be random. The second question is: “where are we evaluating this second derivative?”
For this, we should be considering what our aim is. We know we are defining the MLE by maximising the likelihood function for fixed x. We have said that the difference between values of l gives a measure of relative likelihood. So if the likelihood function has a sharp peak at , then this gives us more confidence than if the peak is very shallow. (I am using ‘confidence’ in a non-technical. Confidence intervals are related, but I am not considering that now.) The absolute value second derivative is precisely a measure of this property.
Ok, but the information does not evaluate this second derivative at , it evaluates it at
. The key point is that it is still a good measure if it evaluates the second derivative at a point close to
. And if
is a good estimator, which it typically will be, especially when we have an IID sequence and the number of terms grows large, then
and
will be close together, and so it remains a plausible measure.
This idea is particularly important when we come to consider the Fisher Information. This is defined as
The cause for confusion is exactly what is mean by this expectation. It is not implausible that this is present, since we have already explained why is a random variable. But we need to decide what distribution we are to integrate with respect to. After all, we don’t actually know the distribution of X. If we did, we wouldn’t be doing statistical inference on it!
So the key thing to remember is that in , the value
plays two roles. First, it gives the distribution of X with respect to which we integrate. Also, it tells us where to evaluate this second derivative. This makes sense overall. If the distribution we are considering is
, then we expect
to be close to the true value
, and so it makes sense to evaluate it there.
Now we deduce the Cramer-Rao bound, which says that for any unbiased estimator of
, we have
First we explain that unbiased means that . This is a property that we would like any estimator to have, though often we have to settle for this property asymptotically. Again, we should be careful about the role of
. Here we mean that given some parameter
,
is then a RV depending onto the actual data, and so has a variance, which happens to be bounded below by a function of the Fisher Information.
So let’s prove this. First we need a quick result about the score, which is defined as:
Again, this is a random variable. We want to show that . This is not difficult. Writing
, we have
as required. Next we consider the covariance of U and . Since we have established that
, this is simply
.
as we assumed at the beginning that was unbiased. Then, from Cauchy-Schwarz, we obtain
So it suffices to prove that . This is a very similar integral rearrangement to what has happened earlier, so I will leave it as an exercise (possibly an exercise in Googling).
Note a good example of this is question 4 on the sheet. At any rate, this is where we see the equality case. We are finding the MLE for given an observation from
. Unsurprisingly,
. We know from our knowledge of the binomial distribution that the variance of this is
, and indeed it turns out that the Fisher Information is precisely the reciprocal of this.
The equality case must happen when the score is proportional to the observed value. I don’t have a particularly strong intuition for when and why this should happen.
In any case, I hope this was helpful and interesting in some way!
Related articles
- Maximum Likelihood Estimates (justinng1.wordpress.com)
- Logistic Regression 101 (cjauvin.blogspot.com)
- Heavy-tails and world records (rigorandrelevance.wordpress.com)
- C.R. Rao and information geometry (ergodicity.net)
- Notes on AP.S: Chapter 1 Introduction (lylelin317.wordpress.com)
- Frequentism and Bayesianism: A Practical Introduction (jakevdp.github.io)