What you have is a classic case of a high dimensional, low signal to noise ratio signal. There are a lot of ways to proceed, but ultimately you will want to know about three different effects:
- Bayesian estimation.
- Dimension reduction.
- Superefficient estimation.
These three ideas are frequently conflated by people with an informal understanding of them. So let's clear that up right away.
- The Bayesian estimation can always be applied. To estimate the probability of success p of independent flips of a possibly unfair coin you can always have a belief that the coin is fair or loaded, and this can lead to a different posterior distribution. You cannot reduce the dimension of the parameter space - unless you decide that you have what passes for 'revealed wisdom' as to that probability. It is also impossible to perform superefficient estimation in this case because the parameter space does not admit it.
Note that in high dimensional cases, the Bayesian prior may reflect beliefs about noise, parameter spaces, and high dimensional estimation in general, as opposed to the traditional understanding that the Bayesian prior reflects belief about the parameter value. The farther you go into the high dimensional / low signal to noise regime, the less you expect to find anything passing for "expert opinion" that relates to the parameter and the more you expect to find all sorts of "expert opinion" about how hard it is to estimate the parameter. This is an important point: You can throw out the bathwater and keep the baby by knowing a lot about babies. But, and this is the one that most statisticians and engineers don't keep in the front window: You can ALSO throw out the bathwater and keep the baby by knowing a lot about bathwater. One of the strategies for exploiting the high availability of noise is that you can study the noise pretty easily. Another important point is that your willingness to believe that you do not have a good idea how to parameterize the probability density means that you will be considering things like Jeffreys' rule for a prior - the unique prior which provides expectations invariant under change of parameterization. Jeffreys' rule is very thin soup as Bayesian priors go - but in the high dimensional low signal to noise ratio it can be significant. It represents another very important principle of this sort of work: "Don't know? Don't Care". There are a lot of things you are not going to know in your situation; you should arrange as much as possible to line up what you are not going to know with what you will not care about. Don't know a good parameterization? Then appeal to a prior (e.g. Jeffreys' rule) which does not depend on the choice of parameterization.
As an example, in the parameterization of poles and zeros, a finite dimensional linear time invariant system has the Jeffreys' prior given by the hyperbolic transfinite diameter of the set of poles and zeros. It turns out that poles and zeros is a provably exponentially bad general parameterization for the finite dimensional linear time invariant system in high dimensions. But you can use the Jeffreys' prior and know that the expectations you would compute with this bad parameterization will be the same (at least on the blackboard) as if you had computed them in some unknown 'good' parameterization.
Dimension reduction. A high dimensional model is by definition capable of dimension reduction. One can, by various means, map the high dimensional parameter space to a lower dimensional parameter space. For example one can locally project onto the eigenspaces of the Fisher information matrix which have large eigenvalues. A lot of naive information theory is along the lines that "fewer parameters is better". It turns out that is false in general, but sometimes true. There are many "Information criteria" which seek to choose the dimension of the parameter space based on how much the likelihood function increases with the parameter. Be skeptical of these. In actual fact, every finite dimensional parameterization has some sort of bias. It is normally difficult to reduce the dimension intelligently unless you have some sort of side information. For example, if you have a translation invariant system, then dimension reduction becomes very feasible, although still very technical. Dimension reduction always interacts with choice of parameterization. Practical model reduction is normally an ad hoc approach. You want to cultivate a deep understanding of Bayesian and superefficient estimation before you choose a form of dimension reduction. However, lots of people skip that step. There is a wide world of ad hoc dimension reduction. Typically, one gets these by adding a small ad hoc penalty to the likelihood function. For example an L1 norm penalty tends to produce parameter estimates with many components equal to zero - because the L1 ball is a cross-polytope (e.g. octahedron) and the vertices of the cross polytope are the standard basis vectors. This is called compressed sensing, and it is a very active area. Needless to say, the estimates you get from this sort of approach depend critically on the coordinate system - it is a good idea to think through the coordinate system BEFORE applying compressed sensing. We will see an echo of this idea a bit later in superefficient estimation. However it is important to avoid confounding dimension reduction with superefficient estimation; so to prove that I give the example of a parameter with two real unconstrained components. You can do compressed sensing in two dimensions. You cannot do superefficient estimation in the plane. Another very important aspect here is "Who's Asking?". If you are estimating a high dimensional parameter, but the only use that will be made of that parameter is to examine the first component? Stop doing that, OK? It is very worthwhile to parameterize the DECISIONS that will be made from the estimation and then look at the preimage of the decisions in the parameter space. Essentially you want to compose the likelihood function (which is usually how you certify your belief about the observations that have been parameterized) with the decision function, and then maximize that (in the presence of your dimension reduction). You can consider the decision function another piece of the baby/bathwater separation, or you can also look at maximal application of don't care to relieve the pressure on what you have to know.
- Superefficient estimation. In the 1950s, following the discovery of the information inequality (formerly known as Cramer-Rao bound), statisticians thought that the best estimates would be unbiased estimates. To their surprise, embarassment, and dismay, they were able to show that in three dimensions and higher, this was not true. The basic example is James-Stein estimation. Probably because the word "biased" sounds bad, people adhere much much longer to "unbiased" estimation than they should have. Plus, the other main flavor of biased estimation, Bayesian estimation, was embroiled in an internecine philosophical war among the statisticians. It is appropriate for academic statisticians to attend to the foundations of their subject, and fiercely adhere to there convictions. That's what the 'academy' is for. However, you are faced with the high dimensional low signal to noise ratio case, which dissolves all philosophy. You are in a situation where unbiased estimation will do badly, and an expert with an informative Bayesian prior is not to be found. But you can prove that superefficient estimation will be good in some important senses (and certainly better than unbiased estimation). So you will do it. In the end, the easiest thing to do is to transform your parameter so that the local Fisher information is the identity (we call this the Fisher Unitary parameterization) and then apply a mild modification of James-Stein estimation (which you can find on Wikipedia). Yes, there are other things that you can do, but this one is as good as any if you can do it. There are some more ad hoc methods, mostly called "shrinkage" estimation. There is a large literature, and things like Beran's REACT theory are worth using, as well as the big back yard of wavelet shrinkage. Don't get too excited about the wavelet in wavelet shrinkage - it's just another coordinate transformation in this business (sorry, Harmonic Analysts). None of these methods can beat a Fisher unitary coordinate transformation IF you can find one. Oddly enough, a lot of the work that goes into having a Fisher unitary coordinate transformation is choosing a parameterization which affords you one. The global geometry of the parameter space and superefficient estimation interact very strongly. Go read Komaki's paper:
KOMAKI, F. (2006) Shrinkage priors for Bayesian prediction. to appear in
Annals of Statistics. (http://arxiv.org/pdf/math/0607021)
which makes that clear. It is thought that if the Brownian diffusion (with Fisher information as Riemannian pseudo-metric) on your parameter space is transient, then you can do superefficient estimation, and not if it is recurrent. This corresponds to the heat kernel on the parameter space, etc. This is very well known in differential geometry to be global information about the manifold. Note that the Bayesian prior and most forms of model reduction are entirely local information. This is a huge and purely mathematical distinction. Do not be confused by the fact that you can show that after you do shrinkage, that there EXISTS a Bayesian prior which agrees with the shrinkage estimate; that just means that shrinkage estimates have some properties in common with Bayesian estimates (from the point of view of decisions and loss functions, etc.) but it does not give you that prior until you construct the shrinkage estimate. Superefficient estimation is GLOBAL.
One other excruciatingly scary aspect of superefficient estimation is that it is extremely "non-unique". There is an inexplicable arbitrary choice; typically of a "point of superefficiency", but it is more like a measure of superefficiency that started out as an atomic measure. You have to choose this thing. There is no reason whatever for you to make one or another available choice. You might as well derive your choice from your social security number. And the parameter estimate that you get, as well as the performance of that estimate, depends on that choice. This is a very important reason that statisticians hated this kind of estimation. But they also hate the situation where you have tons of variance and grams of data, and that is where you are. You can prove (e.g. Komaki's paper gives an example of such proof) that your estimation will be better if you make this choice, so you're going to do it. Just don't expect to ever understand much about that choice. Apply the dont' know/don't care postulate - you will not know, so you're better off not caring. The defense of your estimation is the theorem that proves it's better.
It should be very clear now that these three "nonclassical" effects in estimation theory are really distinct. I think most people don't really understand that. And to some extent it's easy to see why.
Suppose you have an overparameterized generalized linear model (GLM), so your Fisher information is singular, but you do something like iteratively reweighted least squares for the Fisher scoring (because that's what you do with a GLM) and it turns out that the software you use solves with say Householder QR with column pivoting. It's down there under the covers enough that many statisticians performing this estimation would not necessarily know that is what was happening. But because the QR with column pivoting regularizes the system, effectively it is estimating the parameter in a reduced dimension system, where the reduction was done in a Fisher unitary coordinate system (because of the R in the QR factorization). It's really hard for people to understand what they are really doing when they are not aware of the effect of each step. We used to call this "idiot regularization" but I think "sleepwalker" is more accurate than "idiot". But what if the software package used modified Cholesky to solve the system? Well that actually amounts to a form of shrinkage (again in a Fisher unitary coordinate system), it can also be considered a form ("maximum a posteriori") Bayesian prior.
So in order to sort out what these effects do, and what that means you should do, you need a reasonably deep understanding of the custody of your digits all the way from the data being observed through the final decision being taken (treatment, grant proposal, etc.).
At this point (if you're still with me) you might want to know why didn't I just write a one line recommendation of some method suitable for beginners.
Well there isn't one. If you really are in the high dimensional, low signal to noise case, then a high quality result is only available to someone who understands a pretty big part of estimation theory. This is because all three of these distinct effects can improve your result; but in each situation it is difficult to predict what combination will be the most successful. In fact the reason you cannot prescribe a method in advance is precisely because you are in the regime you are in. You can bail out and do something out of a canned program, but that has a good risk of leaving a good deal of the value on the table.
No comments:
Post a Comment