Friday, August 14, 2009

High Variance Partitioning and Marginal Effect

In my previous post, I talked about High Variance Partitioning, which talks about how to create many partitions of a set which satisfy some distance criteria. Read that post first, or this post will be very confusing for you. Now, I'm going to explain why high variance partitions could be useful in a machine learning context. I'll present it in the Bioinformatics background I study, but this is applicable to any machine learning problem with similar qualities.

Small N Large P

Small N large P is a machine learning problem which has far more variables than individuals. For example, SNP data can have 500k+ variables, but only a few hundred individuals genotyped. This is the curse of dimensionality's older, angrier brother. The curse of dimensionality basically says that as the number of dimensions grow, the number of datapoints needed to evenly sample a unit sphere at a given granularity grows a lot (and almost all of these datapoints are near the surface of the sphere). In Small N Large P, not only do we have to deal with an unimaginably large number of dimensions (though personally I find 4 dimensions quite unimaginable enough), we also have to deal with exceptionally few datapoints (because of the cost of genotyping more individuals).

Low Marginal Effect

Bioinformatics scoffs at the above difficulties, and also presents the problem of low marginal effect. Marginal effect is roughly the correlation between a particular attribute alone and your class labels. There are many statistics that can give a good representation of this; chi-square is a common one which I'll use as roughly synonymous with calculation of marginal effect.

Low marginal effect usually means that an attribute is unrelated to the class label. Biology twists this because often genes interact with other genes. Thus, it's possible that one gene taken alone has low marginal effect, but the joint effect of a group of genes is very high (because it's the true cause of the disease). There are a few examples of these, like mouse colon cancer.

Playing devils advocate, one can construct an xor-disease model like the following:


00011011
Y0110

Assuming these attributes are evenly distributed throughout your sample, there will appear to be zero marginal association for both of these attributes, despite the exceptionally strong signal present in the joint model (the penetrance is 1, whereas normally it would be more like 0.2; this is basically the probability of getting the disease even if you have the risk genes). These types of models actually represent a realistic situation: that of two genes which regulate each other.

Another interesting property of this distribution is that, conditioned on one of the attributes, the marginal effect of the other attribute will be very high. This will be useful later.

The State-of-the-Art

Previously, people have been trying to do a lot of different things when confronted with this problem:
  1. Just using single-attribute test anyway
    1. This will miss complicated things, but it's fast, simple, and widely implemented.
    2. Alot of the low hanging fruit that can be found this way already has.
  2. Trying to brute force all possible k-way interactions
    1. Basically, you define a k-attribute version of your single attribute test (like chi-square), and use that instead.
    2. This is computationally intractable. Specifically, the number of groups of attributes one must examine is (M choose k), where M is the number of attributes. This gets big fast.
    3. 2-way interactions can usually be done given well implemented algorithms and some time with a cluster. Wei Wang's paper also linked above talks about a way to get a constant-time speedup in the 2-way case (which can be non-trivial in this problem domain).
  3. Trying heuristic algorithms
    1. I'm not really sure what the point of these heuristic algorithms are. I suppose if one can define the types of multi-attribute distributions which are likely you could reduce the state-space some, but it seems difficult and in practice doesn't work all that well (as of now).
    2. An example is SNPHarvester.
  4. Trying probabilistic algorithms
    1. These are algorithms which primarily work with single attribute tests, but have some other components which might allow some detection of complicated disease. A drawback here is that a lot of these algorithms rely on at least some marginal effect being present in one component of the disease.
    2. Examples would be BEAM, and Random Forests.
Now, presumably high variance partitioning falls into one of these categories, right?

Well... maybe. It's different, which is why it is interesting.

Given our xor disease model, what can we do different than the 4 approaches above? Can we design an algorithm which won't miss any 2-locus interactions without brute forcing the entire space (or at least an exponential number of things)? It seems impossible...

It's not impossible (I think).

The Heritage of High Variance Partitioning

High variance partitioning comes from my own work on recursive partitioning ensembles such as Random Forests (for my purposes, they're all equivalent). Random Forests uses a single-attribute statistic called the gini impurity to make a greedy decision about how to build its decision trees (it also does some other fluff with mtry, but that isn't important for us). This means that there must be some marginal effect for Random Forests to use an attribute. Luckily, Random Forests has some extra tricks up it's sleeve. Specifically, there are two ways that Random Forests can detect interaction effects despite using a single-attribute statistic:
  1. Recursive Partitioning - After constructing a node, all of that node's children are conditioned on the value of their parent. This means that once we get one attribute in a complex disease, we will sometimes get more (the xor model above is an extreme example) because these conditioned subsets have higher marginal probability.
  2. Bootstrap Resampling - Each tree itself is grown on a random bootstrap sample of the training set. Since we'll use many of these, at least some of them are (probably) going to have larger than 0 marginal effect, even with our xor model.
High Variance Partitioning

The important observation is that bootstrap resampling in the traditional sense is a very bad way to find interaction: the bootstrap samples are distributed around the training set (binomially?), so almost all of the bootrap samples are very similar to each other and look much like the training set.

The Adaboost algorithm generates bootstrap samples weighted toward individuals which have thus far proven difficult to classify. This isn't really much better than random bootstrapping, probably because of the noisiness of SNP data.

High variance partitioning allows us to take bootstrap samples which allow many very different views of the data. Since these views are very different from each other, perhaps a lot less samples would need to be used.

Using High Variance Partitioning

Since high variance partitioning probably creates some subsets with marginal effect, one wonders what would happen if you just grew normal decision trees with it. I've tried that using the log2(N) partitions algorithm, and it turns out that the power to select 2-attribute models is similar to the normal Adaboost algorithm (see this paper for how this selection is done, and an improvement I haven't tried). However, Adaboost requires over 200 trees. High variance partitioning creates only 9 trees (technically 18, each tree on a half-size dataset), so it runs about 22 times faster!

High Variance Partitioning Creates Hard Questions...

In the previous post, I talked about high variance partitioning and some of the difficulties I've had in actually doing it. Note that these difficulties (such as the large space of partitions) are very different to the brute force problem above. Once a decent algorithm or even just one good set of partitions for a number N has been found, it can be used over and over again for all problems.

In that post, I defined "high variance" to mean that we wanted to find k partitions into two such that the average pairwise distance between all (k choose 2) pairs of partitions was maximized. Here are some even tougher questions, assuming our xor model:
  1. How many partitions k must one make before there is guaranteed to be a partition with non-zero marginal effect?
  2. What is the minimal best marginal effect seen after k partitions?
  3. What is the expected best marginal effect after k partitions?
  4. How many partitions must be made to have minimum (or expected) best marginal effect past some threshold?
Of course, to make one's head hurt even more, we can consider the above questions with an arbitrary model (instead of xor). I'm not going to tackle that systematically at this juncture. Here's my thoughts on these questions:
  1. For arbitrary N, at least 3. Probably this is a function of N, which means calculating it leads to all sorts of inconsistencies related to the pattern you want to make not dividing N evenly. This fact makes me not want to calculate this number.
  2. Obviously 0 for a while. This is probably a function of N also. Not sure how long it would take to be non-zero, but the probability of it being 0 falls as we add more partitions. If we wanted to prove this, we'd need to do some sort of Adversarial Argument (something I really should add to wikipedia).
  3. This should go up as we add partitions, probably rather quickly.
  4. Sort of a more useful way to state the above problems, this is the question we'd really like to answer (rather than just finding high variance partitions). Finding high variance partitions should be equivalent (or at least well correlated), and it's much less intertwined with a particular disease model (which is usually unknown).
Note that I've definitely glossed over things here. For one, it is true that high variance partitioning will increase the marginal effect of some of the noise SNPs also. I haven't offered much proof that this doesn't get confused with the real signal (meaning we'd have to do some sort of 2-stage algorithm, which uses HVP for tagging and then brute forces a smaller space of candidates), but it seems that in practice it doesn't. I think given you have enough data, the true signal will eventually win, but I don't know how to make a tight bound on that.