Computer Science and the Rest of the World: 2009

Friday, August 14, 2009

High Variance Partitioning and Marginal Effect

In my previous post, I talked about High Variance Partitioning, which talks about how to create many partitions of a set which satisfy some distance criteria. Read that post first, or this post will be very confusing for you. Now, I'm going to explain why high variance partitions could be useful in a machine learning context. I'll present it in the Bioinformatics background I study, but this is applicable to any machine learning problem with similar qualities.

Small N Large P

Small N large P is a machine learning problem which has far more variables than individuals. For example, SNP data can have 500k+ variables, but only a few hundred individuals genotyped. This is the curse of dimensionality's older, angrier brother. The curse of dimensionality basically says that as the number of dimensions grow, the number of datapoints needed to evenly sample a unit sphere at a given granularity grows a lot (and almost all of these datapoints are near the surface of the sphere). In Small N Large P, not only do we have to deal with an unimaginably large number of dimensions (though personally I find 4 dimensions quite unimaginable enough), we also have to deal with exceptionally few datapoints (because of the cost of genotyping more individuals).

Low Marginal Effect

Bioinformatics scoffs at the above difficulties, and also presents the problem of low marginal effect. Marginal effect is roughly the correlation between a particular attribute alone and your class labels. There are many statistics that can give a good representation of this; chi-square is a common one which I'll use as roughly synonymous with calculation of marginal effect.

Low marginal effect usually means that an attribute is unrelated to the class label. Biology twists this because often genes interact with other genes. Thus, it's possible that one gene taken alone has low marginal effect, but the joint effect of a group of genes is very high (because it's the true cause of the disease). There are a few examples of these, like mouse colon cancer.

Playing devils advocate, one can construct an xor-disease model like the following:

	00	01	10	11
Y	0	1	1	0

Assuming these attributes are evenly distributed throughout your sample, there will appear to be zero marginal association for both of these attributes, despite the exceptionally strong signal present in the joint model (the penetrance is 1, whereas normally it would be more like 0.2; this is basically the probability of getting the disease even if you have the risk genes). These types of models actually represent a realistic situation: that of two genes which regulate each other.

Another interesting property of this distribution is that, conditioned on one of the attributes, the marginal effect of the other attribute will be very high. This will be useful later.

The State-of-the-Art

Previously, people have been trying to do a lot of different things when confronted with this problem:

Just using single-attribute test anyway
1. This will miss complicated things, but it's fast, simple, and widely implemented.
2. Alot of the low hanging fruit that can be found this way already has.
Trying to brute force all possible k-way interactions

Basically, you define a k-attribute version of your single attribute test (like chi-square), and use that instead.
This is computationally intractable. Specifically, the number of groups of attributes one must examine is (M choose k), where M is the number of attributes. This gets big fast.
2-way interactions can usually be done given well implemented algorithms and some time with a cluster. Wei Wang's paper also linked above talks about a way to get a constant-time speedup in the 2-way case (which can be non-trivial in this problem domain).

Trying heuristic algorithms
1. I'm not really sure what the point of these heuristic algorithms are. I suppose if one can define the types of multi-attribute distributions which are likely you could reduce the state-space some, but it seems difficult and in practice doesn't work all that well (as of now).
2. An example is SNPHarvester.
Trying probabilistic algorithms
1. These are algorithms which primarily work with single attribute tests, but have some other components which might allow some detection of complicated disease. A drawback here is that a lot of these algorithms rely on at least some marginal effect being present in one component of the disease.
2. Examples would be BEAM, and Random Forests.

Now, presumably high variance partitioning falls into one of these categories, right?

Well... maybe. It's different, which is why it is interesting.

Given our xor disease model, what can we do different than the 4 approaches above? Can we design an algorithm which won't miss any 2-locus interactions without brute forcing the entire space (or at least an exponential number of things)? It seems impossible...

It's not impossible (I think).

The Heritage of High Variance Partitioning

High variance partitioning comes from my own work on recursive partitioning ensembles such as Random Forests (for my purposes, they're all equivalent). Random Forests uses a single-attribute statistic called the gini impurity to make a greedy decision about how to build its decision trees (it also does some other fluff with mtry, but that isn't important for us). This means that there must be some marginal effect for Random Forests to use an attribute. Luckily, Random Forests has some extra tricks up it's sleeve. Specifically, there are two ways that Random Forests can detect interaction effects despite using a single-attribute statistic:

Recursive Partitioning - After constructing a node, all of that node's children are conditioned on the value of their parent. This means that once we get one attribute in a complex disease, we will sometimes get more (the xor model above is an extreme example) because these conditioned subsets have higher marginal probability.
Bootstrap Resampling - Each tree itself is grown on a random bootstrap sample of the training set. Since we'll use many of these, at least some of them are (probably) going to have larger than 0 marginal effect, even with our xor model.

High Variance Partitioning

The important observation is that bootstrap resampling in the traditional sense is a very bad way to find interaction: the bootstrap samples are distributed around the training set (binomially?), so almost all of the bootrap samples are very similar to each other and look much like the training set.

The Adaboost algorithm generates bootstrap samples weighted toward individuals which have thus far proven difficult to classify. This isn't really much better than random bootstrapping, probably because of the noisiness of SNP data.

High variance partitioning allows us to take bootstrap samples which allow many very different views of the data. Since these views are very different from each other, perhaps a lot less samples would need to be used.

Using High Variance Partitioning

Since high variance partitioning probably creates some subsets with marginal effect, one wonders what would happen if you just grew normal decision trees with it. I've tried that using the log2(N) partitions algorithm, and it turns out that the power to select 2-attribute models is similar to the normal Adaboost algorithm (see this paper for how this selection is done, and an improvement I haven't tried). However, Adaboost requires over 200 trees. High variance partitioning creates only 9 trees (technically 18, each tree on a half-size dataset), so it runs about 22 times faster!

High Variance Partitioning Creates Hard Questions...

In the previous post, I talked about high variance partitioning and some of the difficulties I've had in actually doing it. Note that these difficulties (such as the large space of partitions) are very different to the brute force problem above. Once a decent algorithm or even just one good set of partitions for a number N has been found, it can be used over and over again for all problems.

In that post, I defined "high variance" to mean that we wanted to find k partitions into two such that the average pairwise distance between all (k choose 2) pairs of partitions was maximized. Here are some even tougher questions, assuming our xor model:

How many partitions k must one make before there is guaranteed to be a partition with non-zero marginal effect?
What is the minimal best marginal effect seen after k partitions?
What is the expected best marginal effect after k partitions?
How many partitions must be made to have minimum (or expected) best marginal effect past some threshold?

Of course, to make one's head hurt even more, we can consider the above questions with an arbitrary model (instead of xor). I'm not going to tackle that systematically at this juncture. Here's my thoughts on these questions:

For arbitrary N, at least 3. Probably this is a function of N, which means calculating it leads to all sorts of inconsistencies related to the pattern you want to make not dividing N evenly. This fact makes me not want to calculate this number.
Obviously 0 for a while. This is probably a function of N also. Not sure how long it would take to be non-zero, but the probability of it being 0 falls as we add more partitions. If we wanted to prove this, we'd need to do some sort of Adversarial Argument (something I really should add to wikipedia).
This should go up as we add partitions, probably rather quickly.
Sort of a more useful way to state the above problems, this is the question we'd really like to answer (rather than just finding high variance partitions). Finding high variance partitions should be equivalent (or at least well correlated), and it's much less intertwined with a particular disease model (which is usually unknown).

Note that I've definitely glossed over things here. For one, it is true that high variance partitioning will increase the marginal effect of some of the noise SNPs also. I haven't offered much proof that this doesn't get confused with the real signal (meaning we'd have to do some sort of 2-stage algorithm, which uses HVP for tagging and then brute forces a smaller space of candidates), but it seems that in practice it doesn't. I think given you have enough data, the true signal will eventually win, but I don't know how to make a tight bound on that.

Thursday, July 23, 2009

High Variance Partitioning

Consider all of the partitions of a set N of size N. There are B_N of these partitions, where B_i is the Bell number. We want to select partitions that are "different" from other partitions, e.g., to have lots of varied partitions. Additionally, we want partitions of roughly equal size. So, to begin with we'll only consider partitions p of the set into two. The number of such partitions is given by the B_N,2 Bell Polynomial, Specifically, the x_{_N/2}² term. This is (N choose N/2)/2.

Distance Between Two Partitions

Define the distance δ between two partitions a, b (|b| == |a|) as follows: For each group b_i in b, map it to a group a_i such that for all groups a_i, the total number of elements that differ between a_i and it's partner is minimized.

For example, consider the following:
a = {{1,3,5,...}, {2,4,6,...}}
b = {{1,2,5,6,9,10,...}, {3,4,7,8,11,12,...}}

Here, we can match a₀ with b₀ or b₁. Assuming it maps to b₀, exactly half of the elements of b₀ are in a₀ (the integers 1mod4). Similarly, half of the elements of b₁ are in a₁ (the integers 0mod4). Since each b₀ and b₁ is of size N/2, and half of the elements are different, there are N/2 total differences; therefore δ(a,b) = N/2. Note that N/2 is the maximum difference of two partitions due to the minimization mentioned above.

Maximizing Differences Between Half Partitions

Now, we can pose some questions about partitioning in half:

How to select k half-partitions such that the total distance between all pairs of partitions is maximized (max δ_N,k).
What is an algorithm which takes k chosen half-partitions, and finds another partition p such that the total distance between all pairs of the set of the chosen partitions+p is minimized.

You'll notice that question 2 sounds a lot like a greedy formulation to this problem. 1 is really frickin' hard in the sense that there are ((N choose N/2)/2 choose k) choices there, so brute forcing them won't work.

Here is an algorithm which builds partitions which satisfy questions 1 and 2 (sort of) when k < floor(log2(N))

for skip in (2**kk for kk in range(k)):
    partition = {0:[], 1:[]}
    for n in range(N):
        t = int(n/skip) % 2
        partition[t].append(n)

The intuition is pretty simple. First, you pick every other element to be in a. Then, the elements 0,1 mod 4. Then, the elements 0,1,2,3 mod 8, etc. Each of these sets is has a difference of N/2 to each other, meaning the average pairwise distance is maximized at N/2.

Now for the Really Hard Questions

Of course, if it were as simple as above, it wouldn't be interesting. So, here are some tougher questions.

For a given N, what is the largest k such that max δ_N,k is N/2?
Can the k partitions from 1. be found efficiently?
Can the k partitions with max δ_N,kbe found efficiently?
Is the the greedy formulation optimal? More importantly, is there an online way to find k+1 that is "probably good enough?"
Can 4 be done efficiently?
What about 3-way or m-way partitions?

Unfortunately, these answers are elusive. Both wikipedia and google have failed to find any research on this, though it seems like a obvious subject. Here are my thoughts:

It seems like this could be floor(log2(N)), but that's very wrong. A brute-force greedy simulation for N=8 finds 7 partitions with an average distance of N/2 (I bolded the three my above algorithm would find). If the greedy algorithm is sub-optimal, there could be even more partitions.
Perhaps...
No idea.
I think that the greedy formulation is probably sub-optimal, but very close to the real solutions. That is, the greedy algorithm making a mistake will usually be recovered as k grows.
Note that my current algorithm, while labeled as greedy, actually needs to enumerate all partitions and compare them. This is bad. No idea if this is possible either.
I think 3-way and n-way won't be much more complicated than 2-way. Note that we don't really need to consider uneven partitions, because a 3-way split is basically the same as a 1/3 2/3 split.

Thoughts anyone?

Wednesday, April 29, 2009

So I'm in this Ruby on Rails class...

**This was a draft written in April 2009**
I'm publishing it now because, even though it isn't done, it will never get any more done. So that makes it done.

We use Ruby on Rails at work to manage our internal IT. Basically, we have a bunch of government regulated quality control stuff to keep track of, and no existing solution had the features we needed. So, we rolled our own using Rails. Rails made it possible to do this quick enough for it to be cost effective, and the end result is actually very pleasant to use. Rails has built in AJAX features, which we use extensively.

I've known a little about Ruby since it's quite similar to Python. It didn't really do anything for me beyond that though, so I never fully checked it out. But, being given the opportunity to do that and pick up 3 credits in an applications graduate course was too much to pass up.

Programmers are Expensive, Computers are Cheap

Matz's belief in this statement is fundamental to what Ruby is. Ruby is a very flexible programming language which allows a developer to do a lot while writing very little. The main difference between Ruby and Python is that Ruby has been a bit... infected... by Matz's use of Perl in his youth. This actually isn't really bad, as I'll explain in a bit.

The Good

There are some things about Ruby/Rails that are just patently awesome.

Symbols - These are awesome and every language probably should have them. Basically, they separate the semantic difference between "This is a string of text," and :programming_construct. This means you can do something like:
```
redirect_to :action=>:view
```
to cause the current action of a controller to do the same thing as the view action. Why is this better than Strings? It separates two semantically different ideas: I want to tell a person something (string) and I want to tell the computer something (symbol). This is nicer than the keyword args some languages have.
@localVariable - Probably the biggest wart in Python's syntax is explicit self. Not that it is bad; I think it is a very pure way of doing OOP. On the other hand, Python is supposed to be a pragmatic language, and 5 letters is clearly worse than one. Using @ to denote a class's field also has the nice advantage of making field's readily apparent when browsing code. This is compounded by the problem that neither vim nor pydev actually do syntax based field color changes. Using @ to make makes this trivial (and the ruby highlighter for vim handles this)
AJAX - Rails makes ajax almost as easy to do as a normal action. The one exception is when a call needs to update multiple parts of a page (for example, toggle a button and display a result). I couldn't figure out a way to do this without resorting to javascript hackery.
Regex Literals - One of the things that makes Perl sweet, I think this is a great idea. None of this java "\\\\" to match a backslash in a regex crap. Since regex are built into the language, the standard library strings use them nicely.
Open Classes - Hell ya. Wish this was making it into Java 7.

The Bad

There are some things about Ruby that suck.

Interpreters - There are more Ruby interpreters than bugs in my code. Seriously. Here's a short list:
- MRI (ruby 1.8.x)
- YARV (ruby 1.9.x)
- JRuby (ruby on the JVM, more on this in a sec)
- MagLev (based on some smalltalk VM, not much real info)
- IronRuby (ruby on the CLR)
- Rubinius (pure ruby VM)
- MacRuby (ruby on OS X using llvm and the objective-c runtime)
Now, Python has it's own VM harem, but there is one important difference: the mainline implementation is adequate. MRI has been the subject of much abuse due to memory leaks and such and is of course now on life support.
JRuby - I was really psyched to use JRuby for a few reasons. First, I do a lot of programming on the JVM and follow it pretty closely. Thus, I've been watching JSR 292 and generally the work the JRuby team has been doing with much anticipation. Secondly, I think that having a VM platform for many languages is a good idea, since it allows us to more efficiently apply optimization resources. Finally, I had heard bad things about MRI. The reality is that while MRI might not be as solid as CPython, it's still more than sufficient for a student. JRuby on the other hand, was slow to start up and had compatibility issues that a novice ran into in only a few minutes.
Errors - Both Ruby and Rails tend to emit some pretty cryptic messages sometimes.
End Statements - I blame Python for this entirely, but it's convinced me that whitespace is probably the optimal way to delimit code blocks. End is actually kind of worse than braces, since you can't match an end back to what it is ending in most text editors. Since ruby doesn't like to use braces or parens as much, it's actually a little worse than C-like languages since it's a little easier to get confused. And, of course, I managed to write some code that was wrong because the logical indentation was different than the physical one. I also managed to put an extra end statement into my code which took me forever to find...
Does Mongrel really need do a printout everytime the DB is hit? How am I supposed to make sure my params are correct?

Things that the Jury's Still Out On

GIL - sucks in Python, sucks in Ruby. We'll see if unladen swallow, pypy, parrot, or somebody manages to develop a better dynamic language VM
ERB - I don't like templating systems. They suck for readability of your high level language, they suck for the readability for your output language, and they require way too many start and end tags. Trying to write ruby to write javascript to manipulate strings leaves you in a complete mess. Of course, I'm not sure of a better way to acheive this; I guess I'd like a templating language more closely integrated with HTML or JavaScript.
Blocks - Generally, I like the idea of passing code around as little bits. At least, I like anonymous functors a lot (why do I code in C, Python, and Java again?). Using blocks for iteration is a little weird, but I suppose I can see why if you've already got blocks and don't want to add a keyword for looping. I am going to criticize the syntax though... bars? semantic newlines? A do which isn't vertically aligned with it's end? In the end, the following was always unclean:
```
thing for thing in (x for x in y if z):
```
Dropping () {} - I can't decide whether I like this or not. On one hand, I really hate typing all of these crappy symbols while coding. On the other hand, it's annoying if you start a function call without (), and then decide you want to chain invocations or pass the result as an argument to another function. You'll have to go back and add them in again. After dealing with Objective-C's stupid smalltalk square brackets, I'm trying to decide what the best function call syntax would be. I suppose something like this:
```
func1 arg1, arg2 .func2
```
The idea here is that we want to end the argument list of first function call; this is usually done with the closing paren. It's really annoying to have move back over to func1 and add the matching paren, and clearly you can't have unbalanced parens, so I think a different symbol would be ideal.
Pervasive Metaprogramming - While metaprogramming is pretty cool, and I do like the rails magic functions, they have their downsides. Specifically if you are trying to find one, it won't be there in .methods.

The WTF

Why does calling a class method use the dot notation, but accessing a class constant use the name space operator ::?
Exactly why does my ubuntu install want to open .rb files with WINE? I don't even use WINE on this machine.

Computer Science and the Rest of the World