Monday, June 21, 2010

My iOS Game, Towers in Space

Some friends from game design class wanted to keep on making games... and with Apple's growing market share and reasonably fast devices, it seemed like a fun thing to try. We chose to make a tower defense game because we thought it was a relatively simple genre which we understood well, and the play-style fit the more casual nature of cell phone games. Now we're working on a new game with a more unique concept :)

Here is our website.

I'm going to be following this post with some technology posts on how we implemented things. A tentative itinerary:

  1. Developing our advanced targeting AI
  2. Combining A*, flood/fill, and maze solving to efficiently generate paths
  3. Performance pitfalls
  4. Where to save data - and where not to save data
  5. Things we did wrong

Friday, August 14, 2009

High Variance Partitioning and Marginal Effect

In my previous post, I talked about High Variance Partitioning, which talks about how to create many partitions of a set which satisfy some distance criteria. Read that post first, or this post will be very confusing for you. Now, I'm going to explain why high variance partitions could be useful in a machine learning context. I'll present it in the Bioinformatics background I study, but this is applicable to any machine learning problem with similar qualities.

Small N Large P

Small N large P is a machine learning problem which has far more variables than individuals. For example, SNP data can have 500k+ variables, but only a few hundred individuals genotyped. This is the curse of dimensionality's older, angrier brother. The curse of dimensionality basically says that as the number of dimensions grow, the number of datapoints needed to evenly sample a unit sphere at a given granularity grows a lot (and almost all of these datapoints are near the surface of the sphere). In Small N Large P, not only do we have to deal with an unimaginably large number of dimensions (though personally I find 4 dimensions quite unimaginable enough), we also have to deal with exceptionally few datapoints (because of the cost of genotyping more individuals).

Low Marginal Effect

Bioinformatics scoffs at the above difficulties, and also presents the problem of low marginal effect. Marginal effect is roughly the correlation between a particular attribute alone and your class labels. There are many statistics that can give a good representation of this; chi-square is a common one which I'll use as roughly synonymous with calculation of marginal effect.

Low marginal effect usually means that an attribute is unrelated to the class label. Biology twists this because often genes interact with other genes. Thus, it's possible that one gene taken alone has low marginal effect, but the joint effect of a group of genes is very high (because it's the true cause of the disease). There are a few examples of these, like mouse colon cancer.

Playing devils advocate, one can construct an xor-disease model like the following:


00011011
Y0110

Assuming these attributes are evenly distributed throughout your sample, there will appear to be zero marginal association for both of these attributes, despite the exceptionally strong signal present in the joint model (the penetrance is 1, whereas normally it would be more like 0.2; this is basically the probability of getting the disease even if you have the risk genes). These types of models actually represent a realistic situation: that of two genes which regulate each other.

Another interesting property of this distribution is that, conditioned on one of the attributes, the marginal effect of the other attribute will be very high. This will be useful later.

The State-of-the-Art

Previously, people have been trying to do a lot of different things when confronted with this problem:
  1. Just using single-attribute test anyway
    1. This will miss complicated things, but it's fast, simple, and widely implemented.
    2. Alot of the low hanging fruit that can be found this way already has.
  2. Trying to brute force all possible k-way interactions
    1. Basically, you define a k-attribute version of your single attribute test (like chi-square), and use that instead.
    2. This is computationally intractable. Specifically, the number of groups of attributes one must examine is (M choose k), where M is the number of attributes. This gets big fast.
    3. 2-way interactions can usually be done given well implemented algorithms and some time with a cluster. Wei Wang's paper also linked above talks about a way to get a constant-time speedup in the 2-way case (which can be non-trivial in this problem domain).
  3. Trying heuristic algorithms
    1. I'm not really sure what the point of these heuristic algorithms are. I suppose if one can define the types of multi-attribute distributions which are likely you could reduce the state-space some, but it seems difficult and in practice doesn't work all that well (as of now).
    2. An example is SNPHarvester.
  4. Trying probabilistic algorithms
    1. These are algorithms which primarily work with single attribute tests, but have some other components which might allow some detection of complicated disease. A drawback here is that a lot of these algorithms rely on at least some marginal effect being present in one component of the disease.
    2. Examples would be BEAM, and Random Forests.
Now, presumably high variance partitioning falls into one of these categories, right?

Well... maybe. It's different, which is why it is interesting.

Given our xor disease model, what can we do different than the 4 approaches above? Can we design an algorithm which won't miss any 2-locus interactions without brute forcing the entire space (or at least an exponential number of things)? It seems impossible...

It's not impossible (I think).

The Heritage of High Variance Partitioning

High variance partitioning comes from my own work on recursive partitioning ensembles such as Random Forests (for my purposes, they're all equivalent). Random Forests uses a single-attribute statistic called the gini impurity to make a greedy decision about how to build its decision trees (it also does some other fluff with mtry, but that isn't important for us). This means that there must be some marginal effect for Random Forests to use an attribute. Luckily, Random Forests has some extra tricks up it's sleeve. Specifically, there are two ways that Random Forests can detect interaction effects despite using a single-attribute statistic:
  1. Recursive Partitioning - After constructing a node, all of that node's children are conditioned on the value of their parent. This means that once we get one attribute in a complex disease, we will sometimes get more (the xor model above is an extreme example) because these conditioned subsets have higher marginal probability.
  2. Bootstrap Resampling - Each tree itself is grown on a random bootstrap sample of the training set. Since we'll use many of these, at least some of them are (probably) going to have larger than 0 marginal effect, even with our xor model.
High Variance Partitioning

The important observation is that bootstrap resampling in the traditional sense is a very bad way to find interaction: the bootstrap samples are distributed around the training set (binomially?), so almost all of the bootrap samples are very similar to each other and look much like the training set.

The Adaboost algorithm generates bootstrap samples weighted toward individuals which have thus far proven difficult to classify. This isn't really much better than random bootstrapping, probably because of the noisiness of SNP data.

High variance partitioning allows us to take bootstrap samples which allow many very different views of the data. Since these views are very different from each other, perhaps a lot less samples would need to be used.

Using High Variance Partitioning

Since high variance partitioning probably creates some subsets with marginal effect, one wonders what would happen if you just grew normal decision trees with it. I've tried that using the log2(N) partitions algorithm, and it turns out that the power to select 2-attribute models is similar to the normal Adaboost algorithm (see this paper for how this selection is done, and an improvement I haven't tried). However, Adaboost requires over 200 trees. High variance partitioning creates only 9 trees (technically 18, each tree on a half-size dataset), so it runs about 22 times faster!

High Variance Partitioning Creates Hard Questions...

In the previous post, I talked about high variance partitioning and some of the difficulties I've had in actually doing it. Note that these difficulties (such as the large space of partitions) are very different to the brute force problem above. Once a decent algorithm or even just one good set of partitions for a number N has been found, it can be used over and over again for all problems.

In that post, I defined "high variance" to mean that we wanted to find k partitions into two such that the average pairwise distance between all (k choose 2) pairs of partitions was maximized. Here are some even tougher questions, assuming our xor model:
  1. How many partitions k must one make before there is guaranteed to be a partition with non-zero marginal effect?
  2. What is the minimal best marginal effect seen after k partitions?
  3. What is the expected best marginal effect after k partitions?
  4. How many partitions must be made to have minimum (or expected) best marginal effect past some threshold?
Of course, to make one's head hurt even more, we can consider the above questions with an arbitrary model (instead of xor). I'm not going to tackle that systematically at this juncture. Here's my thoughts on these questions:
  1. For arbitrary N, at least 3. Probably this is a function of N, which means calculating it leads to all sorts of inconsistencies related to the pattern you want to make not dividing N evenly. This fact makes me not want to calculate this number.
  2. Obviously 0 for a while. This is probably a function of N also. Not sure how long it would take to be non-zero, but the probability of it being 0 falls as we add more partitions. If we wanted to prove this, we'd need to do some sort of Adversarial Argument (something I really should add to wikipedia).
  3. This should go up as we add partitions, probably rather quickly.
  4. Sort of a more useful way to state the above problems, this is the question we'd really like to answer (rather than just finding high variance partitions). Finding high variance partitions should be equivalent (or at least well correlated), and it's much less intertwined with a particular disease model (which is usually unknown).
Note that I've definitely glossed over things here. For one, it is true that high variance partitioning will increase the marginal effect of some of the noise SNPs also. I haven't offered much proof that this doesn't get confused with the real signal (meaning we'd have to do some sort of 2-stage algorithm, which uses HVP for tagging and then brute forces a smaller space of candidates), but it seems that in practice it doesn't. I think given you have enough data, the true signal will eventually win, but I don't know how to make a tight bound on that.

Thursday, July 23, 2009

High Variance Partitioning

Consider all of the partitions of a set N of size N. There are BN of these partitions, where Bi is the Bell number. We want to select partitions that are "different" from other partitions, e.g., to have lots of varied partitions. Additionally, we want partitions of roughly equal size. So, to begin with we'll only consider partitions p of the set into two. The number of such partitions is given by the BN,2 Bell Polynomial, Specifically, the xN/22 term. This is (N choose N/2)/2.

Distance Between Two Partitions


Define the distance δ between two partitions a, b (|b| == |a|) as follows: For each group bi in b, map it to a group ai such that for all groups ai, the total number of elements that differ between ai and it's partner is minimized.

For example, consider the following:
a = {{1,3,5,...}, {2,4,6,...}}
b = {{1,2,5,6,9,10,...}, {3,4,7,8,11,12,...}}

Here, we can match a0 with b0 or b1. Assuming it maps to b0, exactly half of the elements of b0 are in a0 (the integers 1mod4). Similarly, half of the elements of b1 are in a1 (the integers 0mod4). Since each b0 and b1 is of size N/2, and half of the elements are different, there are N/2 total differences; therefore δ(a,b) = N/2. Note that N/2 is the maximum difference of two partitions due to the minimization mentioned above.

Maximizing Differences Between Half Partitions

Now, we can pose some questions about partitioning in half:
  1. How to select k half-partitions such that the total distance between all pairs of partitions is maximized (max δN,k).
  2. What is an algorithm which takes k chosen half-partitions, and finds another partition p such that the total distance between all pairs of the set of the chosen partitions+p is minimized.
You'll notice that question 2 sounds a lot like a greedy formulation to this problem. 1 is really frickin' hard in the sense that there are ((N choose N/2)/2 choose k) choices there, so brute forcing them won't work.

Here is an algorithm which builds partitions which satisfy questions 1 and 2 (sort of) when k < floor(log2(N))
for skip in (2**kk for kk in range(k)):
partition = {0:[], 1:[]}
for n in range(N):
t = int(n/skip) % 2
partition[t].append(n)

The intuition is pretty simple. First, you pick every other element to be in a. Then, the elements 0,1 mod 4. Then, the elements 0,1,2,3 mod 8, etc. Each of these sets is has a difference of N/2 to each other, meaning the average pairwise distance is maximized at N/2.

Now for the Really Hard Questions

Of course, if it were as simple as above, it wouldn't be interesting. So, here are some tougher questions.
  1. For a given N, what is the largest k such that max δN,k is N/2?
  2. Can the k partitions from 1. be found efficiently?
  3. Can the k partitions with max δN,k be found efficiently?
  4. Is the the greedy formulation optimal? More importantly, is there an online way to find k+1 that is "probably good enough?"
  5. Can 4 be done efficiently?
  6. What about 3-way or m-way partitions?
Unfortunately, these answers are elusive. Both wikipedia and google have failed to find any research on this, though it seems like a obvious subject. Here are my thoughts:
  1. It seems like this could be floor(log2(N)), but that's very wrong. A brute-force greedy simulation for N=8 finds 7 partitions with an average distance of N/2 (I bolded the three my above algorithm would find). If the greedy algorithm is sub-optimal, there could be even more partitions.
  2. Perhaps...
  3. No idea.
  4. I think that the greedy formulation is probably sub-optimal, but very close to the real solutions. That is, the greedy algorithm making a mistake will usually be recovered as k grows.
  5. Note that my current algorithm, while labeled as greedy, actually needs to enumerate all partitions and compare them. This is bad. No idea if this is possible either.
  6. I think 3-way and n-way won't be much more complicated than 2-way. Note that we don't really need to consider uneven partitions, because a 3-way split is basically the same as a 1/3 2/3 split.
Thoughts anyone?

Wednesday, April 29, 2009

So I'm in this Ruby on Rails class...

**This was a draft written in April 2009**
I'm publishing it now because, even though it isn't done, it will never get any more done. So that makes it done.

We use Ruby on Rails at work to manage our internal IT. Basically, we have a bunch of government regulated quality control stuff to keep track of, and no existing solution had the features we needed. So, we rolled our own using Rails. Rails made it possible to do this quick enough for it to be cost effective, and the end result is actually very pleasant to use. Rails has built in AJAX features, which we use extensively.

I've known a little about Ruby since it's quite similar to Python. It didn't really do anything for me beyond that though, so I never fully checked it out. But, being given the opportunity to do that and pick up 3 credits in an applications graduate course was too much to pass up.

Programmers are Expensive, Computers are Cheap

Matz's belief in this statement is fundamental to what Ruby is. Ruby is a very flexible programming language which allows a developer to do a lot while writing very little. The main difference between Ruby and Python is that Ruby has been a bit... infected... by Matz's use of Perl in his youth. This actually isn't really bad, as I'll explain in a bit.

The Good

There are some things about Ruby/Rails that are just patently awesome.
  • Symbols - These are awesome and every language probably should have them. Basically, they separate the semantic difference between "This is a string of text," and :programming_construct. This means you can do something like:

    redirect_to :action=>:view
    to cause the current action of a controller to do the same thing as the view action. Why is this better than Strings? It separates two semantically different ideas: I want to tell a person something (string) and I want to tell the computer something (symbol). This is nicer than the keyword args some languages have.
  • @localVariable - Probably the biggest wart in Python's syntax is explicit self. Not that it is bad; I think it is a very pure way of doing OOP. On the other hand, Python is supposed to be a pragmatic language, and 5 letters is clearly worse than one. Using @ to denote a class's field also has the nice advantage of making field's readily apparent when browsing code. This is compounded by the problem that neither vim nor pydev actually do syntax based field color changes. Using @ to make makes this trivial (and the ruby highlighter for vim handles this)
  • AJAX - Rails makes ajax almost as easy to do as a normal action. The one exception is when a call needs to update multiple parts of a page (for example, toggle a button and display a result). I couldn't figure out a way to do this without resorting to javascript hackery.
  • Regex Literals - One of the things that makes Perl sweet, I think this is a great idea. None of this java "\\\\" to match a backslash in a regex crap. Since regex are built into the language, the standard library strings use them nicely.
  • Open Classes - Hell ya. Wish this was making it into Java 7.

The Bad

There are some things about Ruby that suck.
  • Interpreters - There are more Ruby interpreters than bugs in my code. Seriously. Here's a short list:
    • MRI (ruby 1.8.x)
    • YARV (ruby 1.9.x)
    • JRuby (ruby on the JVM, more on this in a sec)
    • MagLev (based on some smalltalk VM, not much real info)
    • IronRuby (ruby on the CLR)
    • Rubinius (pure ruby VM)
    • MacRuby (ruby on OS X using llvm and the objective-c runtime)
    Now, Python has it's own VM harem, but there is one important difference: the mainline implementation is adequate. MRI has been the subject of much abuse due to memory leaks and such and is of course now on life support.
  • JRuby - I was really psyched to use JRuby for a few reasons. First, I do a lot of programming on the JVM and follow it pretty closely. Thus, I've been watching JSR 292 and generally the work the JRuby team has been doing with much anticipation. Secondly, I think that having a VM platform for many languages is a good idea, since it allows us to more efficiently apply optimization resources. Finally, I had heard bad things about MRI. The reality is that while MRI might not be as solid as CPython, it's still more than sufficient for a student. JRuby on the other hand, was slow to start up and had compatibility issues that a novice ran into in only a few minutes.
  • Errors - Both Ruby and Rails tend to emit some pretty cryptic messages sometimes.
  • End Statements - I blame Python for this entirely, but it's convinced me that whitespace is probably the optimal way to delimit code blocks. End is actually kind of worse than braces, since you can't match an end back to what it is ending in most text editors. Since ruby doesn't like to use braces or parens as much, it's actually a little worse than C-like languages since it's a little easier to get confused. And, of course, I managed to write some code that was wrong because the logical indentation was different than the physical one. I also managed to put an extra end statement into my code which took me forever to find...
  • Does Mongrel really need do a printout everytime the DB is hit? How am I supposed to make sure my params are correct?

Things that the Jury's Still Out On
  • GIL - sucks in Python, sucks in Ruby. We'll see if unladen swallow, pypy, parrot, or somebody manages to develop a better dynamic language VM
  • ERB - I don't like templating systems. They suck for readability of your high level language, they suck for the readability for your output language, and they require way too many start and end tags. Trying to write ruby to write javascript to manipulate strings leaves you in a complete mess. Of course, I'm not sure of a better way to acheive this; I guess I'd like a templating language more closely integrated with HTML or JavaScript.
  • Blocks - Generally, I like the idea of passing code around as little bits. At least, I like anonymous functors a lot (why do I code in C, Python, and Java again?). Using blocks for iteration is a little weird, but I suppose I can see why if you've already got blocks and don't want to add a keyword for looping. I am going to criticize the syntax though... bars? semantic newlines? A do which isn't vertically aligned with it's end? In the end, the following was always unclean:

    thing for thing in (x for x in y if z):
  • Dropping () {} - I can't decide whether I like this or not. On one hand, I really hate typing all of these crappy symbols while coding. On the other hand, it's annoying if you start a function call without (), and then decide you want to chain invocations or pass the result as an argument to another function. You'll have to go back and add them in again. After dealing with Objective-C's stupid smalltalk square brackets, I'm trying to decide what the best function call syntax would be. I suppose something like this:

    func1 arg1, arg2 .func2

    The idea here is that we want to end the argument list of first function call; this is usually done with the closing paren. It's really annoying to have move back over to func1 and add the matching paren, and clearly you can't have unbalanced parens, so I think a different symbol would be ideal.
  • Pervasive Metaprogramming - While metaprogramming is pretty cool, and I do like the rails magic functions, they have their downsides. Specifically if you are trying to find one, it won't be there in .methods.
The WTF
  • Why does calling a class method use the dot notation, but accessing a class constant use the name space operator ::?
  • Exactly why does my ubuntu install want to open .rb files with WINE? I don't even use WINE on this machine.

Saturday, December 20, 2008

Project Euler Part 2

**This was a draft written in December 2008/January 2009**
I'm publishing it now because, even though it isn't done, it will never get any more done. So that makes it done.

Problem 12: The sequence of triangle numbers is generated by adding the natural numbers. So the 7^(th) triangle number would be 1 + 2 + 3 + 4 + 5 + 6 + 7 = 28. What is the value of the first triangle number to have over five hundred divisors?
My first attempt was to iterate through the triangular numbers, checking their number of divisors:
        

Friday, December 19, 2008

Project Euler

**This was a draft written in December 2008**
I'm publishing it now because, even though it isn't done, it will never get any more done. So that makes it done.

[...] Boring as home is downloading TF2 and losing my internet to the ensuing packet deluge has made it worse. As such, I've decided to take advantage of the down time and work on some solutions to project euler problems. This post will list the problems and how I solved them.

In solving each problem, I start with whatever pops to mind first. I strive for readability and conciseness first. Then, if this doesn't work, I will implement a more complex algorithm to solve the problem. I stick to algorithms I already know (though I obviously had to look some of them up for implementation).

It appears the biggest problem here is going to be windows. GVim isn't too amused by my attempts to make it behave like normal. Apparently, they assume windows users would Also, I've decided that this will be more fun in python3k, so I need to set that up also. Windows 2003 isn't amused by this idea... Ahh, genius! obviously, I shouldn't be allowed to run a file I've pushed over a WRITE mounted samba share! In the same user's session who is sharing the directory!

Problem 1: Find the sum of all the multiples of 3 or 5 below 1000.
Kind of can't believe this didn't work on the first try (wrote and instead of or; I need to read problems more thoroughly...)

def problem1():
def problem1_generator():
for n in range(0,1000):
if n%3 == 0 or n%5 == 0:
yield n
result = sum(problem1_generator())
return result

This isn't really as pythonic as it could be though:
def problem1b():
numbers = (n for n in range(0,1000) if n%3 == 0 or n%5 == 0)
return sum(numbers)



Problem 2: sum of all even fibonacci numbers less than 4 million.
Also got this one wrong on the first try, forgot the even part...

def problem2():
def fib_generator(end=float("Infinity")):
""" return fibonacci numbers < b =" 1,2" b =" b," 2 ="="0)">



Problem 3: What is the largest prime factor of the number 600851475143 ?
A simple brute force solution.

def problem3():
from math import sqrt
def factors_generator(n, primes=[2]):
""" Due to the way python binds default method args,
primes will persist across method calls...
"""
for i in range(2, int(sqrt(n))):
if not any(p for p in primes if i%p == 0):
primes.append(i)
if n%i == 0:
n = n//i
yield i
if n == 1:
break

return sorted(list(factors_generator(600851475143)))[-1]



Problem 4: Find the largest palindrome made from the product of two 3-digit numbers.

def problem4():
numbers = list(range(100,1000))
to_test = sorted([x*y for x in numbers for y in numbers])
for n in reversed(to_test):
s = str(n)
if s == s[::-1]:
return n

Here we can see the excellent reversed built-in, added in Py3k. It's kind of unfortunate that str(reversed(s)) converts the generator itself to a string, rather than the items in it.



Problem 5: What is the smallest number that is evenly divisible by all of the numbers from 1 to 20?
I was originally doing this the hard way, writing my own custom lcm routine. Then I remembered that the lcm can be computed in terms of the gcd, and things got much easier.

def problem5():
from functools import reduce

def gcd(a,b):
while b:
a,b = b, a%b
return a

def lcm(a,b):
return a // gcd(a,b) * b

return reduce(lcm, range(1,21))



Problem 6: Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum.
This one is cake.

def problem6():
numbers = range(1,101)
squared = (x**2 for x in numbers)
return sum(numbers)**2 - sum(squared)



Problem 7: Find the 10001st prime.
I reuse the factors_generator method created for problem 3. This takes about 30 seconds on my machine, so it isn't winning any speed tests.

def problem7():
primes = []
i = 2
while len(primes) < 10001: for factor in factors_generator(i, primes): pass i *= 2 print(len(primes)) return primes[10000]

This is a bit entangled, so here's a standalone solution. I think it's even less code...
def problem7b():
primes = []
i = 2
while len(primes) < 10001: if not any(p for p in primes if i%p == 0): primes.append(i) i += 1 return primes[10000]



Problem 8: Find the greatest product of five consecutive digits a (given) 1000-digit number.
For the number, check the code. First problem: how the hell do you represent a 1000 digit number cleanly in code? I was forced to use a multiline string and then strip whitespace... Other than that, not much to see here.

def problem8():
from functools import reduce
from operator import mul
numberS = """
73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450
"""
number = "".join(numberS.split()) #remove whitespace
product = 0
for i in range(0, len(number)-5):
digits = (int(a) for a in number[i:i+5])
t = reduce(mul, digits)
product = max(product, t)

return product




Problem 9: There exists exactly one Pythagorean triplet for which a + b + c = 1000. Find the product abc.
This is pretty simple, if you know about the modified version of Euclid's formula. This version is guaranteed to generate all Pythagorean triples (eventually).

def problem9():
from math import sqrt
def pythagorean_generator():
""" return all pairs (a,b) s.t. a**2 + b**2 = c**2
and a,b,c integers
"""
k = 1
mstart = 2
batch_size = 100
while True:
for m in range(mstart, mstart+batch_size):
for n in range(1,m):
yield ( k*2*m*n, k*(m**2-n**2) )
mstart += batch_size
k += 1

for a,b in pythagorean_generator():
c = int(sqrt(a**2 + b**2))
if a + b + c == 1000:
return a*b*c



Problem 10: Find the sum of all the primes below two million.
The naive algorithm just isn't going to be fast enough in python, clocking in around 10 minutes. Thus, it is interesting to note that in a lower level language, the naive algorithm probably would clock in at under a minute.

def problem10():
primes = [2]
i = 3
while i < 2*10**6: if not any(p for p in primes if i%p == 0): primes.append(i) i += 2 return sum(primes)

Of course, we can always try a naive implementation of the Sieve of Eratosthenes.
def problem10b():
from math import sqrt
primes = []
numbers = list(range(0, 2*10**6))
crossed = [False]*len(numbers)

for i in range(2, len(numbers)):
if not crossed[i]:
p = numbers[i]
primes.append(p)
for j in range(p, len(numbers), p):
crossed[j] = True
return sum(primes)



This gives us the answer in a couple seconds. Of course, I also managed to introduce an off-by-two error due to my indexing of the arrays (which is why numbers now includes 0 and 1, so that numbers[i] = i). I always hate these situations where you have to iterate through two arrays simultaneously. In a more complicated problem I'd consider wrapping things up into an object, but that seems unwarranted here.



Well, that's enough for now. Something wrong or unpythonic? Hit me back.

Monday, October 27, 2008

First Experiences with my Ipod Touch

**This was a draft, originally written in August 2008**
I'm publishing it now because, even though it isn't done, it will never get any more done. So that makes it done ;)

So, my boss bought me an Ipod touch. Not sure if it is worth the extra hours I agreed to work or not (not all that many really, and I still get paid!).

The first thing I noticed was the unhappy request for iTunes. The device refused to turn on until it was connected to iTunes. Given that I regard iTunes with about as much as regard as an evangelical Christian in Kansas has for Barack Obama, this sort of upset me. Luckily, I was able to side step this by connecting it to my friend's mac. One second, and the device started working (no idea what it wanted).

Obviously being an internet junkie (and having heard people rave about Safari on the touch), I wanted to browse reddit. Unfortunately, our campus' wireless seems to have decided we don't deserve internet and is refusing to give anyone DHCP leases. Both my touch and my friend's mac happily accepted the DHCP fake IP (169.x.x.x). Of course, I fail at networking for not recognizing this...

The next step was to try it under linux. Hmm... looks like it still requires jailbreaking. Unfortunate; I won't be able to do that for a few months at least. I guess Apple really has the platform locked down.

So, my file server gets tarnished with iTunes. Oh well, it's got enough other crap on it already. iTunes didn't respond well when I set it's directory to my music folder, requiring that folder to be added manual afterwards. A minor annoyance.

Registration is of a pain; Apple's servers are really slow. And the tab navigation of the form is completely wrong... And apparently I can't do it without entering a credit card. I guess I'll do that if only because there is some music I have been meaning to buy, and the iTunes store might have DRM free versions.

iTunes seems to believe I want another copy of all of my album art. Not sure where it's going; there isn't all that much space on my OS partition so this might not end well. It also seems to dislike my wma music which is an opinion I share with it. No idea what it's transcoding to; m4a would be equally as useless.

I wonder how well iTunes can manage my xvid/VC9 encoded videos. Automatic transcoding to h264 for my device would be sweet... Tried to add a folder of .mkv, I think that confused it. Went to movies and selected open, this turned out to have been a poor decision. The UI went away and I can't figure out how to get back. No helpful back arrow. Oh, it's another window entirely (didn't see it pop up through the VNC session). Tried a folder of more normal files... It seems to have detected the mpg files at least. Not sure what it's doing. A quick google search suggests using handbreak to convert my files, so I guess I'm stuck doing it manually. I'm sure iTunes isn't exactly going to let me script in the handbreak cli either... Actaully, handbreak only seems to work for DVDs.

Next step: sync music. I have more than 8GB of music, so we'll see what iTunes decides to do when I request a sync. Hmm... it seems to have created an iPod playlist. It's initially populated as I would like: random albums. I wonder if it automatically changes each time I sync. It does not, which prevented me from changing my selection for a few months.

Eventually I sat down and spent about half an hour searching the net, and I discovered how to setup a new smart playlist... which is okay. It isn't smart about the albums like the original one, but that's probably my failure. I managed to delete the sync history, which I think will enable me to regenerate the original selection playlist and see what it did. What an odd button to have in preferences... Unfortunately, it didn't offer to setup the playlist again, so I'm stuck searching. After about half an hour more googling, I stumbled upon the 'select-by album' field, which does what I want.