A first foray into predicting stock prices

A whole bunch of other failed strategies.

Pasting below some old correspondence with a friend. This was my first cut at applying a machine learning approach to trading.

Thought I'd share the results I had from modeling stock price jumps. I took another stab at it this past week, here's how I did it:
I collected historical prices from Yahoo for all SP500 tickers. Training data comes from every trading day of 2011, test data from the first 6 months of 2012.
Features: for a given ticker, take two weeks of closing prices and compute the percent change for each day, relative to the first closing price, and also relative to the last closing price. I also threw in the same percent change for SPY as a market indicator. Tried throwing in VIX, both relative and absolute values but it made things worse.

So that gives about 60 features (15 x 2 for the ticker and another 15x2 for SPY) for each ticker, and we have a sample for every trading day in 2011 after the first two weeks, and for every ticker in the S&P500. == ~100,000 training examples.

Then I labeled the examples by looking at the closing price five trading days after the last close, and marking examples where there was more than 5% gain as positives. Everything else was marked as a negative.

Then I trained a binary classifier on the training examples. I used logitboost with 1000 stumps but there's a lot of alternative options you can try.

Finally, score the test samples from 2012 and compute a ROC curve (for each possible score threshold, how many examples exceeding the threshold are actually positives (true positives) and how many are actually negatives (false positives). See wikipedia for details.

First attached image is the ROC plot. Blue curve is the classifier performance and red is random guessing performance (in the absence of a prior- which I'll come to). TP Rate= true positives/All positives FP rate=false positives/All negatives.

So this is an encouraging first step- we do better than random guessing. The problem is that the prior is not 50-50. In the test set only about 7% of the examples saw a 5% increase, so it's not enough to 'just' perform better than random. We want to know the precision, which is #True positives/(#True positives+#false positives), which depends on the prior. The second attached image is the Precision vs the Recall. (Recall is just another name for TP rate). So if we were to put this classifier out there making calls on which stocks to buy, it would only be right about 10-15% of the time and miss more than half the opportunities (eg if we chose a threshold in the 30% recall range we'd have about 12% precision).

Ok, then, I thought maybe 5% bumps are too rare and I should just try to build a classifier that predicts pos vs negative price swings. So I relabeled the examples where a gain over the next week was a positive and a loss was a negative. Third attached image is the result, which shows that we would do a little bit worse than random guessing... :-(

So, long story short: there *are* features that seem to be indicative of a pending price bump but more often than not they are false alarms, and there is basically zero correlation between two weeks of closing prices and whether there is a positive or negative return the following week, even when throwing in a macroeconomic indicator like SPY. I also experimented a little with doing vanilla regression rather than a binary classifier but my toolset can only handle a small number of examples in that case and again there was almost zero correlation using a downsampled training set (1000 examples, using radial-basis-function regression).

A couple additional things I can try: including daily price ranges, and also throwing the actual ticker in as a feature- maybe I'm looking at too broad a set of tickers and the classifier might be able to zero in on some specific tickers that have more predictable behavior.