Mastering Kaggle

Hi everyone.  My name is Phil, and I’m a Kaggle Master.  It took me just under four months to become one.

I’m not a terribly imposing Kaggle Master, I suspect.  You see, I’m not wildly clever or keenly intelligent.  I’m more of a friendly, slightly schlubby programmer who routinely trips on things.  I don’t write books or invent algorithms; I mostly hang out with my family all day and make video games.  I’m your average guy in the Internet Age.

In fact, having chatted with quite a few of my fellow Kaggle Masters, I’ve been surprised at how many of them are almost entirely unlike me.  Many, if not all, are full-time data scientists, or are positively brilliant, or both.  PhDs abound.  Many have their own machine learning products.  When I talk with them I say “what?” a lot, and they patiently explain again the very basic thing they were just talking about.  In Kaggle terms I am Harry Potter’s third cousin’s neighbor from down the street.  I’m only technically a magician, and then only on holidays.

I know there are people like me out there.  Folks who think data is neat, or who’ve heard of Kaggle or machine learning and wanted to try their hand at it, but were afraid.  Perhaps people who thought they weren’t smart enough to do it.  It’s not true; I do it successfully every day, and I have a ball – all this despite being ridiculously unschooled in it.  If I can do it, you can.

I’d like to share my stories – cautionary tales, mostly – in hopes that you find something you can use.  I’ll also talk about machine learning from my perspective – that of a total, unrelenting neophyte – and hopefully we’ll learn something together.  I expect I’ll be wrong at least half the time, and the other half will hopefully involve me learning from my mistakes.

Let’s begin, shall we?

How I became a Kaggle Master in 4 months, Part I: The EnKaggling -or- Utter Failure Is Fun-damental

It began in April of 2014 with a competition called the

Allstate Purchase Prediction Challenge

In huge letters like that.  It involved working out which insurance policy options customers were going to pick from seven different categories.  The trick was that you had to figure out exactly which options they’d pick.  So you couldn’t just work that out Joe Schmoe was going to pick Option 3 from Option Group D and call it a partial victory, you had to guess each of the seven categories correctly.  It’s a bit like guessing correctly what each of your family members wants for Christmas, and you only get presents if you get ALL of them right.  And you have several million family members, and only basic demographic data to go on.  It’s a terrible analogy, but still – it was utter madness.

So!  It was a bit of a challenge.

I shopped around a bit and ended up using scikit-learn, a Python library that serves up all kinds of fun machine learning techniques.  I’m more of a C++ guy in my day-to-day life, but Python seemed to handle this stuff pretty well.  I played around with it for a day or two and had a stonking good time, then jumped into the fray.

I did terribly.  I didn’t know how to do anything properly.  I’ve got a (increasingly dusty) background in natural language processing, but it was about as useful as twin-linked chainsaw launchers on a Subaru.  I struggled, day after day, never making much progress.  At one point I moved forward a few places on the leaderboard, only to lose ten places the next day.  And that was one of the very few times I moved up the leaderboard.  It was a brutal slog, inasmuch as anything can be brutal while seated in front of a  computer drinking coffee.

In the end I made every mistake in the book.  I didn’t change strategies – I pursued the same basic models relentlessly, over-iterating, trying tiny meaningless variations.  I didn’t validate my theories – I used cross-validation in only the most basic ways.  I didn’t visualize the data set.  And I didn’t take good notes about what had gone right or wrong, so I was constantly in the dark.  I spent dozens upon dozens of hours beating my head against a proverbial brick wall.  And I did fairly terribly – placing 1015th.  When the competition ended, I was crushed.

Reading through the forums on the day after the competition, I noticed that approaches to the problem were all over the place.  The top competitors had done wildly different things. I had expected to see one or two solid, well-known solutions that varied slightly, with final scores largely being determined by how intelligently those solutions had been applied.  I couldn’t have been more wrong.  People did wild things that I’d never heard of.  No one solution was very similar to – or very much worse than! – any of the others.

What was strange was that one of my submissions had actually done all right – it would have placed me 270th out of 1500 competitors, a pretty respectable position.  In nearly all Kaggle competitions you need to pick one or two of your submissions to be your “final” submissions, the sum and representation of your work on the competition (free Kaggle lesson: choose carefully).

What was stranger still was that this halfway-good submission was the only one in which I had really tried to use any new techniques, or to think outside the box.  It was the only submission I had taken real notes for.  It was the only one where I’d done some real cross-validation.  But I hadn’t picked it because it hadn’t scored well on the leaderboard.

I had a sudden, stunning realization.

Doing well in Kaggle wasn’t a matter of being smarter than my competitors.  It was a matter of trying to be smarter than myself.

I decided, in my next competition, to prove myself right.

Leave a comment