Making data science a sport – why Kaggle makes my blood boil

So, if you haven’t heard of them there is this company called Kaggle who run data mining competitions.  “Making data science a sport”.  Essentially they offer prizes (or help people who want to offer prizes) to people who can build the best predictive model on a set of data. 

Sometimes the prize is modest – say $10 000 – other times it is huge.  The biggest prize at the moment is $3M for health prediction.

The concept of a data mining prize most famously started with the Netflix challenge – where teams of people could compete to identify the best films to offer people.

I am a Data Miner.  I like building models.  I like predicting things from data.  So why does the thought of Kaggle make my blood boil?  And should it? 

Well I spend a lot of time talking to people about data mining.  I explain what I think the best approach is (I strongly recommend the CRISP-DM methodology, link currently unvailable, but we’re working on it). And the thing that hits me time and time again is that building the model – the bit that Kaggle has elevated to the pinacle of the process – is just about the least important part of it all.  There are numerous techniques available, all of which are well understood.  There are many software vendors (SAS, IBM, Revolution, KXEN), and there are just as many opinions on the best algorithm.  But. 

  • If you have the wrong business question* then no algorithm will fix your problem.
  • If you have the wrong data then no algorithm will fix your problem.
  • Conversley, if you have the right question and the right data it’s pretty hard to get it wrong even if you make a poor algorithm choice,

Does it make Kaggle bad – no it doesn’t.  But if this is making data science a sport we might want to think about the value of putting large sums of money into doing things that are essentially of little real value.  And no, I don’t mean golf.

*Business question = important thing that you want to make predictions about.

Advertisements

2 comments on “Making data science a sport – why Kaggle makes my blood boil

  1. One of the big problems with Kaggle is that in order to win a competition, your model’s predictions need to be the best at conforming to their test data set. The model(s) that best conforms to a test data set isn’t necessarily going to be the best model(s) for conforming to actual data in general.

    Although test data sets are used to prevent over-fitting a model to the training data set, over-fitting can still occur by having overly complicated models. Typically the winners of Kaggle competitions have used overly complicated models (which combine dozens of other models) rather than something simpler that is likely to perform better in the real world.

    Another problem with Kaggle is that it is a ‘winner-takes-all’ system, where most people don’t get paid for their work (work which could be of a higher quality than winner’s work).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s