So, if you haven’t heard of them there is this company called Kaggle who run data mining competitions. “Making data science a sport”. Essentially they offer prizes (or help people who want to offer prizes) to people who can build the best predictive model on a set of data.
Sometimes the prize is modest – say $10 000 – other times it is huge. The biggest prize at the moment is $3M for health prediction.
The concept of a data mining prize most famously started with the Netflix challenge – where teams of people could compete to identify the best films to offer people.
I am a Data Miner. I like building models. I like predicting things from data. So why does the thought of Kaggle make my blood boil? And should it?
Well I spend a lot of time talking to people about data mining. I explain what I think the best approach is (I strongly recommend the CRISP-DM methodology, link currently unvailable, but we’re working on it). And the thing that hits me time and time again is that building the model – the bit that Kaggle has elevated to the pinacle of the process – is just about the least important part of it all. There are numerous techniques available, all of which are well understood. There are many software vendors (SAS, IBM, Revolution, KXEN), and there are just as many opinions on the best algorithm. But.
- If you have the wrong business question* then no algorithm will fix your problem.
- If you have the wrong data then no algorithm will fix your problem.
- Conversley, if you have the right question and the right data it’s pretty hard to get it wrong even if you make a poor algorithm choice,
Does it make Kaggle bad – no it doesn’t. But if this is making data science a sport we might want to think about the value of putting large sums of money into doing things that are essentially of little real value. And no, I don’t mean golf.
*Business question = important thing that you want to make predictions about.