Three topics which are unlikely to get you invited to the best dinner parties. So how (if at all) are they related?
Lets start with data mining.
Every year at about this time, Karl Rexer sends out a survey to data miners asking them about the tools they use, and the problems they address.
Over the years since 2007 there have been a few notable changes. Firstly the rise in the number of people who completed the survey, from 300 to 1300 (2011). But also in the change in the toolsets people are using.
If you go back to 2007 the most popular (and liked) tools were SAS and SPSS Clementine. In 2008, R appeared for the first time, and by 2010 R was the most popular tool, with Clementine (by then IBM SPSS Modeler) had almost disappeared.
Some aspects of this are hardly surprising. Clementine was always an expensive tool to purchase, and it has a limited set of algorithms (yes, I know you can add to nodes, but…). R is free. There are algorithms galore, even if they can be tricky to find. Increasingly it’s the tool that students, especially those researching their PhDs will use at University. It’s the data mining tool of the big data movement.
Back in the 1500s religion, specifically the tensions within religion, were deeply intertwined with the disruptive technology of the day: movable type.
The religious dispute was the reformation of the Catholic church, and one of the core aspects of this was the relationship between individuals and God. Did we need a priesthood or saints to intercede on our behalf? Couldn’t we read the word of God ourselves, and learn the lessons directly? The bible was the big data of the 1500s, and the printing press was the technology that allowed people to translate the bible and provide access to the truth.
Sound a bit familiar?
Yet here we are in the midst of the big data reformation and it appears that we are creating a priesthood, not removing it. What makes Clementine a great tool is that it is intensely and beautifully useable. It has an interface that is elegant and intuitive. This allowed data miners to focus on the meaning of the analysis, not the mechanism of the analysis.
R, for all its strengths, is not elegant or beautiful. When you use R it is like stepping back into the world of SAS procs in the 90s. Too much effort is spent getting the software to work, and not enough is spent on the bigger picture.
“Aha!” you say, “But I’m an expert – I can use R at lightning speeds, I don’t need an elegant GUI!” And so the politics is born.
Do we believe that data mining and data science is the province of a priesthood? The select, who will interpret the holy truths on behalf of the unwashed masses (lets call them managers)? Or do we believe that data science should be democratized and made available to as many people as possible? Can they handle the truth?
Can we navigate the political waters, get over our religious differences, and deliver on the promise of analysis in the world of big data?