Strata’s hot topics 2012

Last year I created some quick and dirty analysis of the topics that were going to be covered at Strata NY, and as I’m leaving for California at the weekend for another Strata conference I thought I’d try again.  Why?  Well to my mind Strata is by far the most interesting conference about Data Science, so hopefully this will have some predictive power as to where the market is going.

Given my theory about the west coast leading and Europe lagging that should be even truer.

As a word of caution these are created in Wordle.  If you don’t like Wordle, stop reading.  Yes, I know it’s limited, but it’s colourful and you can gain some quick insights from it.

The big topics

When you take all the track sessions this is what you see:

All topics at Strata

No surprise that data is still BIG, and that hadoop is a huge topic (there are two tracks dedicated to it), but nice that analytics is significant as is data science.

Data Science track

Data science topics

Perhaps the most surprising thing in the Data Science track is that data scientists don’t like calling it Big Data.  They also seem to be keener on data mining than on analytics as a topic.   I’m heartened by this – I’m not a big fan of b*g data as a phrase…

Business and Industry

Business topics

Back in the ‘real’ world Big is Big.  Some odd words appear too: democratizing?  In business?

Visualization

Visualization topics

A simple glance at this wouldsuggest that the world of visualisation is still searching for the right toolset.  Whereas the architecture war seems to be firmly with Hadoop there is no such dominance for the visual.  Also interesting that visualisation experts aren’t bothered by ‘big’.  Perhaps there isn’t anything unique about b*g data from a visualisation perspective – or perhaps they just haven’t hit on it yet.

The Hadoop tracks

There are two separate Hadoop tracks, applied and technical.  Spot the difference:

Applied Hadoop topics

Technical Hadoop topics

To my mind the interesting thing is what isn’t there in the applied sessions: business cases.  That list still looks pretty techy to me.

Privacy and Policy

Privacy and policy topics

As with Strata NY you almost wonder if this is from the same conference.  I wish this was a bigger track, because until we take privacy seriously then we are all at risk (especially in Europe).

Domain data

Domain data topics

No surprise that Social and Web data are prominent, since for many people that’s all that data science is about.  But where is the manufacturing and locational data (OK, geo is starting to creep in).

Sponsored track

Last time I was quite dismissive of the Sponsored track.  But on mature reflection – and now being employed by a company that sponsors tracks – I’m looking at this as a guide to where people are putting their money for the next 12 months or so.  Where are businesses willing to invest in data science?

Sponsored topics

Unfortunately many if them seem to be blowing their marketing budget trying to own the word ‘big’. Also interesting is how little prominence Hadoop gets, given it’s significance elsewhere in the agenda.

Methodology

I’ve taken the text from the short presentation overviews on the Strataconf.com site, and removed venue and speaker information.  I’ve also tried to take out words like ‘talk’, ‘session’, ‘panel’, and for reasons I hope are obvious, ‘data’. I’ve made all words lower case, and stripped out a few oddities (this isn’t meant to be a perffect analysis).

I’ve then limited the number of words, usually to 75, but in the case of smaller tracks to fewer.

Is it time to retire B*g Data?

Before we start, why would I even start to think such a heresy? After all, I believe in Big Data, in the excitement and opportunity and possibilities and analytics and (am I beginning to sound like a fanboi here?) and and and…

Well, first a history lesson: In the early 90s there was a movement.  It was about taking a customised and analytical approach to marketing and service that allowed you to treat customers as individuals, and it promised that if you did this that you would be a better company.  You might have heard of it. It was called Customer Relationship Management.

It became a BIG THING in the heady days before the millennium, and soon everything became CRM… And nothing did.  As Syndrome put it “because when everyone is super, no one will be”.   For those of us who were involved in CRM then, and who look around now, it is amazing how little CRM changed the ways that companies do business.

In the process CRM changed its name.  it became Customer Experience Management (so I didn’t have a relationship with my bank after all), EMM (forget about the customer), VRM (like I can control Apple), or even worse JAPOS (just another piece of software).

But businesses did change, didn’t they?

What has actually changed things is the advent of Big Data, especially the limited, but special case of social media.  Not in the sense of the way that companies understand you (that is still sadly lacking), but in the way that customers interact with each other and can now know things that were previously hidden.
  • What is the second hand value of something? Thanks eBay.
  • What is the cheapest way to get something? Thanks Amazon.
  • What do people think about products or companies?  Thanks blogs and hate sites.
  • Where are things that I like? Thanks FourSquare.
  • Who do I enjoy hearing from? Thanks Facebook.
  • What pointless chatter can’t I live without today? Thanks Twitter.
All of these (and more) have effectively used mass crowdsourcing input to open up information flow in a way that wasn’t thinkable a few years ago.  They show us the route that Big Data could take… If we let it.

So why retire Big Data?

Because any day now* we’ll have SAP Big Data. Oracle Big Data. SAS Big Data. IBM Big Data.  And like CRM before it the hope that Big Data had for changing the world will fritter away on a tide of corporate blandness. **

And these companies are likely to make the most basic mistakes:
  • Big Data is about size
  • Big Data is about technology
  • Big Data is just social media
  • Big Data = big hardware and big software sales

The/My fight back starts here.

I’m going to try not to refer directly to Big Data in future. I will refer to Data Science, which I think is a more useful term anyway.  It takes things out of the realm of marketing speak and into an arena where we need to justify the use of the word science. That requires thought, and thought is good.

If I have to refer to it I will bleep it where possible.  This may require me to rewrite a large number of slides, but so be it.

I will bear in mind that there is no such thing as B*g Data.  There is just data.  And if there is a lot of it then it may be big, but it is never B*g.

*probably several months ago actually

** full disclosure, I work for Teradata an IT company that has a foot in the door. But in our defence we see B*g Data as a business thing, not a technology thing.

A data in the life… what do I do that generates data?

Whilst I was working on a presentation it occurred to me that we talk a lot about the amount of data being generated in our new ‘Big Data’ world, but when asked to quantify it we tend to talk in generalities… smart phones, IP, Twitter, Facebook (and too many of us just talk about those last two).

So instead I thought I’d try and record what I actually did, and which bits generated data.  I’m not to bothered about the question of is the data actually captured – lets just assume that it might be at some point in the future.

I tried to go further, and try to think of the types of question that could be answered by the data, if it was stored.  I’m not sure how successful that was, but any failings are in my imagination, not in the concept.

About me (and my data)

I live in ruralish Leicestershire.  So many of the data opportunities available to Metropolitan man are unavailable to me.  No Oyster/Muni tickets, for example.  I travel a lot.  I have an iPhone and an iPad.  I’m probably not typical, but I’m probably not that unusual either.  I’m married with children.  Not that that stops the spam bots.

So here are the images – feel free to borrow them, comment on them etc…

Data generated in the morning

Data from the afternoon

“Data Scientists are hardcore coders…”: discuss

Yesterday I almost had a heart attack when an esteemed colleague (who shall remain nameless) came out with the statement: “Data Scientists are people who are hardcore Hadoop coders”…  had I totally misunderstood him?  Or was I so out of step with the world that I had totally misunderstood data science?

This is all the more important for two reasons:

  1. My job title (full disclosure, I made it up) is Director Data Science
  2. I’m busy trying to recruit data scientists for my team.

Well, to be honest, I could probably live with being wrong about 1.  LinkedIn will never find out, so that’s OK.

But whilst I’ve been engaged in recruitment I have had to decide what it is I’m looking for in candidates.  So here it is… in descending order a data scientist will be:

Curious

The first, and most important trait is curiosity.  Insane curiosity.  In many walks of life evolution selects against the kind of person who decides to find out what happens “if I push that button”.  In Data Science it selects for it.

In my own analytical experience nothing has come close to the feeling when you discover something new (even if other people have already been there).  In 5th form working out how to prove what root -9 was.  At University… well too much to drink there, but at work discovering that we could push complex analytics onto an MPP system.  That complex things (cars) failed with the same distribution as simple things (their components).  That social networks could be used to predict some things, and that they couldn’t be used to predict others.

And that last one is important too: the joy of disproving something!

Analytical

I expect any data scientist to have a background in, and an understanding of, complex analytics.  I don’t mean reporting.  I’ve nothing against reporting, it’s important and someone has to do it.  But not a data scientist.  I’m after people who can build a model that predicts something, or who can cluster data, who know the tricks of creating a good dataset, and when a model result is too good.  And importantly people who can tell me if the result is statistically relevant or just one of those things.

When it comes to Big Data “those things” will come up more and more often as our data gets bigger.

Communicative

I have no use for people who are unable to communicate with non-specialists.  Its hard enough discussing these topics within the community – we need people who can explain to those outside the community.  The users of the services we will provide.

Of course communication is two way, and the data scientist needs to listen too.

Novel

The data scientist needs to provide additional value above and beyond what’s happening already.  You can provide a fantastic new way of predicting churn that will only cost $1 million and uses data sources that are already in use?  And it doesn’t outperform the existing methods.  Hmmm.

Novelty, either in ways of thinking, or in terms of the data and approaches to be used, is vital.

Business focused

Obviously by business I mean “focused on the overall objectives of the organisation you’re working with or in”, but that’s a bit long winded.  Again data scientists need to get their heads out of the algorithms and into the business problems.  I you can tell me the correct parameterisation for a novel take on SVM, but can’t tell me the top three issues for a business (and how big data can help fix them), then you aren’t a data scientist, you’re an academic.

A coder

Last and least.  Yes, it would be nice if you can code Hadoop.  Or C#.  Or R.  But this is a passing phase brought on by a lack of good interfaces, it’s not a permanent state of affairs.  So, if you have this skill, good for you.  Bt if you only have this skill it’s time to get out into the world.