A big data glossary

Big data is serious. Really serious. But at Big Data Week it became clear that we need to be able to laugh at ourselves… So here is my initial attempt at an alternative glossary for big data. Thanks to the many contributors (intentional and not), and apologies to anyone who disagrees. Enjoy.

Wordle: BigDataGlossary

Agile analytics: Fast, iterative, experimental analysis often performed by people who aren’t.

Apache: Shorthand for the Apache Software Foundation, an open source collection of 100 projects. Not funny, but important.

Big brother: 1) Notional leader of Oceania in dystopian future novel Nineteen Eighty-Four. 2) Regular data related leader on front page of the Daily Mail in dystopian present 2013.

Big data: Any data problem that I’m working on. See also small data.

Cassandra: Open source, distributed data management system developed at Facebook. See Evil Empire.

Clickstream data: Data logged by client or web server as users navigate a website or software system. As in “In case of death please delete my clickstream”.

Data <insert profession>: Way of making a profession sound up to 22.7 times more sexy. Examples: data journalist, data scientists, data philanthropist. Not yet tried, data accountant, data politician.

Data journalist: Journalist who lets facts get in the way of a story.

Data model: the output of data modelling. Note that this is not a model who wants to sound more sexy.

Data modelling: Evil way of understanding how a process is reflected in data. Frowned upon.

Data miner: a data scientist who isn’t interested in a pay rise. Note that this is not a miner who wants to sound more sexy.

Data mining: The queen of sciences. Alternatively a business process that discovers patterns, or makes predictions from data sets using machine learning approaches.

Data scientist: A magical unicorn. With a good salary.

Dilbert test: Simple test for separating data scientists from IT people. If you take two equally funny cartoon strips, one Dilbert, the other XKCD, then a data scientist will prefer XKCD. An IT professional will prefer Dilbert. If anyone prefers Garfield they are in completely the wrong profession.

Elephants: Obligatory visual or textual reference by anyone involved in Hadoop. These are not the only animal in the zoo.

ETL: Extract, transform and load (ETL) – software and process used to move data.  Not to be confused with FTL, ELT, BLT, or more importantly DLT.

Evil Empire: large software/hardware/service vendor of choice. For example, Apple, Google, Facebook, IBM, Microsoft etc… The tag is independent of their evil doing capabilities.

Facebook: Privacy wrecking future Evil Empire.  Source of interesting graph data.

Fail fast: Low-risk, experimental approach to big data innovation where failure is seen as an opportunity to learn. Also excellent excuse when analysis goes terribly, terribly wrong.

Fail slow: Really, really bad idea.

Fork: Split in an open source project when two or more people stop talking to each other.

Google: Privacy wrecking future Evil Empire.  Source of interesting search data.

Google BigQuery: See Evil Empire.

Hadoop: Open source software controlled by the Apache Software Foundation that enables the distributed processing of large data sets across clusters of commodity servers. Often heard from senior managers “Get me one of those Hadoops that everyone is talking about”

Hadoop Hive: SQL interface to Hadoop MapReduce. Because SQL is evil. Except when it isn’t.

In-Memory Database: Trying to remember what you just did rather than writing it down. See also fail fast.

Internet of things: Stuff talking to other stuff. See singularity.

Java: Dominant programming language developed in the 90s at Sun Microsystems. It was later used to form Hadoop and other big data technologies.  More importantly a type of coffee.

Key value pairs: a way of avoiding evil data modelling.

MapReduce: Programming paradigm that enables scalability across multiple servers in Hadoop, essentially making it easier to process information on a vast scale. Sounds easy, doesn’t it?

MongoDB: NoSQL open source document-oriented database system developed and supported by 10gen. Its name derives from the word ‘humongous’. Seriously, where do they get these names?

NoSQL: No SQL. It’s evil. Do not use it, use something else instead.

NOSQL: Frequently (and understandably) confused with the NoSQL. It actually means not only SQL, or that you forgot to take off capslock.

No! SQL!: Phrase frequently heard as a badly written query takes down your database/drops your table.  See Bobby Tables.

Open data: Data, often from government, made freely available to the public. See also .pdf

Open source: an expensive way to not pay for something.

Python: 1) Dynamic programming language, first developed in the late 1980s. 2) Dynamic comedy team, first developed in the late 1960s.

Pig: High-level platform for creating MapReduce programmes used with Hadoop, also from Apache.

Real time: Illusory requirement for projects.

Relational database management systems (RDBMS): Big ol’ databases.

Singularity: When the stuff talking to other stuff finds our conversation boring.

Small data: (pejorative) Data that you are working on.

Social Media: Facebook, Twitter etc… Sources of trivial data in large volumes. Will save the world somehow.

Social Network: 1) A collection of people and how they interact. May also be on social media. 2) What data scientists really hope they will be part of.

SQL: Standard, structured query language specifically designed for managing data held in Big ol’ databases.

Twitter: Privacy wrecking future #EvilEmpire.  Source of interesting data in less than 140 characters.

Unstructured data: 1) The crap that just got handed to you with a deadline of tomorrow 2) The brilliant data source you just analysed way ahead of expectations. Often includes video and audio. You weren’t watching YouTube for fun.

V: Letter mysteriously associated with big data definitions. Usually comes in threes.  No one really knows why.

Variety: The thing you fail to get in big data definitions. “I didn’t see much variety in that definition!”

Velociraptors: Scary dinosaurs that really should be part of the definition of big data, terribly underused.

Velocity: The speed at which a speaker at a conference moves from the topic to praising their own company.

Volume: The degree to which a business analyst’s hair exceeds expectations.

XKCD: Important reference manual for big data.

ZooKeeper: Another an open source Apache project which provides a centralised infra­structure and services that enable synchronisation across a cluster.  Given all the animals in big data you can see why this was needed.

7 Things from Strata Santa Clara

This is the fifth time I’ve made the pilgrimage to Strata – I was lucky enough to be at the very first event, here in Santa Clara, and that’s made me think about how things have changed over the last two years.

Two years ago big data was new and shiny. About 1600 people turned up at the South end of silicon valley to enthuse about what we could do with data.

Now we’re talking about legacy Hadoop systems, and data science (big data is so 2011), but what else has changed?

1)   Hadoop grew up

The talk this year wasn’t about this new shiny thing called Hadoop, it was about which distro was the best (with a new Intel distro being announced), and which company had the biggest number of coders working in the original open source code. Seriously there were almost fistfights over the bragging rights.

As a mark of the new seriousness the sports jacket to t-shirt ratio was up. But don’t worry the PC to Mac ratio was still tending to zero (the battle was between Air and Pro).

2)   NoSQL is very much secondary to Hadoop

The show is extremely analytically oriented (in a database sense… but more of that later). The NoSQL vendors are there, but attract a fraction of the attention.

3)   SQL is back

Yes, really.  It turns out it is useful for something after all.

4)   Everyone is looking for ways to make it actually work (and work fast)

Hadoop isn’t perfect, and there are a wide range of companies trying to make it work better. Oddly there is a fetishisation of speed. Odd because this is something that the data warehouse companies went through in the days when it was all about the TPC benchmarks. No people, scaling doesn’t just mean big and fast. It means usable by lots of people and a whole raft of other things.

Greenplum were trying to convince us that theirs was fastest. Intel told us that SAP HANA was faster, and more innovative. Really. And the list went on.

Rather worryingly there seems to be a group out there who want to try and reinvent the dead end of TPC* but for Hadoop.

5)   There’s a lot of talk about Bayes, but not many data miners 

I ran a session on Data Mining. Only a handful of people out of about 200 in the room would admit to being data miners. This is terrifying as data scientists are trying to do analytical tasks. Coding a random forest does not make you a data miner!

6)   Data philanthropy is (still) hot

We had a keynote from codeforamerica, and lots of talk about ethics, black hats etc… I ran a Birds of a Feather table on Doing Good With Data. A group of us were talking about the top secret PROJECT EPHASUS. 

7)   The Gartner Hypecycle is in the trough of disillusionment

At least as far as big data is concerned. The show sold out. And the excitement was still there. Data Science has arrived.

Image

* For a fun review about the decision that Teradata made to withdraw from TPC benchmarks, try this by my esteemed colleague Martin Willcox.

Religion, politics and data mining

Three topics which are unlikely to get you invited to the best dinner parties. So how (if at all) are they related?

Lets start with data mining.

Data mining

Every year at about this time, Karl Rexer sends out a survey to data miners asking them about the tools they use, and the problems they address. 

Over the years since 2007 there have been a few notable changes. Firstly the rise in the number of people who completed the survey, from 300 to 1300 (2011). But also in the change in the toolsets people are using.

If you go back to 2007 the most popular (and liked) tools were SAS and SPSS Clementine. In 2008, R appeared for the first time, and by 2010 R was the most popular tool, with Clementine (by then IBM SPSS Modeler) had almost disappeared.

Some aspects of this are hardly surprising. Clementine was always an expensive tool to purchase, and it has a limited set of algorithms (yes, I know you can add to nodes, but…).  R is free. There are algorithms galore, even if they can be tricky to find.  Increasingly it’s the tool that students, especially those researching their PhDs will use at University.  It’s the data mining tool of the big data movement.

Religion

Back in the 1500s religion, specifically the tensions within religion, were deeply intertwined with the disruptive technology of the day: movable type.

The religious dispute was the reformation of the Catholic church, and one of the core aspects of this was the relationship between individuals and God. Did we need a priesthood or saints to intercede on our behalf? Couldn’t we read the word of God ourselves, and learn the lessons directly?  The bible was the big data of the 1500s, and the printing press was the technology that allowed people to translate the bible and provide access to the truth.

Sound a bit familiar?

Politics

Yet here we are in the midst of the big data reformation and it appears that we are creating a priesthood, not removing it.  What makes Clementine a great tool is that it is intensely and beautifully useable. It has an interface that is elegant and intuitive.  This allowed data miners to focus on the meaning of the analysis, not the mechanism of the analysis.

R, for all its strengths, is not elegant or beautiful. When you use R it is like stepping back into the world of SAS procs in the 90s. Too much effort is spent getting the software to work, and not enough is spent on the bigger picture.

“Aha!” you say, “But I’m an expert – I can use R at lightning speeds, I don’t need an elegant GUI!”   And so the politics is born.

Do we believe that data mining and data science is the province of a priesthood?  The select, who will interpret the holy truths on behalf of the unwashed masses (lets call them managers)? Or do we believe that data science should be democratized and made available to as many people as possible? Can they handle the truth?

Can we navigate the political waters, get over our religious differences, and deliver on the promise of analysis in the world of big data?

Doing good with analytics: the pledge

As a data professional I pledge

  • I will be Aware of the outcome and impact of my analysis
  • I won’t be Arrogant – and I will avoid hubris: I won’t assume I should, just because I can
  • I will be an Agent for change: use my analytical powers for positive good
  • I will be Awesome: I will reach out to those who need me, and take their cause further than they could imagine

Sign the pledge online at 
http://www.causes.com/actions/1694321

Doing Good With Data: the case for the ethical Data Scientist

This post is designed to accompany my presentation to the Teradata Partners User Group, but hopefully some of the links will prove useful even if you couldn’t get to the presentation itself.

Needless to say, the most important part – the Pledge – is right at the bottom.  Feel free to skip to it if you like!

Law

The law (as it relates to data – well actually pretty much all law) is complex, highly jurisdictional, and most importantly of all at least 10 years behind reality.  Neither Judy or I are lawyers, but hopefully these links provide some general background:

One of the first legal agreements was the OECD’s position on data transfers between countries. It dates from the early 70s, when flares were hot and digital watches were the dream of a young Douglas Adams: 
http://itlaw.wikia.com/wiki/Guidelines_on_the_Protection_of_Privacy_and_Transborder_Flows_of_Personal_Data

Much later the EU released the snappily titled EU 95/46/EC – better known as the Data Directive. The joy is that each country can implement it differently, resulting in confusion.  There are currently proposals out for consultation on updating it too: 
http://en.wikipedia.org/wiki/Data_Protection_Directive

Of course the EU and the US occasionally come to different decisions, and for a brief discussion of some of the major differences between them you can try this: 
http://www.privireal.org/content/dp/usa.php

Don’t do evil

Google’s famous take on the hypocratic oath can be simplified as ‘don’t do evil’. As we say in the presentation, this is necessary, but scarcely enough.  It also has the disadvantage of being passive. In it’s expanded form it’s available here: 
http://investor.google.com/corporate/code-of-conduct.html

Doing Good With Data

Now for the fun bits!  For information on the UN Global Pulse initiative:
http://www.unglobalpulse.org/ 

Data 4 Development – the Orange France Telecom initiative in Ivory Coast: 
http://www.d4d.orange.com/home

If you have a bent for European socialised medecine, then the NHS hack days are here:
http://nhshackday.com/

DataKind

And our favourite – with a big thanks to Jake Porway and Craig Barowsky – is DataKind: 
http://datakind.org/
You can also follow @DataKind

To find out more about the UK charities mentioned check out Place2Be
https://www.place2be.org.uk/
 and KeyFund 
http://www.keyfund.org.uk/

Please take the time to register with DataKind, and keep your eyes open for other opportunities.  We hope that DataKind will be open for business in the UK too soon!

The Pledge

Please go and look at the Pledge, and if you think you can, then sign up.  If you have one of our printed cards, take it, sign it and put it on your cube wall (or your refrigerator - wherever it will remind you of your commitment). But sign the online one too.  And one you’ve done that, let the world know! Get them to sign up. If you want a Word copy of the printable one just drop me a line.


http://www.causes.com/actions/1694321

The day the (medical) data broke free…

Image

Today is a good day for data – at least in healthcare. At last the data from the NHS is being set free.

For my international colleagues and friends it’s worth pointing out some things about the NHS.  The UK National Health Service* is actually a very large and complex organisation that cares for health needs. The main arms are the GP services and the Hospital services.  GPs are self employed and effectively contracted by the NHS. Hospitals are islands to themselves within regional groupings.  Above all lie funding and commissioning structures. Sounds complex? From a data perspective it certainly is. The data that is generated by the system is often written, frequently in isolated systems, and is barely there for joined up services, never mind research.

On the positive side, it’s free** at point of use, and generally does a good job.

.There have been signs for a while that the NHS has been starting to think about data.

  • Dr Carl Reynolds (@drcjar) at 
    http://openhealthcare.org.uk/
    has been leading the way on doing good things with health data, including running NHS hack days.  If you want to get involved the next one is in Liverpool on the 22-23 September
  • The UK set up the BioBank project, aimed to give a longitudinal study of people who aren’t necessarily ill.  If you think about it it’s fairly obvious that most people who go to their doctor are ill – BioBank aims to understand the factors in their lives that were the same, or different, to other people before and after they were ill.
  • Dr Ben Goldacre (@bengoldacre) has been leading a crusade to get clinical research data (even from trials that are abandoned or not published) into the public domain so that it can be used to compare outcomes.

But now the Government has gone much, much further and has created the Clinical Practice Research Datalink. In addition to having a funky website this aims to bring together data from the NHS so that this vast set of data can be used to improve health outcomes.

Of course there is a very, very big cloud hanging over this. How do you anonymise patient data so that it is still useful?  Simply removing names and addresses won’t deal with the issue, as Ross Anderson of Cambridge University identifies (the Guardian again – don’t say they aren’t fair and balanced!).

But I think, on balance, I disagree with Ross. I’ve come to the conclusion that we can’t rely on privacy, and that the exchange of a guarantee on privacy for free medical care is probably reasonable in itself. Especially as the guarantee isn’t really worth much these days.  When you add to this the potential benefits to research, then the answer is even more obvious. How many people would be happy to give up their privacy if they knew that one day they, or their kids, might be relying on the treatment that resulted?

*Actually there are three, NHS England and Wales, NHS Scotland, and NHS Northern Ireland, but let’s assume they are the same thing for this argument. NHS E&W is by far the largest.

**Nearly.

jakeporwaydatakind

Data *insert profession here*

In thinking about our data revolution, and pondering on my self declared job title (always the best kind of job title if you ask me), it occurred to me: how would professions changed if we put the word “data” in front of the title? Would it make them better, or worse? Would data actually improve the way that the professions work?

Of course, this has already happened for data science, and increasingly for data journalism – and reflects two different approaches.  In the first case it is applying science to data.  In the second it is applying data to journalism.

But if you assume that Data *insert profession here* is about applying data science approaches to <profession>, then what could it mean? Would it make the world better? Let’s try it and see…

Image

A DataKind data dive: more on that later

Data Journalism – leading the way

Assuming we all have our opinions about data science, let’s look instead at Data Journalism. Go and examine The Guardian’s data section (led by the excellent @simonrogers).  Here you will find stories developed from public data, data being used to test the truth of other stories, and the crowdsourcing of journalism.

Famously, when the UK Government tried to convince everyone that Blackberry (remember them?) Messenger and Twitter were the cause of the riots in the UK, The Guardian dug up the data that proved that messages followed incidents, not vice-versa.

It’s exciting, and it’s not clear where it will all end up.  Simon would probably tell you the same.  But it does change the way that journalists work and interact with sources.

So if it works for Data Journalism, where else might it work?

Data Politician

How many politicians have a science background?  The answer is very few.  In the UK there was my MP, Dr Lynne Jones, but she retired at the last election.   Actually there are sites that will tell you that the number is 27 (although being me I looked further and found that was only based on a sample of 430).

It’s still not many.

Could a scientific approach to politics, using data, help? How would we feel if politicians actively set up experiments to validate their policies?  Let’s be clear that current policy ‘trials’ tend to get the results that politicians want, and tend to be neither controlled or statistically significant. I’ve got to wonder if that is because politicians as a whole are unfamiliar with the words “control”, “statistical” or “significant”.

How would we react to experiments?  Would we be willing to tolerate being in one? And how would we treat a politician who did an experiment and changed their position as a result? Would we just shout “U-turn” at them?

Ironically politicians are already using the results of experimentation in their marketing/election efforts. Obama has a large team of predictive modellers whose task is to identify and target likely voters, and I’m sure Romney has too.

If only they would apply this to their policies!

Data Judge

Another area where data could surely add value, is in the policing and criminal justice system. We can predict vulnerabilities, identify mitigating strategies, and even try and modify punishment using data and experimentation.

Does the death penalty reduce the murder rate? What programmes reduce reoffending? What are the causes of criminality, and can they be reduced?

Sadly we seem to be heading in exactly the opposite direction in the UK. In my opinion, given the importance of statistical understanding in modern forensic evidence, any Judge who can’t do basic level statistics should immediately recluse themselves from any case.

Data Philanthropist

Now to my mind this is the biggest, and one that already has traction in the US.  Jake Porway (@jakeporway) has been leading the field with his wonderful DataKind (@datakind).  The idea is golden: if we can do cool things with data for business/industry, why can’t we do cool things with data for charity/not-for-profits?

And it’s coming to the UK. On the 29th and 30th of September we hope there will be a DataKind data dive in London.  A first chance for Uk not for profits to get free insight into their data.  A first chance for data scientists to try their hands at data philanthropy.

If you know a charity that could benefit, get in touch. If you want to be involved get in touch too.

CC license from Wikipedia

The death of Cartography

OSM Cambridge

So it appears that Apple have decided it’s time to ditch Google Maps in favour of Tom Tom’s own version.

Many people have commented on the business wisdom of this, the relative amounts of money that Google make from Maps compared with Android (about four times as much!), and the relative strengths and weaknesses of the different platforms.

What few people have commented on, which is surprising, is the death of the map maker’s art.

Not in the press release

I accept that this wasn’t one of the things that Apple chose to highlight, but it’s there nevertheless: in the comment that they will access “anonymous real-time crowdsourced data from our iOS users to keep this up to date.”

In layman’s terms, they will be using your input to make the maps more accurate.  And near real time.

Of course this isn’t the first time this has been done.  Openstreetmap.org has done this explicitly for a while now, and in some countries is as reliable as the official maps. So why not go with them, Apple?  Well firstly, because they are open source Apple would have to release the data back into the wild.  And secondly because Apple need something with a consistent minimum standard now – not in three months time.

Another difference is that openstreetmap requires active participation.  With the right analytics behind it, and a far bigger community, Apple’s maps could do so much more.

Number of openstreetmap users: 600 000

Number of iPhones : 100 000 000 +

Issues: privacy and unemployed cartographers

As the folks from openpaths have identified, your phone’s geography tells people a lot about you, and at least in the US there is a question about if police need a warrant to get at your phone data.  This is a whole lot more accurate data, gathered in a similar way to Waze.

And what about the cartographers? Well if there still a place for them it may be as curators of the information, rather like Wikipedia senior editors. If not…

Prof Peter Fader looking dynamic, but wrong (Wharton/Peter Olson)

Why Big Data *isn’t* like CRM

It gives me great pleasure to be able to disagree with a learned document for MIT. Or a Professor from the Wharton Business School.  So both at once?  Joy!  I accept that this is a character flaw, but there we have it.

So what has got me so annoyed?

Well this article has Peter Fader likening the Big Data failures of CRM.  Now I was there.  I worked in CRM.  And you, Big Data, are no CRM.

So why is Prof Fader so anti Big Data?

Some of the reasons are just plain dumb.  Yes, more data is not always the same as better data, but deliberately ignoring data is a crazy idea.

What else could it be?  Well (without wanting to go ad-hominem on him) it’s often the case that standing out against perceived wisdom is a better way to make your mark in academia than going with the flow.  Don’t believe in the Higgs Boson?  You’ll get airtime much faster than the thousands who do. Don’t believe in Big Data?  Perhaps MIT will do an article with you…

But perhaps, just perhaps he has some good points.

Prof Peter Fader looking dynamic, but wrong (Wharton/Peter Olson)

So let’s explore (for a moment) why CRM failed.

The failures of CRM

When I started out in CRM, Peppers and Rogers had just released the seminal, and still brilliant One-to-One Future.  They argued that companies who made the leap to treating their customers as individuals, who learned from the data that customers provided, would be leaders.  To my mind this idea never failed.  We can look to the world around us and ask the question: which companies actually implemented that one-to-one vision?  Precious few.

So what went wrong?  Why does Prof Fader link the words “frustration,” “disaster,” “expensive,” and “out of control” to CRM.

It’s because for many, including the software company I worked for at the time, CRM became a technology solution and not a business philosophy.

And often the technology didn’t work quite as well as people hoped.  And when it did companies assumed that putting software in place, but changing nothing else was a good approach.  It wasn’t: they just enabled marketeers to do bad things more efficiently.

And if you haven’t seen a lesson for Big Data there then you haven’t been paying attention: Big Data does not equal Hadoop.  If it does then we are in danger of running down the CRM rabbit hole, and Prof Fader will be right.  And I will be denying ever disagreeing with him.

Screen Shot 2012-05-21 at 17.40.48

A discussion on Big Data – Teradata Universe 2012

The following notes are recreated from the Big Data Panel Session held at the Teradata Universe conference in Dublin, April 2012.

The panel consisted of Dr Judy Bayer (Director Strategic Analytics, Teradata), Tom Fastner @tfastner (Senior Architect, eBay), Navdeep Alam @YoshiNav (Director Data Architecture, Mzinga), and Professor Mark Whitehorn (University of Dundee).

It was moderated by me… so any false memory syndrome is laid at my door.  Note: I have edited it slightly to turn a conversation into something a bit more readable, I hope the contributors will forgive me!

Let’s start with an easy question: what one word sums up Big Data for you?

Mark: Fun

Judy: For this decade – noisy

Navdeep: Bringing together

Tom: Fun(damental)

Navdeep: Big Data is bringing together technologies, it requires interoperability between systems such as Teradata and Hadoop, SQL and MapReduce, it’s also bringing people together.

What makes it fundamental? And fun?

Tom: If you go back to Crossing the Chasm, we are on the left side of the chasm: the innovators. It is fundamental to get our job right as we are doing it first.

Mark: And I can’t believe people pay me to do this, it’s such fun.

You mentioned noise, Judy, why is that?

Judy: Big Data has always been around, it’s defined as much by current capabilities as anything. And each generation of big data brings noise and confusion as to what to do with it.

Audience: It’s also all about telling a story from the data.

So what makes a good Data Scientist?

Tom: There are six characteristics, they are a special breed who need to be: curious, questioning, good communicator, open minded, someone who can dig deeper…

We have five to ten concurrent users of Hadoop and these are the data scientists. I sit next to one and he’s constantly going “Wow!”.  But they also cause the most problems with their big questions.

Judy: I’d add creativity, a passion to experiment and fail, and a passion for finding the stories in data.

Mark (stroking his luxuriant facial hair): A beard and sandals! No: someone who can think outside the box and be adventurous.

Navdeep: They need to be insatiable when it comes to data.  They also need to be a cry-baby – in that they cannot be satisfied, they should always want more data, hardware, more resources.

The McKinsey Global Institute report from 2011 showed a huge skill shortfall for Big Data Analysis – would you agree?

Navdeep: There is clearly a shortage of skills, you need to mix business and technology, so collaboration is key

Tom: Yes!

Mark: In 2008 I was at conference when someone asked what is the academic world doing to fix this problem? In response the University of Dundee set up a Masters course in business intelligence.

Audience: Do Data Scientists exist without realising it? Is Data Science a rebranding of existing skills like data mining?

Judy and I have had disputes about whether Data Scientists actually exist…

Judy: Well I believe analysts are born not made, but they need training to fulfill their potential. When it comes to Data Science I think there may be something new here. Data Scientists will be better at collaboration than traditional Data Miners. But we’re at the infancy of the subject, with data and the tools that don’t really exist yet. In many ways this is a parallel with the early days of data mining.

Tom: Take Kaggle for example, it’s interesting because of the collaboration between the individuals in the teams. You have to form teams and build on skillsets to produce the best algorithm to solve the problem.

Audience: This is probably re-branding, you need an analyst who can work across areas…

Audience: I find Data Science a restrictive term, it doesn’t capture the art side and the creativity that is required – people are rejecting the easy to use GUI tools and going back to R and programming languages.

Which brings us on to a related topic: what is the most important big data analytical technology?

Navdeep: Massively parallel processing, with fault tolerance on commodity hardware and with support for virtual machines. In other words removing the complexity of parallel processing , allowing organisations outside of the military and academia to experiment.

Judy: Visualisation – for example ways of visualizing network graphs.

Tom: It isn’t a single technology, it’s an eco-system, and it’ll take many years to develop.

Mark: R – we need languages that let us use this data.

But isn’t there a danger that these languages restrict usage to a niche specialism?

Mark: Good fundamental languages will allow tools to be built on top.

Do those tools exist?  Judy, do you see visualisation as a mature technology? It’s clear that part of the data science skill set is telling stories but the visualisation doesn’t seem to be quite there yet.

Judy: Some of the visualization you see has too much wow factor (trying to be clever) but isn’t easy enough to understand.  It needs to be easy to communicate but also to be actionable.

Mark: The work of Hans Rosling is a brilliant example of clear visualisation.

Navdeep: It’s clear that BI tools are not sufficient alone, custom visualisation needs to be written.

Audience: Have we collected the right data? Do we need to look at what we have and keep everything, or just what’s relevant?

Tom: There are limitations of what you can actually store. ebay do delete historical data and certain things like pictures. Some data can be reproduced rather than stored.

Mark: It’s a balance. In the case of proteomics it is relatively more expensive to produce than store – and reprocessing may be required at a later date.

Navdeep: Cloud storage is expensive – so at Mzinga we focus on keeping behavourial data that can’t be reproduced. We use a private cloud solution to store our archives. In the case of Facebook data, we use Hadoop to process it, and keep the results. Currently we purge the source data when it is over 5 years old.  We try to recognise what’s valuable and hard to reproduce, and keep that.

If Big Data Fails in 2012 – what will be the cause?

Judy: Keeping everything, our data and businesses, siloed. Not recognise that we have to integrate everything we have.

Mark: Stupidity! We can do it, we can get value. Technically it works, it is people who could cause it to fail.

Navdeep: It comes down to a lack of people who understand and can use the tech. People are needed to drive innovation.

Tom: People, and expectations set by management. It takes time to grow and it is being done successfully.  Big Data is a buzz word that will not go away.

Do you see anything that worries you about Big Data?  What about data protection or security?

Tom:  We have a lot of data at ebay, but need to be cautious over what we do with the users’ data to avoid alienating them. As a result there are lots of rules regarding what can and can’t be done with the data.

Navdeep: We’ve worked with security agencies, and understand the need to be careful.  It’s important to respect the different laws in different countries.

Judy: Privacy and security will increase in relevance but won’t cause big data to fail. Ways will be found to increase privacy – and laws will need to change to cope with the new world.

Isn’t it our job as data professionals to think about what is reasonable and ethical? Thinking about the Target case, a good comment was: don’t use data to do anything indirectly that you wouldn’t do directly.

Finally, if you we’re starting a big data project tomorrow and could do anything at all, what would you do?

Navdeep: I would study the universe.  For decades we’ve had measurements from sensors, so I would take all the information and build some analyses and correlations. There is a huge opportunity to bring all this data together.

Mark: Proteomics! But as we’re already doing it I would opt for quantum data, bringing probability theory to the subatomic world.

Tom: Neuro Linguistic Programming, understanding language – can it be done in a database? Could that be more efficient than Hadoop?

Judy: Analytics that would do good for society, for example using analysis to increase literacy. But I’ve got to back Mark too: proteomics, it’s awesome

Thank you

Additional reporting by @chillax7