The Conference Conundrum

I’m here at the Teradata Partners conference in Dallas (one of my favourite conferences (full disclosure, I’m employed by Teradata)), and enjoying myself immensely.

Of course there are always a few problems with these big conferences:

  1. The air-con is set to arctic
  2. The coffee in breakfast is terrible
  3. I always want to go to sessions that clash

I’ve long since given up on the air-con and the coffee.  It seems these are pretty much immutable laws of conferences.  But what about the scheduling?  Surely there is a (and I hesitate to say it) big data approach to making the scheduling better?

I have no* evidence, but I suspect that current scheduling approaches essentially revolve around avoiding having the same person speak in two places at the same time, making sure that your ‘big’ speakers are in the biggest halls.

But we’ve all been to presentations in aircraft hangers with three people in the audience, and we’ve all been to the broom-closet with a thousand people straining to hear the presenter.

And above all. we’ve all been hanging around in the corridor trying to decide which of the three clashing sessions we should go to next.

The long walk

The long walk

So maybe, just maybe, there is a better way.

How? Well this year’s Partners Conference gave us the ability to use an app or an online tool to choose which sessions we wanted to see.  So I did it.  Two minutes in BZZZZZZZ – you have a clash!  But I wanted to see both of those sessions!  Tough.  Choose one.

But.  What if they had asked me what I wanted to see before they had allocated sessions to time slots and rooms?

They would have ended up with a dataset that would allow someone to optimise the sessions for every attendee.  This would really change the game, we’re moving from an organiser designed system to a user designed system.

But wait! There’s more!

Are you tired from having to walk 500m between sessions?  We could also optimise for distances walked.  And we could make a better guess at which sessions need the aircraft hanger, and which would be just fine in the broom closet. And we could do collaborative filtering and suggest sessions that other people were interested in…

And guess what?  We have the technology to do this.

Next year, Partners?

The worst big data article ever?

There are many, many bad articles on big data. It’s almost impossible to move without tripping over another pundit trying to rubbish the topic. Partly this is just the inevitable sound of people trying to make a name for themselves by being counter-factual. It’s far easier to stand out when you’re fighting against the tide. Even if you end up getting very wet…

Fortunately St Nate of Sliver has actually analysed the data and it’s clear that pundits are fundamentally useless.

But occasionally you come across something so egregiously crap that you have to comment.

This week’s crap-du-jour comes courtesy of Tom Leberecht and Fortune.

In it he decides to lump almost every woe in the world and pile them at the feet of big data. So here are my rebuttals:

Big Data = Big Brother?

This hasn’t been a good couple of weeks for the field of data mining. The NSA scandal has  caused sales of Nineteen Eighty-Four to rise, unfortunately not quite at the speed that the use of the phrase “Big Brother” by journalists has risen.

In his article Leberecht oddly passes this over, and instead mentions the evil of passing on data to private companies in a sideways swipe at quantified self.

Perhaps he forgets to mention it because it appears that people can also see the positive side? There are real issues around privacy, anonymity and data security, but pretending that the age of big data is the cause is rather odd.

Big data is not social

Well firstly, hasn’t he heard of Social Network Analysis? But secondly he seems to be advancing the argument that the status of X (relationships) is threatened by allowing Y (data analysis).  Sound familiar?  Yes that’s the argument against gay marriage. Somehow if my gay friends get married, my marriage will be threatened.

Well for the record analysing data doesn’t mean that humanity will be diminished. Welcome to the world of science! Were we more human in the 17th century? Or the 13th? Because there was a hell of a lot less analysis then than in the 1950s…

Finally on this topic, what about the growing Data Philanthropy movement? Every week we see new initiatives where people want to apply big data to address real social issues in ways that couldn’t happen before.

Big data creates smaller worlds

Apparently big data filters our perception, and limits our openness to new ideas and cultures. Really? To go back to gay marriage – can we imagine this being a possibility 20 years ago? The ability to interact and identify unusual events and groups menas that there is far more diversity than there ever was. A goal of marketers is to open people’s eyes to new things (and to get them to buy them). Leberecht seems to think that the collaborative filtering that Amazon famously use would only ever return you to the same book.

Big data – and opening yourself up to ideas that aren’t part of you narrow ‘intuition’ will surely make your world bigger and more diverse…

Big data is smarter, not wiser

The article makes it clear that wisdom has a twofold definition in Leberecht’s world: it is based on intuition (guesswork) and it is slow. Oh, and it also rejects feedback. Well firstly, big data isn’t always fast. Believe me, Hadoop isn’t a solution suited to the rigours of rapid operationalisation. There are other things for that. But as a definition of wisdom this seems to be a disaster.

Not only should you take the risk of making the wrong decision (intuition is guesswork), but you should do it slowly, and without paying attention to any feedback you get.  Truly this is fail slow writ large.

Big data is too obvious

I think this is the heart of Leberecht’s argument. He didn’t think of big data.

His example (that the financial collapse was caused by measurement) is patently wrong. The problem with mortgage backed securities was exactly the opposite: people failed to measure the risk and relied on intuition that the boom was going to continue indefinitely.

Big data doesn’t give

And then finally we hit the sleeping policeman of CP Snow. You are either an artist or a scientist. A businessman or a data scientist. Creativity belongs to the former, sterile analysis to the latter.

I’ll let you guess what I think of that!

 

A big data glossary

Big data is serious. Really serious. But at Big Data Week it became clear that we need to be able to laugh at ourselves… So here is my initial attempt at an alternative glossary for big data. Thanks to the many contributors (intentional and not), and apologies to anyone who disagrees. Enjoy.

Wordle: BigDataGlossary

Agile analytics: Fast, iterative, experimental analysis often performed by people who aren’t.

Apache: Shorthand for the Apache Software Foundation, an open source collection of 100 projects. Not funny, but important.

Big brother: 1) Notional leader of Oceania in dystopian future novel Nineteen Eighty-Four. 2) Regular data related leader on front page of the Daily Mail in dystopian present 2013.

Big data: Any data problem that I’m working on. See also small data.

Cassandra: Open source, distributed data management system developed at Facebook. See Evil Empire.

Clickstream data: Data logged by client or web server as users navigate a website or software system. As in “In case of death please delete my clickstream”.

Data <insert profession>: Way of making a profession sound up to 22.7 times more sexy. Examples: data journalist, data scientists, data philanthropist. Not yet tried, data accountant, data politician.

Data journalist: Journalist who lets facts get in the way of a story.

Data model: the output of data modelling. Note that this is not a model who wants to sound more sexy.

Data modelling: Evil way of understanding how a process is reflected in data. Frowned upon.

Data miner: a data scientist who isn’t interested in a pay rise. Note that this is not a miner who wants to sound more sexy.

Data mining: The queen of sciences. Alternatively a business process that discovers patterns, or makes predictions from data sets using machine learning approaches.

Data scientist: A magical unicorn. With a good salary.

Dilbert test: Simple test for separating data scientists from IT people. If you take two equally funny cartoon strips, one Dilbert, the other XKCD, then a data scientist will prefer XKCD. An IT professional will prefer Dilbert. If anyone prefers Garfield they are in completely the wrong profession.

Elephants: Obligatory visual or textual reference by anyone involved in Hadoop. These are not the only animal in the zoo.

ETL: Extract, transform and load (ETL) – software and process used to move data.  Not to be confused with FTL, ELT, BLT, or more importantly DLT.

Evil Empire: large software/hardware/service vendor of choice. For example, Apple, Google, Facebook, IBM, Microsoft etc… The tag is independent of their evil doing capabilities.

Facebook: Privacy wrecking future Evil Empire.  Source of interesting graph data.

Fail fast: Low-risk, experimental approach to big data innovation where failure is seen as an opportunity to learn. Also excellent excuse when analysis goes terribly, terribly wrong.

Fail slow: Really, really bad idea.

Fork: Split in an open source project when two or more people stop talking to each other.

Google: Privacy wrecking future Evil Empire.  Source of interesting search data.

Google BigQuery: See Evil Empire.

Hadoop: Open source software controlled by the Apache Software Foundation that enables the distributed processing of large data sets across clusters of commodity servers. Often heard from senior managers “Get me one of those Hadoops that everyone is talking about”

Hadoop Hive: SQL interface to Hadoop MapReduce. Because SQL is evil. Except when it isn’t.

In-Memory Database: Trying to remember what you just did rather than writing it down. See also fail fast.

Internet of things: Stuff talking to other stuff. See singularity.

Java: Dominant programming language developed in the 90s at Sun Microsystems. It was later used to form Hadoop and other big data technologies.  More importantly a type of coffee.

Key value pairs: a way of avoiding evil data modelling.

MapReduce: Programming paradigm that enables scalability across multiple servers in Hadoop, essentially making it easier to process information on a vast scale. Sounds easy, doesn’t it?

MongoDB: NoSQL open source document-oriented database system developed and supported by 10gen. Its name derives from the word ‘humongous’. Seriously, where do they get these names?

NoSQL: No SQL. It’s evil. Do not use it, use something else instead.

NOSQL: Frequently (and understandably) confused with the NoSQL. It actually means not only SQL, or that you forgot to take off capslock.

No! SQL!: Phrase frequently heard as a badly written query takes down your database/drops your table.  See Bobby Tables.

Open data: Data, often from government, made freely available to the public. See also .pdf

Open source: an expensive way to not pay for something.

Python: 1) Dynamic programming language, first developed in the late 1980s. 2) Dynamic comedy team, first developed in the late 1960s.

Pig: High-level platform for creating MapReduce programmes used with Hadoop, also from Apache.

Real time: Illusory requirement for projects.

Relational database management systems (RDBMS): Big ol’ databases.

Singularity: When the stuff talking to other stuff finds our conversation boring.

Small data: (pejorative) Data that you are working on.

Social Media: Facebook, Twitter etc… Sources of trivial data in large volumes. Will save the world somehow.

Social Network: 1) A collection of people and how they interact. May also be on social media. 2) What data scientists really hope they will be part of.

SQL: Standard, structured query language specifically designed for managing data held in Big ol’ databases.

Twitter: Privacy wrecking future #EvilEmpire.  Source of interesting data in less than 140 characters.

Unstructured data: 1) The crap that just got handed to you with a deadline of tomorrow 2) The brilliant data source you just analysed way ahead of expectations. Often includes video and audio. You weren’t watching YouTube for fun.

V: Letter mysteriously associated with big data definitions. Usually comes in threes.  No one really knows why.

Variety: The thing you fail to get in big data definitions. “I didn’t see much variety in that definition!”

Velociraptors: Scary dinosaurs that really should be part of the definition of big data, terribly underused.

Velocity: The speed at which a speaker at a conference moves from the topic to praising their own company.

Volume: The degree to which a business analyst’s hair exceeds expectations.

XKCD: Important reference manual for big data.

ZooKeeper: Another an open source Apache project which provides a centralised infra­structure and services that enable synchronisation across a cluster.  Given all the animals in big data you can see why this was needed.

7 Things from Strata Santa Clara

This is the fifth time I’ve made the pilgrimage to Strata – I was lucky enough to be at the very first event, here in Santa Clara, and that’s made me think about how things have changed over the last two years.

Two years ago big data was new and shiny. About 1600 people turned up at the South end of silicon valley to enthuse about what we could do with data.

Now we’re talking about legacy Hadoop systems, and data science (big data is so 2011), but what else has changed?

1)   Hadoop grew up

The talk this year wasn’t about this new shiny thing called Hadoop, it was about which distro was the best (with a new Intel distro being announced), and which company had the biggest number of coders working in the original open source code. Seriously there were almost fistfights over the bragging rights.

As a mark of the new seriousness the sports jacket to t-shirt ratio was up. But don’t worry the PC to Mac ratio was still tending to zero (the battle was between Air and Pro).

2)   NoSQL is very much secondary to Hadoop

The show is extremely analytically oriented (in a database sense… but more of that later). The NoSQL vendors are there, but attract a fraction of the attention.

3)   SQL is back

Yes, really.  It turns out it is useful for something after all.

4)   Everyone is looking for ways to make it actually work (and work fast)

Hadoop isn’t perfect, and there are a wide range of companies trying to make it work better. Oddly there is a fetishisation of speed. Odd because this is something that the data warehouse companies went through in the days when it was all about the TPC benchmarks. No people, scaling doesn’t just mean big and fast. It means usable by lots of people and a whole raft of other things.

Greenplum were trying to convince us that theirs was fastest. Intel told us that SAP HANA was faster, and more innovative. Really. And the list went on.

Rather worryingly there seems to be a group out there who want to try and reinvent the dead end of TPC* but for Hadoop.

5)   There’s a lot of talk about Bayes, but not many data miners 

I ran a session on Data Mining. Only a handful of people out of about 200 in the room would admit to being data miners. This is terrifying as data scientists are trying to do analytical tasks. Coding a random forest does not make you a data miner!

6)   Data philanthropy is (still) hot

We had a keynote from codeforamerica, and lots of talk about ethics, black hats etc… I ran a Birds of a Feather table on Doing Good With Data. A group of us were talking about the top secret PROJECT EPHASUS. 

7)   The Gartner Hypecycle is in the trough of disillusionment

At least as far as big data is concerned. The show sold out. And the excitement was still there. Data Science has arrived.

Image

* For a fun review about the decision that Teradata made to withdraw from TPC benchmarks, try this by my esteemed colleague Martin Willcox.

Religion, politics and data mining

Three topics which are unlikely to get you invited to the best dinner parties. So how (if at all) are they related?

Lets start with data mining.

Data mining

Every year at about this time, Karl Rexer sends out a survey to data miners asking them about the tools they use, and the problems they address. 

Over the years since 2007 there have been a few notable changes. Firstly the rise in the number of people who completed the survey, from 300 to 1300 (2011). But also in the change in the toolsets people are using.

If you go back to 2007 the most popular (and liked) tools were SAS and SPSS Clementine. In 2008, R appeared for the first time, and by 2010 R was the most popular tool, with Clementine (by then IBM SPSS Modeler) had almost disappeared.

Some aspects of this are hardly surprising. Clementine was always an expensive tool to purchase, and it has a limited set of algorithms (yes, I know you can add to nodes, but…).  R is free. There are algorithms galore, even if they can be tricky to find.  Increasingly it’s the tool that students, especially those researching their PhDs will use at University.  It’s the data mining tool of the big data movement.

Religion

Back in the 1500s religion, specifically the tensions within religion, were deeply intertwined with the disruptive technology of the day: movable type.

The religious dispute was the reformation of the Catholic church, and one of the core aspects of this was the relationship between individuals and God. Did we need a priesthood or saints to intercede on our behalf? Couldn’t we read the word of God ourselves, and learn the lessons directly?  The bible was the big data of the 1500s, and the printing press was the technology that allowed people to translate the bible and provide access to the truth.

Sound a bit familiar?

Politics

Yet here we are in the midst of the big data reformation and it appears that we are creating a priesthood, not removing it.  What makes Clementine a great tool is that it is intensely and beautifully useable. It has an interface that is elegant and intuitive.  This allowed data miners to focus on the meaning of the analysis, not the mechanism of the analysis.

R, for all its strengths, is not elegant or beautiful. When you use R it is like stepping back into the world of SAS procs in the 90s. Too much effort is spent getting the software to work, and not enough is spent on the bigger picture.

“Aha!” you say, “But I’m an expert – I can use R at lightning speeds, I don’t need an elegant GUI!”   And so the politics is born.

Do we believe that data mining and data science is the province of a priesthood?  The select, who will interpret the holy truths on behalf of the unwashed masses (lets call them managers)? Or do we believe that data science should be democratized and made available to as many people as possible? Can they handle the truth?

Can we navigate the political waters, get over our religious differences, and deliver on the promise of analysis in the world of big data?

Doing good with analytics: the pledge

As a data professional I pledge

  • I will be Aware of the outcome and impact of my analysis
  • I won’t be Arrogant – and I will avoid hubris: I won’t assume I should, just because I can
  • I will be an Agent for change: use my analytical powers for positive good
  • I will be Awesome: I will reach out to those who need me, and take their cause further than they could imagine

Sign the pledge online at http://www.causes.com/actions/1694321

Doing Good With Data: the case for the ethical Data Scientist

This post is designed to accompany my presentation to the Teradata Partners User Group, but hopefully some of the links will prove useful even if you couldn’t get to the presentation itself.

Needless to say, the most important part – the Pledge – is right at the bottom.  Feel free to skip to it if you like!

Law

The law (as it relates to data – well actually pretty much all law) is complex, highly jurisdictional, and most importantly of all at least 10 years behind reality.  Neither Judy or I are lawyers, but hopefully these links provide some general background:

One of the first legal agreements was the OECD’s position on data transfers between countries. It dates from the early 70s, when flares were hot and digital watches were the dream of a young Douglas Adams: http://itlaw.wikia.com/wiki/Guidelines_on_the_Protection_of_Privacy_and_Transborder_Flows_of_Personal_Data

Much later the EU released the snappily titled EU 95/46/EC – better known as the Data Directive. The joy is that each country can implement it differently, resulting in confusion.  There are currently proposals out for consultation on updating it too: http://en.wikipedia.org/wiki/Data_Protection_Directive

Of course the EU and the US occasionally come to different decisions, and for a brief discussion of some of the major differences between them you can try this: http://www.privireal.org/content/dp/usa.php

Don’t do evil

Google’s famous take on the hypocratic oath can be simplified as ‘don’t do evil’. As we say in the presentation, this is necessary, but scarcely enough.  It also has the disadvantage of being passive. In it’s expanded form it’s available here: http://investor.google.com/corporate/code-of-conduct.html

Doing Good With Data

Now for the fun bits!  For information on the UN Global Pulse initiative: http://www.unglobalpulse.org/ 

Data 4 Development – the Orange France Telecom initiative in Ivory Coast: http://www.d4d.orange.com/home

If you have a bent for European socialised medecine, then the NHS hack days are here: http://nhshackday.com/

DataKind

And our favourite – with a big thanks to Jake Porway and Craig Barowsky – is DataKind: http://datakind.org/ You can also follow @DataKind

To find out more about the UK charities mentioned check out Place2Be https://www.place2be.org.uk/ and KeyFund http://www.keyfund.org.uk/

Please take the time to register with DataKind, and keep your eyes open for other opportunities.  We hope that DataKind will be open for business in the UK too soon!

The Pledge

Please go and look at the Pledge, and if you think you can, then sign up.  If you have one of our printed cards, take it, sign it and put it on your cube wall (or your refrigerator – wherever it will remind you of your commitment). But sign the online one too.  And one you’ve done that, let the world know! Get them to sign up. If you want a Word copy of the printable one just drop me a line.

http://www.causes.com/actions/1694321

Why Big Data *isn’t* like CRM

It gives me great pleasure to be able to disagree with a learned document for MIT. Or a Professor from the Wharton Business School.  So both at once?  Joy!  I accept that this is a character flaw, but there we have it.

So what has got me so annoyed?

Well this article has Peter Fader likening the Big Data failures of CRM.  Now I was there.  I worked in CRM.  And you, Big Data, are no CRM.

So why is Prof Fader so anti Big Data?

Some of the reasons are just plain dumb.  Yes, more data is not always the same as better data, but deliberately ignoring data is a crazy idea.

What else could it be?  Well (without wanting to go ad-hominem on him) it’s often the case that standing out against perceived wisdom is a better way to make your mark in academia than going with the flow.  Don’t believe in the Higgs Boson?  You’ll get airtime much faster than the thousands who do. Don’t believe in Big Data?  Perhaps MIT will do an article with you…

But perhaps, just perhaps he has some good points.

Prof Peter Fader looking dynamic, but wrong (Wharton/Peter Olson)

So let’s explore (for a moment) why CRM failed.

The failures of CRM

When I started out in CRM, Peppers and Rogers had just released the seminal, and still brilliant One-to-One Future.  They argued that companies who made the leap to treating their customers as individuals, who learned from the data that customers provided, would be leaders.  To my mind this idea never failed.  We can look to the world around us and ask the question: which companies actually implemented that one-to-one vision?  Precious few.

So what went wrong?  Why does Prof Fader link the words “frustration,” “disaster,” “expensive,” and “out of control” to CRM.

It’s because for many, including the software company I worked for at the time, CRM became a technology solution and not a business philosophy.

And often the technology didn’t work quite as well as people hoped.  And when it did companies assumed that putting software in place, but changing nothing else was a good approach.  It wasn’t: they just enabled marketeers to do bad things more efficiently.

And if you haven’t seen a lesson for Big Data there then you haven’t been paying attention: Big Data does not equal Hadoop.  If it does then we are in danger of running down the CRM rabbit hole, and Prof Fader will be right.  And I will be denying ever disagreeing with him.

A discussion on Big Data – Teradata Universe 2012

The following notes are recreated from the Big Data Panel Session held at the Teradata Universe conference in Dublin, April 2012.

The panel consisted of Dr Judy Bayer (Director Strategic Analytics, Teradata), Tom Fastner @tfastner (Senior Architect, eBay), Navdeep Alam @YoshiNav (Director Data Architecture, Mzinga), and Professor Mark Whitehorn (University of Dundee).

It was moderated by me… so any false memory syndrome is laid at my door.  Note: I have edited it slightly to turn a conversation into something a bit more readable, I hope the contributors will forgive me!

Let’s start with an easy question: what one word sums up Big Data for you?

Mark: Fun

Judy: For this decade – noisy

Navdeep: Bringing together

Tom: Fun(damental)

Navdeep: Big Data is bringing together technologies, it requires interoperability between systems such as Teradata and Hadoop, SQL and MapReduce, it’s also bringing people together.

What makes it fundamental? And fun?

Tom: If you go back to Crossing the Chasm, we are on the left side of the chasm: the innovators. It is fundamental to get our job right as we are doing it first.

Mark: And I can’t believe people pay me to do this, it’s such fun.

You mentioned noise, Judy, why is that?

Judy: Big Data has always been around, it’s defined as much by current capabilities as anything. And each generation of big data brings noise and confusion as to what to do with it.

Audience: It’s also all about telling a story from the data.

So what makes a good Data Scientist?

Tom: There are six characteristics, they are a special breed who need to be: curious, questioning, good communicator, open minded, someone who can dig deeper…

We have five to ten concurrent users of Hadoop and these are the data scientists. I sit next to one and he’s constantly going “Wow!”.  But they also cause the most problems with their big questions.

Judy: I’d add creativity, a passion to experiment and fail, and a passion for finding the stories in data.

Mark (stroking his luxuriant facial hair): A beard and sandals! No: someone who can think outside the box and be adventurous.

Navdeep: They need to be insatiable when it comes to data.  They also need to be a cry-baby – in that they cannot be satisfied, they should always want more data, hardware, more resources.

The McKinsey Global Institute report from 2011 showed a huge skill shortfall for Big Data Analysis – would you agree?

Navdeep: There is clearly a shortage of skills, you need to mix business and technology, so collaboration is key

Tom: Yes!

Mark: In 2008 I was at conference when someone asked what is the academic world doing to fix this problem? In response the University of Dundee set up a Masters course in business intelligence.

Audience: Do Data Scientists exist without realising it? Is Data Science a rebranding of existing skills like data mining?

Judy and I have had disputes about whether Data Scientists actually exist…

Judy: Well I believe analysts are born not made, but they need training to fulfill their potential. When it comes to Data Science I think there may be something new here. Data Scientists will be better at collaboration than traditional Data Miners. But we’re at the infancy of the subject, with data and the tools that don’t really exist yet. In many ways this is a parallel with the early days of data mining.

Tom: Take Kaggle for example, it’s interesting because of the collaboration between the individuals in the teams. You have to form teams and build on skillsets to produce the best algorithm to solve the problem.

Audience: This is probably re-branding, you need an analyst who can work across areas…

Audience: I find Data Science a restrictive term, it doesn’t capture the art side and the creativity that is required – people are rejecting the easy to use GUI tools and going back to R and programming languages.

Which brings us on to a related topic: what is the most important big data analytical technology?

Navdeep: Massively parallel processing, with fault tolerance on commodity hardware and with support for virtual machines. In other words removing the complexity of parallel processing , allowing organisations outside of the military and academia to experiment.

Judy: Visualisation – for example ways of visualizing network graphs.

Tom: It isn’t a single technology, it’s an eco-system, and it’ll take many years to develop.

Mark: R – we need languages that let us use this data.

But isn’t there a danger that these languages restrict usage to a niche specialism?

Mark: Good fundamental languages will allow tools to be built on top.

Do those tools exist?  Judy, do you see visualisation as a mature technology? It’s clear that part of the data science skill set is telling stories but the visualisation doesn’t seem to be quite there yet.

Judy: Some of the visualization you see has too much wow factor (trying to be clever) but isn’t easy enough to understand.  It needs to be easy to communicate but also to be actionable.

Mark: The work of Hans Rosling is a brilliant example of clear visualisation.

Navdeep: It’s clear that BI tools are not sufficient alone, custom visualisation needs to be written.

Audience: Have we collected the right data? Do we need to look at what we have and keep everything, or just what’s relevant?

Tom: There are limitations of what you can actually store. ebay do delete historical data and certain things like pictures. Some data can be reproduced rather than stored.

Mark: It’s a balance. In the case of proteomics it is relatively more expensive to produce than store – and reprocessing may be required at a later date.

Navdeep: Cloud storage is expensive – so at Mzinga we focus on keeping behavourial data that can’t be reproduced. We use a private cloud solution to store our archives. In the case of Facebook data, we use Hadoop to process it, and keep the results. Currently we purge the source data when it is over 5 years old.  We try to recognise what’s valuable and hard to reproduce, and keep that.

If Big Data Fails in 2012 – what will be the cause?

Judy: Keeping everything, our data and businesses, siloed. Not recognise that we have to integrate everything we have.

Mark: Stupidity! We can do it, we can get value. Technically it works, it is people who could cause it to fail.

Navdeep: It comes down to a lack of people who understand and can use the tech. People are needed to drive innovation.

Tom: People, and expectations set by management. It takes time to grow and it is being done successfully.  Big Data is a buzz word that will not go away.

Do you see anything that worries you about Big Data?  What about data protection or security?

Tom:  We have a lot of data at ebay, but need to be cautious over what we do with the users’ data to avoid alienating them. As a result there are lots of rules regarding what can and can’t be done with the data.

Navdeep: We’ve worked with security agencies, and understand the need to be careful.  It’s important to respect the different laws in different countries.

Judy: Privacy and security will increase in relevance but won’t cause big data to fail. Ways will be found to increase privacy – and laws will need to change to cope with the new world.

Isn’t it our job as data professionals to think about what is reasonable and ethical? Thinking about the Target case, a good comment was: don’t use data to do anything indirectly that you wouldn’t do directly.

Finally, if you we’re starting a big data project tomorrow and could do anything at all, what would you do?

Navdeep: I would study the universe.  For decades we’ve had measurements from sensors, so I would take all the information and build some analyses and correlations. There is a huge opportunity to bring all this data together.

Mark: Proteomics! But as we’re already doing it I would opt for quantum data, bringing probability theory to the subatomic world.

Tom: Neuro Linguistic Programming, understanding language – can it be done in a database? Could that be more efficient than Hadoop?

Judy: Analytics that would do good for society, for example using analysis to increase literacy. But I’ve got to back Mark too: proteomics, it’s awesome

Thank you

Additional reporting by @chillax7

NoSQL? NOSQL? How about NOHadoop?

Comrades (for that is how all good exhortations to action begin), it is time for us to stand up against a Heresy that is sweeping the world of Big Data Science.

The Heresy is that there is only one God, and its name is Hadoop. This yellow elephant is being taken by some to be the alpha and omega of data science.  Just the other day an eminent blogger started a comment by saying “If Logo of BIG data is Elephant, What is the Logo of Analytics?”

And this is annoying.

It’s like saying the logo of driving is a prancing horse (sorry Daimler).  Or calling a computer tablet an iPad.  Well forget that last one. But you get the idea.  Hadoop may be the fastest example of eponymy ever; it has almost become a generic brand name. “Lets get us some of those there Hadoops” can virtually be heard coming from the boardrooms of the Global 3000.

But only almost.

There is still time for the business idea of big data science to triumph if all of us non-Hadoop struck folks get together.

So, if you like MongoDB, if you think SQL has a few tricks up it’s sleeve, if you R a data mining pirate, if you think the use comes first and the technology comes second, then join the Not Only Hadoop campaign.

Say #NOHadoop