A discussion on Big Data – Teradata Universe 2012

The following notes are recreated from the Big Data Panel Session held at the Teradata Universe conference in Dublin, April 2012.

The panel consisted of Dr Judy Bayer (Director Strategic Analytics, Teradata), Tom Fastner @tfastner (Senior Architect, eBay), Navdeep Alam @YoshiNav (Director Data Architecture, Mzinga), and Professor Mark Whitehorn (University of Dundee).

It was moderated by me… so any false memory syndrome is laid at my door.  Note: I have edited it slightly to turn a conversation into something a bit more readable, I hope the contributors will forgive me!

Let’s start with an easy question: what one word sums up Big Data for you?

Mark: Fun

Judy: For this decade – noisy

Navdeep: Bringing together

Tom: Fun(damental)

Navdeep: Big Data is bringing together technologies, it requires interoperability between systems such as Teradata and Hadoop, SQL and MapReduce, it’s also bringing people together.

What makes it fundamental? And fun?

Tom: If you go back to Crossing the Chasm, we are on the left side of the chasm: the innovators. It is fundamental to get our job right as we are doing it first.

Mark: And I can’t believe people pay me to do this, it’s such fun.

You mentioned noise, Judy, why is that?

Judy: Big Data has always been around, it’s defined as much by current capabilities as anything. And each generation of big data brings noise and confusion as to what to do with it.

Audience: It’s also all about telling a story from the data.

So what makes a good Data Scientist?

Tom: There are six characteristics, they are a special breed who need to be: curious, questioning, good communicator, open minded, someone who can dig deeper…

We have five to ten concurrent users of Hadoop and these are the data scientists. I sit next to one and he’s constantly going “Wow!”.  But they also cause the most problems with their big questions.

Judy: I’d add creativity, a passion to experiment and fail, and a passion for finding the stories in data.

Mark (stroking his luxuriant facial hair): A beard and sandals! No: someone who can think outside the box and be adventurous.

Navdeep: They need to be insatiable when it comes to data.  They also need to be a cry-baby – in that they cannot be satisfied, they should always want more data, hardware, more resources.

The McKinsey Global Institute report from 2011 showed a huge skill shortfall for Big Data Analysis – would you agree?

Navdeep: There is clearly a shortage of skills, you need to mix business and technology, so collaboration is key

Tom: Yes!

Mark: In 2008 I was at conference when someone asked what is the academic world doing to fix this problem? In response the University of Dundee set up a Masters course in business intelligence.

Audience: Do Data Scientists exist without realising it? Is Data Science a rebranding of existing skills like data mining?

Judy and I have had disputes about whether Data Scientists actually exist…

Judy: Well I believe analysts are born not made, but they need training to fulfill their potential. When it comes to Data Science I think there may be something new here. Data Scientists will be better at collaboration than traditional Data Miners. But we’re at the infancy of the subject, with data and the tools that don’t really exist yet. In many ways this is a parallel with the early days of data mining.

Tom: Take Kaggle for example, it’s interesting because of the collaboration between the individuals in the teams. You have to form teams and build on skillsets to produce the best algorithm to solve the problem.

Audience: This is probably re-branding, you need an analyst who can work across areas…

Audience: I find Data Science a restrictive term, it doesn’t capture the art side and the creativity that is required – people are rejecting the easy to use GUI tools and going back to R and programming languages.

Which brings us on to a related topic: what is the most important big data analytical technology?

Navdeep: Massively parallel processing, with fault tolerance on commodity hardware and with support for virtual machines. In other words removing the complexity of parallel processing , allowing organisations outside of the military and academia to experiment.

Judy: Visualisation – for example ways of visualizing network graphs.

Tom: It isn’t a single technology, it’s an eco-system, and it’ll take many years to develop.

Mark: R – we need languages that let us use this data.

But isn’t there a danger that these languages restrict usage to a niche specialism?

Mark: Good fundamental languages will allow tools to be built on top.

Do those tools exist?  Judy, do you see visualisation as a mature technology? It’s clear that part of the data science skill set is telling stories but the visualisation doesn’t seem to be quite there yet.

Judy: Some of the visualization you see has too much wow factor (trying to be clever) but isn’t easy enough to understand.  It needs to be easy to communicate but also to be actionable.

Mark: The work of Hans Rosling is a brilliant example of clear visualisation.

Navdeep: It’s clear that BI tools are not sufficient alone, custom visualisation needs to be written.

Audience: Have we collected the right data? Do we need to look at what we have and keep everything, or just what’s relevant?

Tom: There are limitations of what you can actually store. ebay do delete historical data and certain things like pictures. Some data can be reproduced rather than stored.

Mark: It’s a balance. In the case of proteomics it is relatively more expensive to produce than store – and reprocessing may be required at a later date.

Navdeep: Cloud storage is expensive – so at Mzinga we focus on keeping behavourial data that can’t be reproduced. We use a private cloud solution to store our archives. In the case of Facebook data, we use Hadoop to process it, and keep the results. Currently we purge the source data when it is over 5 years old.  We try to recognise what’s valuable and hard to reproduce, and keep that.

If Big Data Fails in 2012 – what will be the cause?

Judy: Keeping everything, our data and businesses, siloed. Not recognise that we have to integrate everything we have.

Mark: Stupidity! We can do it, we can get value. Technically it works, it is people who could cause it to fail.

Navdeep: It comes down to a lack of people who understand and can use the tech. People are needed to drive innovation.

Tom: People, and expectations set by management. It takes time to grow and it is being done successfully.  Big Data is a buzz word that will not go away.

Do you see anything that worries you about Big Data?  What about data protection or security?

Tom:  We have a lot of data at ebay, but need to be cautious over what we do with the users’ data to avoid alienating them. As a result there are lots of rules regarding what can and can’t be done with the data.

Navdeep: We’ve worked with security agencies, and understand the need to be careful.  It’s important to respect the different laws in different countries.

Judy: Privacy and security will increase in relevance but won’t cause big data to fail. Ways will be found to increase privacy – and laws will need to change to cope with the new world.

Isn’t it our job as data professionals to think about what is reasonable and ethical? Thinking about the Target case, a good comment was: don’t use data to do anything indirectly that you wouldn’t do directly.

Finally, if you we’re starting a big data project tomorrow and could do anything at all, what would you do?

Navdeep: I would study the universe.  For decades we’ve had measurements from sensors, so I would take all the information and build some analyses and correlations. There is a huge opportunity to bring all this data together.

Mark: Proteomics! But as we’re already doing it I would opt for quantum data, bringing probability theory to the subatomic world.

Tom: Neuro Linguistic Programming, understanding language – can it be done in a database? Could that be more efficient than Hadoop?

Judy: Analytics that would do good for society, for example using analysis to increase literacy. But I’ve got to back Mark too: proteomics, it’s awesome

Thank you

Additional reporting by @chillax7