Sexist algorithms

Can an algorithm* be sexist? Or racist? In my last post I said no, and ended up in a debate about it. Partly that was about semantics, what parts of the process we call an algorithm, where personal ethical responsibility lies, and so on.

Rather than heading down that rabbit hole, I thought it would be interesting to go further into the ethics of algorithmic use…  Please remember – I’m not a philosopher, and I’m offering this for discussion. But having said that, let’s go!

The model

To explore the idea, let’s do a thought experiment based on a parsimonious linear model from the O’Reilly Data Science Salary Survey (and you should really read that anyway!)

So, here it is:

70577 intercept
 +1467 age (per year above 18; e.g., 28 is +14,670)
 –8026 gender=Female
 +6536 industry=Software (incl. security, cloud services)
–15196 industry=Education
 -3468 company size: <500
  +401 company size: 2500+
–15196 industry=Education
+32003 upper management (director, VP, CxO)
 +7427 PhD
+15608 California
+12089 Northeast US
  –924 Canada
–20989 Latin America
–23292 Europe (except UK/I)
–25517 Asia

The model was built from data supplied by data scientists across the world, and is in USD.  As the authors state:

“We created a basic, parsimonious linear model using the lasso with R2 of 0.382.  Most features were excluded from the model as insignificant”

Let’s explore potential uses for the model, and see if, in each case, the algorithm behaves in a sexist way.  Note: it’s the same model! And the same data.

Use case 1: How are data scientists paid?

In this case we’re really interested in what the model is telling us about society (or rather the portion of society that incorporates data scientists).

This tells us a number of interesting things: older people get paid more, California is a great place, and women get paid less.

–8026 gender=Female

This isn’t good.

Back to the authors:

“Just as in the 2014 survey results, the model points to a huge discrepancy of earnings by gender, with women earning $8,026 less than men in the same locations at the same types of companies. Its magnitude is lower than last year’s coefficient of $13,000, although this may be attributed to the differences in the models (the lasso has a dampening effect on variables to prevent over-fitting), so it is hard to say whether this is any real improvement.”

The model has discovered something (or, more probably, confirmed something we had a strong suspicion about).  It has noticed, and represented, a bias in the data.

Use case 2: How much should I expect to be paid?

This use case seems fairly benign.  I take the model, and add my data. Or that of someone else (or data that I wish I had!).

I can imagine that if I moved to California I might be able to command an additional $15000. Which would be nice.

Use case 3: How much should I pay someone?

On the other hand, this use case doesn’t seem so good. I’m using the model to reinforce the bad practice it has uncovered.  In some legal systems this might actually be illegal, as if I take the advice of the model I will be discriminating against women (I’m not a lawyer, but don’t take legal advice on this: just don’t do it).

Even if you aren’t aware of the formula, if you rely on this model to support your decisions, then you are in the same ethical position, which raises an interesting challenge in terms of ethics. The defence “I was just following the algorithm” is probably about as convincing as “I was just following orders”.  You have a duty to investigate.

But imagine the model was a random forest. Or a deep neural network. How could a layperson be expected to understand what was happening deep within the code? Or for that matter, how could an expert know?

The solution, of course, is to think carefully about the model, adjust the data inputs (let’s take gender out), and measure the output against test data. That last one is really important, because in the real world there are lots of proxies…

Use case 4: What salary level would a candidate accept?

And now we’re into really murky water. Imagine I’m a consultant, and I’m employed to advise an HR department. They’ve decided to make someone an offer of $X and they ask me “do you think they will accept it?”.

I could ignore the data I have available: that gender has an impact on salaries in the marketplace. But should I? My Marxist landlord (don’t ask) says: no – it would be perfectly reasonable to ignore the gender aspect, and say “You are offering above/below the typical salary”**. I think it’s more nuanced – I have a clash between professional ethics and societal ethics…

There are, of course, algorithmic ethics to be considered. We’re significantly repurposing the model. It was never built to do this (and, in fact, if you were going to build a model to do this kind of thing it might be very, very different).


It’s interesting to think that the same model can effectively be used in ways that are ethically very, very different. In all cases the model is discovering/uncovering something in the data, and – it could be argued – is embedding that fact. But the impact depends on how it is used, and that suggests to me that claiming the algorithm is sexist is (perhaps) a useful shorthand in some circumstances, but very misleading in others.

And in case we think that this sort of thing is going to go away, it’s worth reading about how police forces are using algorithms to predict misconduct


*Actually to be more correct I mean a trained model…

** His views are personal, and not necessarily a representation of Marxist thought in general.



The ethics of data science (some initial thoughts)

Last night I was lucky enough to attend a dinner hosted by TechUK and the Royal Statistical Society to discuss the ethics of big data. As I’m really not a fan of the term I’ll pretend it was about the ethics of data science.

Needless to say there was a lot of discussion around privacy, the DPA and European Data Directives (although the general feeling was against a legalistic approach), and the very real need for the UK to do something so that we don’t end up having an approach imposed from outside.

People first


Kant: not actually a data scientist, but something to say on ethics

Both Paul Maltby and I were really interested in the idea of a code of conduct for people working in data – a bottom-up approach that could inculcate a data-for-good culture. This is possibly the best time to do this – there are still relatively few people working in data science, and if we can get these people now…

With that in mind, I thought it would be useful to remind myself of the data-for-good pledge that I put together, and (unsuccessfully) launched:

  • I will be Aware of the outcome and impact of my analysis
  • I won’t be Arrogant – and I will avoid hubris: I won’t assume I should, just because I can
  • I will be an Agent for change: use my analytical powers for positive good
  • I will be Awesome: I will reach out to those who need me, and take their cause further than they could imagine

OK, way too much alliteration. But (other than the somewhat West Coast Awesomeness) basically a good start. 

The key thing here is that, as a data scientist, I can’t pretend that it’s just data. What I do has consequences.

Ethics in process

But another way of thinking about it is to consider the actual processes of data science – here adapted loosely from the CRISP-DM methodology.  If we think of things this way, then we can consider ethical issues around each part of the process:

  • Data collection and processing
  • Analysis and algorithms
  • Using and communicating the outputs
  • Measuring the results

Data collection and processing

What are the ethical issues here?  Well ensuring that you collect with permission, or in a way that is transparent, repurposing data (especially important for data exhaust), thinking carefully about biases that may exist, and planning and thinking about end use.

Analysis and algorithms

I’ll be honest – I don’t believe that data science algorithms are racist or sexist. For a couple of reasons: firstly those require free-will (something that a random forest clearly doesn’t have), secondly that would require the algorithm to be able to distinguish between a set of numbers that encoded for (say) gender and another that coded for (say) days of the week. Now the input can contain data that is biased, and the target can be based on behaviours that are themselves racist, but that is a data issue, not an algorithm issue, and rightly belongs in another section.

But the choice of algorithm is important. As is the approach you take to analysis. And (as you can see from the pledge) an awareness that this represents people and that the outcome can have impact… although that leads neatly on to…

Using and communicating the outputs

Once you have your model and your scores, how do you communicate its strengths, and more importantly its weaknesses. How do you make sure that it is being used correctly and ethically? I would urge people to compare things against current processes rather than theoretical ideals.  For example, the output may have a gender bias, but (assuming I can’t actually remove it) is it less sexist than the current system? If so, it’s a step forwards…

I only touched on communication, but really this is a key, key aspect. Let’s assume that most people aren’t really aware of the nature of probability. How can we educate people about the risks and the assumptions in a probabilistic model? How can we make sure that the people who take decisions based on that model (and they probably won’t be data scientists) are aware of the implications?  What if they’re building it into an automated system? Well in that case we need to think about the ethics of:

Measuring the results

And the first question would be, is it ethical to use a model where you don’t effectively measure the results? With controls?

This is surely somewhere where we can learn from both medicine (controls and placebos) and econometrists (natural experiments). But both require us to think through the implications of action and inaction.

Using Data for Evil IV: The Journey Home

If you’re interested in talking through ethics more (and perhaps from a different perspective) then all of this will be a useful background for the presentation that Fran Bennett and I will be giving at Strata in London in early June.  And to whet your appetite, here is the hell-cycle of evil data adoption from last year…





It’s not just the Hadron Collider that’s Large: super-colliders and super-papers

During most of my career in data science, I’ve been used to dealing with analysis where there is an objective correct answer. This is bread and butter to data mining: you create a model and test it against reality.  Your model is either good or bad (or sufficiently good!) and you can choose to use it or not.

But since joining THE I’ve been faced with another, and in some ways very different problem – building our new World University Rankings – a challenge where there isn’t an absolute right answer.

So what can you, as a data scientist, do to make sure that the answer you provide is as accurate as possible? Well it turns out (not surprisingly) that the answer is being as certain as possible about the quality, and biases in the input data.

Papers and citations

One of the key elements of our ranking is the ability of a University to generate valuable new knowledge.  There are several ways we evaluate that, but one of the most important is around new papers that are generated by researchers. Our source for these is Elsevier’s Scopus database – a great place to get information on academic papers.

We are interested in a few things: the number of papers generated by a University, the number of papers with international collaboration, and the average number of citations that papers from a University get.

Citations are key. They are an indication that the work has merit. Imagine that in my seminal paper “French philosophy in the late post-war period” I chose to site Anindya Bhattacharyya’s “Sets, Categories and Topoi: approaches to ontology in Badiou’s later work“. I am telling the world that he has done a good piece of research.  If we add up all the citations he has received we get an idea of the value of the work.

Unfortunately not all citations are equal. There are some areas of research where authors cite each other more highly than in others. To avoid this biasing our data in favour of Universities with large medical departments, and against those that specialise in French philosophy, we use a field weighted measure. Essentially we calculate an average number of citations for every* field of academic research, and then determine how a particular paper scores compared to that average.

These values are then rolled up to the University level so we can see how the research performed at one University compares to that of another.  We do this by allocating the weighted count to the University associated with an author of a paper.

The Many Authors problem

But what about papers with multiple authors?  Had Anindya been joined by Prof Jon Agar for the paper, then both learned gentlemen’s institutions would have received credit. Dr Meg Tait also joins, so we have a third institution that gains credit and so on.

Whilst the number of author remains small that works quite well.  I can quite believe that Prof Agar, Dr Tait and Mr Bhattacharya all participated in the work on Badiou.

At this point we must depart from the safe world of philosophy for the dangerous world of particle physics**. Here we have mega-experiments where the academic output is also mega. For perfectly sound reasons there are papers with thousands of authors. In fact “Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC” has 2932 authors.  

Did they all contribute to the experiment? Possibly. In fact, probably. But if we include the data in this form in our rankings it has some very strange results.  Universities are boosted hugely if a single researcher participated in the project.

I feel a bit text bound, so here is a graph of the distribution of papers with more than 100 authors.


Frequency of papers with more than 100 authors

Please note that the vast majority of the 11 million papers in the dataset aren’t shown!  In fact there are approximately 480 papers with more than 2000 authors.

Not all authors will have had the same impact on the research. It used to be assumed that there was a certain ordering to the way that authors were named, and this would allow the reduction of the count to only the significant authors. Unfortunately there is no consensus across academia about how this should happen, and no obvious way of automating the process of counting it.


How to deal with this issue? Well for this year we’re taking a slightly crude, but effective solution. We’re simply not counting the papers with more than 1000 authors. 1000 is a somewhat arbitrary cut off point, but a visual inspection of the distribution suggests that this is a reasonable separation point between the regular distribution on the left, and the abnormal clusters on the right.

In the longer term there are one technical and one structural approach that would be viable.  The technical approach is to use a fractional counting approach (2932 authors? Well you each get 0.034% of the credit).  The structural approach is more of a long term solution: to persuade the academic community to adopt metadata that adequately explains the relationship of individuals to the paper that they are ‘authoring’.  Unfortunately I’m not holding my breath on that one.

*well, most

**plot spoiler: the world didn’t end

Some things I learned at Teradata

Over the last three and a half years I have led a fantastic team of data scientists at Teradata. But now it’s time for me to move on… so what did I learn in my time? What are the key Data Science messages that I’m going to take with me?

Pulp-O-Mizer_Cover_Image (4)

A lot of people don’t get it

What makes a good data scientist? One definition is that it is someone who can code better than a statistician, and do better stats than a coder. Frankly that’s a terrible definition, which really says you want someone who is bad at two different things.

In reality the thing that makes a good data scientist is a particular world view. One that appreciates the insight that data provides, and who is aware of the challenges and joys of data. A good data scientist will always want to jump into the data and start working on finding new questions, answers, and insights.  A great data scientist will want to do that, but will start by thinking about the question instead! If you throw a number at a good data scientist you’ll get a bunch of questions back…

Many people don’t have that worldview. And no matter how good they get at coding in R they will never make a good data scientist.

Data science is the Second Circle of data.

It’s one for the problem, two for the data, three for the technique

One of my favourite dislikes are the algorithm fetishists. A key learning from working across different customers and industries is that when analytical projects fail it’s very rarely because the algorithm was sub optimal. Usually it has been because the problem wasn’t right – or that the data didn’t match the problem.

Where choice of algorithm is important is in consideration of the use of the solution (and potentially in the productionisation of it) rather than in terms of simple measures of performance.

Don’t be afraid of the simple answer

Yes, you know how to run an n-path. Or do Markov chain analysis. Or build a random forrest. But if the answer can be generated from a simple chart, why would you use those other techniques? To show how clever you are?

There is another side – being aware that the simple answer may be wrong, and that the lure of simplicity is dangerous in itself. But usually if you get it then you know about that…

And of course there is also something to be said about the idea that the best ideas seem simple, but only after you’ve found them.

Stories are powerful

When you’re trying to sell an analytical approach (or even analytical software or hardware) the story you tell is vital. And the story might not be where the actual value is. Because to tell the story best you often use the edge cases. The best example comes from some work a colleague was doing. The actual analysis was great, but the thing that sold it to the client was a one-off event (albeit one that was ongoing) of such astonishing stupidity that it instantly caught the imagination. Everyone could immediately see that it was both crazy, and also that it was bound to happen. And it had been found through analysis.

I really wish I could tell you what it was! Buy me a drink sometime and you might find out…

Some of you may say that you’re not selling analytics. But if you’re a data scientist you are – to your boss, your co-workers, people you want to impress… and if you’re selling analysis you need to tell stories.

You still need to munge that data

So much time is spent dealing with data. This is one of the reasons that so many data scientists still use SQL (and it’s also a reason why logical modelling is still more attractive than late binding – I’m lazy and want someone else to have done some of the work first).

I wish it wasn’t the case. And I wish that tools were better at it than they are.

Don’t look for data scientists, look for data science people

Remember that when you want to recruit (and retain) data scientists that they are people. I’ve been lucky at Teradata to work with some fantastic people – both in my team, in the wider company, and at our clients.

I have a concern, however, that we (the data science community) are undervaluing some people, and as a result overlooking fantastic talent. A recent survey on data science salaries by O’Reilly included a regression model, and one of the key findings was that if you were a woman your salary dropped by $13k. For no reason whatsoever.

This seems bizarre to me, as I have had the privilege to work with some fantastic women in data: Judy Bayer, Fran Bennett, Garance Legrand, Kaitlin Thaney, Yodit Stanton and many many more*.

Data Science can change the world

Teradata believe in data philanthropy – the idea that if more social organisations use data for decisions that they will make better decisions, and that tech companies can play a part in helping them achieve this. Because of this they have supported DataKind and DataKind UK.

This is really important – because there are a bunch of challenges in helping charities and not for profits when it comes to data. The last thing these organisations need is well intentioned, but damaging, solutionism being dumped on them by West Coast gurus. There is nothing wrong in Elon Musk working on big issues through things like Tesla, but there is a whole bunch more that can be achieved if we can find sensitive ways to work with the people who deal with social problems everyday.

In my work with DataKind I’ve seen what data can do for charities, and this, in turn, has made me a better data scientist.

Where am I going?

I’m about to start a new career leading the data team at Times Higher Education – where we produce the leading ranking of Universities across the world.  I’ve loved my time at Teradata, and I’ve learnt some important stuff, but it’s time for a change!

*sorry if I didn’t mention you here…

7 Things from Strata Santa Clara

This is the fifth time I’ve made the pilgrimage to Strata – I was lucky enough to be at the very first event, here in Santa Clara, and that’s made me think about how things have changed over the last two years.

Two years ago big data was new and shiny. About 1600 people turned up at the South end of silicon valley to enthuse about what we could do with data.

Now we’re talking about legacy Hadoop systems, and data science (big data is so 2011), but what else has changed?

1)   Hadoop grew up

The talk this year wasn’t about this new shiny thing called Hadoop, it was about which distro was the best (with a new Intel distro being announced), and which company had the biggest number of coders working in the original open source code. Seriously there were almost fistfights over the bragging rights.

As a mark of the new seriousness the sports jacket to t-shirt ratio was up. But don’t worry the PC to Mac ratio was still tending to zero (the battle was between Air and Pro).

2)   NoSQL is very much secondary to Hadoop

The show is extremely analytically oriented (in a database sense… but more of that later). The NoSQL vendors are there, but attract a fraction of the attention.

3)   SQL is back

Yes, really.  It turns out it is useful for something after all.

4)   Everyone is looking for ways to make it actually work (and work fast)

Hadoop isn’t perfect, and there are a wide range of companies trying to make it work better. Oddly there is a fetishisation of speed. Odd because this is something that the data warehouse companies went through in the days when it was all about the TPC benchmarks. No people, scaling doesn’t just mean big and fast. It means usable by lots of people and a whole raft of other things.

Greenplum were trying to convince us that theirs was fastest. Intel told us that SAP HANA was faster, and more innovative. Really. And the list went on.

Rather worryingly there seems to be a group out there who want to try and reinvent the dead end of TPC* but for Hadoop.

5)   There’s a lot of talk about Bayes, but not many data miners 

I ran a session on Data Mining. Only a handful of people out of about 200 in the room would admit to being data miners. This is terrifying as data scientists are trying to do analytical tasks. Coding a random forest does not make you a data miner!

6)   Data philanthropy is (still) hot

We had a keynote from codeforamerica, and lots of talk about ethics, black hats etc… I ran a Birds of a Feather table on Doing Good With Data. A group of us were talking about the top secret PROJECT EPHASUS. 

7)   The Gartner Hypecycle is in the trough of disillusionment

At least as far as big data is concerned. The show sold out. And the excitement was still there. Data Science has arrived.


* For a fun review about the decision that Teradata made to withdraw from TPC benchmarks, try this by my esteemed colleague Martin Willcox.

Doing Good With Data: the case for the ethical Data Scientist

This post is designed to accompany my presentation to the Teradata Partners User Group, but hopefully some of the links will prove useful even if you couldn’t get to the presentation itself.

Needless to say, the most important part – the Pledge – is right at the bottom.  Feel free to skip to it if you like!


The law (as it relates to data – well actually pretty much all law) is complex, highly jurisdictional, and most importantly of all at least 10 years behind reality.  Neither Judy or I are lawyers, but hopefully these links provide some general background:

One of the first legal agreements was the OECD’s position on data transfers between countries. It dates from the early 70s, when flares were hot and digital watches were the dream of a young Douglas Adams:

Much later the EU released the snappily titled EU 95/46/EC – better known as the Data Directive. The joy is that each country can implement it differently, resulting in confusion.  There are currently proposals out for consultation on updating it too:

Of course the EU and the US occasionally come to different decisions, and for a brief discussion of some of the major differences between them you can try this:

Don’t do evil

Google’s famous take on the hypocratic oath can be simplified as ‘don’t do evil’. As we say in the presentation, this is necessary, but scarcely enough.  It also has the disadvantage of being passive. In it’s expanded form it’s available here:

Doing Good With Data

Now for the fun bits!  For information on the UN Global Pulse initiative: 

Data 4 Development – the Orange France Telecom initiative in Ivory Coast:

If you have a bent for European socialised medecine, then the NHS hack days are here:


And our favourite – with a big thanks to Jake Porway and Craig Barowsky – is DataKind: You can also follow @DataKind

To find out more about the UK charities mentioned check out Place2Be and KeyFund

Please take the time to register with DataKind, and keep your eyes open for other opportunities.  We hope that DataKind will be open for business in the UK too soon!

The Pledge

Please go and look at the Pledge, and if you think you can, then sign up.  If you have one of our printed cards, take it, sign it and put it on your cube wall (or your refrigerator – wherever it will remind you of your commitment). But sign the online one too.  And one you’ve done that, let the world know! Get them to sign up. If you want a Word copy of the printable one just drop me a line.

Data *insert profession here*

In thinking about our data revolution, and pondering on my self declared job title (always the best kind of job title if you ask me), it occurred to me: how would professions changed if we put the word “data” in front of the title? Would it make them better, or worse? Would data actually improve the way that the professions work?

Of course, this has already happened for data science, and increasingly for data journalism – and reflects two different approaches.  In the first case it is applying science to data.  In the second it is applying data to journalism.

But if you assume that Data *insert profession here* is about applying data science approaches to <profession>, then what could it mean? Would it make the world better? Let’s try it and see…


A DataKind data dive: more on that later

Data Journalism – leading the way

Assuming we all have our opinions about data science, let’s look instead at Data Journalism. Go and examine The Guardian’s data section (led by the excellent @simonrogers).  Here you will find stories developed from public data, data being used to test the truth of other stories, and the crowdsourcing of journalism.

Famously, when the UK Government tried to convince everyone that Blackberry (remember them?) Messenger and Twitter were the cause of the riots in the UK, The Guardian dug up the data that proved that messages followed incidents, not vice-versa.

It’s exciting, and it’s not clear where it will all end up.  Simon would probably tell you the same.  But it does change the way that journalists work and interact with sources.

So if it works for Data Journalism, where else might it work?

Data Politician

How many politicians have a science background?  The answer is very few.  In the UK there was my MP, Dr Lynne Jones, but she retired at the last election.   Actually there are sites that will tell you that the number is 27 (although being me I looked further and found that was only based on a sample of 430).

It’s still not many.

Could a scientific approach to politics, using data, help? How would we feel if politicians actively set up experiments to validate their policies?  Let’s be clear that current policy ‘trials’ tend to get the results that politicians want, and tend to be neither controlled or statistically significant. I’ve got to wonder if that is because politicians as a whole are unfamiliar with the words “control”, “statistical” or “significant”.

How would we react to experiments?  Would we be willing to tolerate being in one? And how would we treat a politician who did an experiment and changed their position as a result? Would we just shout “U-turn” at them?

Ironically politicians are already using the results of experimentation in their marketing/election efforts. Obama has a large team of predictive modellers whose task is to identify and target likely voters, and I’m sure Romney has too.

If only they would apply this to their policies!

Data Judge

Another area where data could surely add value, is in the policing and criminal justice system. We can predict vulnerabilities, identify mitigating strategies, and even try and modify punishment using data and experimentation.

Does the death penalty reduce the murder rate? What programmes reduce reoffending? What are the causes of criminality, and can they be reduced?

Sadly we seem to be heading in exactly the opposite direction in the UK. In my opinion, given the importance of statistical understanding in modern forensic evidence, any Judge who can’t do basic level statistics should immediately recluse themselves from any case.

Data Philanthropist

Now to my mind this is the biggest, and one that already has traction in the US.  Jake Porway (@jakeporway) has been leading the field with his wonderful DataKind (@datakind).  The idea is golden: if we can do cool things with data for business/industry, why can’t we do cool things with data for charity/not-for-profits?

And it’s coming to the UK. On the 29th and 30th of September we hope there will be a DataKind data dive in London.  A first chance for Uk not for profits to get free insight into their data.  A first chance for data scientists to try their hands at data philanthropy.

If you know a charity that could benefit, get in touch. If you want to be involved get in touch too.

A discussion on Big Data – Teradata Universe 2012

The following notes are recreated from the Big Data Panel Session held at the Teradata Universe conference in Dublin, April 2012.

The panel consisted of Dr Judy Bayer (Director Strategic Analytics, Teradata), Tom Fastner @tfastner (Senior Architect, eBay), Navdeep Alam @YoshiNav (Director Data Architecture, Mzinga), and Professor Mark Whitehorn (University of Dundee).

It was moderated by me… so any false memory syndrome is laid at my door.  Note: I have edited it slightly to turn a conversation into something a bit more readable, I hope the contributors will forgive me!

Let’s start with an easy question: what one word sums up Big Data for you?

Mark: Fun

Judy: For this decade – noisy

Navdeep: Bringing together

Tom: Fun(damental)

Navdeep: Big Data is bringing together technologies, it requires interoperability between systems such as Teradata and Hadoop, SQL and MapReduce, it’s also bringing people together.

What makes it fundamental? And fun?

Tom: If you go back to Crossing the Chasm, we are on the left side of the chasm: the innovators. It is fundamental to get our job right as we are doing it first.

Mark: And I can’t believe people pay me to do this, it’s such fun.

You mentioned noise, Judy, why is that?

Judy: Big Data has always been around, it’s defined as much by current capabilities as anything. And each generation of big data brings noise and confusion as to what to do with it.

Audience: It’s also all about telling a story from the data.

So what makes a good Data Scientist?

Tom: There are six characteristics, they are a special breed who need to be: curious, questioning, good communicator, open minded, someone who can dig deeper…

We have five to ten concurrent users of Hadoop and these are the data scientists. I sit next to one and he’s constantly going “Wow!”.  But they also cause the most problems with their big questions.

Judy: I’d add creativity, a passion to experiment and fail, and a passion for finding the stories in data.

Mark (stroking his luxuriant facial hair): A beard and sandals! No: someone who can think outside the box and be adventurous.

Navdeep: They need to be insatiable when it comes to data.  They also need to be a cry-baby – in that they cannot be satisfied, they should always want more data, hardware, more resources.

The McKinsey Global Institute report from 2011 showed a huge skill shortfall for Big Data Analysis – would you agree?

Navdeep: There is clearly a shortage of skills, you need to mix business and technology, so collaboration is key

Tom: Yes!

Mark: In 2008 I was at conference when someone asked what is the academic world doing to fix this problem? In response the University of Dundee set up a Masters course in business intelligence.

Audience: Do Data Scientists exist without realising it? Is Data Science a rebranding of existing skills like data mining?

Judy and I have had disputes about whether Data Scientists actually exist…

Judy: Well I believe analysts are born not made, but they need training to fulfill their potential. When it comes to Data Science I think there may be something new here. Data Scientists will be better at collaboration than traditional Data Miners. But we’re at the infancy of the subject, with data and the tools that don’t really exist yet. In many ways this is a parallel with the early days of data mining.

Tom: Take Kaggle for example, it’s interesting because of the collaboration between the individuals in the teams. You have to form teams and build on skillsets to produce the best algorithm to solve the problem.

Audience: This is probably re-branding, you need an analyst who can work across areas…

Audience: I find Data Science a restrictive term, it doesn’t capture the art side and the creativity that is required – people are rejecting the easy to use GUI tools and going back to R and programming languages.

Which brings us on to a related topic: what is the most important big data analytical technology?

Navdeep: Massively parallel processing, with fault tolerance on commodity hardware and with support for virtual machines. In other words removing the complexity of parallel processing , allowing organisations outside of the military and academia to experiment.

Judy: Visualisation – for example ways of visualizing network graphs.

Tom: It isn’t a single technology, it’s an eco-system, and it’ll take many years to develop.

Mark: R – we need languages that let us use this data.

But isn’t there a danger that these languages restrict usage to a niche specialism?

Mark: Good fundamental languages will allow tools to be built on top.

Do those tools exist?  Judy, do you see visualisation as a mature technology? It’s clear that part of the data science skill set is telling stories but the visualisation doesn’t seem to be quite there yet.

Judy: Some of the visualization you see has too much wow factor (trying to be clever) but isn’t easy enough to understand.  It needs to be easy to communicate but also to be actionable.

Mark: The work of Hans Rosling is a brilliant example of clear visualisation.

Navdeep: It’s clear that BI tools are not sufficient alone, custom visualisation needs to be written.

Audience: Have we collected the right data? Do we need to look at what we have and keep everything, or just what’s relevant?

Tom: There are limitations of what you can actually store. ebay do delete historical data and certain things like pictures. Some data can be reproduced rather than stored.

Mark: It’s a balance. In the case of proteomics it is relatively more expensive to produce than store – and reprocessing may be required at a later date.

Navdeep: Cloud storage is expensive – so at Mzinga we focus on keeping behavourial data that can’t be reproduced. We use a private cloud solution to store our archives. In the case of Facebook data, we use Hadoop to process it, and keep the results. Currently we purge the source data when it is over 5 years old.  We try to recognise what’s valuable and hard to reproduce, and keep that.

If Big Data Fails in 2012 – what will be the cause?

Judy: Keeping everything, our data and businesses, siloed. Not recognise that we have to integrate everything we have.

Mark: Stupidity! We can do it, we can get value. Technically it works, it is people who could cause it to fail.

Navdeep: It comes down to a lack of people who understand and can use the tech. People are needed to drive innovation.

Tom: People, and expectations set by management. It takes time to grow and it is being done successfully.  Big Data is a buzz word that will not go away.

Do you see anything that worries you about Big Data?  What about data protection or security?

Tom:  We have a lot of data at ebay, but need to be cautious over what we do with the users’ data to avoid alienating them. As a result there are lots of rules regarding what can and can’t be done with the data.

Navdeep: We’ve worked with security agencies, and understand the need to be careful.  It’s important to respect the different laws in different countries.

Judy: Privacy and security will increase in relevance but won’t cause big data to fail. Ways will be found to increase privacy – and laws will need to change to cope with the new world.

Isn’t it our job as data professionals to think about what is reasonable and ethical? Thinking about the Target case, a good comment was: don’t use data to do anything indirectly that you wouldn’t do directly.

Finally, if you we’re starting a big data project tomorrow and could do anything at all, what would you do?

Navdeep: I would study the universe.  For decades we’ve had measurements from sensors, so I would take all the information and build some analyses and correlations. There is a huge opportunity to bring all this data together.

Mark: Proteomics! But as we’re already doing it I would opt for quantum data, bringing probability theory to the subatomic world.

Tom: Neuro Linguistic Programming, understanding language – can it be done in a database? Could that be more efficient than Hadoop?

Judy: Analytics that would do good for society, for example using analysis to increase literacy. But I’ve got to back Mark too: proteomics, it’s awesome

Thank you

Additional reporting by @chillax7

NoSQL? NOSQL? How about NOHadoop?

Comrades (for that is how all good exhortations to action begin), it is time for us to stand up against a Heresy that is sweeping the world of Big Data Science.

The Heresy is that there is only one God, and its name is Hadoop. This yellow elephant is being taken by some to be the alpha and omega of data science.  Just the other day an eminent blogger started a comment by saying “If Logo of BIG data is Elephant, What is the Logo of Analytics?”

And this is annoying.

It’s like saying the logo of driving is a prancing horse (sorry Daimler).  Or calling a computer tablet an iPad.  Well forget that last one. But you get the idea.  Hadoop may be the fastest example of eponymy ever; it has almost become a generic brand name. “Lets get us some of those there Hadoops” can virtually be heard coming from the boardrooms of the Global 3000.

But only almost.

There is still time for the business idea of big data science to triumph if all of us non-Hadoop struck folks get together.

So, if you like MongoDB, if you think SQL has a few tricks up it’s sleeve, if you R a data mining pirate, if you think the use comes first and the technology comes second, then join the Not Only Hadoop campaign.

Say #NOHadoop

LonData III: the MoshiMonsters paradox

Are you familiar with MoshiMonsters?  It is an online pet type game/junior social network developed by MindCandy.  If not, then you probably don’t have kids aged between 6 and 12…

At LonData III we were lucky enough to have a presentation from Toby Moore, CTO of MoshiMonsters, who took us through the world of data that the game generates, and how MindCandy got to where they are.

Toby took us through their aim moving from no data, through big data, right data, predictive data, and eventually strategic data.

At the beginning of their story they had lots of data, but no ETL, no reporting, and no analysis.  They realised they had to move forwards, and put in place a technology stack of:

  • MS SQL Server as an ETL platform
  • Hadoop for data storage
  • MS SQL for analysis/reporting

This still didn’t resolve their problems, and so they are moving to QlikView to give users direct access to their data.

So this is a Big Data play, right? Lots of data? Hadoop?  It must be!

Is this Big Data, or just big data?

There are lots of things that are great about this story – and let me be clear that none of my comments in any way take away from the amazing success of MoshiMonsters…

I like:

  • The fact that data is so important to them
  • The willingness to give end users direct access to data

But I think it fails to be Big Data because

  • They don’t try to experiment using the data
  • They don’t do predictive analysis (although they use six-sigma statistical approaches to identify issues)
  • There is very limited analysis

Data kills Creativity? Really?

In fact the most worrying issue was a CP Snow like divide: on the one side Creativity.  On the other Data.

This came up several times in the presentation – they would never burden their creative staff with data. They don’t think that segmenting their customers, or analysing their behaviour is the way to go. They don’t test out alternative strategies on the website.

Partly this is because they are extremely sensitive to the nature of their customers (young children) who aren’t the same as the people paying the bills (adults). They say they try to avoid pressuring their customers out of the freemium and into the paying segments*.

I’ve got to say, I really don’t believe this divide to be true. Yes, an anally retentive approach to analysis might kill creativity, but anyone that anal probably doesn’t understand the limits of their analysis. Analysis leaves many, many grey areas.  And on the other hand creativity cannot work in a vacuum.

I came away somewhat disturbed by their approach, whilst still being in admiration of their success and drive. I don’t believe that Big Data approaches can be separated from creativity!

The conclusion:

  1. Is Hadoop necessary for Big Data? Possibly, but it isn’t sufficient.
  2. Is volume necessary for Big Data? Not on an absolute scale, although it helps.
  3. Is attitude necessary for Big Data? Yes, absolutely!
  4. Is it creative? Hell Yes!