Who hates Donald Trump the most?

By now you’re probably aware of the UK Government petition about Donald Trump. It’s currently be signed by over 500,000 people, making it the most successful petition in the UK…

But who actually hates him most? The site conveniently provides a map, so you can see where people dislike him… but constituencies aren’t all equal, and you always run the risk of doing a heatmap of population densities:


(Source xkcd.com/1138)

So I thought I’d explore further.  Firstly, which are the top constituencies by proportion of the population who hate Donald?

Constituency MP Signed Percentage
Bethnal Green and Bow Rushanara Ali MP 1874 1.50%
Bristol West Thangam Debbonaire MP 1779 1.43%
Brighton Pavilion Caroline Lucas MP 1405 1.36%
Hackney South and Shoreditch Meg Hillier MP 1496 1.27%
Islington North Jeremy Corbyn MP 1284 1.24%
Cities of London and Westminster Rt Hon Mark Field MP 1364 1.24%
Glasgow North Patrick Grady MP 876 1.22%
Hackney North and Stoke Newington Ms Diane Abbott MP 1556 1.22%
Islington South and Finsbury Emily Thornberry MP 1237 1.20%
Edinburgh North and Leith Deidre Brock MP 1279 1.20%

OK, so we see some of the usual faces there. Jeremy Corbyn, Diane Abbot, Caroline Lucas. Maybe it’s just a political thing.  Let’s look at the bottom constituencies (again by the proportion who are keen on the bewigged one):

Constituency MP Signed Percentage
Blaenau Gwent Nick Smith MP 92 0.13%
Barnsley East Michael Dugher MP 119 0.13%
Makerfield Yvonne Fovargue MP 126 0.13%
Boston and Skegness Matt Warman MP 129 0.13%
Wentworth and Dearne Rt Hon John Healey MP 119 0.12%
Walsall North Mr David Winnick MP 117 0.12%
Doncaster North Rt Hon Edward Miliband MP 123 0.12%
Easington Grahame Morris MP 102 0.12%
Kingston upon Hull East Karl Turner MP 112 0.12%
Cynon Valley Rt Hon Ann Clwyd MP 78 0.11%

Well I wouldn’t (necessarily) have expected to see Ed Miliband and Ann Clwyd’s constituencies there.

So let’s go slightly more data sciency. What is the correlation between Labour vote and proportion hating Donald?

Rather than just looking at that, let’s look at a whole range of stuff:


Woah! What was that?  Well that’s the correlations between a whole load of stuff and the percentage who hate Donald. Percentage is the second column (or row), and a blue box means a positive correlation, and a red box a negative one.

So let’s look at some of the more interesting ones, and sort them from highest to lowest. Remember that these are at a Constituency level.

  • PopulationDensity  0.64
  • Green                           0.55
  • FulltimeStudent       0.53
  • Muslim                        0.37
  • LD                                  0.19
  • Lab                                0.07
  • Con                             -0.11
  • Ethnicity White      -0.48
  • Christian                  -0.60
  • UKIP                           -0.61
  • Retired                      -0.67

So it would seem that the strongest correlation isn’t with Muslim populations, it’s actually strongest with built up areas, then with 2015 Green voters, then with full time students, and only then with Muslim areas.

And it’s probably not surprising that those lease likely to hate Donald are constituencies with lots of retired people, a strong UKIP presence (some immigration is OK then, it would seem), a large number of people identifying as Christian, and white people.

Taking it one small step further, and running our old friend, Linear regression we see a model like this (Labour and Con voters removed to avoid tons of correlation, LD voters removed out of sympathy):

Estimate Std. Error t value Pr(>|t|)
(Intercept) .557 .0705 7.89 E-14
Green .0201 .0016 12.78 < 2e-16
Muslim .0036 .0012 3.04 0.0025
Density .0035 .0003 12.54 < 2e-16
 White .0033 .0007 4.78 2.2 E-6
Student  .0023 .0011  2.05 0.041
Christian -.0021 0.0007 -2.97 0.0031
Lab -.0036 0.0003 -11.14 < 2e-16
UKIP -.0118 0.0007 -16.85 < 2e-16
Retired -.0173 0.0020 -8.55 < 2e-16

A slightly different story – but looking at the key stand outs: if you want to find a constituency that really hates Donald, first look for Green voters, then a densely populated area, with quite a sizeable Muslim and white community, but keep away from those UKIP voters, especially the retired ones.

*Data for the petition harvested at about noon, 11 December 2015
** Other data from ONS based on the 2011 Census
*** Featured image Donald Trump by Gage Skidmore
**** Northern Ireland doesn’t provide breakdown of petition numbers


STS forum. The strangest technology conference you’ve never heard of

At the beginning of October I was in Kyoto (yes, I can hear the tiny violins) attending the STS Forum on behalf of my employers.

What is the STS Forum?  Well this was the 12th meeting of a group focused on linking universities, technology companies, and governments to address global problems. The full name is Science and Technology in Society.

And it’s a really high level kind of thing. The opening was addressed by three prime ministers. There are more university vice-chancellors/provosts/rectors than you could imagine.  If you aren’t a professor then you’d better be a minister. No Nobel prize?  Just a matter of time.

So it’s senior.  But is is about technology?  Or at least the technology that I’m familiar with?

PM Abe addresses STS Forum

The usual players?

Well the first challenge is the sponsors.  A bunch of big companies. Huawei, Lockheed Martin, Saudi Aramco, Toyota, Hitachi, NTT, BAT, EDF.

All big, all important (I leave it up to you to decide if they’re good).  But are these really who you’d expect? Where are IBM?  Oracle? SAP? Even Siemens? Never mind Microsoft, Apple, or (dare I say it) LinkedIn, Facebook etc…

I daren’t even mention the world of big data: MongoDB, Cloudera or others.

Panels and topics

Then there are the panelists.  90% male. (In fact the median number of women on a panel is zero).  They are largely old.  None of them seem to be ‘real world’ experts – most are in Government and academia.

The topics are potentially interesting, but I’m nervous about the big data one. It’s not clear that there are any actual practitioners here (I will feed back later!)

Attendees and Ts

I have never been to a technology conference that is so suited. Even Gartner has a less uptight feel. Over 1000 people and not a single slogan. Wow. I feel quite daring wearing a pink shirt. And no tie.

What could they do?

I’m not saying it’s a bad conference. But I’m not sure it’s a technology conference, and I’m 100% certain it’s not a tech conference.

If they want it to be a tech conference then they need to take some serious action on diversity (especially gender and age)*.  They also need to think about inviting people who feel more comfortable in a T-shirt. The ones with slogans. And who know who xkcd is.

And this seems to be the biggest problem: the conference seems to highlight the gulf between the three components that they talk about (the triple helix) – universities, government, big business – and the markets where the theory hits the road. The innovators, the open source community, the disruptors.

On to the Big Data session

Well that was like a flashback to 2013. Lots of Vs, much confusion. Very doge.

It wasn’t clear what we were talking about big data for. Plenty of emphasis on HPC but not a single mention of Hadoop.

Some parts of the room seemed concerned about the possible impact of big data on society. Others wanted to explore if big data was relevant to science, and if so, how.  So, a lot of confusion, and not a lot of insight…

It’s not just the Hadron Collider that’s Large: super-colliders and super-papers

During most of my career in data science, I’ve been used to dealing with analysis where there is an objective correct answer. This is bread and butter to data mining: you create a model and test it against reality.  Your model is either good or bad (or sufficiently good!) and you can choose to use it or not.

But since joining THE I’ve been faced with another, and in some ways very different problem – building our new World University Rankings – a challenge where there isn’t an absolute right answer.

So what can you, as a data scientist, do to make sure that the answer you provide is as accurate as possible? Well it turns out (not surprisingly) that the answer is being as certain as possible about the quality, and biases in the input data.

Papers and citations

One of the key elements of our ranking is the ability of a University to generate valuable new knowledge.  There are several ways we evaluate that, but one of the most important is around new papers that are generated by researchers. Our source for these is Elsevier’s Scopus database – a great place to get information on academic papers.

We are interested in a few things: the number of papers generated by a University, the number of papers with international collaboration, and the average number of citations that papers from a University get.

Citations are key. They are an indication that the work has merit. Imagine that in my seminal paper “French philosophy in the late post-war period” I chose to site Anindya Bhattacharyya’s “Sets, Categories and Topoi: approaches to ontology in Badiou’s later work“. I am telling the world that he has done a good piece of research.  If we add up all the citations he has received we get an idea of the value of the work.

Unfortunately not all citations are equal. There are some areas of research where authors cite each other more highly than in others. To avoid this biasing our data in favour of Universities with large medical departments, and against those that specialise in French philosophy, we use a field weighted measure. Essentially we calculate an average number of citations for every* field of academic research, and then determine how a particular paper scores compared to that average.

These values are then rolled up to the University level so we can see how the research performed at one University compares to that of another.  We do this by allocating the weighted count to the University associated with an author of a paper.

The Many Authors problem

But what about papers with multiple authors?  Had Anindya been joined by Prof Jon Agar for the paper, then both learned gentlemen’s institutions would have received credit. Dr Meg Tait also joins, so we have a third institution that gains credit and so on.

Whilst the number of author remains small that works quite well.  I can quite believe that Prof Agar, Dr Tait and Mr Bhattacharya all participated in the work on Badiou.

At this point we must depart from the safe world of philosophy for the dangerous world of particle physics**. Here we have mega-experiments where the academic output is also mega. For perfectly sound reasons there are papers with thousands of authors. In fact “Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC” has 2932 authors.  

Did they all contribute to the experiment? Possibly. In fact, probably. But if we include the data in this form in our rankings it has some very strange results.  Universities are boosted hugely if a single researcher participated in the project.

I feel a bit text bound, so here is a graph of the distribution of papers with more than 100 authors.


Frequency of papers with more than 100 authors

Please note that the vast majority of the 11 million papers in the dataset aren’t shown!  In fact there are approximately 480 papers with more than 2000 authors.

Not all authors will have had the same impact on the research. It used to be assumed that there was a certain ordering to the way that authors were named, and this would allow the reduction of the count to only the significant authors. Unfortunately there is no consensus across academia about how this should happen, and no obvious way of automating the process of counting it.


How to deal with this issue? Well for this year we’re taking a slightly crude, but effective solution. We’re simply not counting the papers with more than 1000 authors. 1000 is a somewhat arbitrary cut off point, but a visual inspection of the distribution suggests that this is a reasonable separation point between the regular distribution on the left, and the abnormal clusters on the right.

In the longer term there are one technical and one structural approach that would be viable.  The technical approach is to use a fractional counting approach (2932 authors? Well you each get 0.034% of the credit).  The structural approach is more of a long term solution: to persuade the academic community to adopt metadata that adequately explains the relationship of individuals to the paper that they are ‘authoring’.  Unfortunately I’m not holding my breath on that one.

*well, most

**plot spoiler: the world didn’t end

Some things I learned at Teradata

Over the last three and a half years I have led a fantastic team of data scientists at Teradata. But now it’s time for me to move on… so what did I learn in my time? What are the key Data Science messages that I’m going to take with me?

Pulp-O-Mizer_Cover_Image (4)

A lot of people don’t get it

What makes a good data scientist? One definition is that it is someone who can code better than a statistician, and do better stats than a coder. Frankly that’s a terrible definition, which really says you want someone who is bad at two different things.

In reality the thing that makes a good data scientist is a particular world view. One that appreciates the insight that data provides, and who is aware of the challenges and joys of data. A good data scientist will always want to jump into the data and start working on finding new questions, answers, and insights.  A great data scientist will want to do that, but will start by thinking about the question instead! If you throw a number at a good data scientist you’ll get a bunch of questions back…

Many people don’t have that worldview. And no matter how good they get at coding in R they will never make a good data scientist.

Data science is the Second Circle of data.

It’s one for the problem, two for the data, three for the technique

One of my favourite dislikes are the algorithm fetishists. A key learning from working across different customers and industries is that when analytical projects fail it’s very rarely because the algorithm was sub optimal. Usually it has been because the problem wasn’t right – or that the data didn’t match the problem.

Where choice of algorithm is important is in consideration of the use of the solution (and potentially in the productionisation of it) rather than in terms of simple measures of performance.

Don’t be afraid of the simple answer

Yes, you know how to run an n-path. Or do Markov chain analysis. Or build a random forrest. But if the answer can be generated from a simple chart, why would you use those other techniques? To show how clever you are?

There is another side – being aware that the simple answer may be wrong, and that the lure of simplicity is dangerous in itself. But usually if you get it then you know about that…

And of course there is also something to be said about the idea that the best ideas seem simple, but only after you’ve found them.

Stories are powerful

When you’re trying to sell an analytical approach (or even analytical software or hardware) the story you tell is vital. And the story might not be where the actual value is. Because to tell the story best you often use the edge cases. The best example comes from some work a colleague was doing. The actual analysis was great, but the thing that sold it to the client was a one-off event (albeit one that was ongoing) of such astonishing stupidity that it instantly caught the imagination. Everyone could immediately see that it was both crazy, and also that it was bound to happen. And it had been found through analysis.

I really wish I could tell you what it was! Buy me a drink sometime and you might find out…

Some of you may say that you’re not selling analytics. But if you’re a data scientist you are – to your boss, your co-workers, people you want to impress… and if you’re selling analysis you need to tell stories.

You still need to munge that data

So much time is spent dealing with data. This is one of the reasons that so many data scientists still use SQL (and it’s also a reason why logical modelling is still more attractive than late binding – I’m lazy and want someone else to have done some of the work first).

I wish it wasn’t the case. And I wish that tools were better at it than they are.

Don’t look for data scientists, look for data science people

Remember that when you want to recruit (and retain) data scientists that they are people. I’ve been lucky at Teradata to work with some fantastic people – both in my team, in the wider company, and at our clients.

I have a concern, however, that we (the data science community) are undervaluing some people, and as a result overlooking fantastic talent. A recent survey on data science salaries by O’Reilly included a regression model, and one of the key findings was that if you were a woman your salary dropped by $13k. For no reason whatsoever.

This seems bizarre to me, as I have had the privilege to work with some fantastic women in data: Judy Bayer, Fran Bennett, Garance Legrand, Kaitlin Thaney, Yodit Stanton and many many more*.

Data Science can change the world

Teradata believe in data philanthropy – the idea that if more social organisations use data for decisions that they will make better decisions, and that tech companies can play a part in helping them achieve this. Because of this they have supported DataKind and DataKind UK.

This is really important – because there are a bunch of challenges in helping charities and not for profits when it comes to data. The last thing these organisations need is well intentioned, but damaging, solutionism being dumped on them by West Coast gurus. There is nothing wrong in Elon Musk working on big issues through things like Tesla, but there is a whole bunch more that can be achieved if we can find sensitive ways to work with the people who deal with social problems everyday.

In my work with DataKind I’ve seen what data can do for charities, and this, in turn, has made me a better data scientist.

Where am I going?

I’m about to start a new career leading the data team at Times Higher Education – where we produce the leading ranking of Universities across the world.  I’ve loved my time at Teradata, and I’ve learnt some important stuff, but it’s time for a change!

*sorry if I didn’t mention you here…

Why the Prime Minister is wrong: the maths


Since this post was written we’ve had several new terrorist attacks in the UK. Most recently in Manchester and London. These are horrific events, and it’s natural that people want to do something. In each case there has been a call for the internet companies to ‘do more’, without ever being clear exactly what that means. Perhaps it means taking down posts. Perhaps it means reporting suspects. But whatever stance you take, the maths is still the maths, which makes this post that I wrote in 2014 more valid than ever…

Yesterday the UK suggested that an unnamed internet company could have prevented the murder of a soldier by taking action based on a message sent by the murderer.

It’s assumed that the company in question was Facebook.

The problem is that the maths tells us that this is simply wrong. It couldn’t have, and the reason why takes us to a graveyard near Old Street.


Buried in Bunhill Fields is Thomas Bayes, a non-conformist preacher and part time statistician who died in 1761. He devised a theorem (now known as Bayes Theorem) that helps us to understand the real probability of infrequent events, base on what are called prior probabilities. And (thank God) events like this murder are infrequent.

For the sake of argument let’s imagine that Facebook can devise a technical way to scan through messages and posts and determine if the content is linked to a terrorist action. This, in itself, isn’t trivial. It requires a lot of text analytics, understanding idiom, distinguishing “I could kill them” from “I could kill them” and so on.

But Facebook has some clever analysts, so lets assume that they build a test. And let’s be generous: it’s 98% accurate. I’d be very happy if I could write a text analytics model that was that accurate, but they are the best. Actually let’s make it 99% accurate. Heck! Let’s make it 99.9% accurate!

So now we should be 99.9% likely to catch events like this before they happen?


So let’s look at what Bayes and the numbers tell us.

The first number of interest is the actual number of terrorists in the UK. The number is certainly small. This is the only recent event.

But recently the Home Secretary, Theresa May, told us that 44 terrorist events have been stopped in the UK by the security services. I will take her at her word. Now let’s assume that this means there have been 100 actual terrorists. Again, you can move that number up or down, as you see fit, but it’s certainly true that there aren’t very many.

The second number is the number of people in the UK. There are (give or take) 60million.

(I’m going to assume that terrorists are just as likely, or unlikely, as the population as a whole to use Facebook. This may not be true, but it’s a workable hypothesis.)

So what happens when I apply my very accurate model?

Well the good news is that I identify all of my terrorists – or at least I identify 99.9 of them. Pretty good.

But the bad news is that I also identify 60,000 non-terrorists as terrorists. These are the false positives that my model throws up.

The actual chance of a person being correctly identified as a terrorist is just 0.17%.

Now this is surely a huge advance over where we were – but imagine what would happen if we suddenly dropped 60,000 leads on the police. How would they be able to investigate? How would the legal system cope with generating these 60,000 warrants (yes, you would still need a warrant to see the material)?

And let’s be clear; if we’re more pessimistic about the model accuracy things get worse, fast. A 99% accurate model (still amazingly good) drops the chance of true detection to 0.017%. At 98% it’s 0.008%, and at a plausible 90% it would be 0.0015%. The maths is clear. Thank you Rev Bayes.

The National Information Infrastructure – holes in the road(map)

In my work as part of the Open Data User Group I have come across a secret* document: the National Information Infrastructure.

The idea, which has come out of the Shakespeare Review, is to identify the Government datasets that need to be protected and (potentially) made open in the public interest.


Ignoring the inconvenient fact that two of the most significant datasets won’t fall within it’s remit (the Royal Mail’s Postal Address File, which was conveniently sold off with the Royal Mail; and Ordnance Survey’s MasterMap, which is never, ever going to be open) the idea seems sound.  Data is increasingly important, and Government has a role to play in supporting and protecting it.

But there are some big holes in the road.

Firstly much of the important data isn’t, and will never be, public open data. It is the data that we rely on that is held by commercial organisations.  This data is vital to the economic well being of the country. In fact, much of it is necessary just to make things work!

Just imagine what would happen to the country if there was a significant loss of data in one of the major telecommunications companies?  And bear in mind that telephony today is very much a data business. Or what about if one of our banks had its data maliciously wiped? Most money is data, not pound coins. It would make the financial meltdown look trivial (don’t believe me, then think – would you be willing to buy or sell things if you weren’t confident that the money you were using actually existed, or would continue to exist in ten minutes time?).

And it doesn’t take quite such a catastrophic event to cause problems. Fat finger incidents are already capable of causing significant problems. 

The second issue is the interlinking of physical and data assets. Yes, data is important. But, until the singularity, data sits somewhere. On servers. And it’s transferred via networks.  And these are vulnerable to attacks.  The attacks can be “friendly” (yes NSA, I’m giving you the benefit of the doubt) or malicious (the result of Heartbleed, for example), but they can happen.  And the cloud makes life more complex. Just where exactly is that data you were talking about?  Whose jurisdiction does your national asset reside in?  

And the third problem is legislative. What will the impact of legislation be on your national asset? Some will be beneficial (commitments to open data), others will be troublesome, or even damaging.  Best to think these through and highlight them upfront.

So, if we see the NII in its present form as an end point then it is a disappointing missed opportunity.  But, if we see it as the starting point for a recognition of the vital role of data in society, then it has promise… 

*Not really – the existence of the NII was made public last year.

Prescriptive analytics? My Twitter spat…

So at the Gartner BI Summit I got myself into a Twitter spat with the conference chair over the term “Prescriptive Analytics”.

Gartner have decide that the world of advanced analytics is split into four elements: Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics.  Two of those categories will be very familiar – there are clear technical and conceptual differences between these two types (perhaps most succinctly identified in the old neural network terms unsupervised and supervised).


Diagnostic and Prescriptive Analytics are a bit different though, and I’m struggling to see what they mean that is significantly different from Descriptive or Predictive.

Gartner have an image that tries to take this further:



Image (c) Gartner

So here are my issues.

1) Descriptive vs Diagnostic

I’m not convinced that there is a real difference here. I don’t buy the idea that Descriptive analysis wouldn’t answer the question “Why did it happen?” or that Diagnostic analysis wouldn’t ask the question “What happened?”.  In fact (of course) you also typically use techniques from predictive analysis to help you with both of these – Cox Proportional Hazard Modelling would be one approach that springs to mind.  Technically it’s a two target regression approach, but it’s used to understand as much as to predict.

2) Predictive vs Prescriptive

The apparent difference here is twofold: firstly Predictive doesn’t lead directly to action, but Prescriptive does.  This simply doesn’t hold water.  Predictive analysis can lead directly to action.  Many predictive algorithms are embedded in systems to do exactly that. And if you contend that even that involves some human intervention, then the same is absolutely true of Prescriptive analytics – someone has to create the logic that integrates the analysis into the business process.

3) Prescriptive involves techniques that are different than Predictive

The suggestion is that meta techniques such as constraint based optimisation and ensemble methods qualitatively different and stand alone as a special category.  I don’t agree.  They don’t stand alone.  You can do predictive analytics without descriptive, and descriptive without predictive. You can’t do ‘prescriptive’ analytics without predictive.  It doesn’t stand on its own.  I’d also argue that these are techniques that have always been used with predictive models: sometimes internally within the algorithms, sometimes by hand, and sometimes by software.

4) Only prescriptive analytics leads directly to decision and action

Without human intervention. This also just isn’t true. I dare anyone to build prescriptive analytics without involving people to build the business logic, validate the activities, or just oversee the process. Yet this is the claim. Data mining is a fundamentally human, business focused activity. Think otherwise and you’re in for a big fall.  And, yet again, productionising predictive models has a long tradition – this is nothing new.

But the final defence of Prescriptive Analytics is that it is a term that has been adopted by users.  Unfortunately this doesn’t seem to be the case. Gartner use it, but they need to sell new ideas. SAS and IBM also use it, but they are desperate to differentiate themselves from R. A few other organisations do use it, but when pressed will admit they use it because “Gartner use it and we wanted to jump on their bandwagon”. But I could be wrong, so I looked at Google.

Predictive analytics: 904,000 results

Prescriptive analytics: 36,000 results

Take out SAS/IBM: 17,500 results

The Conference Conundrum

I’m here at the Teradata Partners conference in Dallas (one of my favourite conferences (full disclosure, I’m employed by Teradata)), and enjoying myself immensely.

Of course there are always a few problems with these big conferences:

  1. The air-con is set to arctic
  2. The coffee in breakfast is terrible
  3. I always want to go to sessions that clash

I’ve long since given up on the air-con and the coffee.  It seems these are pretty much immutable laws of conferences.  But what about the scheduling?  Surely there is a (and I hesitate to say it) big data approach to making the scheduling better?

I have no* evidence, but I suspect that current scheduling approaches essentially revolve around avoiding having the same person speak in two places at the same time, making sure that your ‘big’ speakers are in the biggest halls.

But we’ve all been to presentations in aircraft hangers with three people in the audience, and we’ve all been to the broom-closet with a thousand people straining to hear the presenter.

And above all. we’ve all been hanging around in the corridor trying to decide which of the three clashing sessions we should go to next.

The long walk

The long walk

So maybe, just maybe, there is a better way.

How? Well this year’s Partners Conference gave us the ability to use an app or an online tool to choose which sessions we wanted to see.  So I did it.  Two minutes in BZZZZZZZ – you have a clash!  But I wanted to see both of those sessions!  Tough.  Choose one.

But.  What if they had asked me what I wanted to see before they had allocated sessions to time slots and rooms?

They would have ended up with a dataset that would allow someone to optimise the sessions for every attendee.  This would really change the game, we’re moving from an organiser designed system to a user designed system.

But wait! There’s more!

Are you tired from having to walk 500m between sessions?  We could also optimise for distances walked.  And we could make a better guess at which sessions need the aircraft hanger, and which would be just fine in the broom closet. And we could do collaborative filtering and suggest sessions that other people were interested in…

And guess what?  We have the technology to do this.

Next year, Partners?

The worst big data article ever?

There are many, many bad articles on big data. It’s almost impossible to move without tripping over another pundit trying to rubbish the topic. Partly this is just the inevitable sound of people trying to make a name for themselves by being counter-factual. It’s far easier to stand out when you’re fighting against the tide. Even if you end up getting very wet…

Fortunately St Nate of Sliver has actually analysed the data and it’s clear that pundits are fundamentally useless.

But occasionally you come across something so egregiously crap that you have to comment.

This week’s crap-du-jour comes courtesy of Tom Leberecht and Fortune.

In it he decides to lump almost every woe in the world and pile them at the feet of big data. So here are my rebuttals:

Big Data = Big Brother?

This hasn’t been a good couple of weeks for the field of data mining. The NSA scandal has  caused sales of Nineteen Eighty-Four to rise, unfortunately not quite at the speed that the use of the phrase “Big Brother” by journalists has risen.

In his article Leberecht oddly passes this over, and instead mentions the evil of passing on data to private companies in a sideways swipe at quantified self.

Perhaps he forgets to mention it because it appears that people can also see the positive side? There are real issues around privacy, anonymity and data security, but pretending that the age of big data is the cause is rather odd.

Big data is not social

Well firstly, hasn’t he heard of Social Network Analysis? But secondly he seems to be advancing the argument that the status of X (relationships) is threatened by allowing Y (data analysis).  Sound familiar?  Yes that’s the argument against gay marriage. Somehow if my gay friends get married, my marriage will be threatened.

Well for the record analysing data doesn’t mean that humanity will be diminished. Welcome to the world of science! Were we more human in the 17th century? Or the 13th? Because there was a hell of a lot less analysis then than in the 1950s…

Finally on this topic, what about the growing Data Philanthropy movement? Every week we see new initiatives where people want to apply big data to address real social issues in ways that couldn’t happen before.

Big data creates smaller worlds

Apparently big data filters our perception, and limits our openness to new ideas and cultures. Really? To go back to gay marriage – can we imagine this being a possibility 20 years ago? The ability to interact and identify unusual events and groups menas that there is far more diversity than there ever was. A goal of marketers is to open people’s eyes to new things (and to get them to buy them). Leberecht seems to think that the collaborative filtering that Amazon famously use would only ever return you to the same book.

Big data – and opening yourself up to ideas that aren’t part of you narrow ‘intuition’ will surely make your world bigger and more diverse…

Big data is smarter, not wiser

The article makes it clear that wisdom has a twofold definition in Leberecht’s world: it is based on intuition (guesswork) and it is slow. Oh, and it also rejects feedback. Well firstly, big data isn’t always fast. Believe me, Hadoop isn’t a solution suited to the rigours of rapid operationalisation. There are other things for that. But as a definition of wisdom this seems to be a disaster.

Not only should you take the risk of making the wrong decision (intuition is guesswork), but you should do it slowly, and without paying attention to any feedback you get.  Truly this is fail slow writ large.

Big data is too obvious

I think this is the heart of Leberecht’s argument. He didn’t think of big data.

His example (that the financial collapse was caused by measurement) is patently wrong. The problem with mortgage backed securities was exactly the opposite: people failed to measure the risk and relied on intuition that the boom was going to continue indefinitely.

Big data doesn’t give

And then finally we hit the sleeping policeman of CP Snow. You are either an artist or a scientist. A businessman or a data scientist. Creativity belongs to the former, sterile analysis to the latter.

I’ll let you guess what I think of that!


A big data glossary

Big data is serious. Really serious. But at Big Data Week it became clear that we need to be able to laugh at ourselves… So here is my initial attempt at an alternative glossary for big data. Thanks to the many contributors (intentional and not), and apologies to anyone who disagrees. Enjoy.

Wordle: BigDataGlossary

Agile analytics: Fast, iterative, experimental analysis often performed by people who aren’t.

Apache: Shorthand for the Apache Software Foundation, an open source collection of 100 projects. Not funny, but important.

Big brother: 1) Notional leader of Oceania in dystopian future novel Nineteen Eighty-Four. 2) Regular data related leader on front page of the Daily Mail in dystopian present 2013.

Big data: Any data problem that I’m working on. See also small data.

Cassandra: Open source, distributed data management system developed at Facebook. See Evil Empire.

Clickstream data: Data logged by client or web server as users navigate a website or software system. As in “In case of death please delete my clickstream”.

Data <insert profession>: Way of making a profession sound up to 22.7 times more sexy. Examples: data journalist, data scientists, data philanthropist. Not yet tried, data accountant, data politician.

Data journalist: Journalist who lets facts get in the way of a story.

Data model: the output of data modelling. Note that this is not a model who wants to sound more sexy.

Data modelling: Evil way of understanding how a process is reflected in data. Frowned upon.

Data miner: a data scientist who isn’t interested in a pay rise. Note that this is not a miner who wants to sound more sexy.

Data mining: The queen of sciences. Alternatively a business process that discovers patterns, or makes predictions from data sets using machine learning approaches.

Data scientist: A magical unicorn. With a good salary.

Dilbert test: Simple test for separating data scientists from IT people. If you take two equally funny cartoon strips, one Dilbert, the other XKCD, then a data scientist will prefer XKCD. An IT professional will prefer Dilbert. If anyone prefers Garfield they are in completely the wrong profession.

Elephants: Obligatory visual or textual reference by anyone involved in Hadoop. These are not the only animal in the zoo.

ETL: Extract, transform and load (ETL) – software and process used to move data.  Not to be confused with FTL, ELT, BLT, or more importantly DLT.

Evil Empire: large software/hardware/service vendor of choice. For example, Apple, Google, Facebook, IBM, Microsoft etc… The tag is independent of their evil doing capabilities.

Facebook: Privacy wrecking future Evil Empire.  Source of interesting graph data.

Fail fast: Low-risk, experimental approach to big data innovation where failure is seen as an opportunity to learn. Also excellent excuse when analysis goes terribly, terribly wrong.

Fail slow: Really, really bad idea.

Fork: Split in an open source project when two or more people stop talking to each other.

Google: Privacy wrecking future Evil Empire.  Source of interesting search data.

Google BigQuery: See Evil Empire.

Hadoop: Open source software controlled by the Apache Software Foundation that enables the distributed processing of large data sets across clusters of commodity servers. Often heard from senior managers “Get me one of those Hadoops that everyone is talking about”

Hadoop Hive: SQL interface to Hadoop MapReduce. Because SQL is evil. Except when it isn’t.

In-Memory Database: Trying to remember what you just did rather than writing it down. See also fail fast.

Internet of things: Stuff talking to other stuff. See singularity.

Java: Dominant programming language developed in the 90s at Sun Microsystems. It was later used to form Hadoop and other big data technologies.  More importantly a type of coffee.

Key value pairs: a way of avoiding evil data modelling.

MapReduce: Programming paradigm that enables scalability across multiple servers in Hadoop, essentially making it easier to process information on a vast scale. Sounds easy, doesn’t it?

MongoDB: NoSQL open source document-oriented database system developed and supported by 10gen. Its name derives from the word ‘humongous’. Seriously, where do they get these names?

NoSQL: No SQL. It’s evil. Do not use it, use something else instead.

NOSQL: Frequently (and understandably) confused with the NoSQL. It actually means not only SQL, or that you forgot to take off capslock.

No! SQL!: Phrase frequently heard as a badly written query takes down your database/drops your table.  See Bobby Tables.

Open data: Data, often from government, made freely available to the public. See also .pdf

Open source: an expensive way to not pay for something.

Python: 1) Dynamic programming language, first developed in the late 1980s. 2) Dynamic comedy team, first developed in the late 1960s.

Pig: High-level platform for creating MapReduce programmes used with Hadoop, also from Apache.

Real time: Illusory requirement for projects.

Relational database management systems (RDBMS): Big ol’ databases.

Singularity: When the stuff talking to other stuff finds our conversation boring.

Small data: (pejorative) Data that you are working on.

Social Media: Facebook, Twitter etc… Sources of trivial data in large volumes. Will save the world somehow.

Social Network: 1) A collection of people and how they interact. May also be on social media. 2) What data scientists really hope they will be part of.

SQL: Standard, structured query language specifically designed for managing data held in Big ol’ databases.

Twitter: Privacy wrecking future #EvilEmpire.  Source of interesting data in less than 140 characters.

Unstructured data: 1) The crap that just got handed to you with a deadline of tomorrow 2) The brilliant data source you just analysed way ahead of expectations. Often includes video and audio. You weren’t watching YouTube for fun.

V: Letter mysteriously associated with big data definitions. Usually comes in threes.  No one really knows why.

Variety: The thing you fail to get in big data definitions. “I didn’t see much variety in that definition!”

Velociraptors: Scary dinosaurs that really should be part of the definition of big data, terribly underused.

Velocity: The speed at which a speaker at a conference moves from the topic to praising their own company.

Volume: The degree to which a business analyst’s hair exceeds expectations.

XKCD: Important reference manual for big data.

ZooKeeper: Another an open source Apache project which provides a centralised infra­structure and services that enable synchronisation across a cluster.  Given all the animals in big data you can see why this was needed.