Labourosewwwbgr

A brief* analysis** of Labour’s NEC Candidates

LabourosewwwbgrOne of the frustrations of the Labour NEC elections is that you never really know who you’re voting for. Well, obviously you known their names, and they’re happy to give you an impressive list of the Committees they have been elected to (and they are always Committees, not committees).

But what is their stance? Who do they support? Are they Traitorous Blairite Scum™ or Vile Entryist Trots®? How do we know? No one actually says: “I’m backing Eagle. She’s great” or “Corbyn for the Win!”. So we’re left guessing, or poring through their candidate statements, or worse still, relying on the various lobby groups to give us a clue to the slates etc.

So, instead of that, because, frankly, it’s boring, I decide to do a little bit of word frequency analysis, to see if there were words that were unique, or at least more likely to come up in one group or another.

To help me categorise the candidates, I followed a helpful guide from Totnes Labour Party. That gave me the two major slates: Progress/Labour First*** vs the something something Grassroots Alliance***.

So, all I need to do is go to the Labour Party site, download the candidate statements, and away I go. Except, of course, the Party website is bloody useless at giving actual information.

So once more it’s back to the Totnes Labour website, supported by Google. I can’t find statements from every candidate, but I get most of them. Enough for this, anyway.

Normally I would fire up R, load the data, and away I go. But actually there is so little of it that it’s time for some simple Excel kludging.

Methodology*****

First I put all the text from each slate into a single document. Remove case sensitivity, and work out the frequency of each word (with common stop words, natch).

I know that this means I can’t play with tf-idf (sorry, Chris), but really there weren’t enough documents to call this a corpus anyway.

Now I create a common list of words, with the frequency in each slate. This isn’t as easy as it sounds, because a straightforward stemming approach isn’t really going to cut it. These documents are in SocialistSpeak (Centre-left social democratic/democratic socialist sub variant), and so need handcrafting. There is, for example, a world of difference between inclusive and inclusion.

So once that tedious task is out of the way, I took the frequency with which each word appeared in each slate, and simply subtracted one from the other.

Results

Firstly, which are the most frequently used words across both slates?

Word Avg Freq
MEMBER 2.7%
LABOUR 2.5%
PARTY 2.3%
NEC 1.5%
WORK 1.1%
NEED 1.1%
ELECTION 1.1%
POLICY 0.9%
SUPPORT 0.9%
YEARS 0.9%
WIN 0.8%
CLP 0.8%
CAMPAIGN 0.8%
PEOPLE 0.7%
WILL 0.7%
GOVERNMENT 0.7%
POLITICS 0.7%
WALES 0.6%
NATIONAL 0.6%
JEREMY 0.6%

These words appear frequently in both camps. You can easily imagine statements containing them:

“Since I was elected to the NEC in 2010 I have spent six years supporting members of the Labour Party. Sadly Jeremy can’t win in Wales/Only Jeremy can win an election against this Government”

I don’t really know why Wales pops up.

But let’s look at each slate independently.

Team M

Firstly, Team Momentum****** (or the Centre-Left Grassroots Alliance (It isn’t clear where the apostrophe should go (or indeed, if there should be one))).

logo300

These are the words most likely to be in a Momentum candidate’s statement and not in a Progress candidate’s statement, in order of momentum (sorry, couldn’t resist):

WALES 1.2%
JEREMY 0.9%
YEARS 1.2%
PUBLIC 0.9%
COMMITMENT 0.6%
LONDON 0.8%
SERVICES 0.7%
NATIONAL 0.9%
CLP 1.1%
LABOUR 2.7%
ELECTION 1.3%
EQUALITY 0.5%
JOBS 0.5%
COMMITTEE 0.6%
SECRETARY 0.4%
WORK 1.3%
CLEAR 0.4%
BELIEVE 0.6%
UNION 0.7%
GIVE 0.4%

I think we’ve found where Wales comes from: a proud Welsh candidate on the Momentum slate.

They are proud of Jeremy, are probably active in their union, have been a party secretary.

Team P/L1

How about Team Progress? (Or the less impressively websited Labour First)

progress

NEED 1.7%
MEMBER 3.3%
PEOPLE 1.1%
WILL 1.1%
NEC 1.8%
WIN 1.2%
PARTY 2.6%
COUNTRY 0.6%
THINK 0.6%
FIGHT 0.7%
PUT 0.7%
LOCAL 0.8%
LISTEN 0.5%
VOTE 0.4%
NEW 0.7%
DECISIONS 0.3%
ENGAGE 0.3%
MEAN 0.3%
WELCOME 0.3%
POLITICS 0.8%

Woah! That looks very different. Without wanting to judge either group’s politics, one looks (at first) like a very technocratic list of achievements, and the other a very wide forward looking approach. This may simply be reflecting the fact that the Leader of the Party is on the Momentum side, so they have more things to look at and say “this is our record” where as the opposition have to look to a future.

Team WTF?

Are there any oddities?

Well, Blair and Brown (and Iraq) are mentioned by Progress, and not by Momentum, which is perhaps strange. Afghanistan is mentioned by Momentum.

Britain, England, and Scotland are mentioned far more by Momentum (as is Wales, but let’s ignore that for now).

Everyone is claiming to be a socialist. But not very much. (There was one use of socialist on each side. Of course maybe everyone assumes that everyone else knows they are a socialist).

Privatisation is mentioned by Momentum, renationalisation by Progress.

The word used most by Momentum but not at all by Progress is commitment.

The words used most by Progress and not at all by Momentum are country and think (it’s a tie).

But surely…?

What should I really have done to make this a proper analysis?

  • Firstly I should have got all the data.
  • Then I should probably have thought about some more systematic stemming, and looking at n-grams (there is a huge difference between Labour and New Labour).

There are a whole range of interesting things you can do with text analysis, or there are word clouds.

 Footnotes

*Very brief

**Only just, but more later!

***And this is why there is a problem. No one wants to call themselves: No Progress at All, or Labour Last, and everyone thinks that they represent the grassroots. I’ve yet to see a group that markets itself as the home of the elitist intellectuals****

****Well maybe the Fabians. But I didn’t say it.

*****This seems to over dignify it. But heck.

******They deny this.

 

DataScience post Brexit

EU_Flag_specification

This is not an analysis of how people voted, or why. If you’re interested then look at some of John Burn-Murdoch’s analysis in the FT, or Lord Ashcroft’s polling on reasons.

It is, however, an attempt to start to think about some of the implications for DataScience and data scientists in the new world.

Of course, we don’t really know what the new world will be, but here we go:

Big companies

The big employers are still going to need data science and data scientists. Where they are pretty much stuck (telcos, for example) then data science will continue. If there is a recession then it may impact investment, but it’s just as likely to encourage more data science.

Of course it’s a bit different for multi-nationals. Especially for banks (investment, not retail). These are companies with a lot of money, and big investments in data science. Initial suggestions are that London may lose certain financial rights that come with being in the EU, and that 30-70,000 jobs may move.

Some of these will, undoubtably, be data science roles.

Startups

But what about startups? There are a few obvious things: regulation, access to markets, and investment.

We may find that the UK becomes a less country with less red-tape.  I somehow doubt it. But even if we do, in order to trade with Europe we will need to comply (more of that later).

Access to markets is another factor. And it is one that suggests that startups might locate, or relocate, in the EU.

Investment is also key. Will VCs be willing to risk their money in the UK? There is some initial evidence that they are less keen on doing so. News that some deals have been postponed or cancelled. Time will tell.

In many ways startups are the businesses most able to move. But the one thing that might, just might keep them in the UK is the availability of data scientists. After the US the UK is probably the next most developed market for data scientists.

Data regulation

At the moment the UK complies with EU data provisions using the creaking Data Protection Act 1998. That was due to be replaced by the new data directive. But from the UK’s perspective that may no longer happen.

That brings different challenges. Firstly the UK was a key part of the team arguing for looser regulation, so it’s likely that the final data directive may be stronger than it would otherwise have been.

Secondly, the data directive may simply not apply. But in that case what happens to movement of data from the EU to the UK? Safe Harbor, the regime that allowed EU companies to send data to the US, has been successfully challenged in the courts, so it is unlikely that a similar approach would fly.

What then? Would we try to ignore the data directive? Would we create our own, uniquely English data environment? Would we hope that the DPA would still be usable 20, 30 or 40 years after it was written?

People

Data scientists, at the moment, are rare. When (or if) jobs move, then data scientists will be free to move with them. They will almost always be able to demonstrate that they have the qualities wanted by other countries.

And many of our great UK data scientists are actually great EU data scientists. Will they want to stay? They were drawn here by our vibrant data science community… but if that starts to wither?

 

Fair data – fair algorithm?

In my third post about the ethics of data science I’m heading into more challenging waters: fairness.

I was pointed to the work of Sorelle Friedler (thank you @natematias, @otfrom, and @mia_out) on trying to remove unfairness in algorithms by addressing the data that goes into them rather than trying to understand the algorithm itself.

I think this approach has some really positive attributes:

  • It understands the importance of the data
  • It recognises that the real world is biased
  • It deals with the challenges of complex algorithms that may not be as amenable to interpretation as the linear model in my previous example.

Broadly speaking – if I’ve understood the approach correctly – the idea is this…

Rather that trying to interpret an algorithm, let’s see if the input data could be encoding for bias. In an ideal world I would remove the variables gender and ethnicity (for example) and build my model without them.

However, as we know, in the real world there are lots of variables that are very effective proxies for these variables. For example, height would be a pretty good start if I was trying to predict gender…

And so that is exactly what they do! They use the independent variables in the model to see if you can classify the gender (or sexuality, or whatever) of the record.

If you can classify the gender then the more challenging part of the work begins: correcting the data set.

This involves ‘repairing’ the data, but in a way that preserves the ranking of the variable… and the challenge is to do this in a way that minimises the loss of information.

It’s really interesting, and well worth a read.

Some potential difficulties

Whilst I think it’s a good approach, and may be useful in some cases, I think that there are some challenges that this approach needs to address, both at a technical, and at a broader level.  Firstly though let’s deal with a couple of obvious ones:

  • This approach is focused around a US legal definition of disparate impact. That has implications on the approach taken
  • The concept of disparate impact is itself a contentious ethical position, with arguments for and against it
  • Because the approach is based on a legal situation, it doesn’t necessarily deal with wider ethical issues.

Technical challenges

As always, the joy of technical challenges are that you can find technical solutions. So here we go:

  • The focus of the work has been on classifiers – where there is a binary outcome. But in reality we’re entering the world of probability, where decisions aren’t clear cut. This is particularly important when considering how to measure the bias. Where do you put the cutoff?
  • Non-linear and other complex models also tend to work differentially well in different parts of the problem space. If you’re using non-linear models to determine if data is biased then you may have a model that passes because the average discrimination is fair (i.e below your threshold) but where there are still pockets of discrimination.
  • The effect of sampling is important, not least because some discrimination happens to groups who are very much in the minority. We need to think carefully about how to cope with groups that are (in data terms) significantly under-represented.
  • What happens if you haven’t recorded the protected characteristic in the first place? Maybe because you can’t (iPhone data generally won’t have this, for example), or maybe because you didn’t want to be accused of the bias that you’re now trying to remove.  There is also the need to be aware of the biases with which this data itself is recorded…

The real difficulties

But smart people can think through approaches to those.  What about the bigger challenges?

Worse outputs have an ethical dimension too:

If you use this approach you get worse outputs. Your model will be less accurate. I would argue that when considering this approach you also need to consider the ethical impact of a less predictive model. For example, if you were assessing credit worthiness then you may end up offering loans to people who are not going to be able to repay them (which adversely effects them as well as the bank!), and not offering loans to people who need them (because your pool of money to lend is limited). This is partially covered in the idea of the ‘business necessary defence’ in US law, but when you start dealing with probabilities it becomes much more challenging. The authors do have the idea of partially adjusting the input data, so that you limit the impact of correcting the data, but I’m not sure I’m happy with this – it smacks a bit of being a little bit pregnant.

Multiple protected categories create greater problems:

Who decides what protected categories are relevant? And how do you deal with all of them?

The wrong algorithm?

Just because one algorithm can classify gender from the data doesn’t mean that a different one will predict using gender. We could be discarding excellent and discrimination free models because we fear it might discriminate, rather than because it does.  This is particularly important as often the model will be used to support current decision making, which may be more biased than the model that we want to use… We run the risk of entrenching existing discrimination because we’re worried about something that may not be discriminatory at all (or at least less discriminatory).

Conclusions

If it sounds like I think this approach is a bad one, let’s be clear, I don’t. I think it’s an imaginative and exciting addition to the discussion.

I like its focus on the data, rather than the algorithm.

But, I think that it shouldn’t be taken in isolation – which goes back to my main thesis (oh, that sounds grand) that ethical decisions need to be taken at all points in the analysis process, not just one.

 

 

 

 

Sexist algorithms

Can an algorithm* be sexist? Or racist? In my last post I said no, and ended up in a debate about it. Partly that was about semantics, what parts of the process we call an algorithm, where personal ethical responsibility lies, and so on.

Rather than heading down that rabbit hole, I thought it would be interesting to go further into the ethics of algorithmic use…  Please remember – I’m not a philosopher, and I’m offering this for discussion. But having said that, let’s go!

The model

To explore the idea, let’s do a thought experiment based on a parsimonious linear model from the O’Reilly Data Science Salary Survey (and you should really read that anyway!)

So, here it is:

70577 intercept
 +1467 age (per year above 18; e.g., 28 is +14,670)
 –8026 gender=Female
 +6536 industry=Software (incl. security, cloud services)
–15196 industry=Education
 -3468 company size: <500
  +401 company size: 2500+
–15196 industry=Education
+32003 upper management (director, VP, CxO)
 +7427 PhD
+15608 California
+12089 Northeast US
  –924 Canada
–20989 Latin America
–23292 Europe (except UK/I)
–25517 Asia

The model was built from data supplied by data scientists across the world, and is in USD.  As the authors state:

“We created a basic, parsimonious linear model using the lasso with R2 of 0.382.  Most features were excluded from the model as insignificant”

Let’s explore potential uses for the model, and see if, in each case, the algorithm behaves in a sexist way.  Note: it’s the same model! And the same data.

Use case 1: How are data scientists paid?

In this case we’re really interested in what the model is telling us about society (or rather the portion of society that incorporates data scientists).

This tells us a number of interesting things: older people get paid more, California is a great place, and women get paid less.

–8026 gender=Female

This isn’t good.

Back to the authors:

“Just as in the 2014 survey results, the model points to a huge discrepancy of earnings by gender, with women earning $8,026 less than men in the same locations at the same types of companies. Its magnitude is lower than last year’s coefficient of $13,000, although this may be attributed to the differences in the models (the lasso has a dampening effect on variables to prevent over-fitting), so it is hard to say whether this is any real improvement.”

The model has discovered something (or, more probably, confirmed something we had a strong suspicion about).  It has noticed, and represented, a bias in the data.

Use case 2: How much should I expect to be paid?

This use case seems fairly benign.  I take the model, and add my data. Or that of someone else (or data that I wish I had!).

I can imagine that if I moved to California I might be able to command an additional $15000. Which would be nice.

Use case 3: How much should I pay someone?

On the other hand, this use case doesn’t seem so good. I’m using the model to reinforce the bad practice it has uncovered.  In some legal systems this might actually be illegal, as if I take the advice of the model I will be discriminating against women (I’m not a lawyer, but don’t take legal advice on this: just don’t do it).

Even if you aren’t aware of the formula, if you rely on this model to support your decisions, then you are in the same ethical position, which raises an interesting challenge in terms of ethics. The defence “I was just following the algorithm” is probably about as convincing as “I was just following orders”.  You have a duty to investigate.

But imagine the model was a random forest. Or a deep neural network. How could a layperson be expected to understand what was happening deep within the code? Or for that matter, how could an expert know?

The solution, of course, is to think carefully about the model, adjust the data inputs (let’s take gender out), and measure the output against test data. That last one is really important, because in the real world there are lots of proxies…

Use case 4: What salary level would a candidate accept?

And now we’re into really murky water. Imagine I’m a consultant, and I’m employed to advise an HR department. They’ve decided to make someone an offer of $X and they ask me “do you think they will accept it?”.

I could ignore the data I have available: that gender has an impact on salaries in the marketplace. But should I? My Marxist landlord (don’t ask) says: no – it would be perfectly reasonable to ignore the gender aspect, and say “You are offering above/below the typical salary”**. I think it’s more nuanced – I have a clash between professional ethics and societal ethics…

There are, of course, algorithmic ethics to be considered. We’re significantly repurposing the model. It was never built to do this (and, in fact, if you were going to build a model to do this kind of thing it might be very, very different).

Conclusions

It’s interesting to think that the same model can effectively be used in ways that are ethically very, very different. In all cases the model is discovering/uncovering something in the data, and – it could be argued – is embedding that fact. But the impact depends on how it is used, and that suggests to me that claiming the algorithm is sexist is (perhaps) a useful shorthand in some circumstances, but very misleading in others.

And in case we think that this sort of thing is going to go away, it’s worth reading about how police forces are using algorithms to predict misconduct

 

*Actually to be more correct I mean a trained model…

** His views are personal, and not necessarily a representation of Marxist thought in general.

 

 

The ethics of data science (some initial thoughts)

Last night I was lucky enough to attend a dinner hosted by TechUK and the Royal Statistical Society to discuss the ethics of big data. As I’m really not a fan of the term I’ll pretend it was about the ethics of data science.

Needless to say there was a lot of discussion around privacy, the DPA and European Data Directives (although the general feeling was against a legalistic approach), and the very real need for the UK to do something so that we don’t end up having an approach imposed from outside.

People first

Immanuel_Kant_(painted_portrait)

Kant: not actually a data scientist, but something to say on ethics

Both Paul Maltby and I were really interested in the idea of a code of conduct for people working in data – a bottom-up approach that could inculcate a data-for-good culture. This is possibly the best time to do this – there are still relatively few people working in data science, and if we can get these people now…

With that in mind, I thought it would be useful to remind myself of the data-for-good pledge that I put together, and (unsuccessfully) launched:

  • I will be Aware of the outcome and impact of my analysis
  • I won’t be Arrogant – and I will avoid hubris: I won’t assume I should, just because I can
  • I will be an Agent for change: use my analytical powers for positive good
  • I will be Awesome: I will reach out to those who need me, and take their cause further than they could imagine

OK, way too much alliteration. But (other than the somewhat West Coast Awesomeness) basically a good start. 

The key thing here is that, as a data scientist, I can’t pretend that it’s just data. What I do has consequences.

Ethics in process

But another way of thinking about it is to consider the actual processes of data science – here adapted loosely from the CRISP-DM methodology.  If we think of things this way, then we can consider ethical issues around each part of the process:

  • Data collection and processing
  • Analysis and algorithms
  • Using and communicating the outputs
  • Measuring the results

Data collection and processing

What are the ethical issues here?  Well ensuring that you collect with permission, or in a way that is transparent, repurposing data (especially important for data exhaust), thinking carefully about biases that may exist, and planning and thinking about end use.

Analysis and algorithms

I’ll be honest – I don’t believe that data science algorithms are racist or sexist. For a couple of reasons: firstly those require free-will (something that a random forest clearly doesn’t have), secondly that would require the algorithm to be able to distinguish between a set of numbers that encoded for (say) gender and another that coded for (say) days of the week. Now the input can contain data that is biased, and the target can be based on behaviours that are themselves racist, but that is a data issue, not an algorithm issue, and rightly belongs in another section.

But the choice of algorithm is important. As is the approach you take to analysis. And (as you can see from the pledge) an awareness that this represents people and that the outcome can have impact… although that leads neatly on to…

Using and communicating the outputs

Once you have your model and your scores, how do you communicate its strengths, and more importantly its weaknesses. How do you make sure that it is being used correctly and ethically? I would urge people to compare things against current processes rather than theoretical ideals.  For example, the output may have a gender bias, but (assuming I can’t actually remove it) is it less sexist than the current system? If so, it’s a step forwards…

I only touched on communication, but really this is a key, key aspect. Let’s assume that most people aren’t really aware of the nature of probability. How can we educate people about the risks and the assumptions in a probabilistic model? How can we make sure that the people who take decisions based on that model (and they probably won’t be data scientists) are aware of the implications?  What if they’re building it into an automated system? Well in that case we need to think about the ethics of:

Measuring the results

And the first question would be, is it ethical to use a model where you don’t effectively measure the results? With controls?

This is surely somewhere where we can learn from both medicine (controls and placebos) and econometrists (natural experiments). But both require us to think through the implications of action and inaction.

Using Data for Evil IV: The Journey Home

If you’re interested in talking through ethics more (and perhaps from a different perspective) then all of this will be a useful background for the presentation that Fran Bennett and I will be giving at Strata in London in early June.  And to whet your appetite, here is the hell-cycle of evil data adoption from last year…

HellCycle

 

 

 

Donald Trump

Who hates Donald Trump the most?

By now you’re probably aware of the UK Government petition about Donald Trump. It’s currently be signed by over 500,000 people, making it the most successful petition in the UK…

But who actually hates him most? The site conveniently provides a map, so you can see where people dislike him… but constituencies aren’t all equal, and you always run the risk of doing a heatmap of population densities:

heatmap

(Source xkcd.com/1138)

So I thought I’d explore further.  Firstly, which are the top constituencies by proportion of the population who hate Donald?

Constituency MP Signed Percentage
Bethnal Green and Bow Rushanara Ali MP 1874 1.50%
Bristol West Thangam Debbonaire MP 1779 1.43%
Brighton Pavilion Caroline Lucas MP 1405 1.36%
Hackney South and Shoreditch Meg Hillier MP 1496 1.27%
Islington North Jeremy Corbyn MP 1284 1.24%
Cities of London and Westminster Rt Hon Mark Field MP 1364 1.24%
Glasgow North Patrick Grady MP 876 1.22%
Hackney North and Stoke Newington Ms Diane Abbott MP 1556 1.22%
Islington South and Finsbury Emily Thornberry MP 1237 1.20%
Edinburgh North and Leith Deidre Brock MP 1279 1.20%

OK, so we see some of the usual faces there. Jeremy Corbyn, Diane Abbot, Caroline Lucas. Maybe it’s just a political thing.  Let’s look at the bottom constituencies (again by the proportion who are keen on the bewigged one):

Constituency MP Signed Percentage
Blaenau Gwent Nick Smith MP 92 0.13%
Barnsley East Michael Dugher MP 119 0.13%
Makerfield Yvonne Fovargue MP 126 0.13%
Boston and Skegness Matt Warman MP 129 0.13%
Wentworth and Dearne Rt Hon John Healey MP 119 0.12%
Walsall North Mr David Winnick MP 117 0.12%
Doncaster North Rt Hon Edward Miliband MP 123 0.12%
Easington Grahame Morris MP 102 0.12%
Kingston upon Hull East Karl Turner MP 112 0.12%
Cynon Valley Rt Hon Ann Clwyd MP 78 0.11%

Well I wouldn’t (necessarily) have expected to see Ed Miliband and Ann Clwyd’s constituencies there.

So let’s go slightly more data sciency. What is the correlation between Labour vote and proportion hating Donald?

Rather than just looking at that, let’s look at a whole range of stuff:

Correlation

Woah! What was that?  Well that’s the correlations between a whole load of stuff and the percentage who hate Donald. Percentage is the second column (or row), and a blue box means a positive correlation, and a red box a negative one.

So let’s look at some of the more interesting ones, and sort them from highest to lowest. Remember that these are at a Constituency level.

  • PopulationDensity  0.64
  • Green                           0.55
  • FulltimeStudent       0.53
  • Muslim                        0.37
  • LD                                  0.19
  • Lab                                0.07
  • Con                             -0.11
  • Ethnicity White      -0.48
  • Christian                  -0.60
  • UKIP                           -0.61
  • Retired                      -0.67

So it would seem that the strongest correlation isn’t with Muslim populations, it’s actually strongest with built up areas, then with 2015 Green voters, then with full time students, and only then with Muslim areas.

And it’s probably not surprising that those lease likely to hate Donald are constituencies with lots of retired people, a strong UKIP presence (some immigration is OK then, it would seem), a large number of people identifying as Christian, and white people.

Taking it one small step further, and running our old friend, Linear regression we see a model like this (Labour and Con voters removed to avoid tons of correlation, LD voters removed out of sympathy):

Estimate Std. Error t value Pr(>|t|)
(Intercept) .557 .0705 7.89 E-14
Green .0201 .0016 12.78 < 2e-16
Muslim .0036 .0012 3.04 0.0025
Density .0035 .0003 12.54 < 2e-16
 White .0033 .0007 4.78 2.2 E-6
Student  .0023 .0011  2.05 0.041
Christian -.0021 0.0007 -2.97 0.0031
Lab -.0036 0.0003 -11.14 < 2e-16
UKIP -.0118 0.0007 -16.85 < 2e-16
Retired -.0173 0.0020 -8.55 < 2e-16

A slightly different story – but looking at the key stand outs: if you want to find a constituency that really hates Donald, first look for Green voters, then a densely populated area, with quite a sizeable Muslim and white community, but keep away from those UKIP voters, especially the retired ones.

*Data for the petition harvested at about noon, 11 December 2015
** Other data from ONS based on the 2011 Census
*** Featured image Donald Trump by Gage Skidmore
**** Northern Ireland doesn’t provide breakdown of petition numbers

STS forum. The strangest technology conference you’ve never heard of

At the beginning of October I was in Kyoto (yes, I can hear the tiny violins) attending the STS Forum on behalf of my employers.

What is the STS Forum?  Well this was the 12th meeting of a group focused on linking universities, technology companies, and governments to address global problems. The full name is Science and Technology in Society.

And it’s a really high level kind of thing. The opening was addressed by three prime ministers. There are more university vice-chancellors/provosts/rectors than you could imagine.  If you aren’t a professor then you’d better be a minister. No Nobel prize?  Just a matter of time.

So it’s senior.  But is is about technology?  Or at least the technology that I’m familiar with?

PM Abe addresses STS Forum

The usual players?

Well the first challenge is the sponsors.  A bunch of big companies. Huawei, Lockheed Martin, Saudi Aramco, Toyota, Hitachi, NTT, BAT, EDF.

All big, all important (I leave it up to you to decide if they’re good).  But are these really who you’d expect? Where are IBM?  Oracle? SAP? Even Siemens? Never mind Microsoft, Apple, or (dare I say it) LinkedIn, Facebook etc…

I daren’t even mention the world of big data: MongoDB, Cloudera or others.

Panels and topics

Then there are the panelists.  90% male. (In fact the median number of women on a panel is zero).  They are largely old.  None of them seem to be ‘real world’ experts – most are in Government and academia.

The topics are potentially interesting, but I’m nervous about the big data one. It’s not clear that there are any actual practitioners here (I will feed back later!)

Attendees and Ts

I have never been to a technology conference that is so suited. Even Gartner has a less uptight feel. Over 1000 people and not a single slogan. Wow. I feel quite daring wearing a pink shirt. And no tie.

What could they do?

I’m not saying it’s a bad conference. But I’m not sure it’s a technology conference, and I’m 100% certain it’s not a tech conference.

If they want it to be a tech conference then they need to take some serious action on diversity (especially gender and age)*.  They also need to think about inviting people who feel more comfortable in a T-shirt. The ones with slogans. And who know who xkcd is.

And this seems to be the biggest problem: the conference seems to highlight the gulf between the three components that they talk about (the triple helix) – universities, government, big business – and the markets where the theory hits the road. The innovators, the open source community, the disruptors.

On to the Big Data session

Well that was like a flashback to 2013. Lots of Vs, much confusion. Very doge.

It wasn’t clear what we were talking about big data for. Plenty of emphasis on HPC but not a single mention of Hadoop.

Some parts of the room seemed concerned about the possible impact of big data on society. Others wanted to explore if big data was relevant to science, and if so, how.  So, a lot of confusion, and not a lot of insight…

It’s not just the Hadron Collider that’s Large: super-colliders and super-papers

During most of my career in data science, I’ve been used to dealing with analysis where there is an objective correct answer. This is bread and butter to data mining: you create a model and test it against reality.  Your model is either good or bad (or sufficiently good!) and you can choose to use it or not.

But since joining THE I’ve been faced with another, and in some ways very different problem – building our new World University Rankings – a challenge where there isn’t an absolute right answer.

So what can you, as a data scientist, do to make sure that the answer you provide is as accurate as possible? Well it turns out (not surprisingly) that the answer is being as certain as possible about the quality, and biases in the input data.

Papers and citations

One of the key elements of our ranking is the ability of a University to generate valuable new knowledge.  There are several ways we evaluate that, but one of the most important is around new papers that are generated by researchers. Our source for these is Elsevier’s Scopus database – a great place to get information on academic papers.

We are interested in a few things: the number of papers generated by a University, the number of papers with international collaboration, and the average number of citations that papers from a University get.

Citations are key. They are an indication that the work has merit. Imagine that in my seminal paper “French philosophy in the late post-war period” I chose to site Anindya Bhattacharyya’s “Sets, Categories and Topoi: approaches to ontology in Badiou’s later work“. I am telling the world that he has done a good piece of research.  If we add up all the citations he has received we get an idea of the value of the work.

Unfortunately not all citations are equal. There are some areas of research where authors cite each other more highly than in others. To avoid this biasing our data in favour of Universities with large medical departments, and against those that specialise in French philosophy, we use a field weighted measure. Essentially we calculate an average number of citations for every* field of academic research, and then determine how a particular paper scores compared to that average.

These values are then rolled up to the University level so we can see how the research performed at one University compares to that of another.  We do this by allocating the weighted count to the University associated with an author of a paper.

The Many Authors problem

But what about papers with multiple authors?  Had Anindya been joined by Prof Jon Agar for the paper, then both learned gentlemen’s institutions would have received credit. Dr Meg Tait also joins, so we have a third institution that gains credit and so on.

Whilst the number of author remains small that works quite well.  I can quite believe that Prof Agar, Dr Tait and Mr Bhattacharya all participated in the work on Badiou.

At this point we must depart from the safe world of philosophy for the dangerous world of particle physics**. Here we have mega-experiments where the academic output is also mega. For perfectly sound reasons there are papers with thousands of authors. In fact “Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC” has 2932 authors.  

Did they all contribute to the experiment? Possibly. In fact, probably. But if we include the data in this form in our rankings it has some very strange results.  Universities are boosted hugely if a single researcher participated in the project.

I feel a bit text bound, so here is a graph of the distribution of papers with more than 100 authors.

author_count_more100_a

Frequency of papers with more than 100 authors

Please note that the vast majority of the 11 million papers in the dataset aren’t shown!  In fact there are approximately 480 papers with more than 2000 authors.

Not all authors will have had the same impact on the research. It used to be assumed that there was a certain ordering to the way that authors were named, and this would allow the reduction of the count to only the significant authors. Unfortunately there is no consensus across academia about how this should happen, and no obvious way of automating the process of counting it.

Solutions

How to deal with this issue? Well for this year we’re taking a slightly crude, but effective solution. We’re simply not counting the papers with more than 1000 authors. 1000 is a somewhat arbitrary cut off point, but a visual inspection of the distribution suggests that this is a reasonable separation point between the regular distribution on the left, and the abnormal clusters on the right.

In the longer term there are one technical and one structural approach that would be viable.  The technical approach is to use a fractional counting approach (2932 authors? Well you each get 0.034% of the credit).  The structural approach is more of a long term solution: to persuade the academic community to adopt metadata that adequately explains the relationship of individuals to the paper that they are ‘authoring’.  Unfortunately I’m not holding my breath on that one.

*well, most

**plot spoiler: the world didn’t end

Some things I learned at Teradata

Over the last three and a half years I have led a fantastic team of data scientists at Teradata. But now it’s time for me to move on… so what did I learn in my time? What are the key Data Science messages that I’m going to take with me?

Pulp-O-Mizer_Cover_Image (4)

A lot of people don’t get it

What makes a good data scientist? One definition is that it is someone who can code better than a statistician, and do better stats than a coder. Frankly that’s a terrible definition, which really says you want someone who is bad at two different things.

In reality the thing that makes a good data scientist is a particular world view. One that appreciates the insight that data provides, and who is aware of the challenges and joys of data. A good data scientist will always want to jump into the data and start working on finding new questions, answers, and insights.  A great data scientist will want to do that, but will start by thinking about the question instead! If you throw a number at a good data scientist you’ll get a bunch of questions back…

Many people don’t have that worldview. And no matter how good they get at coding in R they will never make a good data scientist.

Data science is the Second Circle of data.

It’s one for the problem, two for the data, three for the technique

One of my favourite dislikes are the algorithm fetishists. A key learning from working across different customers and industries is that when analytical projects fail it’s very rarely because the algorithm was sub optimal. Usually it has been because the problem wasn’t right – or that the data didn’t match the problem.

Where choice of algorithm is important is in consideration of the use of the solution (and potentially in the productionisation of it) rather than in terms of simple measures of performance.

Don’t be afraid of the simple answer

Yes, you know how to run an n-path. Or do Markov chain analysis. Or build a random forrest. But if the answer can be generated from a simple chart, why would you use those other techniques? To show how clever you are?

There is another side – being aware that the simple answer may be wrong, and that the lure of simplicity is dangerous in itself. But usually if you get it then you know about that…

And of course there is also something to be said about the idea that the best ideas seem simple, but only after you’ve found them.

Stories are powerful

When you’re trying to sell an analytical approach (or even analytical software or hardware) the story you tell is vital. And the story might not be where the actual value is. Because to tell the story best you often use the edge cases. The best example comes from some work a colleague was doing. The actual analysis was great, but the thing that sold it to the client was a one-off event (albeit one that was ongoing) of such astonishing stupidity that it instantly caught the imagination. Everyone could immediately see that it was both crazy, and also that it was bound to happen. And it had been found through analysis.

I really wish I could tell you what it was! Buy me a drink sometime and you might find out…

Some of you may say that you’re not selling analytics. But if you’re a data scientist you are – to your boss, your co-workers, people you want to impress… and if you’re selling analysis you need to tell stories.

You still need to munge that data

So much time is spent dealing with data. This is one of the reasons that so many data scientists still use SQL (and it’s also a reason why logical modelling is still more attractive than late binding – I’m lazy and want someone else to have done some of the work first).

I wish it wasn’t the case. And I wish that tools were better at it than they are.

Don’t look for data scientists, look for data science people

Remember that when you want to recruit (and retain) data scientists that they are people. I’ve been lucky at Teradata to work with some fantastic people – both in my team, in the wider company, and at our clients.

I have a concern, however, that we (the data science community) are undervaluing some people, and as a result overlooking fantastic talent. A recent survey on data science salaries by O’Reilly included a regression model, and one of the key findings was that if you were a woman your salary dropped by $13k. For no reason whatsoever.

This seems bizarre to me, as I have had the privilege to work with some fantastic women in data: Judy Bayer, Fran Bennett, Garance Legrand, Kaitlin Thaney, Yodit Stanton and many many more*.

Data Science can change the world

Teradata believe in data philanthropy – the idea that if more social organisations use data for decisions that they will make better decisions, and that tech companies can play a part in helping them achieve this. Because of this they have supported DataKind and DataKind UK.

This is really important – because there are a bunch of challenges in helping charities and not for profits when it comes to data. The last thing these organisations need is well intentioned, but damaging, solutionism being dumped on them by West Coast gurus. There is nothing wrong in Elon Musk working on big issues through things like Tesla, but there is a whole bunch more that can be achieved if we can find sensitive ways to work with the people who deal with social problems everyday.

In my work with DataKind I’ve seen what data can do for charities, and this, in turn, has made me a better data scientist.

Where am I going?

I’m about to start a new career leading the data team at Times Higher Education – where we produce the leading ranking of Universities across the world.  I’ve loved my time at Teradata, and I’ve learnt some important stuff, but it’s time for a change!

*sorry if I didn’t mention you here…

Why David Cameron is wrong: the maths

Yesterday the UK suggested that an unnamed internet company could have prevented the murder of a soldier by taking action based on a message sent by the murderer.

It’s assumed that the company in question was Facebook.

The problem is that the maths tells us that this is simply wrong. It couldn’t have, and the reason why takes us to a graveyard near Old Street.

bayes

Buried in Bunhill Fields is Thomas Bayes, a non-conformist preacher and part time statistician who died in 1761. He devised a theorem (now known as Bayes Theorem) that helps us to understand the real probability of infrequent events, base on what are called prior probabilities. And (thank God) events like this murder are infrequent.

For the sake of argument let’s imagine that Facebook can devise a technical way to scan through messages and posts and determine if the content is linked to a terrorist action. This, in itself, isn’t trivial. It requires a lot of text analytics, understanding idiom, distinguishing “I could kill them” from “I could kill them” and so on.

But Facebook has some clever analysts, so lets assume that they build a test. And let’s be generous: it’s 98% accurate. I’d be very happy if I could write a text analytics model that was that accurate, but they are the best. Actually let’s make it 99% accurate. Heck! Let’s make it 99.9% accurate!

So now we should be 99.9% likely to catch events like this before they happen?

Wrong.

So let’s look at what Bayes and the numbers tell us.

The first number of interest is the actual number of terrorists in the UK. The number is certainly small. This is the only recent event.

But recently the Home Secretary, Theresa May, told us that 44 terrorist events have been stopped in the UK by the security services. I will take her at her word. Now let’s assume that this means there have been 100 actual terrorists. Again, you can move that number up or down, as you see fit, but it’s certainly true that there aren’t very many.

The second number is the number of people in the UK. There are (give or take) 60million.

(I’m going to assume that terrorists are just as likely, or unlikely, as the population as a whole to use Facebook. This may not be true, but it’s a workable hypothesis.)

So what happens when I apply my very accurate model?

Well the good news is that I identify all of my terrorists – or at least I identify 99.9 of them. Pretty good.

But the bad news is that I also identify 60,000 non-terrorists as terrorists. These are the false positives that my model throws up.

The actual chance of a person being correctly identified as a terrorist is just 0.17%.

Now this is surely a huge advance over where we were – but imagine what would happen if we suddenly dropped 60,000 leads on the police. How would they be able to investigate? How would the legal system cope with generating these 60,000 warrants (yes, you would still need a warrant to see the material)?

And let’s be clear; if we’re more pessimistic about the model accuracy things get worse, fast. A 99% accurate model (still amazingly good) drops the chance of true detection to 0.017%. At 98% it’s 0.008%, and at a plausible 90% it would be 0.0015%. The maths is clear. Thank you Rev Bayes.