Thoughts on the numbers: Uber

Uber lost it’s licence.

It is, in many ways, an unpleasant company – making itself less unpleasant only by taking on deeply unpleasant vested interests. (To be transparent: I’m mildly on Uber’s side in this – I have had fewer unpleasant journeys in Uber than in Black Cabs. The worst Uber trip I’ve had has been safe and boring, the worst Black Cab trip I’ve had has involved racist commentary… But in the spirit of data, let’s just say that n is too small to draw definite conclusions from this).

But there are a couple of arguments going around that are, at best, unhelpful.

Uber doesn’t have 40,000 drivers

A number of people (letters in the Guardian, Twitter) are saying that it is ‘widely accepted’ that Uber doesn’t have 40,000 drivers as it claims. Surely that’s a ridiculously high number?

But in 2015 there were over 120,000 taxi drivers in London – including about 27,000 Black Cab drivers. We know that Uber’s position is to dominate a market place, so it doesn’t seem unlikely that there could be 40,000. And until someone comes up with a sourced alternative, it seems bizarre to claim that the number is wrong because.

If they do have 40,000 it’s because most of them work part time…

That might be true. But it’s only relevant from a data perspective if the balance of part time Uber drivers is different than the balance for any other group of taxi drivers. And as we will see it might not have an impact on one of the other major claims.

Uber isn’t safe

There are a number of claims about this, including basic misunderstandings, such as the idea that Uber has no licence, or that it’s drivers aren’t licensed.

However, we do have a bit of data to help answer the question of sexual assault, although it is, at best, partial: the Guardian reports that the Met Police are looking into 32 complaints of rape or sexual assault associated with Uber. Now to be clear, no sexual assault is acceptable.

But the claim is that Uber is dangerous and other taxis are safe. And that claim doesn’t seem to be supported by the data. Because the Guardian also reports that the total number of complaints that are associated with taxi drivers is 154.

So Uber are responsible for 20% of complaints, whilst having 33% of drivers.

These numbers are (of course) open to be challenged if better data becomes available.

It could be that if the claim that many Uber drivers are only part time is true, that a better measure, such as modifying this value by the number of journeys might be more relevant.  And it is quite likely that this data will come out in court. But it seems capricious to deny numbers that do exist in favour of numbers that don’t (yet).

Where is the data?

Of course one of the obvious conclusions is that TfL should do a far far better job of publishing data that is important to the public in London.


A brief* analysis** of Labour’s NEC Candidates

LabourosewwwbgrOne of the frustrations of the Labour NEC elections is that you never really know who you’re voting for. Well, obviously you known their names, and they’re happy to give you an impressive list of the Committees they have been elected to (and they are always Committees, not committees).

But what is their stance? Who do they support? Are they Traitorous Blairite Scum™ or Vile Entryist Trots®? How do we know? No one actually says: “I’m backing Eagle. She’s great” or “Corbyn for the Win!”. So we’re left guessing, or poring through their candidate statements, or worse still, relying on the various lobby groups to give us a clue to the slates etc.

So, instead of that, because, frankly, it’s boring, I decide to do a little bit of word frequency analysis, to see if there were words that were unique, or at least more likely to come up in one group or another.

To help me categorise the candidates, I followed a helpful guide from Totnes Labour Party. That gave me the two major slates: Progress/Labour First*** vs the something something Grassroots Alliance***.

So, all I need to do is go to the Labour Party site, download the candidate statements, and away I go. Except, of course, the Party website is bloody useless at giving actual information.

So once more it’s back to the Totnes Labour website, supported by Google. I can’t find statements from every candidate, but I get most of them. Enough for this, anyway.

Normally I would fire up R, load the data, and away I go. But actually there is so little of it that it’s time for some simple Excel kludging.


First I put all the text from each slate into a single document. Remove case sensitivity, and work out the frequency of each word (with common stop words, natch).

I know that this means I can’t play with tf-idf (sorry, Chris), but really there weren’t enough documents to call this a corpus anyway.

Now I create a common list of words, with the frequency in each slate. This isn’t as easy as it sounds, because a straightforward stemming approach isn’t really going to cut it. These documents are in SocialistSpeak (Centre-left social democratic/democratic socialist sub variant), and so need handcrafting. There is, for example, a world of difference between inclusive and inclusion.

So once that tedious task is out of the way, I took the frequency with which each word appeared in each slate, and simply subtracted one from the other.


Firstly, which are the most frequently used words across both slates?

Word Avg Freq
PARTY 2.3%
NEC 1.5%
WORK 1.1%
NEED 1.1%
YEARS 0.9%
WIN 0.8%
CLP 0.8%
WILL 0.7%
WALES 0.6%

These words appear frequently in both camps. You can easily imagine statements containing them:

“Since I was elected to the NEC in 2010 I have spent six years supporting members of the Labour Party. Sadly Jeremy can’t win in Wales/Only Jeremy can win an election against this Government”

I don’t really know why Wales pops up.

But let’s look at each slate independently.

Team M

Firstly, Team Momentum****** (or the Centre-Left Grassroots Alliance (It isn’t clear where the apostrophe should go (or indeed, if there should be one))).


These are the words most likely to be in a Momentum candidate’s statement and not in a Progress candidate’s statement, in order of momentum (sorry, couldn’t resist):

WALES 1.2%
YEARS 1.2%
CLP 1.1%
JOBS 0.5%
WORK 1.3%
CLEAR 0.4%
UNION 0.7%
GIVE 0.4%

I think we’ve found where Wales comes from: a proud Welsh candidate on the Momentum slate.

They are proud of Jeremy, are probably active in their union, have been a party secretary.

Team P/L1

How about Team Progress? (Or the less impressively websited Labour First)


NEED 1.7%
WILL 1.1%
NEC 1.8%
WIN 1.2%
PARTY 2.6%
THINK 0.6%
FIGHT 0.7%
PUT 0.7%
LOCAL 0.8%
VOTE 0.4%
NEW 0.7%
MEAN 0.3%

Woah! That looks very different. Without wanting to judge either group’s politics, one looks (at first) like a very technocratic list of achievements, and the other a very wide forward looking approach. This may simply be reflecting the fact that the Leader of the Party is on the Momentum side, so they have more things to look at and say “this is our record” where as the opposition have to look to a future.

Team WTF?

Are there any oddities?

Well, Blair and Brown (and Iraq) are mentioned by Progress, and not by Momentum, which is perhaps strange. Afghanistan is mentioned by Momentum.

Britain, England, and Scotland are mentioned far more by Momentum (as is Wales, but let’s ignore that for now).

Everyone is claiming to be a socialist. But not very much. (There was one use of socialist on each side. Of course maybe everyone assumes that everyone else knows they are a socialist).

Privatisation is mentioned by Momentum, renationalisation by Progress.

The word used most by Momentum but not at all by Progress is commitment.

The words used most by Progress and not at all by Momentum are country and think (it’s a tie).

But surely…?

What should I really have done to make this a proper analysis?

  • Firstly I should have got all the data.
  • Then I should probably have thought about some more systematic stemming, and looking at n-grams (there is a huge difference between Labour and New Labour).

There are a whole range of interesting things you can do with text analysis, or there are word clouds.


*Very brief

**Only just, but more later!

***And this is why there is a problem. No one wants to call themselves: No Progress at All, or Labour Last, and everyone thinks that they represent the grassroots. I’ve yet to see a group that markets itself as the home of the elitist intellectuals****

****Well maybe the Fabians. But I didn’t say it.

*****This seems to over dignify it. But heck.

******They deny this.


DataScience post Brexit


This is not an analysis of how people voted, or why. If you’re interested then look at some of John Burn-Murdoch’s analysis in the FT, or Lord Ashcroft’s polling on reasons.

It is, however, an attempt to start to think about some of the implications for DataScience and data scientists in the new world.

Of course, we don’t really know what the new world will be, but here we go:

Big companies

The big employers are still going to need data science and data scientists. Where they are pretty much stuck (telcos, for example) then data science will continue. If there is a recession then it may impact investment, but it’s just as likely to encourage more data science.

Of course it’s a bit different for multi-nationals. Especially for banks (investment, not retail). These are companies with a lot of money, and big investments in data science. Initial suggestions are that London may lose certain financial rights that come with being in the EU, and that 30-70,000 jobs may move.

Some of these will, undoubtably, be data science roles.


But what about startups? There are a few obvious things: regulation, access to markets, and investment.

We may find that the UK becomes a less country with less red-tape.  I somehow doubt it. But even if we do, in order to trade with Europe we will need to comply (more of that later).

Access to markets is another factor. And it is one that suggests that startups might locate, or relocate, in the EU.

Investment is also key. Will VCs be willing to risk their money in the UK? There is some initial evidence that they are less keen on doing so. News that some deals have been postponed or cancelled. Time will tell.

In many ways startups are the businesses most able to move. But the one thing that might, just might keep them in the UK is the availability of data scientists. After the US the UK is probably the next most developed market for data scientists.

Data regulation

At the moment the UK complies with EU data provisions using the creaking Data Protection Act 1998. That was due to be replaced by the new data directive. But from the UK’s perspective that may no longer happen.

That brings different challenges. Firstly the UK was a key part of the team arguing for looser regulation, so it’s likely that the final data directive may be stronger than it would otherwise have been.

Secondly, the data directive may simply not apply. But in that case what happens to movement of data from the EU to the UK? Safe Harbor, the regime that allowed EU companies to send data to the US, has been successfully challenged in the courts, so it is unlikely that a similar approach would fly.

What then? Would we try to ignore the data directive? Would we create our own, uniquely English data environment? Would we hope that the DPA would still be usable 20, 30 or 40 years after it was written?


Data scientists, at the moment, are rare. When (or if) jobs move, then data scientists will be free to move with them. They will almost always be able to demonstrate that they have the qualities wanted by other countries.

And many of our great UK data scientists are actually great EU data scientists. Will they want to stay? They were drawn here by our vibrant data science community… but if that starts to wither?


Who hates Donald Trump the most?

By now you’re probably aware of the UK Government petition about Donald Trump. It’s currently be signed by over 500,000 people, making it the most successful petition in the UK…

But who actually hates him most? The site conveniently provides a map, so you can see where people dislike him… but constituencies aren’t all equal, and you always run the risk of doing a heatmap of population densities:



So I thought I’d explore further.  Firstly, which are the top constituencies by proportion of the population who hate Donald?

Constituency MP Signed Percentage
Bethnal Green and Bow Rushanara Ali MP 1874 1.50%
Bristol West Thangam Debbonaire MP 1779 1.43%
Brighton Pavilion Caroline Lucas MP 1405 1.36%
Hackney South and Shoreditch Meg Hillier MP 1496 1.27%
Islington North Jeremy Corbyn MP 1284 1.24%
Cities of London and Westminster Rt Hon Mark Field MP 1364 1.24%
Glasgow North Patrick Grady MP 876 1.22%
Hackney North and Stoke Newington Ms Diane Abbott MP 1556 1.22%
Islington South and Finsbury Emily Thornberry MP 1237 1.20%
Edinburgh North and Leith Deidre Brock MP 1279 1.20%

OK, so we see some of the usual faces there. Jeremy Corbyn, Diane Abbot, Caroline Lucas. Maybe it’s just a political thing.  Let’s look at the bottom constituencies (again by the proportion who are keen on the bewigged one):

Constituency MP Signed Percentage
Blaenau Gwent Nick Smith MP 92 0.13%
Barnsley East Michael Dugher MP 119 0.13%
Makerfield Yvonne Fovargue MP 126 0.13%
Boston and Skegness Matt Warman MP 129 0.13%
Wentworth and Dearne Rt Hon John Healey MP 119 0.12%
Walsall North Mr David Winnick MP 117 0.12%
Doncaster North Rt Hon Edward Miliband MP 123 0.12%
Easington Grahame Morris MP 102 0.12%
Kingston upon Hull East Karl Turner MP 112 0.12%
Cynon Valley Rt Hon Ann Clwyd MP 78 0.11%

Well I wouldn’t (necessarily) have expected to see Ed Miliband and Ann Clwyd’s constituencies there.

So let’s go slightly more data sciency. What is the correlation between Labour vote and proportion hating Donald?

Rather than just looking at that, let’s look at a whole range of stuff:


Woah! What was that?  Well that’s the correlations between a whole load of stuff and the percentage who hate Donald. Percentage is the second column (or row), and a blue box means a positive correlation, and a red box a negative one.

So let’s look at some of the more interesting ones, and sort them from highest to lowest. Remember that these are at a Constituency level.

  • PopulationDensity  0.64
  • Green                           0.55
  • FulltimeStudent       0.53
  • Muslim                        0.37
  • LD                                  0.19
  • Lab                                0.07
  • Con                             -0.11
  • Ethnicity White      -0.48
  • Christian                  -0.60
  • UKIP                           -0.61
  • Retired                      -0.67

So it would seem that the strongest correlation isn’t with Muslim populations, it’s actually strongest with built up areas, then with 2015 Green voters, then with full time students, and only then with Muslim areas.

And it’s probably not surprising that those lease likely to hate Donald are constituencies with lots of retired people, a strong UKIP presence (some immigration is OK then, it would seem), a large number of people identifying as Christian, and white people.

Taking it one small step further, and running our old friend, Linear regression we see a model like this (Labour and Con voters removed to avoid tons of correlation, LD voters removed out of sympathy):

Estimate Std. Error t value Pr(>|t|)
(Intercept) .557 .0705 7.89 E-14
Green .0201 .0016 12.78 < 2e-16
Muslim .0036 .0012 3.04 0.0025
Density .0035 .0003 12.54 < 2e-16
 White .0033 .0007 4.78 2.2 E-6
Student  .0023 .0011  2.05 0.041
Christian -.0021 0.0007 -2.97 0.0031
Lab -.0036 0.0003 -11.14 < 2e-16
UKIP -.0118 0.0007 -16.85 < 2e-16
Retired -.0173 0.0020 -8.55 < 2e-16

A slightly different story – but looking at the key stand outs: if you want to find a constituency that really hates Donald, first look for Green voters, then a densely populated area, with quite a sizeable Muslim and white community, but keep away from those UKIP voters, especially the retired ones.

*Data for the petition harvested at about noon, 11 December 2015
** Other data from ONS based on the 2011 Census
*** Featured image Donald Trump by Gage Skidmore
**** Northern Ireland doesn’t provide breakdown of petition numbers