Apr 14 2017

How in love with AI are you?

AI is a problematic term at the moment. There is an awful lot of conflation between true existential/ubiquitous computing/end of the world AI on the one hand, and a nerd in a basement programming a decision tree in R on the other.

Which makes for some nice headlines. But isn’t really helpful to people who are trying to work out how to make the most (and best) of the new world of data opportunities.

So to help me, at least, I have devised something I call the LaundryCulture Continuum. It helps me to understand how comfortable you are with data and analytics.

(Because I realise that the hashtag #LaundryCulture might confuse, I’ve also coined the alternative #StrossMBanks Spectrum).

So here are the ends of the Continuum, and a great opportunity to promote two of my favourite writers.

In the beautiful, elegant and restrained corner, sits The Culture. Truly super-intelligent AI minds look on benevolently at us lesser mortals, in a post-scarcity economy. This is the corner of the AI zealots.

In the squamish corner sits The Laundry, protecting us from eldricht horrors that can be summoned by any incompetent with a maths degree and a laptop. This is the home of the AI haters.

Where do people sit? Well it’s pretty clear that Elon Musk sits towards The Culture end of the Continuum. Naming his SpaceX landing barges Just Read The Instructions and Of Course I Still Love You is a pretty big clue.

The Guardian/Observer nexus is hovering nearer The Laundry, judging by most of its recent output.

Others are more difficult… But if I’m talking to you about AI, or even humble data mining, I would like to know where you stand…

Jan 2 2017

In defence of algorithms

I was going to write a blog about how algorithms* can be fair. But if 2016 was the year in which politics went crazy and decided that foreigners were the source of all problems, it looks like 2017 has already decided that the real problem is that foreigners are being assisted by evil algorithms.

So let’s be clear. In the current climate people who believe that data can make the world a better place need to stand up and say so. We can’t let misinformation and ludditism wreck the opportunities for the world going forwards.

And there is a world of misinformation!

For example, there is currently a huge amount of noise about algorithmic fairness (Nesta here , The Guardian here et al). I’ve already blogged a number of times about this (1, 2, 3), but decided (given the noise) that it was time to gather my thoughts together.

(Most of) Robocop’s prime directives (Image from Robocop 1987)

tldr: Don’t believe the hype, and don’t rule out things that are fairer than what happens at the moment.

Three key concepts

So here are some concepts that I would suggest we bear in mind:

The real world is mainly made up of non-algorithmic decisions, and we know that these are frequently racist, sexist, and generally unfair.
Intersectionality is rife, and in data terms this means multicolinearity. All over the place.
No one has a particularly good definition of what fairness might look like. Even lawyers (although there are a number of laws about disproportionate impact even then it gets tricky).

On the other side, what are the campaigners for algorithmic fairness demanding? And what are their claims?

Claim 1: if you feed an algorithm racist data it will become racist.

At the most simple level yes. But (unlike in at least one claim) it takes more than a single racist image for this to happen. In fact I would suggest that generally speaking machine learning is not good at spotting weak cases: this is the challenge of the ‘black swan’. If you present a single racist example then ML will almost certainly ignore it. In fact, if racism is in the minority in your examples, then it will probably be downplayed further by the algorithm: the algorithm will be less racist than reality.

If there are more racist cases than non-racist cases then either you have made a terrible data selection decision (possible), or the real problem is with society, not with the algorithm. Focus on fixing society first.

Claim 2: algorithmic unfairness is worse/more prevalent than human unfairness

Algorithmic unfairness is a first world problem. It’s even smaller scale than that. It’s primarily a minority concern even in the first world. Yes, there are examples in the courts in the USA, and in policing. But if you think that the problems of algorithms are the most challenging ones that face the poor and BAME in the judicial system then you haven’t been paying attention.

Claim 3: to solve the problem people should disclose the algorithm that is used

Um, this gets technical. Firstly, what do you mean by the algorithm? I can easily show you the code used to build a model. It’s probably taken from CRAN or Github anyway. But the actual model? Well if I’ve used a sophisticated technique, a neural network or random forrest etc, it’s probably not going to be sensibly interpretable.

So what do you mean? Share the data? For people data you are going to run headlong into data protection issues. For other data you are going to hit the fact that it will probably be a trade secret.

So why not just do what we do with human decisions? We examine the actual effect. At this point learned judges (and juries, but bear in mind Bayes) can determine if the outcome was illegal.

And in terms of creation? Can we stop bad algorithms from being created? Probably not. But we can do what we do with humans: make sure that the people teaching them are qualified and understand how to make sensible decisions. That’s where people like the Royal Statistical Society can come in…

Final thoughts

People will say “you’re ignoring real world examples of racism/sexism in algorithms”. Yes, I am. Plenty of people are commenting on those, and yes, they need fixing. But very very rarely do people compare the algorithm with the pre-existing approach. That is terrible practice. Don’t give human bias a free pass.

And most of those examples have been because of (frankly) beginners mistakes. Or misuse. None of which are especially unique to the world of ML.

So let’s stand up for algorithms, but at the same time remember that we need to do our best to make them fair when we deploy them, so that they can go on beating humans.

* no, I really can’t be bothered to get into an argument about what is, and what is not an algorithm. Let’s just accept this as shorthand for anything like predictive analytics, stats, AI etc…

Jun 25 2016

DataScience post Brexit

EU_Flag_specification

This is not an analysis of how people voted, or why. If you’re interested then look at some of John Burn-Murdoch’s analysis in the FT, or Lord Ashcroft’s polling on reasons.

It is, however, an attempt to start to think about some of the implications for DataScience and data scientists in the new world.

Of course, we don’t really know what the new world will be, but here we go:

Big companies

The big employers are still going to need data science and data scientists. Where they are pretty much stuck (telcos, for example) then data science will continue. If there is a recession then it may impact investment, but it’s just as likely to encourage more data science.

Of course it’s a bit different for multi-nationals. Especially for banks (investment, not retail). These are companies with a lot of money, and big investments in data science. Initial suggestions are that London may lose certain financial rights that come with being in the EU, and that 30-70,000 jobs may move.

Some of these will, undoubtably, be data science roles.

Startups

But what about startups? There are a few obvious things: regulation, access to markets, and investment.

We may find that the UK becomes a less country with less red-tape. I somehow doubt it. But even if we do, in order to trade with Europe we will need to comply (more of that later).

Access to markets is another factor. And it is one that suggests that startups might locate, or relocate, in the EU.

Investment is also key. Will VCs be willing to risk their money in the UK? There is some initial evidence that they are less keen on doing so. News that some deals have been postponed or cancelled. Time will tell.

In many ways startups are the businesses most able to move. But the one thing that might, just might keep them in the UK is the availability of data scientists. After the US the UK is probably the next most developed market for data scientists.

Data regulation

At the moment the UK complies with EU data provisions using the creaking Data Protection Act 1998. That was due to be replaced by the new data directive. But from the UK’s perspective that may no longer happen.

That brings different challenges. Firstly the UK was a key part of the team arguing for looser regulation, so it’s likely that the final data directive may be stronger than it would otherwise have been.

Secondly, the data directive may simply not apply. But in that case what happens to movement of data from the EU to the UK? Safe Harbor, the regime that allowed EU companies to send data to the US, has been successfully challenged in the courts, so it is unlikely that a similar approach would fly.

What then? Would we try to ignore the data directive? Would we create our own, uniquely English data environment? Would we hope that the DPA would still be usable 20, 30 or 40 years after it was written?

People

Data scientists, at the moment, are rare. When (or if) jobs move, then data scientists will be free to move with them. They will almost always be able to demonstrate that they have the qualities wanted by other countries.

And many of our great UK data scientists are actually great EU data scientists. Will they want to stay? They were drawn here by our vibrant data science community… but if that starts to wither?

Aug 29 2012

The day the (medical) data broke free…

Today is a good day for data – at least in healthcare. At last the data from the NHS is being set free.

For my international colleagues and friends it’s worth pointing out some things about the NHS. The UK National Health Service* is actually a very large and complex organisation that cares for health needs. The main arms are the GP services and the Hospital services. GPs are self employed and effectively contracted by the NHS. Hospitals are islands to themselves within regional groupings. Above all lie funding and commissioning structures. Sounds complex? From a data perspective it certainly is. The data that is generated by the system is often written, frequently in isolated systems, and is barely there for joined up services, never mind research.

On the positive side, it’s free** at point of use, and generally does a good job.

.There have been signs for a while that the NHS has been starting to think about data.

Dr Carl Reynolds (@drcjar) at http://openhealthcare.org.uk/ has been leading the way on doing good things with health data, including running NHS hack days. If you want to get involved the next one is in Liverpool on the 22-23 September
The UK set up the BioBank project, aimed to give a longitudinal study of people who aren’t necessarily ill. If you think about it it’s fairly obvious that most people who go to their doctor are ill – BioBank aims to understand the factors in their lives that were the same, or different, to other people before and after they were ill.
Dr Ben Goldacre (@bengoldacre) has been leading a crusade to get clinical research data (even from trials that are abandoned or not published) into the public domain so that it can be used to compare outcomes.

But now the Government has gone much, much further and has created the Clinical Practice Research Datalink. In addition to having a funky website this aims to bring together data from the NHS so that this vast set of data can be used to improve health outcomes.

Of course there is a very, very big cloud hanging over this. How do you anonymise patient data so that it is still useful? Simply removing names and addresses won’t deal with the issue, as Ross Anderson of Cambridge University identifies (the Guardian again – don’t say they aren’t fair and balanced!).

But I think, on balance, I disagree with Ross. I’ve come to the conclusion that we can’t rely on privacy, and that the exchange of a guarantee on privacy for free medical care is probably reasonable in itself. Especially as the guarantee isn’t really worth much these days. When you add to this the potential benefits to research, then the answer is even more obvious. How many people would be happy to give up their privacy if they knew that one day they, or their kids, might be relying on the treatment that resulted?

*Actually there are three, NHS England and Wales, NHS Scotland, and NHS Northern Ireland, but let’s assume they are the same thing for this argument. NHS E&W is by far the largest.

**Nearly.

Jul 15 2012

Data insert profession here

In thinking about our data revolution, and pondering on my self declared job title (always the best kind of job title if you ask me), it occurred to me: how would professions changed if we put the word “data” in front of the title? Would it make them better, or worse? Would data actually improve the way that the professions work?

Of course, this has already happened for data science, and increasingly for data journalism – and reflects two different approaches. In the first case it is applying science to data. In the second it is applying data to journalism.

But if you assume that Data *insert profession here* is about applying data science approaches to <profession>, then what could it mean? Would it make the world better? Let’s try it and see…

A DataKind data dive: more on that later

Data Journalism – leading the way

Assuming we all have our opinions about data science, let’s look instead at Data Journalism. Go and examine The Guardian’s data section (led by the excellent @simonrogers). Here you will find stories developed from public data, data being used to test the truth of other stories, and the crowdsourcing of journalism.

Famously, when the UK Government tried to convince everyone that Blackberry (remember them?) Messenger and Twitter were the cause of the riots in the UK, The Guardian dug up the data that proved that messages followed incidents, not vice-versa.

It’s exciting, and it’s not clear where it will all end up. Simon would probably tell you the same. But it does change the way that journalists work and interact with sources.

So if it works for Data Journalism, where else might it work?

Data Politician

How many politicians have a science background? The answer is very few. In the UK there was my MP, Dr Lynne Jones, but she retired at the last election. Actually there are sites that will tell you that the number is 27 (although being me I looked further and found that was only based on a sample of 430).

It’s still not many.

Could a scientific approach to politics, using data, help? How would we feel if politicians actively set up experiments to validate their policies? Let’s be clear that current policy ‘trials’ tend to get the results that politicians want, and tend to be neither controlled or statistically significant. I’ve got to wonder if that is because politicians as a whole are unfamiliar with the words “control”, “statistical” or “significant”.

How would we react to experiments? Would we be willing to tolerate being in one? And how would we treat a politician who did an experiment and changed their position as a result? Would we just shout “U-turn” at them?

Ironically politicians are already using the results of experimentation in their marketing/election efforts. Obama has a large team of predictive modellers whose task is to identify and target likely voters, and I’m sure Romney has too.

If only they would apply this to their policies!

Data Judge

Another area where data could surely add value, is in the policing and criminal justice system. We can predict vulnerabilities, identify mitigating strategies, and even try and modify punishment using data and experimentation.

Does the death penalty reduce the murder rate? What programmes reduce reoffending? What are the causes of criminality, and can they be reduced?

Sadly we seem to be heading in exactly the opposite direction in the UK. In my opinion, given the importance of statistical understanding in modern forensic evidence, any Judge who can’t do basic level statistics should immediately recluse themselves from any case.

Data Philanthropist

Now to my mind this is the biggest, and one that already has traction in the US. Jake Porway (@jakeporway) has been leading the field with his wonderful DataKind (@datakind). The idea is golden: if we can do cool things with data for business/industry, why can’t we do cool things with data for charity/not-for-profits?

And it’s coming to the UK. On the 29th and 30th of September we hope there will be a DataKind data dive in London. A first chance for Uk not for profits to get free insight into their data. A first chance for data scientists to try their hands at data philanthropy.

If you know a charity that could benefit, get in touch. If you want to be involved get in touch too.

Jun 21 2012

The death of Cartography

OSM Cambridge

So it appears that Apple have decided it’s time to ditch Google Maps in favour of Tom Tom’s own version.

Many people have commented on the business wisdom of this, the relative amounts of money that Google make from Maps compared with Android (about four times as much!), and the relative strengths and weaknesses of the different platforms.

What few people have commented on, which is surprising, is the death of the map maker’s art.

Not in the press release

I accept that this wasn’t one of the things that Apple chose to highlight, but it’s there nevertheless: in the comment that they will access “anonymous real-time crowdsourced data from our iOS users to keep this up to date.”

In layman’s terms, they will be using your input to make the maps more accurate. And near real time.

Of course this isn’t the first time this has been done. Openstreetmap.org has done this explicitly for a while now, and in some countries is as reliable as the official maps. So why not go with them, Apple? Well firstly, because they are open source Apple would have to release the data back into the wild. And secondly because Apple need something with a consistent minimum standard now – not in three months time.

Another difference is that openstreetmap requires active participation. With the right analytics behind it, and a far bigger community, Apple’s maps could do so much more.

Number of openstreetmap users: 600 000

Number of iPhones : 100 000 000 +

Issues: privacy and unemployed cartographers

As the folks from openpaths have identified, your phone’s geography tells people a lot about you, and at least in the US there is a question about if police need a warrant to get at your phone data. This is a whole lot more accurate data, gathered in a similar way to Waze.

And what about the cartographers? Well if there still a place for them it may be as curators of the information, rather like Wikipedia senior editors. If not…

Feb 3 2012

“Data Scientists are hardcore coders…”: discuss

Yesterday I almost had a heart attack when an esteemed colleague (who shall remain nameless) came out with the statement: “Data Scientists are people who are hardcore Hadoop coders”… had I totally misunderstood him? Or was I so out of step with the world that I had totally misunderstood data science?

This is all the more important for two reasons:

My job title (full disclosure, I made it up) is Director Data Science
I’m busy trying to recruit data scientists for my team.

Well, to be honest, I could probably live with being wrong about 1. LinkedIn will never find out, so that’s OK.

But whilst I’ve been engaged in recruitment I have had to decide what it is I’m looking for in candidates. So here it is… in descending order a data scientist will be:

Curious

The first, and most important trait is curiosity. Insane curiosity. In many walks of life evolution selects against the kind of person who decides to find out what happens “if I push that button”. In Data Science it selects for it.

In my own analytical experience nothing has come close to the feeling when you discover something new (even if other people have already been there). In 5th form working out how to prove what root -9 was. At University… well too much to drink there, but at work discovering that we could push complex analytics onto an MPP system. That complex things (cars) failed with the same distribution as simple things (their components). That social networks could be used to predict some things, and that they couldn’t be used to predict others.

And that last one is important too: the joy of disproving something!

Analytical

I expect any data scientist to have a background in, and an understanding of, complex analytics. I don’t mean reporting. I’ve nothing against reporting, it’s important and someone has to do it. But not a data scientist. I’m after people who can build a model that predicts something, or who can cluster data, who know the tricks of creating a good dataset, and when a model result is too good. And importantly people who can tell me if the result is statistically relevant or just one of those things.

When it comes to Big Data “those things” will come up more and more often as our data gets bigger.

Communicative

I have no use for people who are unable to communicate with non-specialists. Its hard enough discussing these topics within the community – we need people who can explain to those outside the community. The users of the services we will provide.

Of course communication is two way, and the data scientist needs to listen too.

Novel

The data scientist needs to provide additional value above and beyond what’s happening already. You can provide a fantastic new way of predicting churn that will only cost $1 million and uses data sources that are already in use? And it doesn’t outperform the existing methods. Hmmm.

Novelty, either in ways of thinking, or in terms of the data and approaches to be used, is vital.

Business focused

Obviously by business I mean “focused on the overall objectives of the organisation you’re working with or in”, but that’s a bit long winded. Again data scientists need to get their heads out of the algorithms and into the business problems. I you can tell me the correct parameterisation for a novel take on SVM, but can’t tell me the top three issues for a business (and how big data can help fix them), then you aren’t a data scientist, you’re an academic.

A coder

Last and least. Yes, it would be nice if you can code Hadoop. Or C#. Or R. But this is a passing phase brought on by a lack of good interfaces, it’s not a permanent state of affairs. So, if you have this skill, good for you. Bt if you only have this skill it’s time to get out into the world.

duncan3ross

Words on advanced analysis, data science, and more

Category Data Science