The ethics of data science (some initial thoughts)

Last night I was lucky enough to attend a dinner hosted by TechUK and the Royal Statistical Society to discuss the ethics of big data. As I’m really not a fan of the term I’ll pretend it was about the ethics of data science.

Needless to say there was a lot of discussion around privacy, the DPA and European Data Directives (although the general feeling was against a legalistic approach), and the very real need for the UK to do something so that we don’t end up having an approach imposed from outside.

People first

Immanuel_Kant_(painted_portrait)

Kant: not actually a data scientist, but something to say on ethics

Both Paul Maltby and I were really interested in the idea of a code of conduct for people working in data – a bottom-up approach that could inculcate a data-for-good culture. This is possibly the best time to do this – there are still relatively few people working in data science, and if we can get these people now…

With that in mind, I thought it would be useful to remind myself of the data-for-good pledge that I put together, and (unsuccessfully) launched:

  • I will be Aware of the outcome and impact of my analysis
  • I won’t be Arrogant – and I will avoid hubris: I won’t assume I should, just because I can
  • I will be an Agent for change: use my analytical powers for positive good
  • I will be Awesome: I will reach out to those who need me, and take their cause further than they could imagine

OK, way too much alliteration. But (other than the somewhat West Coast Awesomeness) basically a good start. 

The key thing here is that, as a data scientist, I can’t pretend that it’s just data. What I do has consequences.

Ethics in process

But another way of thinking about it is to consider the actual processes of data science – here adapted loosely from the CRISP-DM methodology.  If we think of things this way, then we can consider ethical issues around each part of the process:

  • Data collection and processing
  • Analysis and algorithms
  • Using and communicating the outputs
  • Measuring the results

Data collection and processing

What are the ethical issues here?  Well ensuring that you collect with permission, or in a way that is transparent, repurposing data (especially important for data exhaust), thinking carefully about biases that may exist, and planning and thinking about end use.

Analysis and algorithms

I’ll be honest – I don’t believe that data science algorithms are racist or sexist. For a couple of reasons: firstly those require free-will (something that a random forest clearly doesn’t have), secondly that would require the algorithm to be able to distinguish between a set of numbers that encoded for (say) gender and another that coded for (say) days of the week. Now the input can contain data that is biased, and the target can be based on behaviours that are themselves racist, but that is a data issue, not an algorithm issue, and rightly belongs in another section.

But the choice of algorithm is important. As is the approach you take to analysis. And (as you can see from the pledge) an awareness that this represents people and that the outcome can have impact… although that leads neatly on to…

Using and communicating the outputs

Once you have your model and your scores, how do you communicate its strengths, and more importantly its weaknesses. How do you make sure that it is being used correctly and ethically? I would urge people to compare things against current processes rather than theoretical ideals.  For example, the output may have a gender bias, but (assuming I can’t actually remove it) is it less sexist than the current system? If so, it’s a step forwards…

I only touched on communication, but really this is a key, key aspect. Let’s assume that most people aren’t really aware of the nature of probability. How can we educate people about the risks and the assumptions in a probabilistic model? How can we make sure that the people who take decisions based on that model (and they probably won’t be data scientists) are aware of the implications?  What if they’re building it into an automated system? Well in that case we need to think about the ethics of:

Measuring the results

And the first question would be, is it ethical to use a model where you don’t effectively measure the results? With controls?

This is surely somewhere where we can learn from both medicine (controls and placebos) and econometrists (natural experiments). But both require us to think through the implications of action and inaction.

Using Data for Evil IV: The Journey Home

If you’re interested in talking through ethics more (and perhaps from a different perspective) then all of this will be a useful background for the presentation that Fran Bennett and I will be giving at Strata in London in early June.  And to whet your appetite, here is the hell-cycle of evil data adoption from last year…

HellCycle

 

 

 

STS forum. The strangest technology conference you’ve never heard of

At the beginning of October I was in Kyoto (yes, I can hear the tiny violins) attending the STS Forum on behalf of my employers.

What is the STS Forum?  Well this was the 12th meeting of a group focused on linking universities, technology companies, and governments to address global problems. The full name is Science and Technology in Society.

And it’s a really high level kind of thing. The opening was addressed by three prime ministers. There are more university vice-chancellors/provosts/rectors than you could imagine.  If you aren’t a professor then you’d better be a minister. No Nobel prize?  Just a matter of time.

So it’s senior.  But is is about technology?  Or at least the technology that I’m familiar with?

PM Abe addresses STS Forum

The usual players?

Well the first challenge is the sponsors.  A bunch of big companies. Huawei, Lockheed Martin, Saudi Aramco, Toyota, Hitachi, NTT, BAT, EDF.

All big, all important (I leave it up to you to decide if they’re good).  But are these really who you’d expect? Where are IBM?  Oracle? SAP? Even Siemens? Never mind Microsoft, Apple, or (dare I say it) LinkedIn, Facebook etc…

I daren’t even mention the world of big data: MongoDB, Cloudera or others.

Panels and topics

Then there are the panelists.  90% male. (In fact the median number of women on a panel is zero).  They are largely old.  None of them seem to be ‘real world’ experts – most are in Government and academia.

The topics are potentially interesting, but I’m nervous about the big data one. It’s not clear that there are any actual practitioners here (I will feed back later!)

Attendees and Ts

I have never been to a technology conference that is so suited. Even Gartner has a less uptight feel. Over 1000 people and not a single slogan. Wow. I feel quite daring wearing a pink shirt. And no tie.

What could they do?

I’m not saying it’s a bad conference. But I’m not sure it’s a technology conference, and I’m 100% certain it’s not a tech conference.

If they want it to be a tech conference then they need to take some serious action on diversity (especially gender and age)*.  They also need to think about inviting people who feel more comfortable in a T-shirt. The ones with slogans. And who know who xkcd is.

And this seems to be the biggest problem: the conference seems to highlight the gulf between the three components that they talk about (the triple helix) – universities, government, big business – and the markets where the theory hits the road. The innovators, the open source community, the disruptors.

On to the Big Data session

Well that was like a flashback to 2013. Lots of Vs, much confusion. Very doge.

It wasn’t clear what we were talking about big data for. Plenty of emphasis on HPC but not a single mention of Hadoop.

Some parts of the room seemed concerned about the possible impact of big data on society. Others wanted to explore if big data was relevant to science, and if so, how.  So, a lot of confusion, and not a lot of insight…

The Conference Conundrum

I’m here at the Teradata Partners conference in Dallas (one of my favourite conferences (full disclosure, I’m employed by Teradata)), and enjoying myself immensely.

Of course there are always a few problems with these big conferences:

  1. The air-con is set to arctic
  2. The coffee in breakfast is terrible
  3. I always want to go to sessions that clash

I’ve long since given up on the air-con and the coffee.  It seems these are pretty much immutable laws of conferences.  But what about the scheduling?  Surely there is a (and I hesitate to say it) big data approach to making the scheduling better?

I have no* evidence, but I suspect that current scheduling approaches essentially revolve around avoiding having the same person speak in two places at the same time, making sure that your ‘big’ speakers are in the biggest halls.

But we’ve all been to presentations in aircraft hangers with three people in the audience, and we’ve all been to the broom-closet with a thousand people straining to hear the presenter.

And above all. we’ve all been hanging around in the corridor trying to decide which of the three clashing sessions we should go to next.

The long walk

The long walk

So maybe, just maybe, there is a better way.

How? Well this year’s Partners Conference gave us the ability to use an app or an online tool to choose which sessions we wanted to see.  So I did it.  Two minutes in BZZZZZZZ – you have a clash!  But I wanted to see both of those sessions!  Tough.  Choose one.

But.  What if they had asked me what I wanted to see before they had allocated sessions to time slots and rooms?

They would have ended up with a dataset that would allow someone to optimise the sessions for every attendee.  This would really change the game, we’re moving from an organiser designed system to a user designed system.

But wait! There’s more!

Are you tired from having to walk 500m between sessions?  We could also optimise for distances walked.  And we could make a better guess at which sessions need the aircraft hanger, and which would be just fine in the broom closet. And we could do collaborative filtering and suggest sessions that other people were interested in…

And guess what?  We have the technology to do this.

Next year, Partners?