Last night I was lucky enough to attend a dinner hosted by TechUK and the Royal Statistical Society to discuss the ethics of big data. As I’m really not a fan of the term I’ll pretend it was about the ethics of data science.
Needless to say there was a lot of discussion around privacy, the DPA and European Data Directives (although the general feeling was against a legalistic approach), and the very real need for the UK to do something so that we don’t end up having an approach imposed from outside.
Both Paul Maltby and I were really interested in the idea of a code of conduct for people working in data – a bottom-up approach that could inculcate a data-for-good culture. This is possibly the best time to do this – there are still relatively few people working in data science, and if we can get these people now…
With that in mind, I thought it would be useful to remind myself of the data-for-good pledge that I put together, and (unsuccessfully) launched:
- I will be Aware of the outcome and impact of my analysis
- I won’t be Arrogant – and I will avoid hubris: I won’t assume I should, just because I can
- I will be an Agent for change: use my analytical powers for positive good
- I will be Awesome: I will reach out to those who need me, and take their cause further than they could imagine
OK, way too much alliteration. But (other than the somewhat West Coast Awesomeness) basically a good start.
The key thing here is that, as a data scientist, I can’t pretend that it’s just data. What I do has consequences.
Ethics in process
But another way of thinking about it is to consider the actual processes of data science – here adapted loosely from the CRISP-DM methodology. If we think of things this way, then we can consider ethical issues around each part of the process:
- Data collection and processing
- Analysis and algorithms
- Using and communicating the outputs
- Measuring the results
Data collection and processing
What are the ethical issues here? Well ensuring that you collect with permission, or in a way that is transparent, repurposing data (especially important for data exhaust), thinking carefully about biases that may exist, and planning and thinking about end use.
Analysis and algorithms
I’ll be honest – I don’t believe that data science algorithms are racist or sexist. For a couple of reasons: firstly those require free-will (something that a random forest clearly doesn’t have), secondly that would require the algorithm to be able to distinguish between a set of numbers that encoded for (say) gender and another that coded for (say) days of the week. Now the input can contain data that is biased, and the target can be based on behaviours that are themselves racist, but that is a data issue, not an algorithm issue, and rightly belongs in another section.
But the choice of algorithm is important. As is the approach you take to analysis. And (as you can see from the pledge) an awareness that this represents people and that the outcome can have impact… although that leads neatly on to…
Using and communicating the outputs
Once you have your model and your scores, how do you communicate its strengths, and more importantly its weaknesses. How do you make sure that it is being used correctly and ethically? I would urge people to compare things against current processes rather than theoretical ideals. For example, the output may have a gender bias, but (assuming I can’t actually remove it) is it less sexist than the current system? If so, it’s a step forwards…
I only touched on communication, but really this is a key, key aspect. Let’s assume that most people aren’t really aware of the nature of probability. How can we educate people about the risks and the assumptions in a probabilistic model? How can we make sure that the people who take decisions based on that model (and they probably won’t be data scientists) are aware of the implications? What if they’re building it into an automated system? Well in that case we need to think about the ethics of:
Measuring the results
And the first question would be, is it ethical to use a model where you don’t effectively measure the results? With controls?
This is surely somewhere where we can learn from both medicine (controls and placebos) and econometrists (natural experiments). But both require us to think through the implications of action and inaction.
Using Data for Evil IV: The Journey Home
If you’re interested in talking through ethics more (and perhaps from a different perspective) then all of this will be a useful background for the presentation that Fran Bennett and I will be giving at Strata in London in early June. And to whet your appetite, here is the hell-cycle of evil data adoption from last year…