The ethics of data science (some initial thoughts)

Last night I was lucky enough to attend a dinner hosted by TechUK and the Royal Statistical Society to discuss the ethics of big data. As I’m really not a fan of the term I’ll pretend it was about the ethics of data science.

Needless to say there was a lot of discussion around privacy, the DPA and European Data Directives (although the general feeling was against a legalistic approach), and the very real need for the UK to do something so that we don’t end up having an approach imposed from outside.

People first


Kant: not actually a data scientist, but something to say on ethics

Both Paul Maltby and I were really interested in the idea of a code of conduct for people working in data – a bottom-up approach that could inculcate a data-for-good culture. This is possibly the best time to do this – there are still relatively few people working in data science, and if we can get these people now…

With that in mind, I thought it would be useful to remind myself of the data-for-good pledge that I put together, and (unsuccessfully) launched:

  • I will be Aware of the outcome and impact of my analysis
  • I won’t be Arrogant – and I will avoid hubris: I won’t assume I should, just because I can
  • I will be an Agent for change: use my analytical powers for positive good
  • I will be Awesome: I will reach out to those who need me, and take their cause further than they could imagine

OK, way too much alliteration. But (other than the somewhat West Coast Awesomeness) basically a good start. 

The key thing here is that, as a data scientist, I can’t pretend that it’s just data. What I do has consequences.

Ethics in process

But another way of thinking about it is to consider the actual processes of data science – here adapted loosely from the CRISP-DM methodology.  If we think of things this way, then we can consider ethical issues around each part of the process:

  • Data collection and processing
  • Analysis and algorithms
  • Using and communicating the outputs
  • Measuring the results

Data collection and processing

What are the ethical issues here?  Well ensuring that you collect with permission, or in a way that is transparent, repurposing data (especially important for data exhaust), thinking carefully about biases that may exist, and planning and thinking about end use.

Analysis and algorithms

I’ll be honest – I don’t believe that data science algorithms are racist or sexist. For a couple of reasons: firstly those require free-will (something that a random forest clearly doesn’t have), secondly that would require the algorithm to be able to distinguish between a set of numbers that encoded for (say) gender and another that coded for (say) days of the week. Now the input can contain data that is biased, and the target can be based on behaviours that are themselves racist, but that is a data issue, not an algorithm issue, and rightly belongs in another section.

But the choice of algorithm is important. As is the approach you take to analysis. And (as you can see from the pledge) an awareness that this represents people and that the outcome can have impact… although that leads neatly on to…

Using and communicating the outputs

Once you have your model and your scores, how do you communicate its strengths, and more importantly its weaknesses. How do you make sure that it is being used correctly and ethically? I would urge people to compare things against current processes rather than theoretical ideals.  For example, the output may have a gender bias, but (assuming I can’t actually remove it) is it less sexist than the current system? If so, it’s a step forwards…

I only touched on communication, but really this is a key, key aspect. Let’s assume that most people aren’t really aware of the nature of probability. How can we educate people about the risks and the assumptions in a probabilistic model? How can we make sure that the people who take decisions based on that model (and they probably won’t be data scientists) are aware of the implications?  What if they’re building it into an automated system? Well in that case we need to think about the ethics of:

Measuring the results

And the first question would be, is it ethical to use a model where you don’t effectively measure the results? With controls?

This is surely somewhere where we can learn from both medicine (controls and placebos) and econometrists (natural experiments). But both require us to think through the implications of action and inaction.

Using Data for Evil IV: The Journey Home

If you’re interested in talking through ethics more (and perhaps from a different perspective) then all of this will be a useful background for the presentation that Fran Bennett and I will be giving at Strata in London in early June.  And to whet your appetite, here is the hell-cycle of evil data adoption from last year…






6 comments on “The ethics of data science (some initial thoughts)

  1. re: racist/sexist algos v data “problems” – what if your algo is a neural net or random forest trained on existing data trying to predict the future. Won’t the training data mean that the predictions will be sexist/racist/etc and basically just encode the existing bias? There have been (and probably still are) algos that classify based on race or sex as a feature, or use a culturally applicable proxy for it. Don’t these have problems too?

    • Yes, but I see that as a problem of data collection/problem specification rather than being inherent in the algorithm. In fact, I would probably go further and say that here the racism/sexism is in the designer of the analysis – they are the only one with free will and the ability to act… blaming the algorithm seems to be avoiding responsibility.

      • I’m not sure how to separate the data scrubbing/choosing/collection from the algorithm. I wonder if our problem here is one of definitions then.

        I don’t think blaming the algo is avoiding responsibility. I think examining the biases in the algo+data is a way of making sure that we aren’t carrying the legacy of earlier bad choices forward with us. I think it is taking on, correctly, *more* responsibility. I see to many people say “it can’t be wrong/bad/sexist/racist/etc, b/c that is what the numbers say and everyone knows numbers are true.”

  2. Probably! That’s one of the reasons I separated out the data collection and preparation stage. But intent is also important – the responsibility lies with people, not primarily with algorithms and data. Because without people the algorithms and data won’t do anything (until AIs come along, and then we may have other issues).

    • I agree that the responsibility lies with people. I just think that some of the algos encode earlier responsibility and should be examined and revised and that we should be careful in how we encode our current biases into our models.

      We can have a further discussion later about the material culture of models and how they have lives beyond us, but I’d want to get @cluttercup involved in that.

  3. Pingback: In defence of algorithms | duncan3ross

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s