How in love with AI are you?

AI is a problematic term at the moment. There is an awful lot of conflation between true existential/ubiquitous computing/end of the world AI on the one hand, and a nerd in a basement programming a decision tree in R on the other.

Which makes for some nice headlines. But isn’t really helpful to people who are trying to work out how to make the most (and best) of the new world of data opportunities.

So to help me, at least, I have devised something I call the LaundryCulture Continuum. It helps me to understand how comfortable you are with data and analytics.

(Because I realise that the hashtag #LaundryCulture might confuse, I’ve also coined the alternative #StrossMBanks Spectrum).

So here are the ends of the Continuum, and a great opportunity to promote two of my favourite writers.

In the beautiful, elegant and restrained corner, sits The Culture. Truly super-intelligent AI minds look on benevolently at us lesser mortals, in a post-scarcity economy. This is the corner of the AI zealots.

PlayerofGames

In the squamish corner sits The Laundry, protecting us from eldricht horrors that can be summoned by any incompetent with a maths degree and a laptop. This is the home of the AI haters.

AttrocityArchives

Where do people sit? Well it’s pretty clear that Elon Musk sits towards The Culture end of the Continuum. Naming his SpaceX landing barges Just Read The Instructions and Of Course I Still Love You is a pretty big clue.

The Guardian/Observer nexus is hovering nearer The Laundry, judging by most of its recent output.

Others are more difficult… But if I’m talking to you about AI, or even humble data mining, I would like to know where you stand…

Advertisements

In defence of algorithms

I was going to write a blog about how algorithms* can be fair. But if 2016 was the year in which politics went crazy and decided that foreigners were the source of all problems, it looks like 2017 has already decided that the real problem is that foreigners are being assisted by evil algorithms.

So let’s be clear. In the current climate people who believe that data can make the world a better place need to stand up and say so. We can’t let misinformation and ludditism wreck the opportunities for the world going forwards.

And there is a world of misinformation!

For example, there is currently a huge amount of noise about algorithmic fairness (Nesta here , The Guardian here et al). I’ve already blogged a number of times about this (1, 2, 3), but decided (given the noise) that it was time to gather my thoughts together.

prime_directives

(Most of) Robocop’s prime directives (Image from Robocop 1987)

tldr: Don’t believe the hype, and don’t rule out things that are fairer than what happens at the moment.

Three key concepts

So here are some concepts that I would suggest we bear in mind:

  1. The real world is mainly made up of non-algorithmic decisions, and we know that these are frequently racist, sexist, and generally unfair.
  2. Intersectionality is rife, and in data terms this means multicolinearity. All over the place.
  3. No one has a particularly good definition of what fairness might look like. Even lawyers (although there are a number of laws about disproportionate impact even then it gets tricky).

On the other side, what are the campaigners for algorithmic fairness demanding? And what are their claims?

Claim 1: if you feed an algorithm racist data it will become racist.

At the most simple level yes. But (unlike in at least one claim) it takes more than a single racist image for this to happen. In fact I would suggest that generally speaking machine learning is not good at spotting weak cases: this is the challenge of the ‘black swan’. If you present a single racist example then ML will almost certainly ignore it. In fact, if racism is in the minority in your examples, then it will probably be downplayed further by the algorithm: the algorithm will be less racist than reality.

If there are more racist cases than non-racist cases then either you have made a terrible data selection decision (possible), or the real problem is with society, not with the algorithm. Focus on fixing society first.

Claim 2: algorithmic unfairness is worse/more prevalent than human unfairness

Algorithmic unfairness is a first world problem. It’s even smaller scale than that. It’s primarily a minority concern even in the first world. Yes, there are examples in the courts in the USA, and in policing. But if you think that the problems of algorithms are the most challenging ones that face the poor and BAME in the judicial system then you haven’t been paying attention.

Claim 3: to solve the problem people should disclose the algorithm that is used

Um, this gets technical. Firstly, what do you mean by the algorithm? I can easily show you the code used to build a model. It’s probably taken from CRAN or Github anyway. But the actual model? Well if I’ve used a sophisticated technique, a neural network or random forrest etc, it’s probably not going to be sensibly interpretable.

So what do you mean? Share the data? For people data you are going to run headlong into data protection issues. For other data you are going to hit the fact that it will probably be a trade secret.

So why not just do what we do with human decisions? We examine the actual effect. At this point learned judges (and juries, but bear in mind Bayes) can determine if the outcome was illegal.

And in terms of creation? Can we stop bad algorithms from being created? Probably not. But we can do what we do with humans: make sure that the people teaching them are qualified and understand how to make sensible decisions. That’s where people like the Royal Statistical Society can come in…

Final thoughts

People will say “you’re ignoring real world examples of racism/sexism in algorithms”. Yes, I am. Plenty of people are commenting on those, and yes, they need fixing. But very very rarely do people compare the algorithm with the pre-existing approach. That is terrible practice. Don’t give human bias a free pass.

And most of those examples have been because of (frankly) beginners mistakes. Or misuse. None of which are especially unique to the world of ML.

So let’s stand up for algorithms, but at the same time remember that we need to do our best to make them fair when we deploy them, so that they can go on beating humans.

 

* no, I really can’t be bothered to get into an argument about what is, and what is not an algorithm. Let’s just accept this as shorthand for anything like predictive analytics, stats, AI etc…

 

 

LonData III: the MoshiMonsters paradox

Are you familiar with MoshiMonsters?  It is an online pet type game/junior social network developed by MindCandy.  If not, then you probably don’t have kids aged between 6 and 12…

At LonData III we were lucky enough to have a presentation from Toby Moore, CTO of MoshiMonsters, who took us through the world of data that the game generates, and how MindCandy got to where they are.

Toby took us through their aim moving from no data, through big data, right data, predictive data, and eventually strategic data.

At the beginning of their story they had lots of data, but no ETL, no reporting, and no analysis.  They realised they had to move forwards, and put in place a technology stack of:

  • MS SQL Server as an ETL platform
  • Hadoop for data storage
  • MS SQL for analysis/reporting

This still didn’t resolve their problems, and so they are moving to QlikView to give users direct access to their data.

So this is a Big Data play, right? Lots of data? Hadoop?  It must be!

Is this Big Data, or just big data?

There are lots of things that are great about this story – and let me be clear that none of my comments in any way take away from the amazing success of MoshiMonsters…

I like:

  • The fact that data is so important to them
  • The willingness to give end users direct access to data

But I think it fails to be Big Data because

  • They don’t try to experiment using the data
  • They don’t do predictive analysis (although they use six-sigma statistical approaches to identify issues)
  • There is very limited analysis

Data kills Creativity? Really?

In fact the most worrying issue was a CP Snow like divide: on the one side Creativity.  On the other Data.

This came up several times in the presentation – they would never burden their creative staff with data. They don’t think that segmenting their customers, or analysing their behaviour is the way to go. They don’t test out alternative strategies on the website.

Partly this is because they are extremely sensitive to the nature of their customers (young children) who aren’t the same as the people paying the bills (adults). They say they try to avoid pressuring their customers out of the freemium and into the paying segments*.

I’ve got to say, I really don’t believe this divide to be true. Yes, an anally retentive approach to analysis might kill creativity, but anyone that anal probably doesn’t understand the limits of their analysis. Analysis leaves many, many grey areas.  And on the other hand creativity cannot work in a vacuum.

I came away somewhat disturbed by their approach, whilst still being in admiration of their success and drive. I don’t believe that Big Data approaches can be separated from creativity!

The conclusion:

  1. Is Hadoop necessary for Big Data? Possibly, but it isn’t sufficient.
  2. Is volume necessary for Big Data? Not on an absolute scale, although it helps.
  3. Is attitude necessary for Big Data? Yes, absolutely!
  4. Is it creative? Hell Yes!

 

 

“Data Scientists are hardcore coders…”: discuss

Yesterday I almost had a heart attack when an esteemed colleague (who shall remain nameless) came out with the statement: “Data Scientists are people who are hardcore Hadoop coders”…  had I totally misunderstood him?  Or was I so out of step with the world that I had totally misunderstood data science?

This is all the more important for two reasons:

  1. My job title (full disclosure, I made it up) is Director Data Science
  2. I’m busy trying to recruit data scientists for my team.

Well, to be honest, I could probably live with being wrong about 1.  LinkedIn will never find out, so that’s OK.

But whilst I’ve been engaged in recruitment I have had to decide what it is I’m looking for in candidates.  So here it is… in descending order a data scientist will be:

Curious

The first, and most important trait is curiosity.  Insane curiosity.  In many walks of life evolution selects against the kind of person who decides to find out what happens “if I push that button”.  In Data Science it selects for it.

In my own analytical experience nothing has come close to the feeling when you discover something new (even if other people have already been there).  In 5th form working out how to prove what root -9 was.  At University… well too much to drink there, but at work discovering that we could push complex analytics onto an MPP system.  That complex things (cars) failed with the same distribution as simple things (their components).  That social networks could be used to predict some things, and that they couldn’t be used to predict others.

And that last one is important too: the joy of disproving something!

Analytical

I expect any data scientist to have a background in, and an understanding of, complex analytics.  I don’t mean reporting.  I’ve nothing against reporting, it’s important and someone has to do it.  But not a data scientist.  I’m after people who can build a model that predicts something, or who can cluster data, who know the tricks of creating a good dataset, and when a model result is too good.  And importantly people who can tell me if the result is statistically relevant or just one of those things.

When it comes to Big Data “those things” will come up more and more often as our data gets bigger.

Communicative

I have no use for people who are unable to communicate with non-specialists.  Its hard enough discussing these topics within the community – we need people who can explain to those outside the community.  The users of the services we will provide.

Of course communication is two way, and the data scientist needs to listen too.

Novel

The data scientist needs to provide additional value above and beyond what’s happening already.  You can provide a fantastic new way of predicting churn that will only cost $1 million and uses data sources that are already in use?  And it doesn’t outperform the existing methods.  Hmmm.

Novelty, either in ways of thinking, or in terms of the data and approaches to be used, is vital.

Business focused

Obviously by business I mean “focused on the overall objectives of the organisation you’re working with or in”, but that’s a bit long winded.  Again data scientists need to get their heads out of the algorithms and into the business problems.  I you can tell me the correct parameterisation for a novel take on SVM, but can’t tell me the top three issues for a business (and how big data can help fix them), then you aren’t a data scientist, you’re an academic.

A coder

Last and least.  Yes, it would be nice if you can code Hadoop.  Or C#.  Or R.  But this is a passing phase brought on by a lack of good interfaces, it’s not a permanent state of affairs.  So, if you have this skill, good for you.  Bt if you only have this skill it’s time to get out into the world.