The National Information Infrastructure – holes in the road(map)

In my work as part of the Open Data User Group I have come across a secret* document: the National Information Infrastructure.

The idea, which has come out of the Shakespeare Review, is to identify the Government datasets that need to be protected and (potentially) made open in the public interest.

Image

Ignoring the inconvenient fact that two of the most significant datasets won’t fall within it’s remit (the Royal Mail’s Postal Address File, which was conveniently sold off with the Royal Mail; and Ordnance Survey’s MasterMap, which is never, ever going to be open) the idea seems sound.  Data is increasingly important, and Government has a role to play in supporting and protecting it.

But there are some big holes in the road.

Firstly much of the important data isn’t, and will never be, public open data. It is the data that we rely on that is held by commercial organisations.  This data is vital to the economic well being of the country. In fact, much of it is necessary just to make things work!

Just imagine what would happen to the country if there was a significant loss of data in one of the major telecommunications companies?  And bear in mind that telephony today is very much a data business. Or what about if one of our banks had its data maliciously wiped? Most money is data, not pound coins. It would make the financial meltdown look trivial (don’t believe me, then think – would you be willing to buy or sell things if you weren’t confident that the money you were using actually existed, or would continue to exist in ten minutes time?).

And it doesn’t take quite such a catastrophic event to cause problems. Fat finger incidents are already capable of causing significant problems. 

The second issue is the interlinking of physical and data assets. Yes, data is important. But, until the singularity, data sits somewhere. On servers. And it’s transferred via networks.  And these are vulnerable to attacks.  The attacks can be “friendly” (yes NSA, I’m giving you the benefit of the doubt) or malicious (the result of Heartbleed, for example), but they can happen.  And the cloud makes life more complex. Just where exactly is that data you were talking about?  Whose jurisdiction does your national asset reside in?  

And the third problem is legislative. What will the impact of legislation be on your national asset? Some will be beneficial (commitments to open data), others will be troublesome, or even damaging.  Best to think these through and highlight them upfront.

So, if we see the NII in its present form as an end point then it is a disappointing missed opportunity.  But, if we see it as the starting point for a recognition of the vital role of data in society, then it has promise… 

*Not really – the existence of the NII was made public last year.

Prescriptive analytics? My Twitter spat…

So at the Gartner BI Summit I got myself into a Twitter spat with the conference chair over the term “Prescriptive Analytics”.

Gartner have decide that the world of advanced analytics is split into four elements: Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics.  Two of those categories will be very familiar – there are clear technical and conceptual differences between these two types (perhaps most succinctly identified in the old neural network terms unsupervised and supervised).

Image

Diagnostic and Prescriptive Analytics are a bit different though, and I’m struggling to see what they mean that is significantly different from Descriptive or Predictive.

Gartner have an image that tries to take this further:

Image

 

Image (c) Gartner

So here are my issues.

1) Descriptive vs Diagnostic

I’m not convinced that there is a real difference here. I don’t buy the idea that Descriptive analysis wouldn’t answer the question “Why did it happen?” or that Diagnostic analysis wouldn’t ask the question “What happened?”.  In fact (of course) you also typically use techniques from predictive analysis to help you with both of these – Cox Proportional Hazard Modelling would be one approach that springs to mind.  Technically it’s a two target regression approach, but it’s used to understand as much as to predict.

2) Predictive vs Prescriptive

The apparent difference here is twofold: firstly Predictive doesn’t lead directly to action, but Prescriptive does.  This simply doesn’t hold water.  Predictive analysis can lead directly to action.  Many predictive algorithms are embedded in systems to do exactly that. And if you contend that even that involves some human intervention, then the same is absolutely true of Prescriptive analytics – someone has to create the logic that integrates the analysis into the business process.

3) Prescriptive involves techniques that are different than Predictive

The suggestion is that meta techniques such as constraint based optimisation and ensemble methods qualitatively different and stand alone as a special category.  I don’t agree.  They don’t stand alone.  You can do predictive analytics without descriptive, and descriptive without predictive. You can’t do ‘prescriptive’ analytics without predictive.  It doesn’t stand on its own.  I’d also argue that these are techniques that have always been used with predictive models: sometimes internally within the algorithms, sometimes by hand, and sometimes by software.

4) Only prescriptive analytics leads directly to decision and action

Without human intervention. This also just isn’t true. I dare anyone to build prescriptive analytics without involving people to build the business logic, validate the activities, or just oversee the process. Yet this is the claim. Data mining is a fundamentally human, business focused activity. Think otherwise and you’re in for a big fall.  And, yet again, productionising predictive models has a long tradition – this is nothing new.

But the final defence of Prescriptive Analytics is that it is a term that has been adopted by users.  Unfortunately this doesn’t seem to be the case. Gartner use it, but they need to sell new ideas. SAS and IBM also use it, but they are desperate to differentiate themselves from R. A few other organisations do use it, but when pressed will admit they use it because “Gartner use it and we wanted to jump on their bandwagon”. But I could be wrong, so I looked at Google.

Predictive analytics: 904,000 results

Prescriptive analytics: 36,000 results

Take out SAS/IBM: 17,500 results

The Conference Conundrum

I’m here at the Teradata Partners conference in Dallas (one of my favourite conferences (full disclosure, I’m employed by Teradata)), and enjoying myself immensely.

Of course there are always a few problems with these big conferences:

  1. The air-con is set to arctic
  2. The coffee in breakfast is terrible
  3. I always want to go to sessions that clash

I’ve long since given up on the air-con and the coffee.  It seems these are pretty much immutable laws of conferences.  But what about the scheduling?  Surely there is a (and I hesitate to say it) big data approach to making the scheduling better?

I have no* evidence, but I suspect that current scheduling approaches essentially revolve around avoiding having the same person speak in two places at the same time, making sure that your ‘big’ speakers are in the biggest halls.

But we’ve all been to presentations in aircraft hangers with three people in the audience, and we’ve all been to the broom-closet with a thousand people straining to hear the presenter.

And above all. we’ve all been hanging around in the corridor trying to decide which of the three clashing sessions we should go to next.

The long walk

The long walk

So maybe, just maybe, there is a better way.

How? Well this year’s Partners Conference gave us the ability to use an app or an online tool to choose which sessions we wanted to see.  So I did it.  Two minutes in BZZZZZZZ – you have a clash!  But I wanted to see both of those sessions!  Tough.  Choose one.

But.  What if they had asked me what I wanted to see before they had allocated sessions to time slots and rooms?

They would have ended up with a dataset that would allow someone to optimise the sessions for every attendee.  This would really change the game, we’re moving from an organiser designed system to a user designed system.

But wait! There’s more!

Are you tired from having to walk 500m between sessions?  We could also optimise for distances walked.  And we could make a better guess at which sessions need the aircraft hanger, and which would be just fine in the broom closet. And we could do collaborative filtering and suggest sessions that other people were interested in…

And guess what?  We have the technology to do this.

Next year, Partners?

The worst big data article ever?

There are many, many bad articles on big data. It’s almost impossible to move without tripping over another pundit trying to rubbish the topic. Partly this is just the inevitable sound of people trying to make a name for themselves by being counter-factual. It’s far easier to stand out when you’re fighting against the tide. Even if you end up getting very wet…

Fortunately St Nate of Sliver has actually analysed the data and it’s clear that pundits are fundamentally useless.

But occasionally you come across something so egregiously crap that you have to comment.

This week’s crap-du-jour comes courtesy of Tom Leberecht and Fortune.

In it he decides to lump almost every woe in the world and pile them at the feet of big data. So here are my rebuttals:

Big Data = Big Brother?

This hasn’t been a good couple of weeks for the field of data mining. The NSA scandal has  caused sales of Nineteen Eighty-Four to rise, unfortunately not quite at the speed that the use of the phrase “Big Brother” by journalists has risen.

In his article Leberecht oddly passes this over, and instead mentions the evil of passing on data to private companies in a sideways swipe at quantified self.

Perhaps he forgets to mention it because it appears that people can also see the positive side? There are real issues around privacy, anonymity and data security, but pretending that the age of big data is the cause is rather odd.

Big data is not social

Well firstly, hasn’t he heard of Social Network Analysis? But secondly he seems to be advancing the argument that the status of X (relationships) is threatened by allowing Y (data analysis).  Sound familiar?  Yes that’s the argument against gay marriage. Somehow if my gay friends get married, my marriage will be threatened.

Well for the record analysing data doesn’t mean that humanity will be diminished. Welcome to the world of science! Were we more human in the 17th century? Or the 13th? Because there was a hell of a lot less analysis then than in the 1950s…

Finally on this topic, what about the growing Data Philanthropy movement? Every week we see new initiatives where people want to apply big data to address real social issues in ways that couldn’t happen before.

Big data creates smaller worlds

Apparently big data filters our perception, and limits our openness to new ideas and cultures. Really? To go back to gay marriage – can we imagine this being a possibility 20 years ago? The ability to interact and identify unusual events and groups menas that there is far more diversity than there ever was. A goal of marketers is to open people’s eyes to new things (and to get them to buy them). Leberecht seems to think that the collaborative filtering that Amazon famously use would only ever return you to the same book.

Big data – and opening yourself up to ideas that aren’t part of you narrow ‘intuition’ will surely make your world bigger and more diverse…

Big data is smarter, not wiser

The article makes it clear that wisdom has a twofold definition in Leberecht’s world: it is based on intuition (guesswork) and it is slow. Oh, and it also rejects feedback. Well firstly, big data isn’t always fast. Believe me, Hadoop isn’t a solution suited to the rigours of rapid operationalisation. There are other things for that. But as a definition of wisdom this seems to be a disaster.

Not only should you take the risk of making the wrong decision (intuition is guesswork), but you should do it slowly, and without paying attention to any feedback you get.  Truly this is fail slow writ large.

Big data is too obvious

I think this is the heart of Leberecht’s argument. He didn’t think of big data.

His example (that the financial collapse was caused by measurement) is patently wrong. The problem with mortgage backed securities was exactly the opposite: people failed to measure the risk and relied on intuition that the boom was going to continue indefinitely.

Big data doesn’t give

And then finally we hit the sleeping policeman of CP Snow. You are either an artist or a scientist. A businessman or a data scientist. Creativity belongs to the former, sterile analysis to the latter.

I’ll let you guess what I think of that!

 

A big data glossary

Big data is serious. Really serious. But at Big Data Week it became clear that we need to be able to laugh at ourselves… So here is my initial attempt at an alternative glossary for big data. Thanks to the many contributors (intentional and not), and apologies to anyone who disagrees. Enjoy.

Wordle: BigDataGlossary

Agile analytics: Fast, iterative, experimental analysis often performed by people who aren’t.

Apache: Shorthand for the Apache Software Foundation, an open source collection of 100 projects. Not funny, but important.

Big brother: 1) Notional leader of Oceania in dystopian future novel Nineteen Eighty-Four. 2) Regular data related leader on front page of the Daily Mail in dystopian present 2013.

Big data: Any data problem that I’m working on. See also small data.

Cassandra: Open source, distributed data management system developed at Facebook. See Evil Empire.

Clickstream data: Data logged by client or web server as users navigate a website or software system. As in “In case of death please delete my clickstream”.

Data <insert profession>: Way of making a profession sound up to 22.7 times more sexy. Examples: data journalist, data scientists, data philanthropist. Not yet tried, data accountant, data politician.

Data journalist: Journalist who lets facts get in the way of a story.

Data model: the output of data modelling. Note that this is not a model who wants to sound more sexy.

Data modelling: Evil way of understanding how a process is reflected in data. Frowned upon.

Data miner: a data scientist who isn’t interested in a pay rise. Note that this is not a miner who wants to sound more sexy.

Data mining: The queen of sciences. Alternatively a business process that discovers patterns, or makes predictions from data sets using machine learning approaches.

Data scientist: A magical unicorn. With a good salary.

Dilbert test: Simple test for separating data scientists from IT people. If you take two equally funny cartoon strips, one Dilbert, the other XKCD, then a data scientist will prefer XKCD. An IT professional will prefer Dilbert. If anyone prefers Garfield they are in completely the wrong profession.

Elephants: Obligatory visual or textual reference by anyone involved in Hadoop. These are not the only animal in the zoo.

ETL: Extract, transform and load (ETL) – software and process used to move data.  Not to be confused with FTL, ELT, BLT, or more importantly DLT.

Evil Empire: large software/hardware/service vendor of choice. For example, Apple, Google, Facebook, IBM, Microsoft etc… The tag is independent of their evil doing capabilities.

Facebook: Privacy wrecking future Evil Empire.  Source of interesting graph data.

Fail fast: Low-risk, experimental approach to big data innovation where failure is seen as an opportunity to learn. Also excellent excuse when analysis goes terribly, terribly wrong.

Fail slow: Really, really bad idea.

Fork: Split in an open source project when two or more people stop talking to each other.

Google: Privacy wrecking future Evil Empire.  Source of interesting search data.

Google BigQuery: See Evil Empire.

Hadoop: Open source software controlled by the Apache Software Foundation that enables the distributed processing of large data sets across clusters of commodity servers. Often heard from senior managers “Get me one of those Hadoops that everyone is talking about”

Hadoop Hive: SQL interface to Hadoop MapReduce. Because SQL is evil. Except when it isn’t.

In-Memory Database: Trying to remember what you just did rather than writing it down. See also fail fast.

Internet of things: Stuff talking to other stuff. See singularity.

Java: Dominant programming language developed in the 90s at Sun Microsystems. It was later used to form Hadoop and other big data technologies.  More importantly a type of coffee.

Key value pairs: a way of avoiding evil data modelling.

MapReduce: Programming paradigm that enables scalability across multiple servers in Hadoop, essentially making it easier to process information on a vast scale. Sounds easy, doesn’t it?

MongoDB: NoSQL open source document-oriented database system developed and supported by 10gen. Its name derives from the word ‘humongous’. Seriously, where do they get these names?

NoSQL: No SQL. It’s evil. Do not use it, use something else instead.

NOSQL: Frequently (and understandably) confused with the NoSQL. It actually means not only SQL, or that you forgot to take off capslock.

No! SQL!: Phrase frequently heard as a badly written query takes down your database/drops your table.  See Bobby Tables.

Open data: Data, often from government, made freely available to the public. See also .pdf

Open source: an expensive way to not pay for something.

Python: 1) Dynamic programming language, first developed in the late 1980s. 2) Dynamic comedy team, first developed in the late 1960s.

Pig: High-level platform for creating MapReduce programmes used with Hadoop, also from Apache.

Real time: Illusory requirement for projects.

Relational database management systems (RDBMS): Big ol’ databases.

Singularity: When the stuff talking to other stuff finds our conversation boring.

Small data: (pejorative) Data that you are working on.

Social Media: Facebook, Twitter etc… Sources of trivial data in large volumes. Will save the world somehow.

Social Network: 1) A collection of people and how they interact. May also be on social media. 2) What data scientists really hope they will be part of.

SQL: Standard, structured query language specifically designed for managing data held in Big ol’ databases.

Twitter: Privacy wrecking future #EvilEmpire.  Source of interesting data in less than 140 characters.

Unstructured data: 1) The crap that just got handed to you with a deadline of tomorrow 2) The brilliant data source you just analysed way ahead of expectations. Often includes video and audio. You weren’t watching YouTube for fun.

V: Letter mysteriously associated with big data definitions. Usually comes in threes.  No one really knows why.

Variety: The thing you fail to get in big data definitions. “I didn’t see much variety in that definition!”

Velociraptors: Scary dinosaurs that really should be part of the definition of big data, terribly underused.

Velocity: The speed at which a speaker at a conference moves from the topic to praising their own company.

Volume: The degree to which a business analyst’s hair exceeds expectations.

XKCD: Important reference manual for big data.

ZooKeeper: Another an open source Apache project which provides a centralised infra­structure and services that enable synchronisation across a cluster.  Given all the animals in big data you can see why this was needed.

7 Things from Strata Santa Clara

This is the fifth time I’ve made the pilgrimage to Strata – I was lucky enough to be at the very first event, here in Santa Clara, and that’s made me think about how things have changed over the last two years.

Two years ago big data was new and shiny. About 1600 people turned up at the South end of silicon valley to enthuse about what we could do with data.

Now we’re talking about legacy Hadoop systems, and data science (big data is so 2011), but what else has changed?

1)   Hadoop grew up

The talk this year wasn’t about this new shiny thing called Hadoop, it was about which distro was the best (with a new Intel distro being announced), and which company had the biggest number of coders working in the original open source code. Seriously there were almost fistfights over the bragging rights.

As a mark of the new seriousness the sports jacket to t-shirt ratio was up. But don’t worry the PC to Mac ratio was still tending to zero (the battle was between Air and Pro).

2)   NoSQL is very much secondary to Hadoop

The show is extremely analytically oriented (in a database sense… but more of that later). The NoSQL vendors are there, but attract a fraction of the attention.

3)   SQL is back

Yes, really.  It turns out it is useful for something after all.

4)   Everyone is looking for ways to make it actually work (and work fast)

Hadoop isn’t perfect, and there are a wide range of companies trying to make it work better. Oddly there is a fetishisation of speed. Odd because this is something that the data warehouse companies went through in the days when it was all about the TPC benchmarks. No people, scaling doesn’t just mean big and fast. It means usable by lots of people and a whole raft of other things.

Greenplum were trying to convince us that theirs was fastest. Intel told us that SAP HANA was faster, and more innovative. Really. And the list went on.

Rather worryingly there seems to be a group out there who want to try and reinvent the dead end of TPC* but for Hadoop.

5)   There’s a lot of talk about Bayes, but not many data miners 

I ran a session on Data Mining. Only a handful of people out of about 200 in the room would admit to being data miners. This is terrifying as data scientists are trying to do analytical tasks. Coding a random forest does not make you a data miner!

6)   Data philanthropy is (still) hot

We had a keynote from codeforamerica, and lots of talk about ethics, black hats etc… I ran a Birds of a Feather table on Doing Good With Data. A group of us were talking about the top secret PROJECT EPHASUS. 

7)   The Gartner Hypecycle is in the trough of disillusionment

At least as far as big data is concerned. The show sold out. And the excitement was still there. Data Science has arrived.

Image

* For a fun review about the decision that Teradata made to withdraw from TPC benchmarks, try this by my esteemed colleague Martin Willcox.

Religion, politics and data mining

Three topics which are unlikely to get you invited to the best dinner parties. So how (if at all) are they related?

Lets start with data mining.

Data mining

Every year at about this time, Karl Rexer sends out a survey to data miners asking them about the tools they use, and the problems they address. 

Over the years since 2007 there have been a few notable changes. Firstly the rise in the number of people who completed the survey, from 300 to 1300 (2011). But also in the change in the toolsets people are using.

If you go back to 2007 the most popular (and liked) tools were SAS and SPSS Clementine. In 2008, R appeared for the first time, and by 2010 R was the most popular tool, with Clementine (by then IBM SPSS Modeler) had almost disappeared.

Some aspects of this are hardly surprising. Clementine was always an expensive tool to purchase, and it has a limited set of algorithms (yes, I know you can add to nodes, but…).  R is free. There are algorithms galore, even if they can be tricky to find.  Increasingly it’s the tool that students, especially those researching their PhDs will use at University.  It’s the data mining tool of the big data movement.

Religion

Back in the 1500s religion, specifically the tensions within religion, were deeply intertwined with the disruptive technology of the day: movable type.

The religious dispute was the reformation of the Catholic church, and one of the core aspects of this was the relationship between individuals and God. Did we need a priesthood or saints to intercede on our behalf? Couldn’t we read the word of God ourselves, and learn the lessons directly?  The bible was the big data of the 1500s, and the printing press was the technology that allowed people to translate the bible and provide access to the truth.

Sound a bit familiar?

Politics

Yet here we are in the midst of the big data reformation and it appears that we are creating a priesthood, not removing it.  What makes Clementine a great tool is that it is intensely and beautifully useable. It has an interface that is elegant and intuitive.  This allowed data miners to focus on the meaning of the analysis, not the mechanism of the analysis.

R, for all its strengths, is not elegant or beautiful. When you use R it is like stepping back into the world of SAS procs in the 90s. Too much effort is spent getting the software to work, and not enough is spent on the bigger picture.

“Aha!” you say, “But I’m an expert – I can use R at lightning speeds, I don’t need an elegant GUI!”   And so the politics is born.

Do we believe that data mining and data science is the province of a priesthood?  The select, who will interpret the holy truths on behalf of the unwashed masses (lets call them managers)? Or do we believe that data science should be democratized and made available to as many people as possible? Can they handle the truth?

Can we navigate the political waters, get over our religious differences, and deliver on the promise of analysis in the world of big data?

Doing good with analytics: the pledge

As a data professional I pledge

  • I will be Aware of the outcome and impact of my analysis
  • I won’t be Arrogant – and I will avoid hubris: I won’t assume I should, just because I can
  • I will be an Agent for change: use my analytical powers for positive good
  • I will be Awesome: I will reach out to those who need me, and take their cause further than they could imagine

Sign the pledge online at http://www.causes.com/actions/1694321

Doing Good With Data: the case for the ethical Data Scientist

This post is designed to accompany my presentation to the Teradata Partners User Group, but hopefully some of the links will prove useful even if you couldn’t get to the presentation itself.

Needless to say, the most important part – the Pledge – is right at the bottom.  Feel free to skip to it if you like!

Law

The law (as it relates to data – well actually pretty much all law) is complex, highly jurisdictional, and most importantly of all at least 10 years behind reality.  Neither Judy or I are lawyers, but hopefully these links provide some general background:

One of the first legal agreements was the OECD’s position on data transfers between countries. It dates from the early 70s, when flares were hot and digital watches were the dream of a young Douglas Adams: http://itlaw.wikia.com/wiki/Guidelines_on_the_Protection_of_Privacy_and_Transborder_Flows_of_Personal_Data

Much later the EU released the snappily titled EU 95/46/EC – better known as the Data Directive. The joy is that each country can implement it differently, resulting in confusion.  There are currently proposals out for consultation on updating it too: http://en.wikipedia.org/wiki/Data_Protection_Directive

Of course the EU and the US occasionally come to different decisions, and for a brief discussion of some of the major differences between them you can try this: http://www.privireal.org/content/dp/usa.php

Don’t do evil

Google’s famous take on the hypocratic oath can be simplified as ‘don’t do evil’. As we say in the presentation, this is necessary, but scarcely enough.  It also has the disadvantage of being passive. In it’s expanded form it’s available here: http://investor.google.com/corporate/code-of-conduct.html

Doing Good With Data

Now for the fun bits!  For information on the UN Global Pulse initiative: http://www.unglobalpulse.org/ 

Data 4 Development – the Orange France Telecom initiative in Ivory Coast: http://www.d4d.orange.com/home

If you have a bent for European socialised medecine, then the NHS hack days are here: http://nhshackday.com/

DataKind

And our favourite – with a big thanks to Jake Porway and Craig Barowsky – is DataKind: http://datakind.org/ You can also follow @DataKind

To find out more about the UK charities mentioned check out Place2Be https://www.place2be.org.uk/ and KeyFund http://www.keyfund.org.uk/

Please take the time to register with DataKind, and keep your eyes open for other opportunities.  We hope that DataKind will be open for business in the UK too soon!

The Pledge

Please go and look at the Pledge, and if you think you can, then sign up.  If you have one of our printed cards, take it, sign it and put it on your cube wall (or your refrigerator – wherever it will remind you of your commitment). But sign the online one too.  And one you’ve done that, let the world know! Get them to sign up. If you want a Word copy of the printable one just drop me a line.

http://www.causes.com/actions/1694321

The day the (medical) data broke free…

Image

Today is a good day for data – at least in healthcare. At last the data from the NHS is being set free.

For my international colleagues and friends it’s worth pointing out some things about the NHS.  The UK National Health Service* is actually a very large and complex organisation that cares for health needs. The main arms are the GP services and the Hospital services.  GPs are self employed and effectively contracted by the NHS. Hospitals are islands to themselves within regional groupings.  Above all lie funding and commissioning structures. Sounds complex? From a data perspective it certainly is. The data that is generated by the system is often written, frequently in isolated systems, and is barely there for joined up services, never mind research.

On the positive side, it’s free** at point of use, and generally does a good job.

.There have been signs for a while that the NHS has been starting to think about data.

  • Dr Carl Reynolds (@drcjar) at http://openhealthcare.org.uk/ has been leading the way on doing good things with health data, including running NHS hack days.  If you want to get involved the next one is in Liverpool on the 22-23 September
  • The UK set up the BioBank project, aimed to give a longitudinal study of people who aren’t necessarily ill.  If you think about it it’s fairly obvious that most people who go to their doctor are ill – BioBank aims to understand the factors in their lives that were the same, or different, to other people before and after they were ill.
  • Dr Ben Goldacre (@bengoldacre) has been leading a crusade to get clinical research data (even from trials that are abandoned or not published) into the public domain so that it can be used to compare outcomes.

But now the Government has gone much, much further and has created the Clinical Practice Research Datalink. In addition to having a funky website this aims to bring together data from the NHS so that this vast set of data can be used to improve health outcomes.

Of course there is a very, very big cloud hanging over this. How do you anonymise patient data so that it is still useful?  Simply removing names and addresses won’t deal with the issue, as Ross Anderson of Cambridge University identifies (the Guardian again – don’t say they aren’t fair and balanced!).

But I think, on balance, I disagree with Ross. I’ve come to the conclusion that we can’t rely on privacy, and that the exchange of a guarantee on privacy for free medical care is probably reasonable in itself. Especially as the guarantee isn’t really worth much these days.  When you add to this the potential benefits to research, then the answer is even more obvious. How many people would be happy to give up their privacy if they knew that one day they, or their kids, might be relying on the treatment that resulted?

*Actually there are three, NHS England and Wales, NHS Scotland, and NHS Northern Ireland, but let’s assume they are the same thing for this argument. NHS E&W is by far the largest.

**Nearly.