It’s not just the Hadron Collider that’s Large: super-colliders and super-papers

During most of my career in data science, I’ve been used to dealing with analysis where there is an objective correct answer. This is bread and butter to data mining: you create a model and test it against reality.  Your model is either good or bad (or sufficiently good!) and you can choose to use it or not.

But since joining THE I’ve been faced with another, and in some ways very different problem – building our new World University Rankings – a challenge where there isn’t an absolute right answer.

So what can you, as a data scientist, do to make sure that the answer you provide is as accurate as possible? Well it turns out (not surprisingly) that the answer is being as certain as possible about the quality, and biases in the input data.

Papers and citations

One of the key elements of our ranking is the ability of a University to generate valuable new knowledge.  There are several ways we evaluate that, but one of the most important is around new papers that are generated by researchers. Our source for these is Elsevier’s Scopus database – a great place to get information on academic papers.

We are interested in a few things: the number of papers generated by a University, the number of papers with international collaboration, and the average number of citations that papers from a University get.

Citations are key. They are an indication that the work has merit. Imagine that in my seminal paper “French philosophy in the late post-war period” I chose to site Anindya Bhattacharyya’s “Sets, Categories and Topoi: approaches to ontology in Badiou’s later work“. I am telling the world that he has done a good piece of research.  If we add up all the citations he has received we get an idea of the value of the work.

Unfortunately not all citations are equal. There are some areas of research where authors cite each other more highly than in others. To avoid this biasing our data in favour of Universities with large medical departments, and against those that specialise in French philosophy, we use a field weighted measure. Essentially we calculate an average number of citations for every* field of academic research, and then determine how a particular paper scores compared to that average.

These values are then rolled up to the University level so we can see how the research performed at one University compares to that of another.  We do this by allocating the weighted count to the University associated with an author of a paper.

The Many Authors problem

But what about papers with multiple authors?  Had Anindya been joined by Prof Jon Agar for the paper, then both learned gentlemen’s institutions would have received credit. Dr Meg Tait also joins, so we have a third institution that gains credit and so on.

Whilst the number of author remains small that works quite well.  I can quite believe that Prof Agar, Dr Tait and Mr Bhattacharya all participated in the work on Badiou.

At this point we must depart from the safe world of philosophy for the dangerous world of particle physics**. Here we have mega-experiments where the academic output is also mega. For perfectly sound reasons there are papers with thousands of authors. In fact “Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC” has 2932 authors.  

Did they all contribute to the experiment? Possibly. In fact, probably. But if we include the data in this form in our rankings it has some very strange results.  Universities are boosted hugely if a single researcher participated in the project.

I feel a bit text bound, so here is a graph of the distribution of papers with more than 100 authors.

author_count_more100_a

Frequency of papers with more than 100 authors

Please note that the vast majority of the 11 million papers in the dataset aren’t shown!  In fact there are approximately 480 papers with more than 2000 authors.

Not all authors will have had the same impact on the research. It used to be assumed that there was a certain ordering to the way that authors were named, and this would allow the reduction of the count to only the significant authors. Unfortunately there is no consensus across academia about how this should happen, and no obvious way of automating the process of counting it.

Solutions

How to deal with this issue? Well for this year we’re taking a slightly crude, but effective solution. We’re simply not counting the papers with more than 1000 authors. 1000 is a somewhat arbitrary cut off point, but a visual inspection of the distribution suggests that this is a reasonable separation point between the regular distribution on the left, and the abnormal clusters on the right.

In the longer term there are one technical and one structural approach that would be viable.  The technical approach is to use a fractional counting approach (2932 authors? Well you each get 0.034% of the credit).  The structural approach is more of a long term solution: to persuade the academic community to adopt metadata that adequately explains the relationship of individuals to the paper that they are ‘authoring’.  Unfortunately I’m not holding my breath on that one.

*well, most

**plot spoiler: the world didn’t end

Advertisements

10 comments on “It’s not just the Hadron Collider that’s Large: super-colliders and super-papers

  1. Duncan, your two outlier peaks are almost exclusively the CMS and ATLAS collaborations at CERN. This has been going on for a long time, such that a similar question—“what do these authorships mean and how to we know this person’s productivity?”—routinely comes up in, e.g., academic promotion and tenure reviews.

    One straightforward solution is to realize that the ATLAS and CMS collaborations are, individually, larger than the *readership* of many specialized journals and subfields. These collaborations communicate internally via documents, which (in quality-of-review, readership, and author-naming conventions) closely resembling academic papers. These are called “ATLAS notes” and “CMS notes”; some are public, some are collaboration-internal, but all of them get listed on CVs and are treated like publications for the purpose of thinking about individual productivity. You can see these documents at http://cds.cern.ch/collection/CERN%20Internal%20Notes

    CERN has staff of librarians (http://library.web.cern.ch/) who may be able to supply you with appropriate metadata on this collection of notes.

    • Hi Benjamin, thanks, that is a very useful insight. Part of the problem I have at THE is that we aren’t trying to evaluate individuals, or even subjects, but overall Universities. When doing this these papers cause a lot of noise, the effect of which is to imply that HEP is more valid knowledge than any other field of research (I’m not totally against that idea, but we need to try and be fair). I accept that what we plan to do this year is suboptimal, but would maintain that it is the least bad of a range of suboptimal solutions!

  2. You’re correct, of course, that the LHC’s standard journal articles are “more noise than signal” given their author lists. But I suggest that a non-noisy solution is to treat “CERN notes” as the equivalent of a notable, high-traffic, high-impact specialized journal, whose absence from Scopus is sort of a historical accident rather than a lack of impact. (I’m sure many other specialities would like to plead that their actual-research-impact is countable in another special non-Scopus-indexed way, but that’s the plea in our case!)

  3. In physics we call the publications with many coauthors the “brothers’ cemetery” or “brothers’ grave” – named after the Common/Military Cemetery term. 🙂

    Yes, many universities (and researchers) all over the world rely on such publications when trying to improve their ratings. For example, in South Africa we’ve got a bunch of people who are “working for CERN/LHC” but in practice they are just programmers and database coders who barely understands the physics itself.

      • My background and training are in mathematics (differential geometry and topology), and I’m just an outsider when it comes to experimental physics. I certainly appreciate the valuable work done by the giant research teams at CERN/LHC, and I certainly want the individual scientists to be rewarded for their contributions. But when it comes to authorship of papers, which in most cases are probably written by no more than 5-10 people, the crediting of authorship to hundreds or even thousands of people seems batty. I like the way you’ve thought about potential fixes, but none of them really seems to get around the basic problem, which is that the large majority of listed “authors” did not have any role in the paper itself (as opposed to the results it’s reporting). Ideally, a paper could be credited to a “Research Team 107” or something like that, but I’m not sure how enthusiastic the physicists are going to be about that idea!

  4. I fail to see how you are unable to come up with relatively easy and significantly less arbitrary solutions. If you look carefully at your data you will see that within the long list of authors, not all the institutions have the same number of people. So with large lists you could easily split the credit in proportion to the number of authors in each contributing institute. This would also make sense as the financial contributions to these mega experiments are also split in the same spirit.

    As for the location of the cutoff: it makes no sense at all. Your histogram shows a statistically significant peak right below 1000 – so do we get that the papers in that peak agree with the rest of your distribution? Ironically, this sort of “not-blind” setting of a cutoff would be completely shunned in the field that you are are cutting off, and would be likely be rejected from publication as it is found sloppy or prone to bias.

    I sincerely hope that this is more carefully revisited next year as it is likely to harm one of the biggest research enterprises that we have attempted as race. 😦

    • So you would recommend disadvantaging small Universities that put a disproportionate effort in compared to their size? We will work towards a better, and universal solution, for next year. Of course in most subjects a paper with thousands of authors would also be rejected for publication, simply on that basis.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s