THE Ranking Cycle

Unusually, for me, this blog concerns my (paid) work! A large part of that is putting together the various rankings for Times Higher Education. Now, from the inside these all make perfect sense, and we’re well aware of when they happen. But from the outside, maybe less so.

This blog is really designed for people interested either in submitting data, or for using the outcomes of our rankings. And possibly a few rankings geeks too. But mainly it is designed as a handy reference point. As dates get closer I will update this blog to reflect the progress of the cycle.

As a brief reminder, THE produces two levels of ranking: Rankings (note the capitals) and editorial analysis. Our main focus from a data team is on the former, although we do support our friends in the magazine on editorial analysis too.

The Rankings are far more structured, and also more likely to be tied down in terms of dates. Publication dates are usually designed to coincide with one of our Summit series.

Within the Rankings there are two streams: the World University Ranking, and our Teaching Rankings – currently the Japan University Ranking, and the US College Ranking.

The approximate dates for our next rankings are as given below:

World University Ranking Series

Name Date Data collection cycle
Reputation 15/06/2017 2017
Europe 22/06/2017 2016
SE Asia 05/07/2017 2016
Latam 20/07/2017 2017
WUR 05/09/2017 2017
Subjects September 2017 2017
BRICS and Emerging December 2018 2017
Asia February 2018 2017
Young Spring 2018 2017
Reputation Summer 2018 2018
SE Asia Summer 2018 2017

 

Teaching Series

Name Date Data collection cycle
US College 28/09/2017 2017
Japan University Spring 2018 2017
European Teaching Summer 2018 2017

 

Editorial Analysis

This is (inevitably) more variable, as it depends on what the editorial team think is insightful and interesting to our readership, but it is likely to include the following:

Name Date Data collection cycle Data source
Employer November 2017 2017 Survey
Liberal Arts Winter 2017/18 2017 USA
International Winter 2017/18 2017 WUR
UK Student Spring 2018 2017 Survey
Small Universities Spring 2018 2017 WUR

It’s not just the Hadron Collider that’s Large: super-colliders and super-papers

During most of my career in data science, I’ve been used to dealing with analysis where there is an objective correct answer. This is bread and butter to data mining: you create a model and test it against reality.  Your model is either good or bad (or sufficiently good!) and you can choose to use it or not.

But since joining THE I’ve been faced with another, and in some ways very different problem – building our new World University Rankings – a challenge where there isn’t an absolute right answer.

So what can you, as a data scientist, do to make sure that the answer you provide is as accurate as possible? Well it turns out (not surprisingly) that the answer is being as certain as possible about the quality, and biases in the input data.

Papers and citations

One of the key elements of our ranking is the ability of a University to generate valuable new knowledge.  There are several ways we evaluate that, but one of the most important is around new papers that are generated by researchers. Our source for these is Elsevier’s Scopus database – a great place to get information on academic papers.

We are interested in a few things: the number of papers generated by a University, the number of papers with international collaboration, and the average number of citations that papers from a University get.

Citations are key. They are an indication that the work has merit. Imagine that in my seminal paper “French philosophy in the late post-war period” I chose to site Anindya Bhattacharyya’s “Sets, Categories and Topoi: approaches to ontology in Badiou’s later work“. I am telling the world that he has done a good piece of research.  If we add up all the citations he has received we get an idea of the value of the work.

Unfortunately not all citations are equal. There are some areas of research where authors cite each other more highly than in others. To avoid this biasing our data in favour of Universities with large medical departments, and against those that specialise in French philosophy, we use a field weighted measure. Essentially we calculate an average number of citations for every* field of academic research, and then determine how a particular paper scores compared to that average.

These values are then rolled up to the University level so we can see how the research performed at one University compares to that of another.  We do this by allocating the weighted count to the University associated with an author of a paper.

The Many Authors problem

But what about papers with multiple authors?  Had Anindya been joined by Prof Jon Agar for the paper, then both learned gentlemen’s institutions would have received credit. Dr Meg Tait also joins, so we have a third institution that gains credit and so on.

Whilst the number of author remains small that works quite well.  I can quite believe that Prof Agar, Dr Tait and Mr Bhattacharya all participated in the work on Badiou.

At this point we must depart from the safe world of philosophy for the dangerous world of particle physics**. Here we have mega-experiments where the academic output is also mega. For perfectly sound reasons there are papers with thousands of authors. In fact “Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC” has 2932 authors.  

Did they all contribute to the experiment? Possibly. In fact, probably. But if we include the data in this form in our rankings it has some very strange results.  Universities are boosted hugely if a single researcher participated in the project.

I feel a bit text bound, so here is a graph of the distribution of papers with more than 100 authors.

author_count_more100_a

Frequency of papers with more than 100 authors

Please note that the vast majority of the 11 million papers in the dataset aren’t shown!  In fact there are approximately 480 papers with more than 2000 authors.

Not all authors will have had the same impact on the research. It used to be assumed that there was a certain ordering to the way that authors were named, and this would allow the reduction of the count to only the significant authors. Unfortunately there is no consensus across academia about how this should happen, and no obvious way of automating the process of counting it.

Solutions

How to deal with this issue? Well for this year we’re taking a slightly crude, but effective solution. We’re simply not counting the papers with more than 1000 authors. 1000 is a somewhat arbitrary cut off point, but a visual inspection of the distribution suggests that this is a reasonable separation point between the regular distribution on the left, and the abnormal clusters on the right.

In the longer term there are one technical and one structural approach that would be viable.  The technical approach is to use a fractional counting approach (2932 authors? Well you each get 0.034% of the credit).  The structural approach is more of a long term solution: to persuade the academic community to adopt metadata that adequately explains the relationship of individuals to the paper that they are ‘authoring’.  Unfortunately I’m not holding my breath on that one.

*well, most

**plot spoiler: the world didn’t end