Tuesday, July 1, 2014

Can we measure scientific success? Should we?

Can we measure scientific success? Should we?
My new paper.
Measures for scientific success have become a hot topic in the community. Many scientists have spoken out in view of the increasingly widespread use of these measures. They largely all agree that the attempt to quantify, even predict, scientific success is undesirable if not flawed. In this blog’s archive, you find me too banging the same drum.

Scientific quality assessment, so the argument goes, can’t be left to software crunching data. An individual’s promise can’t be summarized in a number. Success can’t be predicted on past achievements, look at all the historical counterexamples. Already Einstein said. I’m sure he said something.

I’ve had a change of mind lately. I think science need measures. Let me explain.

The problem with measures for scientific success has two aspects. One is that measures are used by people outside the community to rank institutions or even individuals for justification and accountability. That’s problematic because it’s questionable this leads to smart research investments, but I don’t think it’s the root of the problem.

The aspect that concerns me more, and that I think is the root of all evil, is that any measure for success feeds back into the system and affects the way science is conducted. The measure will be taken on by the researchers themselves. Rather than defining success individually, scientists are then encouraged to work towards an external definition of scientific achievement. They will compare themselves and others on these artificially created scales. So even if a quantifiable marker of scientific output was once an indicator for success, its predictive power will inevitably change as scientists work specifically towards it. What was meant to be a measure instead becomes a goal.

This has already happened in several cases. The most obvious examples are the number of publications or the number of research grants obtained. On the average, both are plausibly correlated with scientific success. And yet a scientist who increases her paper output doesn’t necessarily increase the quality of her research, and employing more people to work on a certain project doesn’t necessarily mean its scientific relevance increases.

A correlation is not a causation. If Einstein didn’t say that he should have. And another truth that comes courtesy of my grandma is that too much of a good thing can be a bad thing. My daughter reminds me we’re not born with that wisdom. If sunlight falls on my screen and I close the blinds, she’ll declare that mommy is tired. Yesterday she poured a whole bottle of body lotion over herself.

Another example comes from Lee Smolin’s book “The Trouble with Physics”. Smolin argued that the number of single authored papers is a good indicator for a young researcher’s promise. He’s not alone in this belief. Most young researchers are very aware that a single authored paper will put a sparkle on their publication list. But maybe a researcher with many single authored papers just a bad collaborator.

Simple measures, too simple measures, are being used in the community. And this use affects what researchers strive for, distracting them from their actual task of doing good research.

So, yes, I too dislike attempts to measure scientific success. But if we all agree that it stinks why are we breathing the stink? Why are not only funding agencies and other assessment ‘exercises’ using these measures, but why are scientists themselves using them?

Ask any scientist if they think the number of papers shows a candidate’s promise and they’ll probably say no. Ask if they think publications in high impact journals are indicators for scientific quality and they’ll probably say no. Look at what they do, and the length of the publication list and occurrence of high impact journals on that list is suddenly remarkably predictive of their opinion. And then somebody will ask for the h-index. The very reason that politically savvy researchers tune their score on these scales is that, sadly, it does matter. Analogies to natural selection are not coincidental. Both are examples of complex adaptive systems.

The reason for the widespread use of oversimplified measures is that they’ve become necessary. They stink, all right, but they’re the smallest evil among the options we presently have. They’re the least stinky option.

The world has changed and the scientific community with it. Two decades ago you’d apply for jobs by carrying letters to the post office, grateful for the sponge so wouldn’t have to lick all these stamps. Today you apply by uploading application documents within seconds all over the globe and I'm not sure they still sell lickable stamps. This, together with increasing mobility and connectivity, has greatly inflated the number of places researchers apply to. And with that, the number of applications every place gets has skyrocketed.

Simplified measures are being used because it has become impossible to actually do the careful, individual assessment that everybody agrees would be optimal. And that has lead me to think that instead of outright rejecting the idea of scientific measures, we have to accept them and improve them and make them useful to our needs, not to that of bean counters.

Scientists, in hiring committees or on some funding agency’s review panel, have needs that presently just aren’t addressed by existing measures. Maybe one would like to know what’s the overlap of some person’s research topics with those represented at a department? How often have they been named in acknowledgements? Do you share common collaborators? What administrational skills does the candidate bring? Is there somebody in my network who knows this person and could give me a firsthand assessment? Have they experience with conference organization? What’s their h-index relative to the typical h-index in a field? What would you like to know?

You might complain these are not measures for scientific quality and that’s correct. But science is done by humans. These aren’t measures for scientific quality, they’re indicators for how well a candidate might fit on an open position and into a new environment. And that, in return, is relevant for both their success and that of the institution.

Today, personal relations are highly relevant for successful applications. That is a criterion which sparks interest that is being used in absence of better alternatives. We can improve on that by offering possibilities to quantify, for example, the vicinity of research areas. This can provide a fast way to identify interesting candidates that one might not have heard of before.

And so I think “Can we measure scientific success?” is the wrong question to ask. We should ask instead what measures serve scientists in their profession. I’m aware there are meanwhile several alt-metrics being offered, but they don’t address the issue, they merely take into account more data sources to measure essentially the same.

That concerns the second aspect of the problem, the use of measures in the community. For what the first aspect is concerned, the use of measures by accountants who are not scientists themselves: The reason they use certain measures for success or impact is that they believe scientists themselves regard them useful. Administrators use these measures simply because they exist and because scientists, in lack of better alternatives, draw upon them to justify and account for their success or that of their institution. If you have argued that the value of your institute is in the amount of papers produced or conferences held, in the number of visitors pushed through or distinguished furniture bought, you’ve contributed to that problem. Yes, I’m talking about you. Yes, I know not using these numbers would just make matters worse. That’s my point: They’re a bad option, but still the best available one.

So what to do?

Feedback in complex systems and network dynamics have been studied extensively during the last decade. Dirk Helbig recently had a very readable brief review in Nature (pdf here) and I’ve tried to extract some lessons from this.    
  •  No universal measures.
    Nobody has a recipe for scientific success. Picking a single measure bears a great risk of failure. We need a variety so that the pool remains heterogeneous. There is a trend towards standardized measures because people love ordered lists. But we should have a large number of different performance indicators.    

  •  Individualizable measures.
    Measures must be possible to individualize, so that they can take into account local and cultural differences as well as individual opinions and different purposes. You might want to give importance to the number of single authored papers. I might want to give importance to science blogging. You might think patents are of central relevance. I might think a long-term vision is. Maybe your department needs somebody who is skilled in public outreach. Somebody once told me he wouldn’t hire a postdoc who doesn’t like Jazz. One size doesn’t fit all.
  •     Self-organized and network solutions
    Measures should take into account locations and connections in the various scientific networks, may that be social networks, coauthor networks or networks based on research topics. If you’re not familiar with somebody’s research, can you find somebody who you trust to give you a frank assessment? Can I find a link to this person’s research plans?
  •     No measure is ever final.
    Since the use of measures feeds back into the system, they need to be constantly adapted and updated. This should be a design feature and not an afterthought.

Some time between Pythagoras and Feynman, scientists had to realize that it had become impossible to check the accuracy of all experimental and theoretical knowledge that their own work depended upon. Instead they adopted a distributed approach in which scientists rely on the judgment of specialists for topics in which they are not specialists themselves; they rely on the integrity of their colleagues and the shared goal of understanding nature.

If humans lived forever and were infinitely patient then every scientists could trace down and fact-check every detail that their work makes use of. But that’s not our reality. The use of measures to assess scientists and institutions represents a similar change towards a networked solution. Done the right way, I think that measures can make science fairer and more efficient.

0 comments:

Post a Comment