Replication crisis | Papyrus Walk

A recent article in Nature on the robustness of research findings in the social and behavioural sciences found that only 34% of re-analyses of the data yielded the same result as the original report. This sounds horrible. It sounds like two-thirds of the research that social and behavioural scientists are doing is low quality work, and certainly does not deserve to be published. One might reasonably ask if “confabulist” rather than “scientist” might not be a better job title.

Unfortunately, the edifice of “robust research” has been built on foundations of sand. The research shares many of the weaknesses of another article recently published in Science Advances, which I discuss here. There is little that can be concluded from the research that could actually inform scientific practice nor permit any observation about the quality or robustness of the original articles. It does, however, say something of interest for sociologists of science about the diversity of views that researchers have about how to re-analyse data to address conceptual claims.

The procedure followed in the Nature article was described thus.

To explore the robustness of published claims, we selected a key claim from each of our 100 studies, in which the authors provided evidence for a (directional) effect. We presented each empirical claim to at least five analysts along with the original data and asked them to analyse the data to examine the claim, following their best judgement and report only their main result. The analysts were encouraged to analyse those studies where they saw the greatest relevance of their expertise.

The word “claim” here does a lot of work. One might reasonably argue that a scientific claim in a published article is a statement of finding in the context of the hypothesis, the model, the analytic process, and the results. But this is not what is meant here. That full scientific sense of a claim is closer to what the Centre for Open Science team use as a starting point for a separate article on “reproducible” research. In the context of this article a “claim” is some vaguer statement of finding. It is an isolated single claim, has a direction of effect, and critically, is “phrased on a conceptual and not statistical level”.

The conceptual claim is closer to a vernacular claim. It is closer to the kind of thing you might say at a dinner party or read in the popular science section of a magazine. Something like, “did you hear that single female students report lower desired salaries when they think their classmates can see their preferences?” (Claim 025).

Under this framework, one should be able to abstract a full scientific claim into a conceptual claim, and if the conceptual claim is robust, independent scientists analysing the same data, making equally sensible choices about the analysis of the data, will converge on the conceptual claim. The challenge is that your pool of independent and equally sensible scientists need to agree with each other (without consultation) how that conceptual claim is to be translated into a scientific claim. A part of the science is deciding on the estimand for testing the claim, but the estimand is fixed by the analytic choice not by the conceptual claim. If two scientist analyse the same dataset but target different estimands through their analytic choices, they are not converging on the same conceptual claim. Against all logic, an analytic schema targeting a different estimand that nonetheless produces an estimate close to the estimate of the original paper, supports the robustness of the paper.

The framework, therefore, has a double incoherence. First, divergence of estimates (between the original analysis and re-analysis) is misread as fragility when it may simply reflect different estimands—different scientists sensibly translating the conceptual claim into different scientific claims. Second, and more damaging, convergence is misread as robustness when it may be entirely spurious—two analysts targeting different estimands who happen to produce similar point estimates are not confirming each other. They’re producing agreement by accident, across questions that aren’t the same question.

So the framework is wrong in both directions simultaneously. It penalises legitimate scientific pluralism and rewards numerical coincidence. A study could score as highly robust because several analysts happened to get similar numbers while asking entirely different questions. A study could score as fragile because several analysts made defensible but divergent estimand-constituting choices that led to genuinely different answers to genuinely different questions.

There is another an far more interesting reading of this paper, which has neither a click-bait quality nor the opportunity to remonstrate. Where the authors have identified fragility (or a lack of robustness), another could legitimately and positively see vitality and methodological pluralism. The social and behavioural sciences work in the messy space of self-referential agents actively interacting with and changing the environments in which they live and do science. It is hardly surprising that epistemic pluralism is a consequence of this. The 34% figure is not a scandal. It is valuable (under appreciated) data about the nature of social reality.

I did not have access to the published article which is behind the Springer-Nature paywall. Instead I relied on the publicly available preprint.

Tag Archives: Replication crisis

Analytic robustness could be a real problem