Tag Archives: reproducible research

Campbell and Stanley explained replication rates in 1963

Over 60 years ago, Donald Campbell and Julian Stanley published their classic, slim volume Experimental and Quasi-Experimental Designs for Research. One of their earliest observations concerns the trade-off between internal and external validity. Specifically, the more precisely one can establish a causal relationship, the less one can say about its generality. In recent work, I show that simultaneously maximising internal and external validity is not merely a practical limitation to be mitigated, but a structural impossibility. The relationship is analogous to the Heisenberg uncertainty principle that shows one cannot simultaneously know both the position and momentum of a particle with arbitrary precision. In the context of the social and behavioural sciences, the more precisely one identifies a cause, the narrower the domain to which that knowledge applies.

I reviewed this problem in terms of the so-called “replication crisis”, the difficulty researchers have encountered in replicating published causal findings. Shortly after posting that paper, Nature published a series of articles on research credibility, including a large-scale investigation of replicability in the social and behavioural sciences. The empirical effort is extraordinary, involving hundreds of researchers and a substantial coordination infrastructure. The methods, results, and theoretical framing are all of considerable interest. However, the study has also generated headline figures that are readily misinterpreted—an outcome encouraged both by editorial framing and by the structure of the paper itself.

The central difficulty lies in two under-specified concepts that drive the research. The replication is of the “same question” and the “claim”. Whether a replication tests the “same question” is treated as a local, theory-laden judgement made by individual teams. Sameness is treated as constant at two levels simultaneously. First the multiple replications of a single study should be replicating the same thing, as if each attempt stood in an identical relationship to the original. And across all the original studies, the idea of sameness should stand in an identical relationship between a replication and its target regardless of which study is being replicated. If “same” does not mean the equivalent thing within and between replications, the target drifts meaninglessly

At the same time, replications are of “claims” which are scientific claims reduced to directional empirical statements, detached from the estimands, models, and analytic pipelines. That is, the claim is detached from the scientific meaning that gave it purchase in the original study. The same problem with “claims” arose in the team’s Nature paper on analytic robusteness. Abstracting scientific claims into more generic “claims” produces a mismatch between design and inference. Heterogeneous interpretations of what is actually being tested are collapsed into standardised statistical comparisons. Apparent agreement or disagreement may therefore reflect shifts in underlying targets rather than genuine replication or failure.

A related issue is that the study attempts to straddle internal and external validity without resolving their tension. It presents itself as assessing whether findings replicate, but in practice examines how results behave under modest variation in context, measurement, and implementation—something closer to robustness or transportability than strict replication. The use of multiple, non-equivalent metrics of “success” in the Nature article reinforces this ambiguity. Replication rates vary substantially depending on the criterion, yet a single headline figure is foregrounded: “Half of social-science studies fail replication test in years-long project“. The result is a study that is informative about the behaviour of findings (and researchers) under perturbation, but is easily—and predictably—read as making stronger claims about the reliability or truth of scientific results than its design can support.

Underlying both issues is a deeper disagreement about what replication is for. The paper’s opening paragraph explicitly reflects this tension. One reference is the National Academies of Sciences (NAS) report, which defines replication in procedural and statistical terms. Collect new data using similar methods and assess whether results are consistent, typically via effect sizes and uncertainty intervals. The other reference is a 2020 PLoS Biology article by Nosek and Errington (the two senior authors of this Nature paper), who argue that the NAS definition is not merely imprecise but conceptually mistaken. On the Nosek-Errington account, determining that a study is a replication is a theoretical commitment. Both confirming and disconfirming outcomes must be treated in advance as diagnostic of the original claim. The Nature paper adopts this language—replication teams were instructed to produce “good faith tests” of claims—but the article reports results entirely using metrics derived from the procedural-statistical tradition of NAS. This is not a superficial inconsistency. The two frameworks imply different standards of success, different interpretations of failure, and different meanings for any aggregated replication rate. The headline figures that have circulated are products of the latter framework; whether they would survive translation into the former is not addressed.

It is here that Campbell and Stanley’s observation, and its formalisation, becomes decisive. The procedural-statistical approach implicitly treats internal validity as primary and assumes that external validity can be inferred from it. That is, if results are consistent, the finding travels. The structural trade-off shows that this assumption cannot hold. The very steps taken to secure internal validity constrain the scope of generalisation. A high replication rate under this framework may therefore be simultaneously informative and misleading. It indicates that a result can be reproduced under sufficiently similar conditions, while obscuring how narrow those conditions may be. The Nosek-Errington framework recognises the need for theoretical commitment, but without a principled account of causal structure it cannot resolve the tension either. What the Nature paper ultimately demonstrates—perhaps inadvertently—is that replicability is not a property of findings alone. It is a property of the relationship between a finding and the conditions under which it is tested. This underscores a Cartwrightian notion of relationships tied to particular material configurations–nomological machines. Until that relationship is made explicit, headline replication rates will continue to invite overconfident conclusions in both directions and admonitions for better methods.

I did not have access to the published article which is behind the Springer-Nature paywall. Instead I relied on the publicly available preprint.

Analytic robustness could be a real problem

A recent article in Nature on the robustness of research findings in the social and behavioural sciences found that only 34% of re-analyses of the data yielded the same result as the original report. This sounds horrible. It sounds like two-thirds of the research that social and behavioural scientists are doing is low quality work, and certainly does not deserve to be published. One might reasonably ask if “confabulist” rather than “scientist” might not be a better job title.

Unfortunately, the edifice of “robust research” has been built on foundations of sand. The research shares many of the weaknesses of another article recently published in Science Advances, which I discuss here. There is little that can be concluded from the research that could actually inform scientific practice nor permit any observation about the quality or robustness of the original articles. It does, however, say something of interest for sociologists of science about the diversity of views that researchers have about how to re-analyse data to address conceptual claims.

The procedure followed in the Nature article was described thus.

To explore the robustness of published claims, we selected a key claim from each of our 100 studies, in which the authors provided evidence for a (directional) effect. We presented each empirical claim to at least five analysts along with the original data and asked them to analyse the data to examine the claim, following their best judgement and report only their main result. The analysts were encouraged to analyse those studies where they saw the greatest relevance of their expertise.

The word “claim” here does a lot of work. One might reasonably argue that a scientific claim in a published article is a statement of finding in the context of the hypothesis, the model, the analytic process, and the results. But this is not what is meant here. That full scientific sense of a claim is closer to what the Centre for Open Science team use as a starting point for a separate article on “reproducible” research. In the context of this article a “claim” is some vaguer statement of finding. It is an isolated single claim, has a direction of effect, and critically, is “phrased on a conceptual and not statistical level”.

The conceptual claim is closer to a vernacular claim. It is closer to the kind of thing you might say at a dinner party or read in the popular science section of a magazine. Something like, “did you hear that single female students report lower desired salaries when they think their classmates can see their preferences?” (Claim 025).

Under this framework, one should be able to abstract a full scientific claim into a conceptual claim, and if the conceptual claim is robust, independent scientists analysing the same data, making equally sensible choices about the analysis of the data, will converge on the conceptual claim. The challenge is that your pool of independent and equally sensible scientists need to agree with each other (without consultation) how that conceptual claim is to be translated into a scientific claim. A part of the science is deciding on the estimand for testing the claim, but the estimand is fixed by the analytic choice not by the conceptual claim. If two scientist analyse the same dataset but target different estimands through their analytic choices, they are not converging on the same conceptual claim. Against all logic, an analytic schema targeting a different estimand that nonetheless produces an estimate close to the estimate of the original paper, supports the robustness of the paper.

The framework, therefore, has a double incoherence. First, divergence of estimates (between the original analysis and re-analysis) is misread as fragility when it may simply reflect different estimands—different scientists sensibly translating the conceptual claim into different scientific claims. Second, and more damaging, convergence is misread as robustness when it may be entirely spurious—two analysts targeting different estimands who happen to produce similar point estimates are not confirming each other. They’re producing agreement by accident, across questions that aren’t the same question.

So the framework is wrong in both directions simultaneously. It penalises legitimate scientific pluralism and rewards numerical coincidence. A study could score as highly robust because several analysts happened to get similar numbers while asking entirely different questions. A study could score as fragile because several analysts made defensible but divergent estimand-constituting choices that led to genuinely different answers to genuinely different questions.

There is another an far more interesting reading of this paper, which has neither a click-bait quality nor the opportunity to remonstrate. Where the authors have identified fragility (or a lack of robustness), another could legitimately and positively see vitality and methodological pluralism. The social and behavioural sciences work in the messy space of self-referential agents actively interacting with and changing the environments in which they live and do science. It is hardly surprising that epistemic pluralism is a consequence of this. The 34% figure is not a scandal. It is valuable (under appreciated) data about the nature of social reality.

I did not have access to the published article which is behind the Springer-Nature paywall. Instead I relied on the publicly available preprint.

Low cost reproducible microscopy

There has been growing interest in reproducible research. The interest arises from the idea that scientific discoveries that are one off, isolated and never to be repeated have limited value. For research to inform future science other must be able to reproduce the results. There are even courses on reproducible research. However, a look at the courses and a quick search of PubMed will reveal that when people refer to reproducible research, they often mean shared data or shared analytic code. And when I write “shared data”, I don’t even really mean any data, I mean electronic data … a spreadsheet, a database, etc. Of course, the Methodology section of journal articles are supposed to support reproducible research, but these are often hints and teasers for what was done, rather than a genuine “how to”.

Reproducibility becomes more challenging when all one has to work with is a one-off observation. How do I show you what I saw? The question became particularly relevant to me in a recent discussion with a colleague about reproducible microscopy. The obvious answer is, “a photograph” — and with a professional set up, one can achieve spectacular results — but what should be done in resource poor settings where money and equipment are limited?

I am one of the investigators on a Wellcome Trust funded “Our Planet Our Health” award led by Rebekah Brown at the Monash Sustainable Development Institute. The “Revitalisation of Informal Settlements and their Environment” (RISE) study involves the collection of large quantities of diverse data from informal settlements in Makassar, Indonesia and Suva, Fiji. Most of the samples will be collected by in-country teams, and they will not have access to high-end equipment. Nonetheless, some of the samples will have to be examined under a microscope in Makassar and Suva. It is likely that we will have to rely on basic equipment and Lab Technicians with limited skills in microscope photography. From my SEACO experience, I would be looking for a low-cost solution that can be implemented with basic training. The solution “works” if the images are appropriate; that is, they are the thing of scientific interest, and are of sufficient fidelity that a researcher somewhere else in the world can interpret them appropriately. It is not necessary for the technician to be brilliant, just adequate.

I decided to play. I am neither a good photographer nor am I good at microscopy. I reasoned that if I could get something approximating a reasonable image, then a Lab Technician with some actual training would have no problems. The best low-cost camera solution is not to buy another piece of equipment at all. We are already committed to using a smartphone/Tablet solution in the RISE project and plan to use them for capturing photographs, tagging photographs, and uploading them to a server. The only challenge was getting the smartphone camera to “peek” into the microscope. Fortunately, there is a broad range of solutions, and I opted for the very cheapest I could find on eBay. It cost me USD$5.99, brand new, including postage and handling.

The mount is straightforward to use, although my first attempt was pretty awful. I found a weevil crawling around the kitchen (welcome to the tropics!) and that became the first portraiture subject.

A photograph of a weevil taken with a google phone using a smartphone-microscope adapter

The images are of much higher resolution than I have posted here. I didn’t know what I was doing, and most of the image is taken up with the microscope surrounds rather than the subject of the photograph. I tried the next day, this time using a peppercorn as the subject — it didn’t move as quickly.

A photograph of a peppercorn taken with a google phone using a smartphone-microscope adapter

The only real difference in my approach was that this time I zoomed in slightly on the peppercorn. I will never look at peppercorns the same way. What appears to be (to my unqualified eyes) fungal mycelium is less than appealing. Nonetheless, it also seems like the general approach to capturing microscope images might be a reasonable. As long as the technician knows what to photograph, the quality of the images is almost certainly good enough for others to view and interpret. This is potentially quite exciting because it does allow science (and quite basic science) to be virtual and shared. A photograph of a microscopic image taken in Makassar could be shared with the world within hours giving scientists anywhere an opportunity to look, think, interpret, question and suggest.

Guidelines for the reporting of COde Developed to Analyse daTA (CODATA)

I was reviewing an article recently for a journal in which the authors referenced a GitHub repository for the Stata code they had developed to support their analysis. I had a look at the repository. The code was there in a complex hierarchy of nested folders. Each individual do-file was well commented, but there was no file that described the overall structure, the interlinking of the files, or how to use the code to actually run an analysis.

I have previously published code associated with some of my own analyses. The code for a recent paper on gender bias in clinical case reports was published here, and the code for the Bayesian classification of ethnicity based on names was published here. None of my code had anything like the complexity of the code referenced in the paper I was reviewing. It did get me thinking however about how the code for statistical analyses should be written. The EQUATOR (Enhancing the QUAlity and Transparency Of health Research) Network has 360 separate guidelines for reporting research. This includes guidelines for everything from randomised trials and observational studies through to diagnostic studies, economic evaluations and case reports. Nothing on the reporting of code for the analysis of data.

On the back of the move towards making data available for re-analysis, and the reproducible research movement, it struck me that guidelines for the structuring of code for simultaneous publication with articles would be enormously beneficial. I started to sketch it out on paper, and write the idea up as an article. Ideally, I would be able to enrol some others as contributors. In my head, the code should have good meta-data at the start describing the structure and interrelationship of the files. I now tend to break my code up into separate files with one file describing the workflow: data importation, data cleaning, setting up factors, analysis. And then I have separate files for each element of the workflow. My analysis is further divided into specific references to parts of papers. “This code refers to Table 1”. I write the code this way for two reasons. It makes it easier for collaborators to pick it up and use it, and I often have a secondary, teaching goal in mind. If I can write the code nicely, it may persuade others to emulate the idea. Having said that, I often use fairly unattractive ways to do things, because I don’t know any better; and I sometimes deliberately break an analytic process down into multiple inefficient steps simply to clarify the process — this is the anti-Perl strategy.

I then started to review the literature and stumbled across a commentary written by Nick Barnes in 2010 in the journal Nature. He has completely persuaded me that my idea is silly.

It is not silly to hope that people will write intelligible, well structured. well commented code for statistical analysis of data. It is not silly to hope that people will include this beautiful code in their papers. The problem with guidelines published by the EQUATOR Network is in the way that journals require authors to comply with them. They become exactly the opposite of guidelines, they are rules — the ironic twist on the observation by Geoffrey Rush’s character, Hector Barbossa in Pirates of the Caribbean.

Barnes wrote, “I want to share a trade secret with scientists: most professional computer software isn’t very good.” Most academics/researchers feel embarrassed by their code. I have collaborated with a very good Software Engineer in some of my work and spent large amounts of time apologising for my code. We want to be judged for our science, not for our code. The problem with that sense of embarrassment is that the perfect becomes the enemy of the good.

The Methods sections of most research articles make fairly vague allusions to how the data were actually managed and analysed. One may make references to statistical tests and theoretical distributions. For a reader to move from that to a re-analysis of the data is often not straight forward. The actual code, however, explains exactly what was done. “Ah! You dropped two cases, collapsed two factors, and used a particular version of an algorithm to perform a logistic regression analysis. And now I know why my results don’t quite match yours”.

It would be nice to have an agreed set of guidelines reporting COde Developed to Analyse daTA (CODATA). It would be great if some authors followed the CODATA guidelines when they published. But it would be even better if everyone published their code, no matter how bad or inefficient it was.