Category Archives: Epidemiology

The study of the causes and distribution of disease. A methodological branch of health sciences

Campbell and Stanley explained replication rates in 1963

Over 60 years ago, Donald Campbell and Julian Stanley published their classic, slim volume Experimental and Quasi-Experimental Designs for Research. One of their earliest observations concerns the trade-off between internal and external validity. Specifically, the more precisely one can establish a causal relationship, the less one can say about its generality. In recent work, I show that simultaneously maximising internal and external validity is not merely a practical limitation to be mitigated, but a structural impossibility. The relationship is analogous to the Heisenberg uncertainty principle that shows one cannot simultaneously know both the position and momentum of a particle with arbitrary precision. In the context of the social and behavioural sciences, the more precisely one identifies a cause, the narrower the domain to which that knowledge applies.

I reviewed this problem in terms of the so-called “replication crisis”, the difficulty researchers have encountered in replicating published causal findings. Shortly after posting that paper, Nature published a series of articles on research credibility, including a large-scale investigation of replicability in the social and behavioural sciences. The empirical effort is extraordinary, involving hundreds of researchers and a substantial coordination infrastructure. The methods, results, and theoretical framing are all of considerable interest. However, the study has also generated headline figures that are readily misinterpreted—an outcome encouraged both by editorial framing and by the structure of the paper itself.

The central difficulty lies in two under-specified concepts that drive the research. The replication is of the “same question” and the “claim”. Whether a replication tests the “same question” is treated as a local, theory-laden judgement made by individual teams. Sameness is treated as constant at two levels simultaneously. First the multiple replications of a single study should be replicating the same thing, as if each attempt stood in an identical relationship to the original. And across all the original studies, the idea of sameness should stand in an identical relationship between a replication and its target regardless of which study is being replicated. If “same” does not mean the equivalent thing within and between replications, the target drifts meaninglessly

At the same time, replications are of “claims” which are scientific claims reduced to directional empirical statements, detached from the estimands, models, and analytic pipelines. That is, the claim is detached from the scientific meaning that gave it purchase in the original study. The same problem with “claims” arose in the team’s Nature paper on analytic robusteness. Abstracting scientific claims into more generic “claims” produces a mismatch between design and inference. Heterogeneous interpretations of what is actually being tested are collapsed into standardised statistical comparisons. Apparent agreement or disagreement may therefore reflect shifts in underlying targets rather than genuine replication or failure.

A related issue is that the study attempts to straddle internal and external validity without resolving their tension. It presents itself as assessing whether findings replicate, but in practice examines how results behave under modest variation in context, measurement, and implementation—something closer to robustness or transportability than strict replication. The use of multiple, non-equivalent metrics of “success” in the Nature article reinforces this ambiguity. Replication rates vary substantially depending on the criterion, yet a single headline figure is foregrounded: “Half of social-science studies fail replication test in years-long project“. The result is a study that is informative about the behaviour of findings (and researchers) under perturbation, but is easily—and predictably—read as making stronger claims about the reliability or truth of scientific results than its design can support.

Underlying both issues is a deeper disagreement about what replication is for. The paper’s opening paragraph explicitly reflects this tension. One reference is the National Academies of Sciences (NAS) report, which defines replication in procedural and statistical terms. Collect new data using similar methods and assess whether results are consistent, typically via effect sizes and uncertainty intervals. The other reference is a 2020 PLoS Biology article by Nosek and Errington (the two senior authors of this Nature paper), who argue that the NAS definition is not merely imprecise but conceptually mistaken. On the Nosek-Errington account, determining that a study is a replication is a theoretical commitment. Both confirming and disconfirming outcomes must be treated in advance as diagnostic of the original claim. The Nature paper adopts this language—replication teams were instructed to produce “good faith tests” of claims—but the article reports results entirely using metrics derived from the procedural-statistical tradition of NAS. This is not a superficial inconsistency. The two frameworks imply different standards of success, different interpretations of failure, and different meanings for any aggregated replication rate. The headline figures that have circulated are products of the latter framework; whether they would survive translation into the former is not addressed.

It is here that Campbell and Stanley’s observation, and its formalisation, becomes decisive. The procedural-statistical approach implicitly treats internal validity as primary and assumes that external validity can be inferred from it. That is, if results are consistent, the finding travels. The structural trade-off shows that this assumption cannot hold. The very steps taken to secure internal validity constrain the scope of generalisation. A high replication rate under this framework may therefore be simultaneously informative and misleading. It indicates that a result can be reproduced under sufficiently similar conditions, while obscuring how narrow those conditions may be. The Nosek-Errington framework recognises the need for theoretical commitment, but without a principled account of causal structure it cannot resolve the tension either. What the Nature paper ultimately demonstrates—perhaps inadvertently—is that replicability is not a property of findings alone. It is a property of the relationship between a finding and the conditions under which it is tested. This underscores a Cartwrightian notion of relationships tied to particular material configurations–nomological machines. Until that relationship is made explicit, headline replication rates will continue to invite overconfident conclusions in both directions and admonitions for better methods.

I did not have access to the published article which is behind the Springer-Nature paywall. Instead I relied on the publicly available preprint.

Analytic robustness could be a real problem

A recent article in Nature on the robustness of research findings in the social and behavioural sciences found that only 34% of re-analyses of the data yielded the same result as the original report. This sounds horrible. It sounds like two-thirds of the research that social and behavioural scientists are doing is low quality work, and certainly does not deserve to be published. One might reasonably ask if “confabulist” rather than “scientist” might not be a better job title.

Unfortunately, the edifice of “robust research” has been built on foundations of sand. The research shares many of the weaknesses of another article recently published in Science Advances, which I discuss here. There is little that can be concluded from the research that could actually inform scientific practice nor permit any observation about the quality or robustness of the original articles. It does, however, say something of interest for sociologists of science about the diversity of views that researchers have about how to re-analyse data to address conceptual claims.

The procedure followed in the Nature article was described thus.

To explore the robustness of published claims, we selected a key claim from each of our 100 studies, in which the authors provided evidence for a (directional) effect. We presented each empirical claim to at least five analysts along with the original data and asked them to analyse the data to examine the claim, following their best judgement and report only their main result. The analysts were encouraged to analyse those studies where they saw the greatest relevance of their expertise.

The word “claim” here does a lot of work. One might reasonably argue that a scientific claim in a published article is a statement of finding in the context of the hypothesis, the model, the analytic process, and the results. But this is not what is meant here. That full scientific sense of a claim is closer to what the Centre for Open Science team use as a starting point for a separate article on “reproducible” research. In the context of this article a “claim” is some vaguer statement of finding. It is an isolated single claim, has a direction of effect, and critically, is “phrased on a conceptual and not statistical level”.

The conceptual claim is closer to a vernacular claim. It is closer to the kind of thing you might say at a dinner party or read in the popular science section of a magazine. Something like, “did you hear that single female students report lower desired salaries when they think their classmates can see their preferences?” (Claim 025).

Under this framework, one should be able to abstract a full scientific claim into a conceptual claim, and if the conceptual claim is robust, independent scientists analysing the same data, making equally sensible choices about the analysis of the data, will converge on the conceptual claim. The challenge is that your pool of independent and equally sensible scientists need to agree with each other (without consultation) how that conceptual claim is to be translated into a scientific claim. A part of the science is deciding on the estimand for testing the claim, but the estimand is fixed by the analytic choice not by the conceptual claim. If two scientist analyse the same dataset but target different estimands through their analytic choices, they are not converging on the same conceptual claim. Against all logic, an analytic schema targeting a different estimand that nonetheless produces an estimate close to the estimate of the original paper, supports the robustness of the paper.

The framework, therefore, has a double incoherence. First, divergence of estimates (between the original analysis and re-analysis) is misread as fragility when it may simply reflect different estimands—different scientists sensibly translating the conceptual claim into different scientific claims. Second, and more damaging, convergence is misread as robustness when it may be entirely spurious—two analysts targeting different estimands who happen to produce similar point estimates are not confirming each other. They’re producing agreement by accident, across questions that aren’t the same question.

So the framework is wrong in both directions simultaneously. It penalises legitimate scientific pluralism and rewards numerical coincidence. A study could score as highly robust because several analysts happened to get similar numbers while asking entirely different questions. A study could score as fragile because several analysts made defensible but divergent estimand-constituting choices that led to genuinely different answers to genuinely different questions.

There is another an far more interesting reading of this paper, which has neither a click-bait quality nor the opportunity to remonstrate. Where the authors have identified fragility (or a lack of robustness), another could legitimately and positively see vitality and methodological pluralism. The social and behavioural sciences work in the messy space of self-referential agents actively interacting with and changing the environments in which they live and do science. It is hardly surprising that epistemic pluralism is a consequence of this. The 34% figure is not a scandal. It is valuable (under appreciated) data about the nature of social reality.

I did not have access to the published article which is behind the Springer-Nature paywall. Instead I relied on the publicly available preprint.

Parsing the NIH Reform Debate

I was recently alerted to Martin Kulldorff’s Blueprint for NIH Reform — a document that’s stirred some intense reactions among my colleagues. A few view it as a needed critique of systemic inefficiencies. Most regard it as an ideological Trojan horse—an attack on science dressed as reform. So where does the truth lie?

The short answer is: it’s complicated—and the messenger matters.

Kulldorff, once a Harvard professor and biostatistician, became a polarising figure during the COVID-19 pandemic for promoting ideas widely dismissed by the mainstream scientific community, including opposition to lockdowns, masking, and even some aspects of vaccination policy. He was also a co-author of the controversial Great Barrington Declaration, which called for herd immunity through natural infection — a strategy many experts considered unscientific and dangerous at the time.

This background understandably colors how his recent proposals are received.

But here’s the nuance: the Blueprint itself raises a number of ideas that aren’t inherently fringe. Calls for reforming NIH grant structures, enhancing academic freedom, incentivising open science, and streamlining peer review are echoed by many researchers across disciplines — including those with no ties to politicised public health debates. Frustrations with bureaucratic inefficiencies and perverse incentives in scientific funding are real and shared.

Where it becomes tricky is in the framing. Kulldorff doesn’t just argue for reform — he implies that current structures are suppressing truth, and that controversial views (like his own during the pandemic) have been silenced not because they lack merit, but because of groupthink or institutional bias. That framing, for many, crosses the line from constructive critique into undermining the scientific process itself.

There’s also a risk that pushing for more “openness” in what research gets funded — while laudable in theory — could result in resources being diverted to low-evidence, high-noise pursuits. Or, as one colleague aptly put it, “sending the ferret down an empty warren.” Science thrives on curiosity, but it also requires discipline and evidence-based filters.

Venue choice also matters. If this proposal were intended as a serious intervention into science policy, it might have been published in a mainstream medical or policy journal where it could be openly debated across the full spectrum of scientific opinion. Instead, it was published in the Journal of the Academy of Public Health — a platform co-founded and edited by Kulldorff himself, with close ties to politically conservative and contrarian public health figures. That choice raises questions about whether the article is seeking reform through consensus, or carving out space for alternative narratives that have struggled to find support in mainstream science.

So how should we engage with this?

Acknowledge the valid points: There is room — and need — for reform in how science is funded, reviewed, and communicated.
Be vigilant about context: Not all calls for reform are neutral. Motivations and affiliations matter, especially when public trust is on the line.
Defend the integrity of science: We can advocate for better systems without abandoning the core principles of evidence, rigor, and accountability — including fair peer review and a balance of risk and reward.

In the end, this is not a binary question of “pro-science” vs “anti-science.” It’s about how science evolves, who gets to shape that evolution, and what values we prioritise along the way — openness, yes, but always in service of evidence and public good.

This is an independent submission, edited by D.D. Reidpath.

The Great Trade Experiment

Last month I wrote about The Great Foreign Aid Experiment of the Trump administration. Foreign aid has not been without its critics because it is inefficient, promotes corruption, or is a part of an insidious program of neo-colonialism. The decision, however, by the US Government to put foreign aid “through the wood chipper” sets up a natural experiment to test whether aid save lives—more precisely, whether the sudden removal of aid ends lives. Most people in global health believe that it will result in significant suffering, although some see a silver lining: deaths among the poor and vulnerable will mark the emergence of independent health systems in low-income countries that are more resilient and finally free of external interference.

Not content with one natural experiment at the expense of the global poor, on the 2^nd of April 2025, Donald Trump announced the imposition of the highest rate of tariffs on US imports in almost 100 years. In effect, the government is dismantling the free-trade mechanism that has been operating since the mid-1990s, and adopting a more isolationist market posture. Under this new theory of trade, wealth is not created, it is finite and accrued by one country to dominate another.

The evidence has been pretty clear about the effects of poverty on health. Poor people are more likely to die than rich ones. Infant, child, and maternal mortality rates are significantly higher among the poor. Preventable and treatable diseases such as HIV, tuberculosis, and malaria also disproportionately infect and kill the poor. These poverty effects occur both within and between countries. Furthermore, they are not just biological outcomes—they are deeply social, economic, and political in nature. The conditions of poverty limit access to healthcare, nutrition, education, and safe living environments.

Over the last 75 years, in parallel with increasing life expectancy across the globe, wealth has also increased. The proportion of people living in extreme poverty today is much lower than it was 50, 20, or even 10 years ago. In fact, historically the sharpest global decline in extreme poverty occurred between 1995 and 2019—2020 was, of course the COVID pandemic, which reversed a wide rage of health and economic indicators.

Bill Clinton assumed the presidency of the United States in January 1993. He was supportive of free trade and the Uruguay Round of of the General Agreement on Tariffs and Trade (GATT), which was completed in 1994. The successful conclusion of GATT led to the creation of the World Trade Organization (WTO) in January 1995.

Following the liberalisation of trade, global extreme poverty rates fell from 36% to 10% between 1995 and 2018. In South and South-East Asia the extreme poverty rates fell from 41% to 10%. In Sub-Saharan Africa, the extreme poverty rates fell substantially, but without the same speed or depth as elsewhere: 60% to 37%. The gains of trade liberalisation were also more advantageous to some markets than others, and it particularly benefited countries with cheap manufacturing capacity such as Bangladesh and Cambodia.

The sudden US reversal on tariffs will be punishing for those poor countries that have developed a manufacturing sector—particularly in shoes and garments—to provide cheap, volume goods based on low labour costs. Of course, the goods in the US need not be cheap, because there is considerable profit in branding.

If exports drop significantly, factories will want to cut staff numbers swiftly to retain their commercial viability. Poor households, particularly those reliant on a single income manufacturing jobs, will likely be thrown backwards into extreme poverty. The global economic gains of the last 30 years could begin to reverse. A major drop in exports will have an immediate impact on the factories’ labour force but there will be flow on effects to the entire economy of poor countries. In Bangladesh, for example, garment manufacturing is the single biggest source of export revenue, and reductions here will mean reductions in national tax revenue which supports health, education and welfare services.

In other LMICs that are less reliant on a global export market, shifts in tariffs will have a concomitantly smaller impact. Thus, the two natural experiments will intersect. The impact of foreign aid on health and the impact of foreign trade on health will play out with interacting effects.

Needless to say, none of this was ever framed as an experiment. Cutting aid and raising tariffs was all to “Make America Great Again”. It is a cruel, indifferent approach to trade and foreign policy. There will be no one in the Situation Room plotting a Kaplan-Meier survival curve. No policymaker will announce that the hypothesis has been confirmed/rejected: that wealth, when withdrawn or walled off, leaves people dead. Nonetheless, the data will tell its own story.

And when it does, it won’t speak in dollars or trade deficits. It will speak in the numbers of anaemic mothers, closed clinics, empty pharmacies, and missed meals. It will speak in children pulled from school to help at home. It will speak in lives shortened not by biology, but by policy

The Great Trade Experiment, like the Great Aid Experiment, won’t just test theories in global health and economics. It will test people—millions of them. And the results, while statistically significant, will not be ethically neutral. Some experiments happen by accident. Others, by design.

This one was designed—by the President of the United States.