Category Archives: Higher Education

Campbell and Stanley explained replication rates in 1963

Over 60 years ago, Donald Campbell and Julian Stanley published their classic, slim volume Experimental and Quasi-Experimental Designs for Research. One of their earliest observations concerns the trade-off between internal and external validity. Specifically, the more precisely one can establish a causal relationship, the less one can say about its generality. In recent work, I show that simultaneously maximising internal and external validity is not merely a practical limitation to be mitigated, but a structural impossibility. The relationship is analogous to the Heisenberg uncertainty principle that shows one cannot simultaneously know both the position and momentum of a particle with arbitrary precision. In the context of the social and behavioural sciences, the more precisely one identifies a cause, the narrower the domain to which that knowledge applies.

I reviewed this problem in terms of the so-called “replication crisis”, the difficulty researchers have encountered in replicating published causal findings. Shortly after posting that paper, Nature published a series of articles on research credibility, including a large-scale investigation of replicability in the social and behavioural sciences. The empirical effort is extraordinary, involving hundreds of researchers and a substantial coordination infrastructure. The methods, results, and theoretical framing are all of considerable interest. However, the study has also generated headline figures that are readily misinterpreted—an outcome encouraged both by editorial framing and by the structure of the paper itself.

The central difficulty lies in two under-specified concepts that drive the research. The replication is of the “same question” and the “claim”. Whether a replication tests the “same question” is treated as a local, theory-laden judgement made by individual teams. Sameness is treated as constant at two levels simultaneously. First the multiple replications of a single study should be replicating the same thing, as if each attempt stood in an identical relationship to the original. And across all the original studies, the idea of sameness should stand in an identical relationship between a replication and its target regardless of which study is being replicated. If “same” does not mean the equivalent thing within and between replications, the target drifts meaninglessly

At the same time, replications are of “claims” which are scientific claims reduced to directional empirical statements, detached from the estimands, models, and analytic pipelines. That is, the claim is detached from the scientific meaning that gave it purchase in the original study. The same problem with “claims” arose in the team’s Nature paper on analytic robusteness. Abstracting scientific claims into more generic “claims” produces a mismatch between design and inference. Heterogeneous interpretations of what is actually being tested are collapsed into standardised statistical comparisons. Apparent agreement or disagreement may therefore reflect shifts in underlying targets rather than genuine replication or failure.

A related issue is that the study attempts to straddle internal and external validity without resolving their tension. It presents itself as assessing whether findings replicate, but in practice examines how results behave under modest variation in context, measurement, and implementation—something closer to robustness or transportability than strict replication. The use of multiple, non-equivalent metrics of “success” in the Nature article reinforces this ambiguity. Replication rates vary substantially depending on the criterion, yet a single headline figure is foregrounded: “Half of social-science studies fail replication test in years-long project“. The result is a study that is informative about the behaviour of findings (and researchers) under perturbation, but is easily—and predictably—read as making stronger claims about the reliability or truth of scientific results than its design can support.

Underlying both issues is a deeper disagreement about what replication is for. The paper’s opening paragraph explicitly reflects this tension. One reference is the National Academies of Sciences (NAS) report, which defines replication in procedural and statistical terms. Collect new data using similar methods and assess whether results are consistent, typically via effect sizes and uncertainty intervals. The other reference is a 2020 PLoS Biology article by Nosek and Errington (the two senior authors of this Nature paper), who argue that the NAS definition is not merely imprecise but conceptually mistaken. On the Nosek-Errington account, determining that a study is a replication is a theoretical commitment. Both confirming and disconfirming outcomes must be treated in advance as diagnostic of the original claim. The Nature paper adopts this language—replication teams were instructed to produce “good faith tests” of claims—but the article reports results entirely using metrics derived from the procedural-statistical tradition of NAS. This is not a superficial inconsistency. The two frameworks imply different standards of success, different interpretations of failure, and different meanings for any aggregated replication rate. The headline figures that have circulated are products of the latter framework; whether they would survive translation into the former is not addressed.

It is here that Campbell and Stanley’s observation, and its formalisation, becomes decisive. The procedural-statistical approach implicitly treats internal validity as primary and assumes that external validity can be inferred from it. That is, if results are consistent, the finding travels. The structural trade-off shows that this assumption cannot hold. The very steps taken to secure internal validity constrain the scope of generalisation. A high replication rate under this framework may therefore be simultaneously informative and misleading. It indicates that a result can be reproduced under sufficiently similar conditions, while obscuring how narrow those conditions may be. The Nosek-Errington framework recognises the need for theoretical commitment, but without a principled account of causal structure it cannot resolve the tension either. What the Nature paper ultimately demonstrates—perhaps inadvertently—is that replicability is not a property of findings alone. It is a property of the relationship between a finding and the conditions under which it is tested. This underscores a Cartwrightian notion of relationships tied to particular material configurations–nomological machines. Until that relationship is made explicit, headline replication rates will continue to invite overconfident conclusions in both directions and admonitions for better methods.

I did not have access to the published article which is behind the Springer-Nature paywall. Instead I relied on the publicly available preprint.

A Christmas Story

In the last year of the reign of Biden, there was a ruler in Judea named Benyamin. He was a man of great cunning and greater cruelty.

In those days, Judea, though powerful, was a vassal state. Its strength was created through alliances with distant empires. It wielded its might with a fierce arm and harboured a deep hatred for its neighbors. Benyamin, fearing the loss of his power, sought to destroy the Philistines on that small strip of land called Gaza, and claim it for himself.

For over four hundred and forty days and nights, he commanded his armies to bomb their towns and villages, reducing them to rubble. The Philistines were corralled, trapped within walls and wire, with no escape. Benyamin promised them safety in Rafah and bombed the people there. He offered refuge in Jabalia, and bombed the people there.

In Gaza, there was no safety and there was no food.

Even as leaders wept for the Philistines, they sold weapons to Benyamin and lent him money to prosecute his war. Thus, the world watched in silence as the Philistines endured great suffering. Their cries rose up to heaven, seemingly unanswered.

And so it came to pass, in the last days of the last year of Biden, there was a humble Philistine named Yusouf born of the family of Dawoud. Before the war, Yusouf had been a mechanic. He worked hard each day fixing tires and carburetors, changing break-pads and exhaust systems. And at the end of each day, he would return home to his young wife, Mariam. The same Mariam, you may have heard of her, who was known for her inexhaustable cheerfulness.

That was before the war. Now Mariam was gaunt and tired, and heavy with child.

On the night of the winter solstice, in a dream, a messenger came to Yusouf. “Be not afraid, Yusouf”, the messenger said. “Be not afraid for yourself, for the wife you love so very much, or for your son—who will change the world. What will be, will be and was always meant to be”. Yusouf was troubled by this dream, and found himself torn between wonder, happiness, and fear. Mariam asked him why he looked troubled, but he said nothing and kept his own counsel.

The following night the same messenger visited Mariam in her dreams. Mariam was neither afraid nor troubled. The next morning she had a smile on her face that Yusouf had not seen for so long he had almost forgotten it. “It is time, Yusouf”, she said. “We have to go to the hospital in Beit Lahiya.”

Yusouf was troubled. Long ago he had learned to trust Mariam, but his motorbike had no fuel and it was a long walk. Too far for Mariam, and they were bombing Beit Lahiya. He remembered the words of the messenger in his dreams and he went from neighbour to neighbour. A teaspoon of fuel here, half a cup there. No one demanded payment. If they had any fuel, no one refused him. Having little, they shared what they had. It was the small act of kindness that binds communities. Yusouf wept for their generosity.

When he had gathered enough fuel, he had Mariam climb on the bike. Shadiah, the old sweet seller who had not made a sweet in over a year and could barely remember the smell of honey or rosewater, helped her onto the back.

Yusouf rode carefully. He weaved slowly around potholes and navigated bumps. In spite of his care, he could feel Mariam tense and grip him tighter. And then the motorbike stopped. A last gasping jerk and silence. The fuel was spent.

The late afternoon air was cooling as he helped Mariam walk towards the hospital. When they arrived at the gate, a porter stopped, them. “They’re evacuating the hospital. You can’t go in”, the porter told them. Yusouf begged. “My wife, she is going to give birth,” he told the porter—who could plainly see this for himself. The porter looked at Mariam and took pity. “You can’t go in, but there is a small community clinic around the corner. It was bombed recently, but some of it, a room or two, is still standing. I’ll send a midwife.”

Yusouf gently guided Mariam to the clinic. He found an old mattress on a broken gurney and a blanket. He lay it on the floor and settled Mariam.

If there had been a midwife—if she had ever arrived… if she had ever got the porter’s message—she would have been eager to retell the story of the birth. Sharing a coffee, with a date-filled siwa, she would have painted the picture. Mariam’s face was one of grace. Yusouf anxiously held her hand. The baby came quickly, with a minimum of fuss, as if Mariam was having her fifth and not her first.

Yusouf quickly scooped up the baby as it began to vocalise it’s unhappiness with the shock of a cold Gaza night. He cut the cord crudely but effectively with his pocket knife. And it was only as he was passing the the baby to Mariam that he looked confused. He did not have the son he was promised, he had a daughter. The moment was so fleeting that quantum physicists would have struggled to measure the breadth of time, and Yusouf smiled at the messenger’s joke.

Because there was no midwife to witness this moment, we need to account for the witnesses who were present. There was a mangy dog with a limp looking for warmth. He watched patiently and, once the birth was completed, he found a place at Mariam’s feet. There were three rats that crawled out of the rubble looking for scraps. They gave a hopeful sniff of the night air and sat respectfully and companionably on a broken chair. As soon as the moment passed, they disappeared into the crevices afforded by broken brick and torn concrete. Finally, there was an unremarkable cat. In comfortable fellowship, they all watch the moment of birth knowing that, tomorrow or the next day, they were mortal enemies, but tonight there was peace.

“Nasrin”, Yousuf whispered in Mariam’s ear as he kissed her forehead. “We’ll call her Nasrin.” The wild rose that grows and conquers impossible places.

There was a photo journalist called Weissman, who heard from the porter that was a very pregnant woman at the clinic. “She’s about to pop”, the porter said. Weissman hurried to the bombed out clinic so that he could bear witness to this miracle in the midst of war.

He missed the birth. And when he arrived, he did not announce his presence. It seemed rude. An intrusion on a very private moment. It did not, however, stop him from taking photos for AAP.

He later shared those images with the world. Yusouf lay on the gurney mattress, propped against a half destroyed wall. Mariam was lying against him, exhausted, eyes closed, covered in a dirty blanket. The baby Nasrin was feeding quietly, just the top of her head with a shock of improbably thick dark hair peeking out. Yousuf stared through the broken roof at the stars in heaven. The blackness of a world without electricity made resplendent. He looked up with wonderment and contentment on his face. He was blessed, he thought. No. They were blessed. The messenger was right.

As Weissman picked his way in the dark towards the hospital gate, where he had last seen the porter, he shared the same hope that he had seen on Yusouf’s face. New life can change things.

The night sky lit up, brightening his path to the hospital. He turned back and was awed by a red flare descending slowly over the remains of the clinic as if announcing a new beginning to the world. A chance for something different was born here today.

The explosion shook the ground and Weissman fell. Cement and brick dust from where the clinic had stood rose sharply in to the air. An avalanche of dust raced towards him.

UKRI go its A.I. policy half right

UKRI AI policy: Authors on the left. Assessors on the right (image generated by DALL.E)

When UKRI released its policy on using generative artificial intelligence (A.I.) in funding applications this September, I found myself nodding until I wasn’t. Like many in the research community, I’ve been watching the integration of A.I. tools into academic work with excitement and trepidation. In contrast, UKRI’s approach is a puzzling mix of Byzantine architecture and modern chic.

The modern chic, the half they got right, is on using A.I. in research proposal development. By adopting what amounts to a “don’t ask, don’t tell” policy, they have side-stepped endless debates that swirl about university circles. Do you want to use an A.I. to help structure your proposal? Go ahead. Do you prefer to use it for brainstorming or polishing your prose? That’s fine, too. Maybe you like to write your proposal on blank sheets of paper using an HB pencil. You’re a responsible adult—we’ll trust you, and please don’t tell us about it.

The approach is sensible. It recognises A.I. as just one of the many tools in the researcher’s arsenal. It is no different in principle from grammar-checkers or reference managers. UKRI has avoided creating artificial distinctions between AI-assisted work and “human work” by not requiring disclosure. Such a distinction also becomes increasingly meaningless as A.I. tools integrate into our daily workflows, often completely unknown to us.

Now let’s turn to the Byzantine—the half UKRI got wrong—the part dealing with assessors of grant proposals. And here, UKRI seems to have lost its nerve. The complete prohibition on using A.I. by assessors feels like a policy from a different era—some time “Before ChatGPT” (B.C.) was released in November 2022. The B.C. policy fails to recognise the enormous potential of A.I. to support and improve human assessors’ judgment.

You’re a senior researcher who’s agreed to review for UKRI. You have just submitted a proposal using an A.I. to clean, polish and improve the work. As an assessor, you are now juggling multiple complex proposals, each crossing traditional disciplinary boundaries (which is increasingly regarded as a positive). You’re probably doing this alongside your day job because that’s how senior researchers work. Wouldn’t it be helpful to have an A.I. assistant to organise key points, flag potential concerns, help clarify technical concepts outside your immediate expertise, act as a sounding board, or provide an intelligent search of the text?

The current policy says no. Assessors must perform every aspect of the review manually, potentially reducing the time they can spend on a deep evaluation of the proposal. The restriction becomes particularly problematic when considering international reviewers, especially those from the Global South. Many brilliant researchers who could offer valuable perspectives might struggle with English as a second language and miss some nuance without support. A.I. could help bridge this gap, but the policy forbids it.

The dual-use policy leads to an ironic situation. Applicants can use A.I. to write their proposals, but assessors can’t use it to support the evaluation of those proposals. It is like allowing Formula 1 teams to use bleeding-edge technology to design their racing cars while insisting that race officials use an hourglass and the naked eye to monitor the race.

Strategically, the situation worries me. Research funding is a global enterprise; other funding bodies are unlikely to maintain such a conservative stance for long. As other funders integrate A.I. into their assessment processes, they will develop best-practice approaches and more efficient workflows. UKRI will fall behind. This could affect the quality of assessments and UKRI’s ability to attract busy reviewers. Why would a busy senior researcher review for UKRI when other funders value their reviewers’ time and encourage efficiency and quality?

There is a path forward. UKRI could maintain its thoughtful approach to applicants while developing more nuanced guidelines for assessors. One approach would be a policy that clearly outlines appropriate A.I. use cases at different stages of assessment, from initial review to technical clarification to quality control. By adding transparency requirements, proper training, and regular policy reviews, UKRI could lead the way with approaches that both protect research integrity and embrace innovation.

If UKRI is nervous, they could start with a pilot program. Evaluate the impact of AI-assisted assessment. Compare it to a traditional approach. This would provide evidence-based insights for policy development while demonstrating leadership in research governance and funding.

The current policy feels half-baked. UKRI has shown they can craft sophisticated policy around A.I. use. The approach to applicants proves this. They need to extend that same thoughtfulness to the assessment process. The goal is not to use A.I. to replace human judgment but to enhance it. It would allow assessors to focus their expertise where it matters most.

This is about more than efficiency and keeping up with technology. It’s about creating the best possible system for identifying and supporting excellent research. If A.I. is a tool to support this process, we should celebrate. When we help assessors do their job more effectively, we help the entire research community.

The research landscape is changing rapidly. UKRI has taken an important first step in allowing A.I. to support the writing of funding grant applications. Now it’s time for the next one—using A.I. to support funding grant evaluation.