Category Archives: Education

Topics related to the education sector (usually the tertiary or Higher Education sector).

Campbell and Stanley explained replication rates in 1963

Over 60 years ago, Donald Campbell and Julian Stanley published their classic, slim volume Experimental and Quasi-Experimental Designs for Research. One of their earliest observations concerns the trade-off between internal and external validity. Specifically, the more precisely one can establish a causal relationship, the less one can say about its generality. In recent work, I show that simultaneously maximising internal and external validity is not merely a practical limitation to be mitigated, but a structural impossibility. The relationship is analogous to the Heisenberg uncertainty principle that shows one cannot simultaneously know both the position and momentum of a particle with arbitrary precision. In the context of the social and behavioural sciences, the more precisely one identifies a cause, the narrower the domain to which that knowledge applies.

I reviewed this problem in terms of the so-called “replication crisis”, the difficulty researchers have encountered in replicating published causal findings. Shortly after posting that paper, Nature published a series of articles on research credibility, including a large-scale investigation of replicability in the social and behavioural sciences. The empirical effort is extraordinary, involving hundreds of researchers and a substantial coordination infrastructure. The methods, results, and theoretical framing are all of considerable interest. However, the study has also generated headline figures that are readily misinterpreted—an outcome encouraged both by editorial framing and by the structure of the paper itself.

The central difficulty lies in two under-specified concepts that drive the research. The replication is of the “same question” and the “claim”. Whether a replication tests the “same question” is treated as a local, theory-laden judgement made by individual teams. Sameness is treated as constant at two levels simultaneously. First the multiple replications of a single study should be replicating the same thing, as if each attempt stood in an identical relationship to the original. And across all the original studies, the idea of sameness should stand in an identical relationship between a replication and its target regardless of which study is being replicated. If “same” does not mean the equivalent thing within and between replications, the target drifts meaninglessly

At the same time, replications are of “claims” which are scientific claims reduced to directional empirical statements, detached from the estimands, models, and analytic pipelines. That is, the claim is detached from the scientific meaning that gave it purchase in the original study. The same problem with “claims” arose in the team’s Nature paper on analytic robusteness. Abstracting scientific claims into more generic “claims” produces a mismatch between design and inference. Heterogeneous interpretations of what is actually being tested are collapsed into standardised statistical comparisons. Apparent agreement or disagreement may therefore reflect shifts in underlying targets rather than genuine replication or failure.

A related issue is that the study attempts to straddle internal and external validity without resolving their tension. It presents itself as assessing whether findings replicate, but in practice examines how results behave under modest variation in context, measurement, and implementation—something closer to robustness or transportability than strict replication. The use of multiple, non-equivalent metrics of “success” in the Nature article reinforces this ambiguity. Replication rates vary substantially depending on the criterion, yet a single headline figure is foregrounded: “Half of social-science studies fail replication test in years-long project“. The result is a study that is informative about the behaviour of findings (and researchers) under perturbation, but is easily—and predictably—read as making stronger claims about the reliability or truth of scientific results than its design can support.

Underlying both issues is a deeper disagreement about what replication is for. The paper’s opening paragraph explicitly reflects this tension. One reference is the National Academies of Sciences (NAS) report, which defines replication in procedural and statistical terms. Collect new data using similar methods and assess whether results are consistent, typically via effect sizes and uncertainty intervals. The other reference is a 2020 PLoS Biology article by Nosek and Errington (the two senior authors of this Nature paper), who argue that the NAS definition is not merely imprecise but conceptually mistaken. On the Nosek-Errington account, determining that a study is a replication is a theoretical commitment. Both confirming and disconfirming outcomes must be treated in advance as diagnostic of the original claim. The Nature paper adopts this language—replication teams were instructed to produce “good faith tests” of claims—but the article reports results entirely using metrics derived from the procedural-statistical tradition of NAS. This is not a superficial inconsistency. The two frameworks imply different standards of success, different interpretations of failure, and different meanings for any aggregated replication rate. The headline figures that have circulated are products of the latter framework; whether they would survive translation into the former is not addressed.

It is here that Campbell and Stanley’s observation, and its formalisation, becomes decisive. The procedural-statistical approach implicitly treats internal validity as primary and assumes that external validity can be inferred from it. That is, if results are consistent, the finding travels. The structural trade-off shows that this assumption cannot hold. The very steps taken to secure internal validity constrain the scope of generalisation. A high replication rate under this framework may therefore be simultaneously informative and misleading. It indicates that a result can be reproduced under sufficiently similar conditions, while obscuring how narrow those conditions may be. The Nosek-Errington framework recognises the need for theoretical commitment, but without a principled account of causal structure it cannot resolve the tension either. What the Nature paper ultimately demonstrates—perhaps inadvertently—is that replicability is not a property of findings alone. It is a property of the relationship between a finding and the conditions under which it is tested. This underscores a Cartwrightian notion of relationships tied to particular material configurations–nomological machines. Until that relationship is made explicit, headline replication rates will continue to invite overconfident conclusions in both directions and admonitions for better methods.

I did not have access to the published article which is behind the Springer-Nature paywall. Instead I relied on the publicly available preprint.

Analytic robustness could be a real problem

A recent article in Nature on the robustness of research findings in the social and behavioural sciences found that only 34% of re-analyses of the data yielded the same result as the original report. This sounds horrible. It sounds like two-thirds of the research that social and behavioural scientists are doing is low quality work, and certainly does not deserve to be published. One might reasonably ask if “confabulist” rather than “scientist” might not be a better job title.

Unfortunately, the edifice of “robust research” has been built on foundations of sand. The research shares many of the weaknesses of another article recently published in Science Advances, which I discuss here. There is little that can be concluded from the research that could actually inform scientific practice nor permit any observation about the quality or robustness of the original articles. It does, however, say something of interest for sociologists of science about the diversity of views that researchers have about how to re-analyse data to address conceptual claims.

The procedure followed in the Nature article was described thus.

To explore the robustness of published claims, we selected a key claim from each of our 100 studies, in which the authors provided evidence for a (directional) effect. We presented each empirical claim to at least five analysts along with the original data and asked them to analyse the data to examine the claim, following their best judgement and report only their main result. The analysts were encouraged to analyse those studies where they saw the greatest relevance of their expertise.

The word “claim” here does a lot of work. One might reasonably argue that a scientific claim in a published article is a statement of finding in the context of the hypothesis, the model, the analytic process, and the results. But this is not what is meant here. That full scientific sense of a claim is closer to what the Centre for Open Science team use as a starting point for a separate article on “reproducible” research. In the context of this article a “claim” is some vaguer statement of finding. It is an isolated single claim, has a direction of effect, and critically, is “phrased on a conceptual and not statistical level”.

The conceptual claim is closer to a vernacular claim. It is closer to the kind of thing you might say at a dinner party or read in the popular science section of a magazine. Something like, “did you hear that single female students report lower desired salaries when they think their classmates can see their preferences?” (Claim 025).

Under this framework, one should be able to abstract a full scientific claim into a conceptual claim, and if the conceptual claim is robust, independent scientists analysing the same data, making equally sensible choices about the analysis of the data, will converge on the conceptual claim. The challenge is that your pool of independent and equally sensible scientists need to agree with each other (without consultation) how that conceptual claim is to be translated into a scientific claim. A part of the science is deciding on the estimand for testing the claim, but the estimand is fixed by the analytic choice not by the conceptual claim. If two scientist analyse the same dataset but target different estimands through their analytic choices, they are not converging on the same conceptual claim. Against all logic, an analytic schema targeting a different estimand that nonetheless produces an estimate close to the estimate of the original paper, supports the robustness of the paper.

The framework, therefore, has a double incoherence. First, divergence of estimates (between the original analysis and re-analysis) is misread as fragility when it may simply reflect different estimands—different scientists sensibly translating the conceptual claim into different scientific claims. Second, and more damaging, convergence is misread as robustness when it may be entirely spurious—two analysts targeting different estimands who happen to produce similar point estimates are not confirming each other. They’re producing agreement by accident, across questions that aren’t the same question.

So the framework is wrong in both directions simultaneously. It penalises legitimate scientific pluralism and rewards numerical coincidence. A study could score as highly robust because several analysts happened to get similar numbers while asking entirely different questions. A study could score as fragile because several analysts made defensible but divergent estimand-constituting choices that led to genuinely different answers to genuinely different questions.

There is another an far more interesting reading of this paper, which has neither a click-bait quality nor the opportunity to remonstrate. Where the authors have identified fragility (or a lack of robustness), another could legitimately and positively see vitality and methodological pluralism. The social and behavioural sciences work in the messy space of self-referential agents actively interacting with and changing the environments in which they live and do science. It is hardly surprising that epistemic pluralism is a consequence of this. The 34% figure is not a scandal. It is valuable (under appreciated) data about the nature of social reality.

I did not have access to the published article which is behind the Springer-Nature paywall. Instead I relied on the publicly available preprint.

On becoming a decolonial scholar

I have observed some early, tentative steps of young academics to become world-class decolonial scholars in global health. This is a rich and rewarding area of endeavour that has real potential to launch a career without the baggage of narrow disciplinary boundaries, rigid methodological commitments, or premature demands for epistemic closure. When approached carefully, decolonial scholarship allows emerging researchers to engage critically with power, history, and knowledge while retaining considerable flexibility in analytic approach. What follows is offered as practical guidance for those who wish to navigate this space with confidence and coherence.

Decolonising global health has become a central ethical orientation for contemporary scholarship in the field. For early-career researchers, the challenge is not whether to adopt a decolonial stance, but how to do so convincingly within existing academic norms. You do not want a piece buried in the Malawi Medical Journal when global recognition can be found in The Lancet, PLOS, or BMJ. This brief guide offers practical advice on positioning oneself as a decolonising scholar, drawing on common techniques that are widely recognised as markers of both epistemic and moral alignment.

A successful decolonial paper begins with the scholar rather than the question. Reflexive positioning statements have become an essential opening move, allowing authors to locate themselves within global hierarchies of power, privilege, and complicity. Personal proximity to marginalisation is an asset. These declarations are most effective when they are candid yet indeterminate, signalling awareness without constraining interpretation. Once positionality has been established, subsequent analytic choices are more readily received as ethically grounded rather than methodologically contingent.

Research problems should then be framed in historical or structural terms rather than as empirical hypotheses. Questions that invite falsification risk reintroducing evaluative standards associated with dominant epistemic traditions linked to globalised practice. By contrast, concepts such as colonial residue, epistemic silencing, or inherited injustice resist closure and encourage moral elaboration. Where findings introduce ambiguity or contradiction, this can be interpreted not as analytic weakness but as evidence of the complexity produced by colonial histories themselves.

Within this framing, epistemic injustice can be treated as an outcome rather than a proposition requiring demonstration. The presence of asymmetry—between disciplines, regions, or forms of expertise—may be taken as sufficient indication of harm. Distinguishing between the unjust exclusion of valid insight and the justified rejection of erroneous claims is rarely necessary and may inadvertently reinscribe colonial distinctions between knowledge and belief. Moral recognition, once granted, does much of the epistemic work.

Lived experience occupies a privileged place in this literature and should be elevated accordingly. Personal and communal narratives can be used generously as data, though care should be taken to avoid subjecting them to processes such as validation, triangulation, or comparative assessment. Such techniques imply the possibility of error, which sits uneasily with commitments to epistemic plurality. Where accounts conflict, the tension may be presented as evidence of multiple ways of knowing rather than as a problem requiring resolution.

Ontological language offers particular flexibility. Early declaration of commitment to multiple ontologies allows scholars to accommodate divergent claims without adjudication. Later, when universal commitments are invoked—such as equity, justice, or health for all—these can be treated as ethical aspirations rather than propositions dependent on a shared reality. The absence of an explicit bridge between ontological plurality and universal goals rarely attracts critical scrutiny.

Power should be rendered visible throughout the paper, though preferably without becoming too specific. Abstractions such as “Western science”, “biomedicine”, or “the Global North” serve as effective explanatory devices while minimising the risk of implicating proximate institutions, funding structures, or professional incentives. Authorship practices, by contrast, provide a concrete and manageable site for decolonial intervention, often with greater symbolic return than methodological reform.

Papers should conclude with a call for transformation that exceeds immediate implementation. Appeals to reimagining, unsettling, or dismantling signal seriousness of intent, while the absence of operational detail preserves the moral horizon of the work. Evaluation frameworks, metrics, and timelines may be deferred as future tasks, once the appropriate epistemic shift has been achieved.

Finally, dissemination matters. Publishing in high-impact international journals ensures that critiques of epistemic dominance reach those best positioned to recognise them. Should access be restricted by paywalls, a brief acknowledgement of the irony is sufficient to demonstrate reflexive awareness.

In this way, decolonising global health can be practised as a scholarly orientation that aligns ethical seriousness with professional viability. The goal is not to resolve uncertainty or to determine what works, but to occupy the correct stance toward history and power. When that stance is convincingly performed, the work will speak for itself.

Building Research Capacity with AI

Over 25 years ago, the “10/90 gap” was used to illustrate the global imbalance in health research. Only 10% of global research benefited the regions where 90% of preventable deaths occurred. Since then, efforts to improve research capacity in low- and middle-income countries (LMICs)—where 90% of avoidable deaths occurred—have made important gains; nonetheless, significant challenges remain. A quarter of a century later, there are still too few well-trained researchers in LMICs, and their research infrastructure and governance are also inadequate. The scope of the problem increased dramatically in 2025 when governments cut North American and European overseas development assistance (ODA, i.e., foreign aid) precipitously. That aid—however inadequate—supported improvements in research capacity.

Traditional approaches to improving research capacity, such as training workshops and degree scholarship programs, have gone some way to address the expertise challenge. However, they fall short because they are not scalable. The relatively recent introduction of massive open online courses (MOOCs), such as TDR/WHO’s MOOCs in implementation research, goes a long way to overcoming that scalability problem—at least in instruction-based learning. Nonetheless, for many LMIC researchers, major bottlenecks remain because of poor or limited access to mentorship, one-off and quick advice, bespoke training, research assistance, and inter- and intra-disciplinary collaboration. The scalability problem can leave them at a persistent disadvantage compared to their high-income country counterparts. Research is not done well from isolation and ignorance.

The rise of large language model artificial intelligence (LLM-AIs) such as ChatGPT, Mistral, Gemini, Claude, and DeepSeek offers an unprecedented opportunity…and some additional risks. LLM-AIs are advanced AI models trained on vast amounts of text data to understand and generate human-like language. They are flexible, multilingual, and always available (24/7), offering researchers in LMICs immediate access to knowledge and assistance. If used correctly, LLMs could revolutionise approaches to building research capacity and democratise access to skills, knowledge, and global scientific discourse. Many online educational providers already integrate LLM-AIs into their instructional pipelines as tutors and coaches.

Unfortunately, LMICs risk further entrenching or increasing the 10/90 gap if they cannot take advantage of the benefits of LLM-AIs.

AI as a game changer

Researchers in resource-limited settings can access an always-on, massively scalable assistant for the first time. By massively scalable, every researcher could have one or more 24/7, decent research assistants for a monthly subscription of less than $20. They offer scalability and flexibility that traditional human research assistants cannot (and should not) match. However, they are not human and may not fully replicate a human research assistant’s nuanced understanding and critical thinking—and they are certainly less fun to have a cup of coffee with. Furthermore, the effectiveness of LLM-AIs depends on the sophistication of the user, the task complexity and the quality of input the user provides.

I read a recent post on LinkedIn by a UCLA professor decrying the inadequacies of LLM-AIs. However, a quick read of the post revealed that the professor had no idea how to engage appropriately with the technology.

Unfortunately, like all research assistants, senior researchers, and professors, LLM-AIs can be wrong. Like all tools, one needs to learn how to use them with sophistication.

In spite of any inadequacies, LLM-AIs can remove barriers to research participation by offering tutoring on complex concepts, assisting with literature reviews and data analysis, and supporting the writing and editing of manuscripts and grant proposals.

Reid Hoffman, the AI entrepreneur, described on a podcast how he used LLM-AIs to learn about complex ideas. He would upload a research paper onto the platform and ask, “Explain this paper as if to a 12-year-old”. Hoffman could then “chat” with the LLM-AI about the paper at that level. Once comfortable with the concepts, he would ask the LLM-AI to “explain this paper as if to a high school senior”. He could use the LLM-AI as a personal tutor by iterating-up in age and sophistication.

Researchers can also use the LLM-AIs to support the preparation of scientific papers. This is happening already because an explosion of generically dull (and sometimes fraudulent) scientific papers is hitting the market. This explosion has delighted the publishing houses and created existential ennui among the researchers. The problem is not the LLM-AIs—it is in their utilisation, and it will take time for the paper production cycle to settle.

While access to many LLMs requires a monthly subscription, some LLM-AIs, like DeepSeek, significantly lower costs and accessibility barriers by distributing “open weights models”. Researchers can download these open weights models freely and put them on personal or university computer infrastructure without paying a monthly subscription. They make AI-powered research assistance viable for most LMIC research settings, and universities and research institutes can potentially lower the costs further.

LLM-AIs allow researchers in LMICs to become less dependent on high-income countries for training and mentorship, shifting the balance towards scientific self-sufficiency. AI-powered tools could accelerate the development of a new generation of LMIC researchers, fostering homegrown expertise and leadership in relevant global science. They are no longer constrained by the curriculum and interests of high-income countries and can develop contextually relevant research expertise.

The Double-Edged Sword

Despite its positive potential, the entry of LLM-AIs into the research world could have significant downsides. Without careful implementation, existing inequalities could be exacerbated rather than alleviated. High-income countries are already harnessing LLM-AIs at scale, integrating them into research institutions, project pipelines, training, and funding systems. LMICs, lacking the same level of investment and infrastructure, risk being left behind—again. The AI revolution could widen the research gap rather than close it, entrenching the divide between well-resourced and under-resourced institutions.

There is also a danger in how researchers use LLM-AIs. They are the cheapest research assistants ever created, which raises a troubling question: will senior researchers begin to rely on AI to replace the need for training junior scientists? Suppose an LLM-AI can summarise the literature, draft proposals, and assist in the analysis. In that case, there is a real risk that senior researchers will neglect mentorship, training and hands-on learning. Instead of empowering a new generation of LMIC researchers, LLM-AIs could be used as a crutch to maintain existing hierarchies. If institutions see the LLM-AIs as a shortcut to productivity rather than an investment in building research capacity, it could stall the development of genuine human expertise.

Compounding these risks, AI is fallible. LLM-AIs can “hallucinate”, generating false information with complete confidence. They always write with confidence. I’ve never seen one write, “I think this is the answer, but I could be wrong”. They can fabricate references, misinterpret scientific data, and reflect biases embedded in their training data. If used uncritically, they could propagate misinformation and skew research findings.

The challenge of bias is not to be underestimated. LLM-AIs are trained on the corpus of material currently available on the web, reflecting all the biases of the web–who creates the content, what content they create, etc.

Furthermore, while tools like DeepSeek reduce cost barriers, commercial AI models still pose a financial challenge. LMIC institutions will need to negotiate sustainable access to AI tools or risk remaining locked out of their benefits—particularly of the leading edge models. The worst outcome would be a scenario where HICs use AI to accelerate their research dominance while LMICs struggle to afford the very tools that could democratise access.

A Strategic Approach

To ensure LLM-AIs build rather than undermine research capacity in LMICs, they must be integrated strategically and equitably. Training researchers and students in AI literacy is paramount. Knowing how to ask the right questions, validate AI outputs, and integrate results into research workflows is essential. This is not a difficult task, but it takes time and effort, like all learning. The LLM-AIs can help with the task—effectively bootstrapping the learning curve.

Rather than replacing traditional research capacity building, LLM-AIs should be embedded into existing frameworks. MOOCs, mentorship programs, and research fellowships should incorporate LLM-AI-based tutoring, iterative feedback, and language support to enhance—not replace—human mentorship. The focus should be on areas where LLM-AI can offer the greatest immediate impact, such as brainstorming, editing, grant writing support, statistical assistance, and multilingual research dissemination.

Institutions in LMICs should also push for local, ethical LLM-AI development that considers regional needs. This push is easier said than done, particularly in a world of fracturing multilateralism. However, appropriately managed, LLM-AI models can be adapted to recognise and integrate local research priorities rather than merely reinforcing an existing scientific discourse. The fact that a research question is of no interest in high-income countries does not mean it is not critically urgent in an LMIC context.

Finally, securing affordable and sustainable access to AI tools will be essential. Governments, universities, and research institutions must lobby for cost-effective AI licensing models or explore open-source alternatives to prevent another digital divide. Disunited lobbying efforts are weak, but together, across national boundaries, they could have significant power.

An Equity Tipping Point

The LLM-AI revolution is a key juncture for building research capacity in LMICs. Harnessed correctly, LLM-AIs could break down long-standing barriers to participation in science, allowing LMIC researchers to compete on (a more) equal footing. The rise of models like DeepSeek suggests a future where AI is not necessarily a privilege of the few but a democratised resource for the many.

Fair access will not happen automatically. Without deliberate, ethical, and strategic intervention, LLM-AIs could reinforce existing research hierarchies. The key to harvesting the benefits of the technology lies in training researchers, integrating LLM-AIs into programs to build research capacity and securing equitable access to the tools. Done well, LLM-AIs could be a transformative force, not just in scaling research capacity but in redefining who gets to lead global scientific discovery.

LLM-AIs offer an enormous opportunity. They could either empower LMIC researchers to chart their own scientific futures, or they could become another tool to push them further behind.

Acknowledgment: This blog builds upon insights from a draft concept note developed by me (Daniel D. Reidpath), Lucas Sempe, and Luciana Brondi from the Institute for Global Health and Development (Queen Margaret University, Edinburgh), and Anna Thorson from the TDR Research Capacity Strengthening Unit (WHO, Geneva). Our work on AI-driven research capacity strengthening in LMICs informed much of the discussion presented here.

The original draft concept note is accessible here.