Category Archives: Research

Citation Panic

Fake citation panic has arrived. A paper was recently published in the Lancet that audited 2.5 million biomedical papers for fabricated citations; i.e., “references whose claimed titles correspond to no existing publication”. The headline finding was that 1 in 277 papers contained fake citations.

The response to the paper’s publication has been varied but at times close to hysterical, including this news article in Nature: Surge in fake citations uncovered by audit of 2.5 million biomedical-science papers.

The Lancet paper itself has a few problems in the way it presents the results, finally settling on the most heart-stopping number (#Affected_Papers/#Papers), which was 1 paper in every 227. An “affected paper” is one with at least a single fabricated citation. If you look at the data in terms of total references, fake citations are extremely low: 4046/97.1 million≈0.0042%.

The seriousness of the problem also needs to be put against the backdrop of general error rates in citations pre-AI. The rate was around 15% with 9% being major errors “in which the referenced source either fails to substantiate, is unrelated to, or contradicts the assertion”. In other words, citation unreliability predates LLMs and occurs at rates vastly exceeding outright fabrication. Assuming the fabricated citations found in the Lancet paper are attributable to AI, they have added a tiny quanta to a substantial existing problem.

As academic researchers, we also need to be honest about how we use the literature in the development of a paper, and the extent to which a fake citation is actually doing load-bearing epistemic work. Remember, fake citations that assert complete nonsense are valueless to the author because they act as flags of authorial ineptitude. A “good” fake citation is one that says nothing too controversial and supports the general thrust of normal science.

There are five potential, non-load bearing, reasons for adding citations:

1. Credentialling. Signalling to reviewers and readers that you belong to the field, have done the reading, know the players. It is a display of club membership.

2. Tribute. Citing the people who might review your paper, your supervisor, your allies. It is a social currency. Although it happens less now, it was relatively common practice for anonymous reviewers to suggest a citation to be included in a resubmission: i.e., you forgot to cite ME!

3. Defensive armour. Pre-empting the reviewer who asks “but what about X?” You cite X, even cursorily, to close the objection.

4. Territorial marking. Establishing your position in an intellectual lineage. “I stand in this tradition, not that one.”

5. Apparent support. Your actual use case. Finding something that gestures in the direction of your claim.

All of these reasons for citation have important sociological roles in the production of science. Many could also support a fake citation, and none of them (with the possible exception of “apparent support”) do substantial epistemic load-bearing. This is plausibly why the fake (inappropriate) citation rates pre-AI had made so few waves in the academic world. The citations are intellectual grease that move the paper forward. Failure to sprinkle citations throughout would condemn a paper to the bin–never even getting further than the editor’s desk.

The people with most to fear from fake citations are perhaps bibliometricians, the researchers who treat citation counts as meaningful measures of scientific quality and impact. The fact that they readily get through the peer-review process is a strong signal that they are not doing heavy work within the paper.

An examination of the Supplementary Material of the Lancet paper is also revealing. In Appendix 2 the authors provide “Illustrative Examples of Suspected Fabricated References”. In each case they misquote the actual text associated with the reference. They do not simply compress the quote, they substitute words and meaning.

Example A:

The increase in ICU admissions in the post-implementation group suggests that more individuals survived initial injuries and required intensive care, aligning with findings by Doe and Smith.

Actual:

The increase in ICU admissions in the post-implementation group suggests that more severe cases required intensive care, possibly due to the application of stricter monitoring and care protocols for high-risk patients (Doe and Lee 2023; Doe and Smith 2023; Smith and Johnson 2020).

Example B:

MRI excels in soft tissue visualization and compositional assessment; CT offers superior bone detail and faster acquisition, with emerging dual-energy techniques adding material differentiation capabilities.

Actual:

MRI excels in soft tissue visualization and compositional assessment; CT offers superior bone structure characterization; ultrasound provides dynamic, real-time evaluation; and nuclear medicine techniques, including positron emission tomography (PET), capture metabolic and molecular processes [88,89,90].

Example C:

Activation of P2X7 promotes astrocyte differentiation, and astrocytes can secrete inflammatory mediators, which further promote the activation of microglia and ultimately contribute to the development of chronic pain.

Actual:

Activation of P2X7 promotes astrocyte differentiation, and astrocytes can secrete D-serine, a co-agonist at NMDARs, to enhance both homo- and heterosynaptic long-term potentiation (LTP (146)).

The examples are not neutral summaries of the original text. The altered wording makes the fabricated citations appear more epistemically important than they are in the source papers.

A statement supported by multiple citations, one of which is fabricated, is not the same as a statement appearing to rest on a single fabricated citation (Examples B and C). A fabricated citation linked to a statement that makes anodyne and uncontroversial observations about imaging techniques (Example B) is not doing substantial epistemic work. A statement that “more severe cases required intensive care” is far more cautious than the stronger causal interpretation that “more individuals survived initial injuries” (Example A).

The irony here is difficult to miss. A paper warning about fabricated citations strengthens its case through altered quotations that make the offending references appear more epistemically consequential than they are in the source papers themselves.

The Lancet paper identifies a real phenomenon. Fabricated citations exist and appear to be increasing, although they currently are a marginal problem in the grand scheme of academic literature. But the examples in the supplementary appendix suggest that the dominant failure mode is not fabricated evidence supporting radical falsehoods. More often, the fake citations appear attached to background claims, disciplinary signalling, or low-load rhetorical scaffolding. That does not make them acceptable, but it does matter for understanding the scale and nature of the problem.

The more interesting question is not simply whether a citation exists, but what work it is doing in the argument. A fake citation attached to intellectual grease is very different from fabricated evidence entering a meta-analysis or clinical guideline. Treating all citation failures as equivalent obscures the distinction between bibliographic error, rhetorical inflation, and genuine epistemic corruption.

Campbell and Stanley explained replication rates in 1963

Over 60 years ago, Donald Campbell and Julian Stanley published their classic, slim volume Experimental and Quasi-Experimental Designs for Research. One of their earliest observations concerns the trade-off between internal and external validity. Specifically, the more precisely one can establish a causal relationship, the less one can say about its generality. In recent work, I show that simultaneously maximising internal and external validity is not merely a practical limitation to be mitigated, but a structural impossibility. The relationship is analogous to the Heisenberg uncertainty principle that shows one cannot simultaneously know both the position and momentum of a particle with arbitrary precision. In the context of the social and behavioural sciences, the more precisely one identifies a cause, the narrower the domain to which that knowledge applies.

I reviewed this problem in terms of the so-called “replication crisis”, the difficulty researchers have encountered in replicating published causal findings. Shortly after posting that paper, Nature published a series of articles on research credibility, including a large-scale investigation of replicability in the social and behavioural sciences. The empirical effort is extraordinary, involving hundreds of researchers and a substantial coordination infrastructure. The methods, results, and theoretical framing are all of considerable interest. However, the study has also generated headline figures that are readily misinterpreted—an outcome encouraged both by editorial framing and by the structure of the paper itself.

The central difficulty lies in two under-specified concepts that drive the research. The replication is of the “same question” and the “claim”. Whether a replication tests the “same question” is treated as a local, theory-laden judgement made by individual teams. Sameness is treated as constant at two levels simultaneously. First the multiple replications of a single study should be replicating the same thing, as if each attempt stood in an identical relationship to the original. And across all the original studies, the idea of sameness should stand in an identical relationship between a replication and its target regardless of which study is being replicated. If “same” does not mean the equivalent thing within and between replications, the target drifts meaninglessly

At the same time, replications are of “claims” which are scientific claims reduced to directional empirical statements, detached from the estimands, models, and analytic pipelines. That is, the claim is detached from the scientific meaning that gave it purchase in the original study. The same problem with “claims” arose in the team’s Nature paper on analytic robusteness. Abstracting scientific claims into more generic “claims” produces a mismatch between design and inference. Heterogeneous interpretations of what is actually being tested are collapsed into standardised statistical comparisons. Apparent agreement or disagreement may therefore reflect shifts in underlying targets rather than genuine replication or failure.

A related issue is that the study attempts to straddle internal and external validity without resolving their tension. It presents itself as assessing whether findings replicate, but in practice examines how results behave under modest variation in context, measurement, and implementation—something closer to robustness or transportability than strict replication. The use of multiple, non-equivalent metrics of “success” in the Nature article reinforces this ambiguity. Replication rates vary substantially depending on the criterion, yet a single headline figure is foregrounded: “Half of social-science studies fail replication test in years-long project“. The result is a study that is informative about the behaviour of findings (and researchers) under perturbation, but is easily—and predictably—read as making stronger claims about the reliability or truth of scientific results than its design can support.

Underlying both issues is a deeper disagreement about what replication is for. The paper’s opening paragraph explicitly reflects this tension. One reference is the National Academies of Sciences (NAS) report, which defines replication in procedural and statistical terms. Collect new data using similar methods and assess whether results are consistent, typically via effect sizes and uncertainty intervals. The other reference is a 2020 PLoS Biology article by Nosek and Errington (the two senior authors of this Nature paper), who argue that the NAS definition is not merely imprecise but conceptually mistaken. On the Nosek-Errington account, determining that a study is a replication is a theoretical commitment. Both confirming and disconfirming outcomes must be treated in advance as diagnostic of the original claim. The Nature paper adopts this language—replication teams were instructed to produce “good faith tests” of claims—but the article reports results entirely using metrics derived from the procedural-statistical tradition of NAS. This is not a superficial inconsistency. The two frameworks imply different standards of success, different interpretations of failure, and different meanings for any aggregated replication rate. The headline figures that have circulated are products of the latter framework; whether they would survive translation into the former is not addressed.

It is here that Campbell and Stanley’s observation, and its formalisation, becomes decisive. The procedural-statistical approach implicitly treats internal validity as primary and assumes that external validity can be inferred from it. That is, if results are consistent, the finding travels. The structural trade-off shows that this assumption cannot hold. The very steps taken to secure internal validity constrain the scope of generalisation. A high replication rate under this framework may therefore be simultaneously informative and misleading. It indicates that a result can be reproduced under sufficiently similar conditions, while obscuring how narrow those conditions may be. The Nosek-Errington framework recognises the need for theoretical commitment, but without a principled account of causal structure it cannot resolve the tension either. What the Nature paper ultimately demonstrates—perhaps inadvertently—is that replicability is not a property of findings alone. It is a property of the relationship between a finding and the conditions under which it is tested. This underscores a Cartwrightian notion of relationships tied to particular material configurations–nomological machines. Until that relationship is made explicit, headline replication rates will continue to invite overconfident conclusions in both directions and admonitions for better methods.


I did not have access to the published article which is behind the Springer-Nature paywall. Instead I relied on the publicly available preprint.

Ideology and the Illusion of Disagreement in Empirical Research

There is deep scepticism about the honesty of researchers and their capacity to say things that are true about the world. If one could demonstrate that their interpretation of data was motivated by their ideology, that would be powerful evidence for the distrust. A recent paper in Science Advances ostensibly showed just that. The authors, Borjas and Breznau (B&B), re-analysed data from a large experiment designed to study researchers. The researcher-participants were each given the same dataset and asked to analyse it to answer the same question: “Does immigration affect public support for social welfare programs?” Before conducting any analysis of the data, participant-researchers also reported their own views on immigration policy, ranging from very anti- to very pro-immigration. B&B reasoned that, if everyone was answering the same question, they would be able to infer something about the impact of prior ideological commitments on the interpretation of the data.

Each team independently chose how to operationalise variables, select sub-samples from the data, and specify statistical models to answer the question, which resulted in over a thousand distinct regression estimates. B&B use the observed diversity of modelling choices as data, and examined how the research process unfolded, as well as the relationship of the answers to the question and researcher-participants’ prior views on immigration.

B&B suggested that participant-researchers with moderate prior views on immigration find the truth–although they never actually say it that cleanly. Indeed, in the Methods and Results they demonstrate appropriate caution about making causal claims. However, from the Title through to the Discussion, the narrative framing is that immoderate ideology distorts interpretation—and this is exactly the question their research does not and cannot answer—by design.

Readers of the paper did not miss the narrative spin in which B&B shrouded their more cautious science. Within a few days of publication, the paper had collected hundreds of posts and it was picked up in international news feeds and blogs. Commentaries tended to frame pro-immigration positions as more ideologically suspect.

There are significant problems with the B&B study, however, which are missed or not afforded sufficient salience. To understand the problems more clearly, it helps to step away from immigration altogether and consider a simpler case. Suppose researchers are given the same dataset and asked to answer the question: “Do smaller class sizes improve student outcomes?” The data they are given includes class size, test scores, and graduation rates (a proxy for student outcomes). On the surface, this looks like a single empirical question posed to multiple researchers using the same data.

Now introduce a variable that is both substantively central and methodologically ambiguous, a measure of the students’ socio-economic disadvantage. Some researchers treat socio-economic disadvantage as a covariate, adjusting for baseline differences to estimate an average effect of class size across all students. Others restrict the sample to disadvantaged pupils, on the grounds that education policy is primarily about remediation or equity. Still others model heterogeneity explicitly, asking whether smaller classes matter more for some students than for others. Each of these choices is orthodox. None involves questionable practice, and all of them are “answering” the same surface question. But each corresponds to a different definition of the effect being studied and, most precisely, to a different question being answered. By definition, different models answer different questions.

In this setting, differences between researchers analyses would not normally be described as researchers answering the same question differently. Nor would we infer that analysts who focus on disadvantaged students are “biased” toward finding larger effects, or that those estimating population averages are distorting inference. We would recognise instead that the original prompt was under-specified, and that researchers made reasonable—if normatively loaded—decisions about which policy effect should be evaluated. B&B explicitly acknowledge this problem in their own work, writing: “[a]lthough it would be of interest to conduct a study of exactly how researchers end up using a specific ‘preferred’ specification, the experimental data do not allow examination of this crucial question” (p. 5). Even with this insight, however, they persist with the fiction that the researchers were indeed answering the same question, treating two different “preferred specifications” as if they answer the same question. It would be like our educationalists treating an analysis of outcomes for children from socio-economically deprived families as if answered the same question as an analysis that included all family types.

B&B’s immigration experiment goes a step further, and in doing so introduces an additional complication. Participant-researchers’ prior policy positions on immigration are elicited in advance of their data analysis, and then B&B used that as an organising variable in their analysis of participant-researchers.

Imagine a parallel design in the education case. Before analysing the data, researchers are asked whether they believe differences in educational outcome are primarily driven by school resources or by family deprivation. Their subsequent modelling choices—whether to focus on disadvantaged pupils, whether to emphasise average effects, whether to model strong heterogeneity—are then correlated with these priors. Such correlations would be unsurprising. If you think disadvantage is more important than school resources to student outcomes, you may well focus your analysis on students from deprived backgrounds. It would be a mistake, however, to conclude that researchers with strong views are biasing results, rather than pursuing different, defensible conceptions of the policy problem.

Once prior beliefs are foregrounded in this way, a basic ambiguity arises. Are we observing ideologically distorted inferences over the same shared question, or systematic differences in the questions being addressed given an under-specified prompt? Without agreement on what effect the analysis is meant to capture, those two interpretations cannot be disentangled. Conditioning on ideology (as B&B did) therefore risks converting a problem of an under-specified prompt into a story about ideologically biased reasoning. This critique does not deny that motivated reasoning exists, or that B&B’s research-participants were engaged in it. They simply do not show it, and the alternative explanation is more parsimonious.

The problems with the B&B paper are compounded when they attempt to measure “research quality” through peer evaluations. Researcher-participants in the experiment are asked to assess the quality of one another’s modelling strategies, introducing a second and distinct issue. The evaluation process is confounded by the distribution of views within the researcher-participant pool.

To see this, return again to the education example. Suppose researchers’ views about the importance of family deprivation for educational outcomes are normally distributed, with most clustered around a moderate position and fewer at the extremes. A randomly selected researcher asked to evaluate another randomly selected researcher will, with high probability, be paired with someone holding broadly similar views (around the middle of the distribution). In such cases, the modelling choices are likely to appear reasonable and well motivated, and to receive high quality scores. The evaluation implicitly invites the following reasoning: “your doing something similar to what I was doing, and I was doing high quality research, therefore you must be doing high quality research as well”.

By contrast, models produced by researchers in the tails of the distribution will more often be evaluated by researchers further away from their ideological view. Those models may be judged as poorly framed or unbalanced—not because they violate statistical standards, but because they depart from the modal conception of what the broadly framed question is about. Under these conditions, lower average quality scores for researchers with more extreme priors may reflect distance from the dominant framing, not inferior analytical practice. B&B, however, argued the results show that being ideologically in the middle produced higher quality research.

The issue here is not bias but design. When both peer reviewers and reviewees are drawn from the same population, and when quality is assessed without a fixed external benchmark for what counts as a good answer to the question, peer scores inevitably track conformity to the field’s modal worldview. Interpreting these scores as evidence that ideology degrades research quality is wrong.

B&B’s paper is useful. It shows that ideological commitments are associated with the questions that researchers answer. Cleanly, that is as far as it goes. Researchers answer the questions they think are important. The small, accurate interpretation is not as impressive a finding as “ideology drives interpretation”, but B&B’s research is most valuable where it is most restrained. The further it moves from firm ground describing correlations in researchers’ modelling choices towards the quick-sand of diagnosing ideological distortion of inference, the worse it gets. What they present as evidence of bias is more reasonably understood as evidence that their framing question itself was never well defined. Through its narrative style, and not withstanding quiet abjurations against causal inference, the paper invites the conclusion that researchers working on a divisive, politically salient topics simply find what their ideologies lead them to find. And taken at face-value, it licenses the distrust of empirical research on contested policy questions.

 

On becoming a decolonial scholar

I have observed some early, tentative steps of young academics to become world-class decolonial scholars in global health. This is a rich and rewarding area of endeavour that has real potential to launch a career without the baggage of narrow disciplinary boundaries, rigid methodological commitments, or premature demands for epistemic closure. When approached carefully, decolonial scholarship allows emerging researchers to engage critically with power, history, and knowledge while retaining considerable flexibility in analytic approach. What follows is offered as practical guidance for those who wish to navigate this space with confidence and coherence.

Decolonising global health has become a central ethical orientation for contemporary scholarship in the field. For early-career researchers, the challenge is not whether to adopt a decolonial stance, but how to do so convincingly within existing academic norms. You do not want a piece buried in the Malawi Medical Journal when global recognition can be found in The Lancet, PLOS, or BMJ. This brief guide offers practical advice on positioning oneself as a decolonising scholar, drawing on common techniques that are widely recognised as markers of both epistemic and moral alignment.

A successful decolonial paper begins with the scholar rather than the question. Reflexive positioning statements have become an essential opening move, allowing authors to locate themselves within global hierarchies of power, privilege, and complicity. Personal proximity to marginalisation is an asset. These declarations are most effective when they are candid yet indeterminate, signalling awareness without constraining interpretation. Once positionality has been established, subsequent analytic choices are more readily received as ethically grounded rather than methodologically contingent.

Research problems should then be framed in historical or structural terms rather than as empirical hypotheses. Questions that invite falsification risk reintroducing evaluative standards associated with dominant epistemic traditions linked to globalised practice. By contrast, concepts such as colonial residue, epistemic silencing, or inherited injustice resist closure and encourage moral elaboration. Where findings introduce ambiguity or contradiction, this can be interpreted not as analytic weakness but as evidence of the complexity produced by colonial histories themselves.

Within this framing, epistemic injustice can be treated as an outcome rather than a proposition requiring demonstration. The presence of asymmetry—between disciplines, regions, or forms of expertise—may be taken as sufficient indication of harm. Distinguishing between the unjust exclusion of valid insight and the justified rejection of erroneous claims is rarely necessary and may inadvertently reinscribe colonial distinctions between knowledge and belief. Moral recognition, once granted, does much of the epistemic work.

Lived experience occupies a privileged place in this literature and should be elevated accordingly. Personal and communal narratives can be used generously as data, though care should be taken to avoid subjecting them to processes such as validation, triangulation, or comparative assessment. Such techniques imply the possibility of error, which sits uneasily with commitments to epistemic plurality. Where accounts conflict, the tension may be presented as evidence of multiple ways of knowing rather than as a problem requiring resolution.

Ontological language offers particular flexibility. Early declaration of commitment to multiple ontologies allows scholars to accommodate divergent claims without adjudication. Later, when universal commitments are invoked—such as equity, justice, or health for all—these can be treated as ethical aspirations rather than propositions dependent on a shared reality. The absence of an explicit bridge between ontological plurality and universal goals rarely attracts critical scrutiny.

Power should be rendered visible throughout the paper, though preferably without becoming too specific. Abstractions such as “Western science”, “biomedicine”, or “the Global North” serve as effective explanatory devices while minimising the risk of implicating proximate institutions, funding structures, or professional incentives. Authorship practices, by contrast, provide a concrete and manageable site for decolonial intervention, often with greater symbolic return than methodological reform.

Papers should conclude with a call for transformation that exceeds immediate implementation. Appeals to reimagining, unsettling, or dismantling signal seriousness of intent, while the absence of operational detail preserves the moral horizon of the work. Evaluation frameworks, metrics, and timelines may be deferred as future tasks, once the appropriate epistemic shift has been achieved.

Finally, dissemination matters. Publishing in high-impact international journals ensures that critiques of epistemic dominance reach those best positioned to recognise them. Should access be restricted by paywalls, a brief acknowledgement of the irony is sufficient to demonstrate reflexive awareness.

In this way, decolonising global health can be practised as a scholarly orientation that aligns ethical seriousness with professional viability. The goal is not to resolve uncertainty or to determine what works, but to occupy the correct stance toward history and power. When that stance is convincingly performed, the work will speak for itself.