Category Archives: Epidemiology

The study of the causes and distribution of disease. A methodological branch of health sciences

Guidelines for the reporting of COde Developed to Analyse daTA (CODATA)

I was reviewing an article recently for a journal in which the authors referenced a GitHub repository for the Stata code they had developed to support their analysis. I had a look at the repository. The code was there in a complex hierarchy of nested folders.  Each individual do-file was well commented, but there was no file that described the overall structure, the interlinking of the files, or how to use the code to actually run an analysis.

I have previously published code associated with some of my own analyses.  The code for a recent paper on gender bias in clinical case reports was published here, and the code for the Bayesian classification of ethnicity based on names was published here. None of my code had anything like the complexity of the code referenced in the paper I was reviewing.  It did get me thinking however about how the code for statistical analyses should be written. The EQUATOR (Enhancing the QUAlity and Transparency Of health Research) Network has 360 separate guidelines for reporting research.  This includes guidelines for everything from randomised trials and observational studies through to diagnostic studies, economic evaluations and case reports. Nothing on the reporting of code for the analysis of data.

On the back of the move towards making data available for re-analysis, and the reproducible research movement, it struck me that guidelines for the structuring of code for simultaneous publication with articles would be enormously beneficial.  I started to sketch it out on paper, and write the idea up as an article.  Ideally, I would be able to enrol some others as contributors.  In my head, the code should have good meta-data at the start describing the structure and interrelationship of the files.  I now tend to break my code up into separate files with one file describing the workflow: data importation, data cleaning, setting up factors, analysis.  And then I have separate files for each element of the workflow. My analysis is further divided into specific references to parts of papers. “This code refers to Table 1”.  I write the code this way for two reasons.  It makes it easier for collaborators to pick it up and use it, and I often have a secondary, teaching goal in mind.  If I can write the code nicely, it may persuade others to emulate the idea.  Having said that, I often use fairly unattractive ways to do things, because I don’t know any better; and I sometimes deliberately break an analytic process down into multiple inefficient steps simply to clarify the process — this is the anti-Perl strategy.

I then started to review the literature and stumbled across a commentary written by Nick Barnes in 2010 in the journal Nature. He has completely persuaded me that my idea is silly.

It is not silly to hope that people will write intelligible, well structured. well commented code for statistical analysis of data.  It is not silly to hope that people will include this beautiful code in their papers.  The problem with guidelines published by the EQUATOR Network is in the way that journals require authors to comply with them. They become exactly the opposite of guidelines, they are rules — the ironic twist on the observation by Geoffrey Rush’s character, Hector Barbossa in Pirates of the Caribbean.

Barnes wrote, “I want to share a trade secret with scientists: most professional computer software isn’t very good.”  Most academics/researchers feel embarrassed by their code.  I have collaborated with a very good Software Engineer in some of my work and spent large amounts of time apologising for my code.  We want to be judged for our science, not for our code.  The problem with that sense of embarrassment is that the perfect becomes the enemy of the good.

The Methods sections of most research articles make fairly vague allusions to how the data were actually managed and analysed.  One may make references to statistical tests and theoretical distributions.  For a reader to move from that to a re-analysis of the data is often not straight forward.  The actual code, however, explains exactly what was done.  “Ah! You dropped two cases, collapsed two factors, and used a particular version of an algorithm to perform a logistic regression analysis.  And now I know why my results don’t quite match yours”.

It would be nice to have an agreed set of guidelines reporting COde Developed to Analyse daTA (CODATA).  It would be great if some authors followed the CODATA guidelines when they published.  But it would be even better if everyone published their code, no matter how bad or inefficient it was.


Babies have less than a 1 in 3 chance of recovery from a poor 1 minute Apgar score

We recently completed a study of 272,472 live, singleton, term births without congenital anomalies recorded in the Malaysian National Obstetrics Registry (NOR). We wanted to know what proportion of births had a poor 1 minute Apgar score (<4); and the likelihood that they would recover (Apgar score ≥7) by 5 minutes.

As we noted in the paper:

While the Apgar score at 5 minutes is a better predictor of later outcomes than the Apgar score at 1 minute, there is a necessary temporal process involved, and a neonate must pass through the first minute of life to reach the fifth. Understanding the factors associated with the transition from intrauterine to extrauterine life, particularly for neonates with 1 min Apgar scores <4, has the potential to improve care.

Surprisingly, to me at least, we could find no research looking at that 1 minute to 5 minute transition.  Ours was a first.

From the 270,000+ births, you can see (Figure 1) that the probability of a 5 minute Apgar score ≥7 rises dramatically as the 1 minute Apgar score increases. There is an almost straight line relationship between a 1 minute Apgar score of 1, a 1 minute Apgar score of 6, and the chance of  a 5 minute Apgar score ≥7.

Fig 1: The probability (with 95% CI) of an Apgar score at 5 min (≥7) given any Apgar score at 1 minute

A 1 minute Apgar of 6 almost guarantees a 5 minute Apgar score ≥7; in contrast a 1 minute Apgar of 3 has only a 50% chance of recovery, and a 1 minute Apgar of 1 has only less than a 10% chance of recovery.

Fortunately, only 0.6% of births had poor Apgar scores (<4).  The type of delivery (Caesarean section, or vaginal delivery) and the staff conducting the delivery (Doctor or Midwife) were both significantly associated with the chance of recovery.  The challenge is working out the causal order.  Do certain kinds of delivery cause poor recovery, or are babies likely to have poor recovery delivered in particular ways?  Does the training of Doctors or Midwives exacerbate/improve the risks of poor recovery, or are babies likely to have poor recovery delivered by particular personnel?

Our study cannot answer the questions, but it does raise interesting points for future studies of actual labor room practice — questions not easily answered with registry type data.



Zika Causes Birth Defects In 1 In 10 Pregnancies

Well … Not really.  But that was the misleading headline of an article I saw in the “healthy living” section of The Huffington Post. And then chased it up to its source — an article published by Reuter‘s journalist Julie Steenhuysen.

There were 3,978,497 births in the US in 2015.  Assuming similar numbers in 2017 (and no seasonal variation which is unlikely), you would be looking at a whopping 400,000 births with a Zika virus related birth defect.  The usual rate of birth defects in the US from all causes is about 3 per 100, so with a cumulative total in excess of three times the current numbers one could anticipate a swift, dramatic (and possibly ineffective) response from the government.

Moving down from the headline, however, a very different story is revealed:

About one in 10 pregnant women with confirmed Zika infections had a fetus or baby with birth defects, offering the clearest picture yet of the risk of Zika infection during pregnancy, U.S. researchers said on Tuesday.

No longer is it 1 in 10 pregnancies. Its 1 in 10 pregnancies with Zika.  The facts are not half as dramatic as the headline.  What am I talking about?  “Not half as dramatic”?  The total number of pregnancies in the US Zika Pregnancy Register for 2016–2017 (on 8 April 2017) was 1,311. Fifty-six of the pregnancies resulted in liveborn infants with birth defects, and 7 of the pregnancies were associated with losses with birth defects.  That just doesn’t sound as impressive a number as the headline suggested.  Undoubtedly personally tragic, but far from as significant a population health issue.

Inequality of life expectancy between countries

A colleague of mine recently asked me if I knew of a citation for the narrowing in life expectancy between high-income countries (HICs) and low- and middle-income countries (LMICs).  I didn’t.  But the question did get me thinking.  Was there a narrowing between country-level life expectancy?  Probably … maybe … I didn’t know.

There are some very nice resources on life expectancy. I particularly liked Max Roser‘s post on the Our World in Data website.  None of the things I found, however, seemed to tackle the question of “the narrowing” in quite the way I wanted.  A longer search may have solved the problem, but it seemed just as easy to grab some data and have a look for myself.  While my colleague asked about a narrowing in the life expectancy gap according to the World Bank’s income classification (i.e., between HICs and LMICs), my interest was piqued by the broader question of the inequality in life expectancy between countries.

I decided to use the GapMinder data.  For a “quick and dirty” look it suited my purposes, it’s readily available, and the googlesheets R-package makes it trivial to access the data for re-purposing.  To simplify things, I calculated the deciles of life expectancy for the available countries in the gapminder data from 1870 to 2016.

I started with 1870 because in the years prior (from 1800) the gapminder data show nine largely unvarying parallel lines.  Around 1870 you can see that the life expectancy of the top (9th decile) improve rapidly, moving away from the pack of the lowest performing (90%) of countries.  The divergence continues until the beginning of World War I, when life expectancy in the 9th decile countries begin to decline as Europe started to implode. There is a sharp drop for life expectancies in all countries in 1918 marking the appearance of “Spanish Flu“.  After 1918 life expectancy in deciles 6-9 all start to improve, taking a dip for World War II; and then after World War II, life expectancy in all the deciles began to improve.  The overall pattern is one of narrow and low life expectancies in 1870.  Increasing disparity between the 1st and the 9th deciles, peaking around 1950, and then there is a gradual narrowing.

I find it quite difficult to make those kinds of visual comparisons, so I calculated a simple measure of inequality, the difference in years between the life expectancy of the 9th-decile countries and the life expectancy of the 1st-decile countries.

This 9th/1st decile gap (mis-named in the graph titles) in life expectancy is much, much clearer.  There is a relatively steady increase in the inequality, peaking around 1950.  There is then a steady decline in the inequality until the 1990s (when it increases again) and begins to decline again in 2000.  The narrowing inequality is, thus a relatively recent phenomena.  In 2016 the difference between the life expectancies in countries of the 9th- and the 1st-decile was 20.1 years.   In every year prior to 1909, the inequality was even lower.  Of course the life expectancies were also much lower.  In 2016 the life expectancies were 81.4 (9-th decile) and 61.3 (1st-decile), in 1909 they were 46.2 (9th-decile) and 26.0 (1st-decile).

The data extraction and plotting with the R-code is posted as a “gist” on GitHub.


The data are not without their problems, for one, they are derived from multiple sources (some better than others).  Another obvious problem is that a “country” is not static over time.  Countries come and go and their borders change. To ask then about the life expectancy of a country is not straightforward.  Imagine a country with significant regional disparities in life expectancy, and that country is then divided into two independent states along those same regional lines.  Simply by division, an inequality in life expectancy arises.  I did not try to discuss this, nor to weight the analysis by the population size of the country. On the gapminder site you can find details of the data sources.

Finally the difference between the 9th-decile and the 1st-decile is only one among many ways to measure and understand inequality.