reproducible research | Papyrus Walk

There has been growing interest in reproducible research. The interest arises from the idea that scientific discoveries that are one off, isolated and never to be repeated have limited value. For research to inform future science other must be able to reproduce the results. There are even courses on reproducible research. However, a look at the courses and a quick search of PubMed will reveal that when people refer to reproducible research, they often mean shared data or shared analytic code. And when I write “shared data”, I don’t even really mean any data, I mean electronic data … a spreadsheet, a database, etc. Of course, the Methodology section of journal articles are supposed to support reproducible research, but these are often hints and teasers for what was done, rather than a genuine “how to”.

Reproducibility becomes more challenging when all one has to work with is a one-off observation. How do I show you what I saw? The question became particularly relevant to me in a recent discussion with a colleague about reproducible microscopy. The obvious answer is, “a photograph” — and with a professional set up, one can achieve spectacular results — but what should be done in resource poor settings where money and equipment are limited?

I am one of the investigators on a Wellcome Trust funded “Our Planet Our Health” award led by Rebekah Brown at the Monash Sustainable Development Institute. The “Revitalisation of Informal Settlements and their Environment” (RISE) study involves the collection of large quantities of diverse data from informal settlements in Makassar, Indonesia and Suva, Fiji. Most of the samples will be collected by in-country teams, and they will not have access to high-end equipment. Nonetheless, some of the samples will have to be examined under a microscope in Makassar and Suva. It is likely that we will have to rely on basic equipment and Lab Technicians with limited skills in microscope photography. From my SEACO experience, I would be looking for a low-cost solution that can be implemented with basic training. The solution “works” if the images are appropriate; that is, they are the thing of scientific interest, and are of sufficient fidelity that a researcher somewhere else in the world can interpret them appropriately. It is not necessary for the technician to be brilliant, just adequate.

I decided to play. I am neither a good photographer nor am I good at microscopy. I reasoned that if I could get something approximating a reasonable image, then a Lab Technician with some actual training would have no problems. The best low-cost camera solution is not to buy another piece of equipment at all. We are already committed to using a smartphone/Tablet solution in the RISE project and plan to use them for capturing photographs, tagging photographs, and uploading them to a server. The only challenge was getting the smartphone camera to “peek” into the microscope. Fortunately, there is a broad range of solutions, and I opted for the very cheapest I could find on eBay. It cost me USD$5.99, brand new, including postage and handling.

The mount is straightforward to use, although my first attempt was pretty awful. I found a weevil crawling around the kitchen (welcome to the tropics!) and that became the first portraiture subject.

A photograph of a weevil taken with a google phone using a smartphone-microscope adapter

The images are of much higher resolution than I have posted here. I didn’t know what I was doing, and most of the image is taken up with the microscope surrounds rather than the subject of the photograph. I tried the next day, this time using a peppercorn as the subject — it didn’t move as quickly.

A photograph of a peppercorn taken with a google phone using a smartphone-microscope adapter

The only real difference in my approach was that this time I zoomed in slightly on the peppercorn. I will never look at peppercorns the same way. What appears to be (to my unqualified eyes) fungal mycelium is less than appealing. Nonetheless, it also seems like the general approach to capturing microscope images might be a reasonable. As long as the technician knows what to photograph, the quality of the images is almost certainly good enough for others to view and interpret. This is potentially quite exciting because it does allow science (and quite basic science) to be virtual and shared. A photograph of a microscopic image taken in Makassar could be shared with the world within hours giving scientists anywhere an opportunity to look, think, interpret, question and suggest.

I was reviewing an article recently for a journal in which the authors referenced a GitHub repository for the Stata code they had developed to support their analysis. I had a look at the repository. The code was there in a complex hierarchy of nested folders. Each individual do-file was well commented, but there was no file that described the overall structure, the interlinking of the files, or how to use the code to actually run an analysis.

I have previously published code associated with some of my own analyses. The code for a recent paper on gender bias in clinical case reports was published here, and the code for the Bayesian classification of ethnicity based on names was published here. None of my code had anything like the complexity of the code referenced in the paper I was reviewing. It did get me thinking however about how the code for statistical analyses should be written. The EQUATOR (Enhancing the QUAlity and Transparency Of health Research) Network has 360 separate guidelines for reporting research. This includes guidelines for everything from randomised trials and observational studies through to diagnostic studies, economic evaluations and case reports. Nothing on the reporting of code for the analysis of data.

On the back of the move towards making data available for re-analysis, and the reproducible research movement, it struck me that guidelines for the structuring of code for simultaneous publication with articles would be enormously beneficial. I started to sketch it out on paper, and write the idea up as an article. Ideally, I would be able to enrol some others as contributors. In my head, the code should have good meta-data at the start describing the structure and interrelationship of the files. I now tend to break my code up into separate files with one file describing the workflow: data importation, data cleaning, setting up factors, analysis. And then I have separate files for each element of the workflow. My analysis is further divided into specific references to parts of papers. “This code refers to Table 1”. I write the code this way for two reasons. It makes it easier for collaborators to pick it up and use it, and I often have a secondary, teaching goal in mind. If I can write the code nicely, it may persuade others to emulate the idea. Having said that, I often use fairly unattractive ways to do things, because I don’t know any better; and I sometimes deliberately break an analytic process down into multiple inefficient steps simply to clarify the process — this is the anti-Perl strategy.

I then started to review the literature and stumbled across a commentary written by Nick Barnes in 2010 in the journal Nature. He has completely persuaded me that my idea is silly.

It is not silly to hope that people will write intelligible, well structured. well commented code for statistical analysis of data. It is not silly to hope that people will include this beautiful code in their papers. The problem with guidelines published by the EQUATOR Network is in the way that journals require authors to comply with them. They become exactly the opposite of guidelines, they are rules — the ironic twist on the observation by Geoffrey Rush’s character, Hector Barbossa in Pirates of the Caribbean.

Barnes wrote, “I want to share a trade secret with scientists: most professional computer software isn’t very good.” Most academics/researchers feel embarrassed by their code. I have collaborated with a very good Software Engineer in some of my work and spent large amounts of time apologising for my code. We want to be judged for our science, not for our code. The problem with that sense of embarrassment is that the perfect becomes the enemy of the good.

The Methods sections of most research articles make fairly vague allusions to how the data were actually managed and analysed. One may make references to statistical tests and theoretical distributions. For a reader to move from that to a re-analysis of the data is often not straight forward. The actual code, however, explains exactly what was done. “Ah! You dropped two cases, collapsed two factors, and used a particular version of an algorithm to perform a logistic regression analysis. And now I know why my results don’t quite match yours”.

It would be nice to have an agreed set of guidelines reporting COde Developed to Analyse daTA (CODATA). It would be great if some authors followed the CODATA guidelines when they published. But it would be even better if everyone published their code, no matter how bad or inefficient it was.

Tag Archives: reproducible research

Low cost reproducible microscopy

Guidelines for the reporting of COde Developed to Analyse daTA (CODATA)