Careful Statistical Computing: Part 1

Posted: September 29th, 2010 | Author: | Filed under: Statistical Computing | 7 Comments »

by Vince Buffalo

Editors’ Note: Analyzing data at some point comes down to making conclusions. These may be simple conclusions that lead to new hypotheses and future projects, or far-reaching conclusions which lead to further medical trials, public or environmental policy, investment strategies, etc. As data hackers, we must minimize the risk of a conclusion being incorrect due to shoddy programming, the theme of this and coming articles by Vince Buffalo. Note: This article was originally posted on Vince’s personal site. The series will continue here.

A Personal Motivation

This past year I had the amazing opportunity to teach a graduate student some R. On the first day, I had planned on covering the basics of the interactive environment, vectors, and built-in functions. Shortly after beginning, I got off track and told her a personal story – one that defines me as a statistical programmer.

When I was in 6th grade, our class was given the routine middle school assignment to carry out a science experiment and present it at the school’s science fair. I loved science (much thanks to my father, who every Easter purchased me a science text, including Linus Pauling’s General Chemistry when I was 10, Kline’s Mathematics for the Non-Mathematician when I was 11, and so on…), but I had never carried out a full experiment.

I decided to test which brand of athlete’s foot medicine killed the most athlete’s foot fungus cultured in an agar plate. Gross I know; however I was determined to do the experiment correctly. The SF Library had some dermatology reference books that would help me identify athlete’s foot, which then allowed me to swab the infected feet of unsuspecting peers. Again, gross, I know.

I cultured the fungi, treated different groups with different medicines, and had a control. Later, I pasted various writeup sections on my poster board and packed up my petri dishes for curious types to see during presentation night. I set up my board, placed my petri dishes around the table, and began to wait for passers-by to ask me questions.

The first person to stop by was intrigued – she quickly identified herself as a dermatologist. I was excited: maybe she had done similar work? As I answered her first question, she began looking at all the petri dishes.

“Are these your cultured petri dishes from the experiment?”
“Yes!” I responded eagerly, “This one here was treated with the generic Walgreens bran-”
“Hmm, actually none of these fungi appear to be the same fungus that causes athlete’s foot”

My stomach sank. I thought immediately “…but I identified it – those people had athlete’s foot”

“Yea, none of these dishes have it. Oh, wait, this one has a bit of it – see this orange bit here? That’s it”
“Oh,” my body started heating up and my leg began to quiver.
“Well, good try!” and she walked off.

That was a defining moment. I got it all wrong. No control, accurate measurement, or dedication (hell, I swabbed people’s feet for this project!) matters when something so basic is flawed. I was so distraught that my first encounter with experimental science had ended this way that I sat on the steps of the school for an hour. I thought, “How could I have prevented this?”

Noisy Statistical Computing

My experiment failed because I wasn’t careful in eliminating the early danger of contamination. Furthermore, the control was coincidentally more contaminated than other dishes, making it look like there was a treatment effect.

As I work on increasingly many scientific projects (through my job at the Bioinformatics Core, and helping friends on the side), I still dedicate a lot of thought to this idea of contamination. Statistics is becoming indifferentiable from statistical programming. And as in all programming, contamination is a huge risk; rather than airborne pathogens landing on agar plates, the contaminant of the statistical programmer is the bug hiding in one’s code.

I can’t emphasize how important it is to code carefully. A scientist could conduct an experiment perfectly, but in the hands of a rushed or clumsy statistical programmer all their efforts could go to waste. This is what I emphasized to the graduate student I was teaching R: as a student of the R language, you will make errors. You must stack the deck in your favor so that you see them.

If you’re reading this, I imagine you work with, code in, or have some interest in the R language. Do you unit test? Have you ever used a stopifnot() call in your code? How do you ensure what you do is correct? What’s at stake if your code is incorrect?

None of us can guarantee our code 100%. But we should all strive to put as many checks in our code as possible. Most of the data I get at work comes from a “nearly opaque box” – either from an outside researcher, or someone in my group upstream of the analysis. This further increases the probability of an error being made. In future posts, I will try to illustrate some of the ways in which we can minimize potentially project-shattering bugs.