The basic data science pipeline is on its way to becoming an open one. From Open Data, through an open source analysis, and ending up in results released as part of the Creative Commons, every step of data science can be performed openly.
The problems of releasing data openly are being overcome either aggressively, via sites such as Wikileaks, peacefully through movements such as OpenGov and data.gov.uk, or commercially via sites like Infochimps.
The concept of Open Source is now well known. Through programs from sed to Firefox, open source software is a thriving part of the software ecosystem. This is especially important when performing analysis on open data: why should we be trusted if we don’t tell everyone how we analyzed the data?
At the end of the pipeline, the Creative Commons is becoming more mainstream: For example, much of the image content on Flickr is CC licensed. Authors like Cory Doctorow are proving that creative people can build a career around releasing their work in the creative commons. Larry Lessig, in a brilliant interview with Stephen Colbert, shows how value can be added incrementally to a creative work without anyone losing out.
The central part of this pipeline – Open Analysis – has a basic problem: what’s the use of sharing analysis nobody can read or understand? It’s great that people put their analysis online for the world to see, but what’s the point if that analysis is written in dense code that can barely be read by its author?
This is still a problem even when your analysis code is beautifully laid out in a high-level scripting language and well commented. The chances are that the reader who is deeply moved by some statistical analysis of the latest Wikileaks release still can’t read, or critique, the code that generated it.
The technological problems of sharing code are now all but solved: sites like sourceforge and github allow the sharing, dissemination, and tracking of source code. Projects such as CRAN (for R Packages) and MLOSS (Machine Learning Open Source Software) allow the colocation of code, binaries, documentation, and package information, making finding the code an easy job. There have been several attempts at making the code itself easy to read. We’ve got beautiful documentation generators – but these require careful commenting, and all you really end up with are those comments pretty printed – not so great for expressing your modeling assumptions. Another attempt at readable code is Literate Programming, which encourages you to write the code and its documentation all at the same time but, again, is labour intensive. And this, I think, is at the heart of the problem of writing readable code. It’s just plain hard to do.
Who’s got the time to write a whole PDF every time you want to draw a bar chart? We’ve got press deadlines, conference deadlines, and a public attention span measurable in hours. Writing detailed comments is not only time consuming, it’s often a seriously complicated affair akin to writing an academic paper. And, unless we’re actually writing an academic paper, it’s mostly a thankless task.
My contention is this: nobody is going to consistently write readable code, ever. It’s simply too time consuming and the immediate rewards to the coder are negligible. Yet it’s important for others to be able to understand our analysis if they’re making decisions, as citizens or as subjects, based on this analysis. What is to be done?
The answer lies, I think, in convention. The web development community has nailed this with projects like Ruby On Rails and Django. If I’m working within Ruby on Rails, and I name my objects according to convention, then I get a lot of code for free – I actually save time by writing good code. This saving is not a projected – “you’re not going to be able to read that code in 2 years” saving – but an immediate and obvious one. If I abide by the Ruby on Rails structure, then I don’t have to build my databases from scratch. Web forms are automagically generated. My life is made considerably easier and, without trying, my code has a much better structure.
So do we have any data science conventions? My argument is ‘hell yes’: if I don’t abide by some strong data science conventions then I’ll get into well justified trouble. Are the raw data available? Have I made the preprocessing steps clear? Are my data properly normalized? Are my assumptions valid and openly expressed? Has my algorithm converged? Have the functions I have written been unit tested? Have I performed a proper cross validation of the results?
I think that projects like ProjectTemplate, which imposes a useful structure for a project written in R, is a great start. ProjectTemplate treads a fine line: not upsetting those who like to code close to the metal, whilst rewarding those who follow some simple conventions. ProjectTemplate coaxes us into writing well structured projects by saving us time. For example, it currently provides generic helper functions that read and format most common data files placed into the data/ folder, producing a well structured R object with virtually zero effort on the part of the coder.
A lot of code already exists to implement standard data science conventions. From cross validation packages to unit tests, our conventions are already well encapsulated. Collecting these tools together into a data science templating system would allow us to formalize best-practices and help with teaching the ‘carpentry‘ aspects of data science. Most importantly it would allow readers to get a clear view of the analysis, using well documented data science conventions as a navigational tool.
At a recent meeting in NYC a well-known data scientist said something like “is awk and grep the best we can do?” which, though a little incendiary, raised a serious question. Are we really destined, time and time again, to re-create a data science pipeline every time a new data set comes our way? Or could we come to some agreement that there is a set of common procedures that underly all our projects?
So I’m interested in hearing what the data science communities think our conventions are, and then in building these into software like ProjectTemplate. Please leave your ideas in the comments and, by automating these conventions, we can start to build more readable code structures. I’ll report on how these conventions evolve as I go along. Maybe we don’t have to reinvent the wheel over and over again – even if it does mean accepting some loose conventions. In return, we focus on the important aspects of analysis, and everyone else will find it much easier to trust what we have to say.