The Iraq War Diary – An Initial Grep

Posted: October 29th, 2010 | Author: | Filed under: Data Analysis, Philosophy of Data | Tags: , , , , | 17 Comments »

Editors’ Note: While data itself is rarely a source of controversy, supports pursuing a data-centric view of conflict. Here, Mike Dewar examines the Wikileaks Iraq war documents with sobering results. Hackers should note the link to his source code and methods at the end of the post.

…any man’s death diminishes me, because I am involved in mankind, and therefore never send to know for whom the bell tolls; it tolls for thee.” — John Donne, Meditation XVII

The Iraq War logs recently released by Wikileaks are contained in a 371729121 byte CSV file. It contains 390849 rows and 34 columns. The columns contain dates, locations, reports, category information and counts of the killed and wounded. The date range of the events spans from 2004-11-06 12:37:00 to 2009-04-23 12:30:00, and the events are located within the bounding box defined by (22.5,22.4), (49.6,51.8). Row 4 describes a female walking into a crowd and detonating the explosives and ball bearings she was wrapped in, killing 35 and wounding 36. Searching for events mentioning `ball bearing’ returns 503 events.

There were 65349 Improvised Explosive Device (IED) explosions between the start of 2004 and the end of 2009. Of these 1794 had one enemy killed in action. The month that saw the highest number of explosions was May of 2007, when Iraq experienced 2080 IED explosions. During this month 693 civilians were killed, 85 enemies were killed and 93 friendlies were killed. The ratio of civilian deaths to combatant deaths is 3.89 civilians per combatant. On the first day of May there were 49 IED explosions in which 3 people were killed.

IED explosions in Iraq

Location of all IED explosions as reported in the Wikileaks Iraq War Diary

108 different categories are used to categorise all but 6 events. The category with the most events is `IED explosion’ with 65439 events, followed by `direct fire’ with 57815 events. The category `recon threat’ has 1 event which occurred at 8am on the 17th of April, 2009, where 25 people were noticed with 6 cars in front of a police station in Basra. There are 325 `rock throwing’ events and 325 `assassination’ events.

There are 1211 mentions of the word `robot’, 4710 mentions of the word `UAV’, 1332 mentions of the word `predator’ and 443 mentions of the word `reaper’. The first appearance of one of these keywords is on the 3rd of October, 2006. There are 445 mentions of one or more of the words “contractor”, “blackwater” or “triple canopy”.

drones and contractors in iraq

Density showing the distribution over time of events mentioning contractors and drones

The joint forces report that 108398 people lost their lives in Iraq during 2004-2010. 65650 civilians were killed, 15125 members of the host nation forces were killed, 23857 enemy combatants were killed, and 3766 friendly combatants were killed.

Deaths in Iraq over time

The number of deaths per month as reported in the Wikileaks Iraq War Diary

Please don’t believe any of this. Go instead to the data and have a look for yourself. All the code that has generated this post is available on github at You can also see what others have been saying, for example the Guardian and the New York Times have great write ups.

What’s the use of sharing code nobody can read?

Posted: October 21st, 2010 | Author: | Filed under: Philosophy of Data | Tags: , , , | 8 Comments »

The basic data science pipeline is on its way to becoming an open one. From Open Data, through an open source analysis, and ending up in results released as part of the Creative Commons, every step of data science can be performed openly.

The problems of releasing data openly are being overcome either aggressively, via sites such as Wikileaks, peacefully through movements such as OpenGov and, or commercially via sites like Infochimps.

The concept of Open Source is now well known. Through programs from sed to Firefox, open source software is a thriving part of the software ecosystem. This is especially important when performing analysis on open data: why should we be trusted if we don’t tell everyone how we analyzed the data?

At the end of the pipeline, the Creative Commons is becoming more mainstream: For example, much of the image content on Flickr is CC licensed. Authors like Cory Doctorow are proving that creative people can build a career around releasing their work in the creative commons. Larry Lessig, in a brilliant interview with Stephen Colbert, shows how value can be added incrementally to a creative work without anyone losing out.

The central part of this pipeline – Open Analysis – has a basic problem: what’s the use of sharing analysis nobody can read or understand? It’s great that people put their analysis online for the world to see, but what’s the point if that analysis is written in dense code that can barely be read by its author?

This is still a problem even when your analysis code is beautifully laid out in a high-level scripting language and well commented. The chances are that the reader who is deeply moved by some statistical analysis of the latest Wikileaks release still can’t read, or critique, the code that generated it.

The technological problems of sharing code are now all but solved: sites like sourceforge and github allow the sharing, dissemination, and tracking of source code. Projects such as CRAN (for R Packages) and MLOSS (Machine Learning Open Source Software) allow the colocation of code, binaries, documentation, and package information, making finding the code an easy job. There have been several attempts at making the code itself easy to read. We’ve got beautiful documentation generators – but these require careful commenting, and all you really end up with are those comments pretty printed – not so great for expressing your modeling assumptions. Another attempt at readable code is Literate Programming, which encourages you to write the code and its documentation all at the same time but, again, is labour intensive. And this, I think, is at the heart of the problem of writing readable code. It’s just plain hard to do.

Who’s got the time to write a whole PDF every time you want to draw a bar chart? We’ve got press deadlines, conference deadlines, and a public attention span measurable in hours. Writing detailed comments is not only time consuming, it’s often a seriously complicated affair akin to writing an academic paper. And, unless we’re actually writing an academic paper, it’s mostly a thankless task.

My contention is this: nobody is going to consistently write readable code, ever. It’s simply too time consuming and the immediate rewards to the coder are negligible. Yet it’s important for others to be able to understand our analysis if they’re making decisions, as citizens or as subjects, based on this analysis. What is to be done?

The answer lies, I think, in convention. The web development community has nailed this with projects like Ruby On Rails and Django. If I’m working within Ruby on Rails, and I name my objects according to convention, then I get a lot of code for free – I actually save time by writing good code. This saving is not a projected – “you’re not going to be able to read that code in 2 years” saving – but an immediate and obvious one. If I abide by the Ruby on Rails structure, then I don’t have to build my databases from scratch. Web forms are automagically generated. My life is made considerably easier and, without trying, my code has a much better structure.

So do we have any data science conventions? My argument is ‘hell yes’: if I don’t abide by some strong data science conventions then I’ll get into well justified trouble. Are the raw data available? Have I made the preprocessing steps clear? Are my data properly normalized? Are my assumptions valid and openly expressed? Has my algorithm converged? Have the functions I have written been unit tested? Have I performed a proper cross validation of the results?

I think that projects like ProjectTemplate, which imposes a useful structure for a project written in R, is a great start. ProjectTemplate treads a fine line: not upsetting those who like to code close to the metal, whilst rewarding those who follow some simple conventions. ProjectTemplate coaxes us into writing well structured projects by saving us time. For example, it currently provides generic helper functions that read and format most common data files placed into the  data/ folder, producing a well structured R object with virtually zero effort on the part of the coder.

A lot of code already exists to implement standard data science conventions. From cross validation packages to unit tests, our conventions are already well encapsulated. Collecting these tools together into a data science templating system would allow us to formalize best-practices and help with teaching the ‘carpentry‘ aspects of data science. Most importantly it would allow readers to get a clear view of the analysis, using well documented data science conventions as a navigational tool.

At a recent meeting in NYC a well-known data scientist said something like “is awk and grep the best we can do?” which, though a little incendiary, raised a serious question. Are we really destined, time and time again, to re-create a data science pipeline every time a new data set comes our way? Or could we come to some agreement that there is a set of common procedures that underly all our projects?

So I’m interested in hearing what the data science communities think our conventions are, and then in building these into software like ProjectTemplate. Please leave your ideas in the comments and, by automating these conventions, we can start to build more readable code structures. I’ll report on how these conventions evolve as I go along. Maybe we don’t have to reinvent the wheel over and over again – even if it does mean accepting some loose conventions. In return, we focus on the important aspects of analysis, and everyone else will find it much easier to trust what we have to say.

What Data Visualization Should Do: Simple Small Truth

Posted: October 14th, 2010 | Author: | Filed under: Philosophy of Data | Tags: | 9 Comments »

Yesterday the good folks at IA Ventures asked me to lead off the discussion of data visualization at their Big Data Conference. I was rather misplaced among the high-profile venture capitalists and technologist in the room, but I welcome any opportunity to wax philosophically about the power and danger of conveying information visually.

I began my talk by referencing the infamous Afghanistan war PowerPoint slide because I believe it is a great example of spectacularly bad visualization, and how good intentions can lead to disastrous result. As it turns out, the war in Afghanistan is actually very complicated. Therefore, by attempting to represent that complex problem in its entirety much more is lost than gained. Sticking with that theme, yesterday I focused on three key things—I think—data visualization should do:

  1. Make complex things simple
  2. Extract small information from large data
  3. Present truth, do not deceive

The emphasis is added to highlight the goal of all data visualization; to present an audience with simple small truth about whatever the data are measuring. To explore these ideas further, I provided a few examples.

As the Afghanistan war slide illustrates, networks are often the most poorly visualized data. This is frequently because those visualizing network data think it is a good idea to include all nodes and edges in the visualization. This, however, is not making a complex thing simple—rather—this is making a complex thing ugly.

Below is an example of exactly this problem. On the left is a relatively small network (V: ~2,220 and E:~4,400) with weighted edges. I have used edge thickness to illustrate weights, and used a basic force-directed algorithm in Gephi to position the nodes. This is a network hairball, and while it is possible to observe some structural properties in this example, many more subtle aspects of the data are lost in the mess.

Network Slide07.png

On the right are the same data, but I have used information contained in the data to simplify the visualization. First, I performed a k-core analysis to remove all pendants and pendant chains in the data; an extremely useful technique I have mentioned several times before. Next, I used the weighted in-degree of each node as a color scale for the edges, i.e., the dark the blue the higher the in-degree of the node the edges connect to. Then, I simply dropped the nodes from the visualization entirely. Finally, I added a threshold weight for the edges so that any edges below the threshold are drawn with the lightest blue scale. Using these simple techniques the community structures are much more apparent; and more importantly, the means by which those communities are related are easily identified (note the single central node connecting nearly all communities).

To discuss the importance of extracting small information from large data I used the visualization of the WikiLeaks Afghanistan War Diaries that I worked on this past summer. The original visualization is on the left, and while many people found it useful, its primary weakness is the inability to distinguish among the various attack types represented on the map. It is clear that activity gradually increased in specific areas over time; however, it is entirely unclear what activity was driving that evolution. A better approach is to focus on one attack type and attempt glean information from that single dimension.

Original Afghanistan WL viz Only 'Enemy Explosive' events

On the right I have extracted only the ‘Explosive Hazard’ data from the full set and visualized as before. Now, it is easy to see that the technology of IEDs were a primary force in the war, and as has been observed before, the main highway in Afghanistan significantly restricted the operations of forces.

Finally, to show the danger of data deception I replicated a chart published at the Monkey Cage a few months ago on the sagging job market for political science professors. On the left is my version of the original chart published at the Monkey Cage. At first glance, the decline in available assistant professorships over time is quite alarming. The steep slope conveys a message of general collapse in the job market. This, however, is not representative of the truth.

Slide10.png Slide11.png

Note that in the visualization on the left the y-axis scales go from 450 to 700, which happen to be the limits of the y-axis data. Many data visualization tools, including ggplot2 which is used here, will scale their axes by the data limits by default. Often this is desirable; hence the default behavior, but in this case it is conveying a dishonest perspective on the job market decline. As you can see from the visualization on the right, by scaling the y-axis from zero the decline is much less dramatic, though still relatively troubling for those of us who will be going on the job market in the not distant future.

These ideas are very straightforward, which is why I think they are so important to consider when doing your own visualizations. Thanks again to IA Ventures for providing me a small soap box in front of such a formidable crowd yesterday. As always, I welcome any comments or criticisms.

Cross-posed at Zero Intelligence Agents

Data Tools Contest Update

Posted: October 10th, 2010 | Author: | Filed under: R Explorations | No Comments »

At midnight this morning, Kaggle began accepting submissions for the data hacking contest that we announced on Thursday. Hopefully you’ve used the last few days to build predictions for the test data set. Once you submit your predictions, you’ll be able to see your position on the leaderboard. Good luck!

Using Data Tools to Find Data Tools, the Yo Dawg of Data Hacking

Posted: October 7th, 2010 | Author: | Filed under: R Explorations | 15 Comments »

by John Myles White and Drew Conway

Editors’ Note: One theme likely to recur on is that data hackers love using their tools to analyze, visualize, and predict everything. Data hackers also love discovering and learning about new tools. So it should come as no surprise that Dataist contributors John Myles White and Drew Conway thought to develop a model that can predict which R packages a particular user would like. And in the spirit of friendly competition, they’re opening it up for others to participate!


A graphical visualization of packages’ “suggestion” relationships. Affectionately referred to as the R Flying Spaghetti Monster. More info below.

As part of the kickoff for dataists, we’re announcing a data hacking contest tailored to the statistical computing community. Contestants will build a recommendation engine for R packages. The contest is being administered in collaboration with Kaggle. If you’re interested in the details of the contest, please read on.

By sponsoring this contest, we’re hoping to encourage the data hacking community to use its skills to build a recommendation engine that will help R programmers to find the best packages on CRAN, the standard repository for R libraries. Like many data-driven projects, the question has evolved with the availability of data. We started with the question, “which packages are best?” and replaced it with the empirical question, “which packages are used most often?” This is quite a difficult question to answer as well, because the appropriate data set is neither readily available nor can it be easily acquired. For that reason, we’ve settled on the more manageable question, “which packages are most often installed by normal R users?”

This last question could potentially be answered in a variety of ways. Our current approach uses a convenience sample of installation data that we’ve collected from volunteers in the R community, who kindly agreed to send us a list of the packages they have on their systems. We’ve anonymized this data and compiled a set of metadata-based predictors that allow us to predict the installation probabilities quite well. We’re releasing all of our current work, including the data we have and all of the code we’ve used so far for our exploratory analyses. The contest itself will go live on Kaggle on Sunday and will end four months from Sunday on February 10, 2011. The rules, prizes and official data sets are all described below.

Rules and Prizes

To win the contest, you need to predict the probability that a user U has a package P installed on their system for every pair, (U, P). We’ll assess your performance using ROC methods, which will be evaluated against a held out test data set. The winning team will receive 3 UseR! books of their choosing. In order to win the contest, you’ll have to provide your analysis code to us by creating a fork of our GitHub repository. You’ll also be required to provide a written description of your approach. We’re asking for so much openness from the winning team because we want this contest to serve as a stepping stone for the R community. We’re also hoping that enterprising data hackers will extend the lessons learned through this contest to other programming languages.

Getting Started

To get started, you can go to GitHub to download the primary data sets and code. The sections below describe the data sets that you can download and the baseline model you should try to beat.

Data Sets

For this contest, there are really three data sets. At the start, you’ll want to download the heavily preprocessed data set that we’ll be providing to you through Kaggle. This data set is also available on GitHub, where it is labeled as training_data.csv. This file contains a matrix with roughly 100,000 rows and 16 columns, representing installation information for all existing R packages and 52 users. The test data set against which your performance will be evaluated contains approximately another 30,000 rows.

Each row of this matrix contains the following information:

  1. Package: The name of the current R package.
  2. DependencyCount: The number of other R packages that depend upon the current package.
  3. SuggestionCount: The number of other R packages that suggest the current package.
  4. ImportCount: The number of other R packages that import the current package.
  5. ViewsIncluding: The number of task views on CRAN that include the current package.
  6. CorePackage: A dummy variable indicating whether the current package is part of core R.
  7. RecommendedPackage: A dummy variable indicating whether the current package is a recommended R package.
  8. Maintainer: The name and e-mail address of the package’s maintainer.
  9. PackagesMaintaining: The number of other R packages that are being maintained by the current package’s maintainer.
  10. User: The numeric ID of the current user who may or may not have installed the current package.
  11. Installed: A dummy variable indicating whether the current package was installed by the current user.

In addition to these central predictors, we are including logarithmic transforms of the non-binary predictors as we find that this improves the model’s fit to the full data set. For that reason, the last five columns of our data set are,

  1. LogDependencyCount
  2. LogSuggestionCount
  3. LogImportCount
  4. LogViewsIncluding
  5. LogPackagesMaintaining

The Kaggle data set is really the minimal amount of data you should use to build your model. For most users, you’ll quickly want to move on to the raw metadata that we’re providing on GitHub. This second-level data set is contained in several normalized CSV files inside of the data directory:

  1. core.csv: The R and base packages are listed here as core packages.
  2. depends.csv: The full dependency graph for CRAN as of 8/28/2010. An edge between A and B indicates that A depends upon B. For example, ggplot2 depends upon plyr, but plyr does not depend upon ggplot2.
  3. imports.csv: The full import graph for CRAN as of 8/28/2010. An edge between A and B indicates that A imports B.
  4. installations.csv: A list of the packages installed on 52 users’ systems. Each row indicates whether or not user A has installed package B.
  5. maintainers.csv: A list of the current maintainers for each package. We use this instead of the Author field because it is generally easier to parse.
  6. packages.csv: A list of all of the packages contained in CRAN on 8/28/2010.
  7. recommended.csv: A list of the packages recommended for installation by the R Core team.
  8. suggests.csv: The full suggestion graph for CRAN as of 8/28/2010. An edge between A and B indicates that A suggests B.
  9. views.csv: A list of all of the packages indicated in each of the task views on CRAN as of 8/28/2010.

To give you a taste of this richer data set we’re providing, we’ve built a visualization of the suggestions graph found in suggestions.csv:

In the graph (above), the package names are sized and colored by in-degree centrality (i.e., larger sized and darker colored nodes have higher centrality), which you can think of as a very rough proxy for importance. If you’re interested in producing similar visualizations of this data, you can use Gephi to produce new graphics like this. To better explore the graph toggle to full-screen mode.

For those interested, we’re also providing the R scripts we used to generate the metadata predictors we’re providing, in case you’d like to use them as examples of how to work with the raw data from CRAN. The relevant scripts are:

  1. extract_graphs.R: Extracts the dependency, import and suggestion graphs from CRAN.
  2. get_maintainers.R: Extracts the package maintainers from CRAN.
  3. get_packages.R: Extracts the names of all of the packages on CRAN.
  4. get_views.rb: Extracts the packages that are contained in each of the task views on CRAN. This program is written in Ruby, not R.

All of the other data sets described earlier were compiled by hand.

Please note that these data sets are normalized, so we are also providing preprocessing scripts that build one large data frame that contains all of the information we’ve used to build our predictive model. The lib/preprocess_data.R script performs the relevant merging and transformation operations, including logarithmic transformations that we’ve found improve predictive accuracy. The result of this merging is the training data set that we’re providing through Kaggle.

For the truly dedicated, you should consider CRAN itself to be the raw data set for this contest. If you want to use predictors beyond those we’re giving you, you’ll want to download a copy of CRAN that you can work with locally. You can do this using the Perl script,, that we’re providing. To be kind to the CRAN maintainers, this download script sleeps for ten seconds between each step in the spidering process. Obviously you can change this, but please be considerate about the amount of bandwidth you use if you do make changes.

Please note: until you are familiar with the preprocessed data sets that we’re providing, we suggest that you do not download a copy of CRAN. For many users, working directly with a raw copy of CRAN will not be efficient.

Closing Remarks

We think this contest can help focus data hackers on an unsolved problem: using our current data tools to help us find the best data tools. We hope you’ll consider participating and even extending this work to new contexts. Happy hacking!

More Info

If you have further questions about this contest, please direct them to John Myles White.

Outlier: Code Quarterly

Posted: October 3rd, 2010 | Author: | Filed under: outliers | No Comments »

by Vince Buffalo

Outlier: Something a data hacker keeps their eye on.

Peter Seibel is a hacker extraordinaire. He is the author of the best selling introduction to Common Lisp, Practical Common Lisp, as well as Coders at Work, a collection of interviews with 15 great programmers. His new project is a “Hackademic Journal” entitled Code Quarterly.

This is something to watch for in the future. The format is excitingly different; less up-to-date buzz, more technical, in-depth pieces. One particularly exciting article type is code reads: actual “guided tours” through beautiful code. Book reviews, Q&A interviews with prominent programmers, and technical explanations of concepts are also to be featured.

Currently Peter is looking for writers and future readers. If you’re interested in writing for Code Quarterly, email him at (duck sound removed) and visit the writing guidelines and some topics they’re looking for. If you’re just interested in following the project, complete the form on Code Quarterly’s website.

The Data Science Venn Diagram

Posted: September 30th, 2010 | Author: | Filed under: Philosophy of Data | Tags: , , | 40 Comments »

by Drew Conway

Last Monday I—humbly—joined a group of NYC’s most sophisticated thinkers on all things data for a half-day unconference to help O’Reily organize their upcoming Strata conference. The break out sessions were fantastic, and the number of people in each allowed for outstanding, expert driven, discussions. One of the best sessions I attended focused on issues related to teaching data science, which inevitably led to a discussion on the skills needed to be a fully competent data scientist.

As I have said before, I think the term “data science” is a bit of a misnomer, but I was very hopeful after this discussion; mostly because of the utter lack of agreement on what a curriculum on this subject would look like. The difficulty in defining these skills is that the split between substance and methodology is ambiguous, and as such it is unclear how to distinguish among hackers, statisticians, subject matter experts, their overlaps and where data science fits.

What is clear, however, is that one needs to learn a lot as they aspire to become a fully competent data scientist. Unfortunately, simply enumerating texts and tutorials does not untangle the knots. Therefore, in an effort to simplify the discussion, and add my own thoughts to what is already a crowded market of ideas, I present the Data Science Venn Diagram.

Data science Venn diagram

How to read the Data Science Venn Diagram

The primary colors of data: hacking skills, math and stats knowledge, and substantive expertise

  • On Monday we spent a lot of time talking about “where” a course on data science might exist at a university. The conversation was largely rhetorical, as everyone was well aware of the inherent interdisciplinary nature of the these skills; but then, why have I highlighted these three? First, none is discipline specific, but more importantly, each of these skills are on their own very valuable, but when combined with only one other are at best simply not data science, or at worst downright dangerous.
  • For better or worse, data is a commodity traded electronically; therefore, in order to be in this market you need to speak hacker. This, however, does not require a background in computer science—in fact—many of the most impressive hackers I have met never took a single CS course. Being able to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically; these are the hacking skills that make for a successful data hacker.
  • Once you have acquired and cleaned the data, the next step is to actually extract insight from it. In order to do this, you need to apply appropriate math and statistics methods, which requires at least a baseline familiarity with these tools. This is not to say that a PhD in statistics in required to be a competent data scientist, but it does require knowing what an ordinary least squares regression is and how to interpret it.
  • In the third critical piece—substance—is where my thoughts on data science diverge from most of what has already been written on the topic. To me, data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science. Science is about discovery and building knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods. On the flip-side, substantive expertise plus math and statistics knowledge is where most traditional researcher falls. Doctoral level researchers spend most of their time acquiring expertise in these areas, but very little time learning about technology. Part of this is the culture of academia, which does not reward researchers for understanding technology. That said, I have met many young academics and graduate students that are eager to bucking that tradition.
  • Finally, a word on the hacking skills plus substantive expertise danger zone. This is where I place people who, “know enough to be dangerous,” and is the most problematic area of the diagram. In this area people who are perfectly capable of extracting and structuring data, likely related to a field they know quite a bit about, and probably even know enough R to run a linear regression and report the coefficients; but they lack any understanding of what those coefficients mean. It is from this part of the diagram that the phrase “lies, damned lies, and statistics” emanates, because either through ignorance or malice this overlap of skills gives people the ability to create what appears to be a legitimate analysis without any understanding of how they got there or what they have created. Fortunately, it requires near willful ignorance to acquire hacking skills and substantive expertise without also learning some math and statistics along the way. As such, the danger zone is sparsely populated, however, it does not take many to produce a lot of damage.

I hope this brief illustration has provided some clarity into what data science is and what it takes to get there. By considering these questions at a high level it prevents the discussion from degrading into minutia, such as specific tools or platforms, which I think hurts the conversation.

I am sure I have overlooked many important things, but again the purpose was not to be speific. As always, I welcome any and all comments.

Cross-posted at Zero Intelligence Agents

Careful Statistical Computing: Part 1

Posted: September 29th, 2010 | Author: | Filed under: Statistical Computing | 7 Comments »

by Vince Buffalo

Editors’ Note: Analyzing data at some point comes down to making conclusions. These may be simple conclusions that lead to new hypotheses and future projects, or far-reaching conclusions which lead to further medical trials, public or environmental policy, investment strategies, etc. As data hackers, we must minimize the risk of a conclusion being incorrect due to shoddy programming, the theme of this and coming articles by Vince Buffalo. Note: This article was originally posted on Vince’s personal site. The series will continue here.

A Personal Motivation

This past year I had the amazing opportunity to teach a graduate student some R. On the first day, I had planned on covering the basics of the interactive environment, vectors, and built-in functions. Shortly after beginning, I got off track and told her a personal story – one that defines me as a statistical programmer.

When I was in 6th grade, our class was given the routine middle school assignment to carry out a science experiment and present it at the school’s science fair. I loved science (much thanks to my father, who every Easter purchased me a science text, including Linus Pauling’s General Chemistry when I was 10, Kline’s Mathematics for the Non-Mathematician when I was 11, and so on…), but I had never carried out a full experiment.

I decided to test which brand of athlete’s foot medicine killed the most athlete’s foot fungus cultured in an agar plate. Gross I know; however I was determined to do the experiment correctly. The SF Library had some dermatology reference books that would help me identify athlete’s foot, which then allowed me to swab the infected feet of unsuspecting peers. Again, gross, I know.

I cultured the fungi, treated different groups with different medicines, and had a control. Later, I pasted various writeup sections on my poster board and packed up my petri dishes for curious types to see during presentation night. I set up my board, placed my petri dishes around the table, and began to wait for passers-by to ask me questions.

The first person to stop by was intrigued – she quickly identified herself as a dermatologist. I was excited: maybe she had done similar work? As I answered her first question, she began looking at all the petri dishes.

“Are these your cultured petri dishes from the experiment?”
“Yes!” I responded eagerly, “This one here was treated with the generic Walgreens bran-”
“Hmm, actually none of these fungi appear to be the same fungus that causes athlete’s foot”

My stomach sank. I thought immediately “…but I identified it – those people had athlete’s foot”

“Yea, none of these dishes have it. Oh, wait, this one has a bit of it – see this orange bit here? That’s it”
“Oh,” my body started heating up and my leg began to quiver.
“Well, good try!” and she walked off.

That was a defining moment. I got it all wrong. No control, accurate measurement, or dedication (hell, I swabbed people’s feet for this project!) matters when something so basic is flawed. I was so distraught that my first encounter with experimental science had ended this way that I sat on the steps of the school for an hour. I thought, “How could I have prevented this?”

Noisy Statistical Computing

My experiment failed because I wasn’t careful in eliminating the early danger of contamination. Furthermore, the control was coincidentally more contaminated than other dishes, making it look like there was a treatment effect.

As I work on increasingly many scientific projects (through my job at the Bioinformatics Core, and helping friends on the side), I still dedicate a lot of thought to this idea of contamination. Statistics is becoming indifferentiable from statistical programming. And as in all programming, contamination is a huge risk; rather than airborne pathogens landing on agar plates, the contaminant of the statistical programmer is the bug hiding in one’s code.

I can’t emphasize how important it is to code carefully. A scientist could conduct an experiment perfectly, but in the hands of a rushed or clumsy statistical programmer all their efforts could go to waste. This is what I emphasized to the graduate student I was teaching R: as a student of the R language, you will make errors. You must stack the deck in your favor so that you see them.

If you’re reading this, I imagine you work with, code in, or have some interest in the R language. Do you unit test? Have you ever used a stopifnot() call in your code? How do you ensure what you do is correct? What’s at stake if your code is incorrect?

None of us can guarantee our code 100%. But we should all strive to put as many checks in our code as possible. Most of the data I get at work comes from a “nearly opaque box” – either from an outside researcher, or someone in my group upstream of the analysis. This further increases the probability of an error being made. In future posts, I will try to illustrate some of the ways in which we can minimize potentially project-shattering bugs.

A Taxonomy of Data Science

Posted: September 25th, 2010 | Author: | Filed under: Philosophy of Data | Tags: , , , | 100 Comments »
by Hilary Mason and Chris Wiggins

Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science?

We’ve variously heard it said that data science requires some command-line fu for data procurement and preprocessing, or that one needs to know some machine learning or stats, or that one should know how to `look at data’. All of these are partially true, so we thought it would be useful to propose one possible taxonomy — we call it the Snice* taxonomy — of what a data scientist does, in roughly chronological order: Obtain, Scrub, Explore, Model, and iNterpret (or, if you like, OSEMN, which rhymes with possum).

Different data scientists have different levels of expertise with each of these 5 areas, but ideally a data scientist should be at home with them all.

We describe each one of these steps briefly below:

  1. Obtain: pointing and clicking does not scale.

    Getting a list of numbers from a paper via PDF or from within your web browser via copy and paste rarely yields sufficient data to learn something `new’ by exploratory or predictive analytics. Part of the skillset of a data scientist is knowing how to obtain a sufficient corpus of usable data, possibly from multiple sources, and possibly from sites which require specific query syntax. At a minimum, a data scientist should know how to do this from the command line, e.g., in a UN*X environment. Shell scripting does suffice for many tasks, but we recommend learning a programming or scripting language which can support automating the retrieval of data and add the ability to make calls asynchronously and manage the resulting data. Python is a current favorite at time of writing (Fall 2010). 

    APIs are standard interfaces for accessing web applications, and one should be familiar with how to manipulate them (and even identify hidden, ‘internal’ APIs that may be available but not advertised). Rich actions on web sites often use APIs underneath. You have probably generated thousands of API calls already today without even knowing it! APIs are a two-way street: someone has to have written an API — a syntax — for you to interact with it. Typically one then writes a program which can execute commands to obtain these data in a way which respects this syntax. For example, let’s say we wish to query the NYT archive of stories in bash. Here’s a command-line trick for doing so to find stories about Justin Beiber (and the resulting JSON): Now let’s look for stories with the word ‘data’ in the title, but in python:

  2. Scrub: the world is a messy place

    Whether provided by an experimentalist with missing data and inconsistent labels, or via a website with an awkward choice of data formatting, there will almost always be some amount of data cleaning (or scrubbing) necessary before analysis of these data is possible. As with Obtaining data, herein a little command line fu and simple scripting can be of great utility. Scrubbing data is the least sexy part of the analysis process, but often one that yields the greatest benefits. A simple analysis of clean data can be more productive than a complex analysis of noisy and irregular data.

    The most basic form of scrubbing data is just making sure that it’s read cleanly, stripped of extraneous characters, and parsed into a usable format. Unfortunately, many data sets are complex and messy. Imagine that you decide to look at something as simple as the geographic distribution of twitter users by self-reported location in their profile. Easy, right? Even people living in the same place may use different text to represent it. Values for people who live in New York City contain “New York, NY”, “NYC”, “New York City”, “Manhattan, NY”, and even more fanciful things like “The Big Apple”. This could be an entire blog post (and will!), but how do you disambiguate it? (Example)

    Sed, awk, grep are enough for most small tasks, and using either Perl or Python should be good enough for the rest. Additional skills which may come to play are familiarity with databases, including their syntax for representing data (e.g., JSON, above) and for querying databases.

  3. Explore: You can see a lot by looking

    Visualizing, clustering, performing dimensionality reduction: these are all part of `looking at data.’ These tasks are sometimes described as “exploratory” in that no hypothesis is being tested, no predictions are attempted. Wolfgang Pauli would call these techniques “not even wrong,” though they are hugely useful for getting to know your data. Often such methods inspire predictive analysis methods used later. Tricks to know:

    • more or less (though less is more): Yes, that more and less. You can see a lot by looking at your data. Zoom out if you need to, or use unix’s head to view the first few lines, or awk or cut to view the first few fields or characters.
    • Single-feature histograms visually render the range of single features and their distribution. Since histograms of real-valued data are contingent on choice of binning, we should remember that they an art project rather than a form of analytics in themselves.
    • Similarly, simple feature-pair scatter plots can often reveal characteristics of the data that you miss when just looking at raw numbers.
    • Dimensionality reduction (MDS, SVD, PCA, PLS etc): Hugely useful for rendering high-demensional data on the page. In most cases we are performing ‘unsupervised’ dimensionality reduction (as in PCA), in which we find two-dimensional shadows which capture as much variance of the data as possible. Occasionally, low-dimensional regression techniques can provide insight, for example in this review article describing the Netflix Prize which features a scatterplot of movies (Fig. 3) derived from a regression problem in which one wishes to predict users’ movie ratings.
    • Clustering: Unsupervised machine learning techniques for grouping observations; this can include grouping nodes of a graph into “modules” or “communities”, or inferring latent variable assignments in a generative model with latent structure (e.g., Gaussian mixture modeling, or K-means, which can be derived via a limiting case of Gaussian mixture modeling).
  4. Models: always bad, sometimes ugly

    Whether in the natural sciences, in engineering, or in data-rich startups, often the ‘best’ model is the most predictive model. E.g., is it `better’ to fit one’s data to a straight line or a fifth-order polynomial? Should one combine a weighted sum of 10 rules or 10,000? One way of framing such questions of model selection is to remember why we build models in the first place: to predict and to interpret. While the latter is difficult to quantify, the former can be framed not only quantitatively but empirically. That is, armed with a corpus of data, one can leave out a fraction of the data (the “validation” data or “test set”), learn/optimize a model using the remaining data (the “learning” data or “training set”) by minimizing a chosen loss function (e.g., squared loss, hinge loss, or exponential loss), and evaluate this or another loss function on the validation data. Comparing the value of this loss function for models of differing complexity yields the model complexity which minimizes generalization error. The above process is sometimes called “empirical estimation of generalization error” but typically goes by its nickname: “cross validation.” Validation does not necessarily mean the model is “right.” As Box warned us, “all models are wrong, but some are useful”. Here, we are choosing from among a set of allowed models (the `hypothesis space’, e.g., the set of 3rd, 4th, and 5th order polynomials) which model complexity maximizes predictive power and is thus the least bad among our choices.

    Above we mentioned that models are built to predict and to interpret. While the former can be assessed quantitatively (`more predictive’ is `less bad’) the latter is a matter of which is less ugly, and is in the mind of the beholder. Which brings us to…

  5. iNterpret: “The purpose of computing is insight, not numbers.”

    Consider the task of automated digit recognition. The value of an algorithm which can predict ‘4’ and distinguish from ‘5’ is assessed by its predictive power, not on theoretical elegance; the goal of machine learning for digit recognition is not to build a theory of ‘3.’ However, in the natural sciences, the ability to predict complex phenomena is different from what most mean by ‘understanding’ or ‘interpreting.’

    The predictive power of a model lies in its ability to generalize in the quantitative sense: to make accurate quantitative predictions of data in new experiments. The interpretability of a model lies in its ability to generalize in the qualitative sense: to suggest to the modeler which would be the most interesting experiments to perform next.

    The world rarely hands us numbers; more often the world hands us clickstreams, text, graphs, or images. Interpretable modeling in data science begins with choosing a natural set of input features — e.g., choosing a representation of text in terms of a bag-of-words, rather than bag-of-letters; choosing a representation of a graph in terms of subgraphs rather than the spectrum of the Laplacian. In this step, domain expertise and intuition can be more important than technical or coding expertise. Next one chooses a hypothesis space, e.g., linear combinations of these features vs. exponentiated products of special functions or lossy hashes of these features’ values. Each of these might have advantages in terms of computational complexity vs interpretability. Finally one chooses a learning/optimization algorithm, sometimes including a “regularization” term (which penalizes model complexity but does not involve observed data). For example, interpretability can be aided by learning by boosting or with an L1 penalty to yield sparse models; in this case, models which can be described in terms of a comprehensible number of nonzero weights of, ideally, individually-interpretable features. Rest assured that interpretability in data science is not merely a desideratum for the natural scientist.

    Startups building products without the perspective of multi-year research cycles are often both exploring the data and constructing systems on the data at the same time. Interpretable models offer the benefits of producing useful products while at the same time suggesting which directions are best to explore next.

    For example, at, we recently completed a project to classify popular content by click patterns over time and topic. In most cases, topic identification was straightforward, e.g., identifying celebrity gossip (you can imagine those features!). One particular click pattern was difficult to interpret, however; with further exploration we realized that people were using links on images embedded in a page in order to study their own real-time metrics. Each page load counted as a ‘click’ (the page content itself was irrelevant), and we discovered a novel use case ‘in the wild’ for our product.

Deep thoughts:

Data science is clearly a blend of the hackers’ arts (primarily in steps “O” and “S” above); statistics and machine learning (primarily steps “E” and “M” above); and the expertise in mathematics and the domain of the data for the analysis to be interpretable (that is, one needs to understand the domain in which the data were generated, but also the mathematical operations performed during the “learning” and “optimization”). It requires creative decisions and open-mindedness in a scientific context.

Our next post addresses how one goes about learning these skills, that is: “what does a data science curriculum look like?”

* named after Snice, our favorite NYC café, where this blog post was hatched.

Thanks to Mike Dewar for comments on an earlier draft of this.

Hello, World!

Posted: August 24th, 2010 | Author: | Filed under: Uncategorized | 2 Comments »

Welcome! is a collaborative blogging effort started by Hilary Mason and Vince Buffalo. Our goal is to bring well-written articles on big data processing, statistics and statistical programming, and machine learning to one place. There’s been a surge of collaborative effort across Twitter (i.e. with the #rstats hashtag), local data drinking groups and R user groups, hackathons, and StackOverflow; the next logical step is a collaborative data blog.

We still have some configuration to do with WordPress, so it may be a bit before we whip out posts more interesting than this introduction. In the meantime, there’s a lot of other great blogs out there!

Hilary and Vince