Posted: October 29th, 2010 | Author: mikedewar | Filed under: Data Analysis, Philosophy of Data | Tags: analysis, data, dataanalysis, iraq, wikileaks | 17 Comments »
Editors’ Note: While data itself is rarely a source of controversy, dataists.com supports pursuing a data-centric view of conflict. Here, Mike Dewar examines the Wikileaks Iraq war documents with sobering results. Hackers should note the link to his source code and methods at the end of the post.
…any man’s death diminishes me, because I am involved in mankind, and therefore never send to know for whom the bell tolls; it tolls for thee.” — John Donne, Meditation XVII
The Iraq War logs recently released by Wikileaks are contained in a 371729121 byte CSV file. It contains 390849 rows and 34 columns. The columns contain dates, locations, reports, category information and counts of the killed and wounded. The date range of the events spans from 2004-11-06 12:37:00 to 2009-04-23 12:30:00, and the events are located within the bounding box defined by (22.5,22.4), (49.6,51.8). Row 4 describes a female walking into a crowd and detonating the explosives and ball bearings she was wrapped in, killing 35 and wounding 36. Searching for events mentioning `ball bearing’ returns 503 events.
There were 65349 Improvised Explosive Device (IED) explosions between the start of 2004 and the end of 2009. Of these 1794 had one enemy killed in action. The month that saw the highest number of explosions was May of 2007, when Iraq experienced 2080 IED explosions. During this month 693 civilians were killed, 85 enemies were killed and 93 friendlies were killed. The ratio of civilian deaths to combatant deaths is 3.89 civilians per combatant. On the first day of May there were 49 IED explosions in which 3 people were killed.
Location of all IED explosions as reported in the Wikileaks Iraq War Diary
108 different categories are used to categorise all but 6 events. The category with the most events is `IED explosion’ with 65439 events, followed by `direct fire’ with 57815 events. The category `recon threat’ has 1 event which occurred at 8am on the 17th of April, 2009, where 25 people were noticed with 6 cars in front of a police station in Basra. There are 325 `rock throwing’ events and 325 `assassination’ events.
There are 1211 mentions of the word `robot’, 4710 mentions of the word `UAV’, 1332 mentions of the word `predator’ and 443 mentions of the word `reaper’. The first appearance of one of these keywords is on the 3rd of October, 2006. There are 445 mentions of one or more of the words “contractor”, “blackwater” or “triple canopy”.
Density showing the distribution over time of events mentioning contractors and drones
The joint forces report that 108398 people lost their lives in Iraq during 2004-2010. 65650 civilians were killed, 15125 members of the host nation forces were killed, 23857 enemy combatants were killed, and 3766 friendly combatants were killed.
The number of deaths per month as reported in the Wikileaks Iraq War Diary
Please don’t believe any of this. Go instead to the data and have a look for yourself. All the code that has generated this post is available on github at http://github.com/mikedewar/Iraq-War-Diary-Analysis. You can also see what others have been saying, for example the Guardian and the New York Times have great write ups.
Posted: October 21st, 2010 | Author: mikedewar | Filed under: Philosophy of Data | Tags: data, mikedewar, open data, open source | 8 Comments »
The basic data science pipeline is on its way to becoming an open one. From Open Data, through an open source analysis, and ending up in results released as part of the Creative Commons, every step of data science can be performed openly.
The problems of releasing data openly are being overcome either aggressively, via sites such as Wikileaks, peacefully through movements such as OpenGov and data.gov.uk, or commercially via sites like Infochimps.
The concept of Open Source is now well known. Through programs from sed to Firefox, open source software is a thriving part of the software ecosystem. This is especially important when performing analysis on open data: why should we be trusted if we don’t tell everyone how we analyzed the data?
At the end of the pipeline, the Creative Commons is becoming more mainstream: For example, much of the image content on Flickr is CC licensed. Authors like Cory Doctorow are proving that creative people can build a career around releasing their work in the creative commons. Larry Lessig, in a brilliant interview with Stephen Colbert, shows how value can be added incrementally to a creative work without anyone losing out.
The central part of this pipeline – Open Analysis – has a basic problem: what’s the use of sharing analysis nobody can read or understand? It’s great that people put their analysis online for the world to see, but what’s the point if that analysis is written in dense code that can barely be read by its author?
This is still a problem even when your analysis code is beautifully laid out in a high-level scripting language and well commented. The chances are that the reader who is deeply moved by some statistical analysis of the latest Wikileaks release still can’t read, or critique, the code that generated it.
The technological problems of sharing code are now all but solved: sites like sourceforge and github allow the sharing, dissemination, and tracking of source code. Projects such as CRAN (for R Packages) and MLOSS (Machine Learning Open Source Software) allow the colocation of code, binaries, documentation, and package information, making finding the code an easy job. There have been several attempts at making the code itself easy to read. We’ve got beautiful documentation generators – but these require careful commenting, and all you really end up with are those comments pretty printed – not so great for expressing your modeling assumptions. Another attempt at readable code is Literate Programming, which encourages you to write the code and its documentation all at the same time but, again, is labour intensive. And this, I think, is at the heart of the problem of writing readable code. It’s just plain hard to do.
Who’s got the time to write a whole PDF every time you want to draw a bar chart? We’ve got press deadlines, conference deadlines, and a public attention span measurable in hours. Writing detailed comments is not only time consuming, it’s often a seriously complicated affair akin to writing an academic paper. And, unless we’re actually writing an academic paper, it’s mostly a thankless task.
My contention is this: nobody is going to consistently write readable code, ever. It’s simply too time consuming and the immediate rewards to the coder are negligible. Yet it’s important for others to be able to understand our analysis if they’re making decisions, as citizens or as subjects, based on this analysis. What is to be done?
The answer lies, I think, in convention. The web development community has nailed this with projects like Ruby On Rails and Django. If I’m working within Ruby on Rails, and I name my objects according to convention, then I get a lot of code for free – I actually save time by writing good code. This saving is not a projected – “you’re not going to be able to read that code in 2 years” saving – but an immediate and obvious one. If I abide by the Ruby on Rails structure, then I don’t have to build my databases from scratch. Web forms are automagically generated. My life is made considerably easier and, without trying, my code has a much better structure.
So do we have any data science conventions? My argument is ‘hell yes’: if I don’t abide by some strong data science conventions then I’ll get into well justified trouble. Are the raw data available? Have I made the preprocessing steps clear? Are my data properly normalized? Are my assumptions valid and openly expressed? Has my algorithm converged? Have the functions I have written been unit tested? Have I performed a proper cross validation of the results?
I think that projects like ProjectTemplate, which imposes a useful structure for a project written in R, is a great start. ProjectTemplate treads a fine line: not upsetting those who like to code close to the metal, whilst rewarding those who follow some simple conventions. ProjectTemplate coaxes us into writing well structured projects by saving us time. For example, it currently provides generic helper functions that read and format most common data files placed into the data/ folder, producing a well structured R object with virtually zero effort on the part of the coder.
A lot of code already exists to implement standard data science conventions. From cross validation packages to unit tests, our conventions are already well encapsulated. Collecting these tools together into a data science templating system would allow us to formalize best-practices and help with teaching the ‘carpentry‘ aspects of data science. Most importantly it would allow readers to get a clear view of the analysis, using well documented data science conventions as a navigational tool.
At a recent meeting in NYC a well-known data scientist said something like “is awk and grep the best we can do?” which, though a little incendiary, raised a serious question. Are we really destined, time and time again, to re-create a data science pipeline every time a new data set comes our way? Or could we come to some agreement that there is a set of common procedures that underly all our projects?
So I’m interested in hearing what the data science communities think our conventions are, and then in building these into software like ProjectTemplate. Please leave your ideas in the comments and, by automating these conventions, we can start to build more readable code structures. I’ll report on how these conventions evolve as I go along. Maybe we don’t have to reinvent the wheel over and over again – even if it does mean accepting some loose conventions. In return, we focus on the important aspects of analysis, and everyone else will find it much easier to trust what we have to say.