Live stream the Strata NY Data Science Conference!

Posted: September 19th, 2011 | Author: | Filed under: Data Analysis | Tags: , | 1 Comment »

Strata New York 2011 has just begun, and you can view the livestream here:

Best Boxes in a Super Bowl Pool

Posted: February 9th, 2011 | Author: | Filed under: Data Analysis | Tags: , | 17 Comments »

While I am fairly confident I am the only member of the dataists that is a sports fan (Hilary has developed a sports filter for her Twitter feed!), I am certain I am the only America football fan. For those of you like me, this past weekend marked the conclusion of another NFL season and the beginning of the worst stretch of the year for sports (31 days until Selection Sunday for March Madness).

To celebrate the conclusion of the season I—like many others—went to a Super Bowl party. As is the case with many Super Bowl parties, the one attended had a pool for betting on the score after each quarter. For those unfamiliar with the Super Bowl pool, the basic idea is that anyone can participate in the pool by purchasing boxes on a 10×10 grid. This works as follows: when a box is purchased the buyer writes his or her name in any available box. Once all boxes have been filled in, the grid is then labeled 0 through 9 in random order along the horizontal and vertical axis. These numbers correspond to score tallies for both the home and away teams in the game. Below, is an example of such a pool grid from last year’s Super Bowl.

Super Bowl pool example

After the end of each quarter of the game, whichever box corresponds to the trailing digit of the scores for each team wins a portion of the pot in the pool. For example, at the end of the first quarter of Super Bowl XLIV the Colts led the Saints 10 to 0. From the above example, Charles Newkirk would have won the first quarter portion of the pot for having the box on zeroes for both axes. Likewise, the final score of last year’s Super Bowl was New Orleans, 31; Indianapolis, 17. In this case, Mike Taylor would have won the pool for the final score in the above grid with the box for 1 on the Colts and 7 for the Saints.

This weekend, as I watched the pool grid get filled out and the row and column numbers drawn from a hat, I wondered: what combinations give me the highest probability of winning at the end of each quarter? Thankfully, there’s data to help answer this question, and I set out to analyze it. To calculate these probabilities I scraped the box scores from all 45 Super Bowls on Wikipedia, and then created heat maps for the probabilities of winning in each box based on this historical data. Below are the results of this analysis.

Heat Map of Win Probabilties -- First Quater Heat Map of Win Probabilties -- Half Time

First Quarter

Half Time

Heat Map of Win Probabilties -- Third Quarter Heat Map of Win Probabilties -- Final

Third Quarter


The results are an interesting study in Super Bowl scoring as the game progresses. You have the highest chance of winning the first quarter portion of the pool if you have a zero box for either team, and the highest overall chance of winning anything of you have both zeroes. This makes sense, as it is common for one team to go scoreless after the first quarter. After the first quarter, however, you winning chances become significantly diluted.

Into half time having a zero box is good, but having a seven box gives you nearly the same chance of winning. Interestingly, into the third quarter it is best to have a seven box for the home team, while everything else is basically a wash. With the final score, everything is basically a wash, as teams are given more opportunity to score and thus adding variance to the counts for each trailing digit. That said, by a very slight margin having a seven box for either the home or away team provides a better chance of winning the final pool.

So, next year, when you are watching the numbers get assigned to the grid; cross your fingers for either a zero or a seven box. If you happen to draw a two, five, or six, consider your wager all but lost. Incidentally, one could argue that a better analysis would have used all historic NFL scoring. Perhaps, though I think most sports analysts would agree that the Super Bowl is a unique game situation. Better, therefore, to focus only on those games, despite the small-N.

Finally, the process of doing this analysis required mostly heavy-lifting on data wrangling; including, scraping the data from Wikipedia, then slicing and counting the trailing digits of the box scores to produce the probabilities for each cell. For those interested, the code is available on my github repository.

There are two R functions used in this analysis, however, that may be of general interest. First, a function that converts an integer into its Roman Numeral equivalent, which is useful when scraping Wikipedia for Super Bowl data (graciously inspired by the Dive Into Python example).

Second, the function used to perform the web scraping. Note that R’s XML package, and the good fortune that the Wikipedia editors have a uniform format for Super Bowl box scores, makes this process very easy.

R Packages used:

The Iraq War Diary – An Initial Grep

Posted: October 29th, 2010 | Author: | Filed under: Data Analysis, Philosophy of Data | Tags: , , , , | 16 Comments »

Editors’ Note: While data itself is rarely a source of controversy, supports pursuing a data-centric view of conflict. Here, Mike Dewar examines the Wikileaks Iraq war documents with sobering results. Hackers should note the link to his source code and methods at the end of the post.

…any man’s death diminishes me, because I am involved in mankind, and therefore never send to know for whom the bell tolls; it tolls for thee.” — John Donne, Meditation XVII

The Iraq War logs recently released by Wikileaks are contained in a 371729121 byte CSV file. It contains 390849 rows and 34 columns. The columns contain dates, locations, reports, category information and counts of the killed and wounded. The date range of the events spans from 2004-11-06 12:37:00 to 2009-04-23 12:30:00, and the events are located within the bounding box defined by (22.5,22.4), (49.6,51.8). Row 4 describes a female walking into a crowd and detonating the explosives and ball bearings she was wrapped in, killing 35 and wounding 36. Searching for events mentioning `ball bearing’ returns 503 events.

There were 65349 Improvised Explosive Device (IED) explosions between the start of 2004 and the end of 2009. Of these 1794 had one enemy killed in action. The month that saw the highest number of explosions was May of 2007, when Iraq experienced 2080 IED explosions. During this month 693 civilians were killed, 85 enemies were killed and 93 friendlies were killed. The ratio of civilian deaths to combatant deaths is 3.89 civilians per combatant. On the first day of May there were 49 IED explosions in which 3 people were killed.

IED explosions in Iraq

Location of all IED explosions as reported in the Wikileaks Iraq War Diary

108 different categories are used to categorise all but 6 events. The category with the most events is `IED explosion’ with 65439 events, followed by `direct fire’ with 57815 events. The category `recon threat’ has 1 event which occurred at 8am on the 17th of April, 2009, where 25 people were noticed with 6 cars in front of a police station in Basra. There are 325 `rock throwing’ events and 325 `assassination’ events.

There are 1211 mentions of the word `robot’, 4710 mentions of the word `UAV’, 1332 mentions of the word `predator’ and 443 mentions of the word `reaper’. The first appearance of one of these keywords is on the 3rd of October, 2006. There are 445 mentions of one or more of the words “contractor”, “blackwater” or “triple canopy”.

drones and contractors in iraq

Density showing the distribution over time of events mentioning contractors and drones

The joint forces report that 108398 people lost their lives in Iraq during 2004-2010. 65650 civilians were killed, 15125 members of the host nation forces were killed, 23857 enemy combatants were killed, and 3766 friendly combatants were killed.

Deaths in Iraq over time

The number of deaths per month as reported in the Wikileaks Iraq War Diary

Please don’t believe any of this. Go instead to the data and have a look for yourself. All the code that has generated this post is available on github at You can also see what others have been saying, for example the Guardian and the New York Times have great write ups.