Posted: February 15th, 2011  Author: Hilary Mason  Filed under: Snippets  Tags: data science, magazine, publication, science  2 Comments »
The February edition of Science offers a special collection of articles from scientists in a variety of fields on the challenges and opportunities of working with large amounts of data.
The overwhelming theme seems to be a need for tools, visualizations, and a common vocabulary for expressing, exploring, and working with data across disciplines.
Thanks to Chris Wiggins for the pointer.
Posted: February 9th, 2011  Author: drewconway  Filed under: Data Analysis  Tags: R, sports  17 Comments »
While I am fairly confident I am the only member of the dataists that is a sports fan (Hilary has developed a sports filter for her Twitter feed!), I am certain I am the only America football fan. For those of you like me, this past weekend marked the conclusion of another NFL season and the beginning of the worst stretch of the year for sports (31 days until Selection Sunday for March Madness).
To celebrate the conclusion of the season I—like many others—went to a Super Bowl party. As is the case with many Super Bowl parties, the one attended had a pool for betting on the score after each quarter. For those unfamiliar with the Super Bowl pool, the basic idea is that anyone can participate in the pool by purchasing boxes on a 10×10 grid. This works as follows: when a box is purchased the buyer writes his or her name in any available box. Once all boxes have been filled in, the grid is then labeled 0 through 9 in random order along the horizontal and vertical axis. These numbers correspond to score tallies for both the home and away teams in the game. Below, is an example of such a pool grid from last year’s Super Bowl.
After the end of each quarter of the game, whichever box corresponds to the trailing digit of the scores for each team wins a portion of the pot in the pool. For example, at the end of the first quarter of Super Bowl XLIV the Colts led the Saints 10 to 0. From the above example, Charles Newkirk would have won the first quarter portion of the pot for having the box on zeroes for both axes. Likewise, the final score of last year’s Super Bowl was New Orleans, 31; Indianapolis, 17. In this case, Mike Taylor would have won the pool for the final score in the above grid with the box for 1 on the Colts and 7 for the Saints.
This weekend, as I watched the pool grid get filled out and the row and column numbers drawn from a hat, I wondered: what combinations give me the highest probability of winning at the end of each quarter? Thankfully, there’s data to help answer this question, and I set out to analyze it. To calculate these probabilities I scraped the box scores from all 45 Super Bowls on Wikipedia, and then created heat maps for the probabilities of winning in each box based on this historical data. Below are the results of this analysis.


First Quarter

Half Time



Third Quarter

Final

The results are an interesting study in Super Bowl scoring as the game progresses. You have the highest chance of winning the first quarter portion of the pool if you have a zero box for either team, and the highest overall chance of winning anything of you have both zeroes. This makes sense, as it is common for one team to go scoreless after the first quarter. After the first quarter, however, you winning chances become significantly diluted.
Into half time having a zero box is good, but having a seven box gives you nearly the same chance of winning. Interestingly, into the third quarter it is best to have a seven box for the home team, while everything else is basically a wash. With the final score, everything is basically a wash, as teams are given more opportunity to score and thus adding variance to the counts for each trailing digit. That said, by a very slight margin having a seven box for either the home or away team provides a better chance of winning the final pool.
So, next year, when you are watching the numbers get assigned to the grid; cross your fingers for either a zero or a seven box. If you happen to draw a two, five, or six, consider your wager all but lost. Incidentally, one could argue that a better analysis would have used all historic NFL scoring. Perhaps, though I think most sports analysts would agree that the Super Bowl is a unique game situation. Better, therefore, to focus only on those games, despite the smallN.
Finally, the process of doing this analysis required mostly heavylifting on data wrangling; including, scraping the data from Wikipedia, then slicing and counting the trailing digits of the box scores to produce the probabilities for each cell. For those interested, the code is available on my github repository.
There are two R functions used in this analysis, however, that may be of general interest. First, a function that converts an integer into its Roman Numeral equivalent, which is useful when scraping Wikipedia for Super Bowl data (graciously inspired by the Dive Into Python example).
Second, the function used to perform the web scraping. Note that R’s XML
package, and the good fortune that the Wikipedia editors have a uniform format for Super Bowl box scores, makes this process very easy.
R Packages used:
Posted: February 5th, 2011  Author: Hilary Mason  Filed under: Snippets  Tags: contest, health, healthcare, snippet  9 Comments »
Our friends at Kaggle are hosting the Heritage Health Prize. Launching April 4, the competition is seeking an algorithm that can predict patients at high risk for hospital admissions.
It’s difficult to do meaningful work with health data due to a variety of policy, legal, and technical challenges. The success of this contest will be something we can all point to as an indicator that we need to make more mindful decisions about how health data is managed and analyzed.
Who’s up for a dataists team entry?