Live stream the Strata NY Data Science Conference!

Posted: September 19th, 2011 | Author: | Filed under: Data Analysis | Tags: , | 1 Comment »

Strata New York 2011 has just begun, and you can view the livestream here:

Snippet: Where the F**k Was I?

Posted: June 24th, 2011 | Author: | Filed under: Data Visualization, Snippets | 4 Comments »

James Bridle had an interesting reaction to the revelation that his iPhone was tracking his location: he made a book!

He describes his reaction to his phone’s data collection habits rather poetically:

I love its hunger for new places, the inquisitive sensor blooming in new areas of the city, the way it stripes the streets of Sydney and Udaipur; new to me, new to the machine. It is opening its eyes and looking around, walking the streets beside me with the same surprise.

His book is documented on his site and on flickr. Machine Learning for Complex Language Entry

Posted: April 15th, 2011 | Author: | Filed under: Machine Learning in the Real World | Tags: , | 21 Comments »

Editors note: We’d like to invite people with interesting machine learning and data analysis applications to explain the techniques that are working for them in the real world on real data. is an open-source browser addon that uses machine learning techniques to make it easier for people around the world to communicate.

Authors: Kevin Scannell and Michael Schade

Many languages around the world use the familiar Latin alphabet (A-Z), but in order to represent the sounds of the language accurately, their writing systems employ diacritical marks and other special characters.    For example:

  • Vietnamese (Mọi người đều có quyền tự do ngôn luận và bầy tỏ quan điểm),
  • Hawaiian  (Ua noa i nā kānaka apau ke kūʻokoʻa o ka manaʻo a me ka hōʻike ʻana i ka manaʻo),
  • Ewe (Amesiame kpɔ mɔ abu tame le eɖokui si eye wòaɖe eƒe susu agblɔ faa mɔxexe manɔmee),
  • and hundreds of others.

Speakers of these languages have difficulty entering text into a computer because keyboards are often not available, and even when they are, typing special characters can be slow and cumbersome.    Also, in many cases, speakers may not be completely familiar with the “correct” writing system and may not always know where the special characters belong.   The end result is that for many languages, the texts people type in emails, blogs, and social networking sites are left as plain ASCII, omitting any special characters, and leading to ambiguities and confusion.

To solve this problem, we have created a free and open source Firefox add-on called that allows users to type texts in plain ASCII, and then automatically adds all diacritics and special characters in the correct places–a process we call “Unicodification”. uses a machine learning approach, employing both character-level and word-level models trained on data crawled from the web for more than 100 languages.

It is easiest to describe our algorithm with an example.   Let’s say a user is typing Irish (Gaelic), and they enter the phrase nios mo muinteoiri fiorchliste with no diacritics.   For each word in the input, we check to see if it is an “ascii-fied” version of a word that was seen during training.

  • In our example, for two of the words, there is exactly one candidate unicodification in the training data: nios is the asciification of the word níos which is very common in our Irish data, and muinteoiri is the asciification of múinteoirí, also very common.   As there are no other candidates, we take níos and múinteoirí as the unicodifications.
  • There are two possibilities for mo; it could be correct as is, or it could be the asciification of mó.   When there is an ambiguity of this kind, we rely on standard word-level n-gram language modeling; in this case, the training data contains many instances of the set phrase níos mó, and no examples of níos mo, so mó is chosen as the correct answer.
  • Finally, the word fiorchliste doesn’t appear at all in our training data, so we resort to a character-level model, treating each character that could admit a diacritic as a classification problem.  For each language, we train a naive Bayes classifier using trigrams (three character sequences) in a neighborhood of the ambiguous character as features.   In this case, the model classifies the first “i” as needing an acute accent, and leaves all other characters as plain ASCII, thereby (correctly) restoring fiorchliste to fíorchliste.

The example above illustrates the ability of the character-level models to handle never-before-seen words; in this particular case fíorchliste is a compound word, and the character sequences in the two pieces fíor and chliste are relatively common in the training data.  It is also an effective way of handling morphologically complex languages, where there can be thousands or even millions of forms of any given root word, so many that one is lucky to see even a small fraction of them in a training corpus.  But the chances of seeing individual morphemes is much higher, and these are captured reasonably well by the character-level models.

We are far from the first to have studied this problem from the machine learning point of view (full references are given in our paper), but this is the first time that models have been trained for so many languages, and made available in a form that will allow widespread adoption in many language communities.

We have done a detailed evaluation of the performance of the software for all of the languages (all the numbers are in the paper) and this raised a number of interesting issues.

First, we were only able to do this on such a large scale because of the availability of training text on the web in so many languages.   But experience has shown that web texts are much noisier than texts found in traditional corpora–does this have an impact on the performance of a statistical systems?   The short answer appears to be “yes,” at least for the problem of unicodification.   In cases where we had access to high quality corpora of books and newspaper texts, we achieved substantially better performance.

Second, it is probably no surprise that some languages are much harder than others.   A simple baseline algorithm is to simply leave everything as plain ASCII, and this performs quite well for languages like Dutch which have only a small number of words containing diacritics (this baseline get 99.3% of words correct for Dutch).    In Figure 1 we plot the word-level accuracy of against this baseline.

But recall there are really two models at play, and we could ask about the relative contribution of, say, the character-level model to the performance of the system.   With this in mind, we introduce a second “baseline” which omits the character-level model entirely.   More precisely, given an ASCII word as input, it chooses the most common unicodification that was seen in the training data, and leaves the word as ASCII if there were no candidate unicodifications in the training data.   In Figure 2 we plot the word-level accuracy of against this improved baseline.  We see that the contribution of the character model is really quite small in most cases, and not surprisingly several of the languages where it helps the most are morphologically quite complex, like Hungarian and Turkish (though Vietnamese is not).  In quite a few cases, the character model actually hurts performance, although our analyses show that this is generally due to noise in the training data: a lot of noise in web texts is English (and hence almost pure ASCII) so the baseline will outperform any algorithm that tries to add diacritics.

The Firefox add-on works by communicating with the web service via its stable API, and we have a number of other clients including a vim plugin (written by fellow St. Louisan Bill Odom) and Perl, Python, and Haskell implementations.    We hope that developers interested in supporting language communities around the world will consider integrating this service in their own software.

Please feel free to contact us with any questions, comments, or suggestions.

Snippet: The Popularity of Data Analysis Software

Posted: April 5th, 2011 | Author: | Filed under: Snippets | Tags: , , | 6 Comments »

We’re often asked what our tool stack looks like. Robert Muenchen over at r4stats has a study of the most popular data analysis software.

He looks at factors as varied as traffic on the language mailing lists, number of search results and web site popularity, sales, and finally surveys of use. For example:

mailing list traffic over time

It’s interesting to think which of these factors indicate greater adoption. Don’t let me spoil it for you, but R comes out looking good across the board.

Snippet: Science special collection on Dealing with Data

Posted: February 15th, 2011 | Author: | Filed under: Snippets | Tags: , , , | 2 Comments »

The February edition of Science offers a special collection of articles from scientists in a variety of fields on the challenges and opportunities of working with large amounts of data.

The overwhelming theme seems to be a need for tools, visualizations, and a common vocabulary for expressing, exploring, and working with data across disciplines.

Thanks to Chris Wiggins for the pointer.

Best Boxes in a Super Bowl Pool

Posted: February 9th, 2011 | Author: | Filed under: Data Analysis | Tags: , | 17 Comments »

While I am fairly confident I am the only member of the dataists that is a sports fan (Hilary has developed a sports filter for her Twitter feed!), I am certain I am the only America football fan. For those of you like me, this past weekend marked the conclusion of another NFL season and the beginning of the worst stretch of the year for sports (31 days until Selection Sunday for March Madness).

To celebrate the conclusion of the season I—like many others—went to a Super Bowl party. As is the case with many Super Bowl parties, the one attended had a pool for betting on the score after each quarter. For those unfamiliar with the Super Bowl pool, the basic idea is that anyone can participate in the pool by purchasing boxes on a 10×10 grid. This works as follows: when a box is purchased the buyer writes his or her name in any available box. Once all boxes have been filled in, the grid is then labeled 0 through 9 in random order along the horizontal and vertical axis. These numbers correspond to score tallies for both the home and away teams in the game. Below, is an example of such a pool grid from last year’s Super Bowl.

Super Bowl pool example

After the end of each quarter of the game, whichever box corresponds to the trailing digit of the scores for each team wins a portion of the pot in the pool. For example, at the end of the first quarter of Super Bowl XLIV the Colts led the Saints 10 to 0. From the above example, Charles Newkirk would have won the first quarter portion of the pot for having the box on zeroes for both axes. Likewise, the final score of last year’s Super Bowl was New Orleans, 31; Indianapolis, 17. In this case, Mike Taylor would have won the pool for the final score in the above grid with the box for 1 on the Colts and 7 for the Saints.

This weekend, as I watched the pool grid get filled out and the row and column numbers drawn from a hat, I wondered: what combinations give me the highest probability of winning at the end of each quarter? Thankfully, there’s data to help answer this question, and I set out to analyze it. To calculate these probabilities I scraped the box scores from all 45 Super Bowls on Wikipedia, and then created heat maps for the probabilities of winning in each box based on this historical data. Below are the results of this analysis.

Heat Map of Win Probabilties -- First Quater Heat Map of Win Probabilties -- Half Time

First Quarter

Half Time

Heat Map of Win Probabilties -- Third Quarter Heat Map of Win Probabilties -- Final

Third Quarter


The results are an interesting study in Super Bowl scoring as the game progresses. You have the highest chance of winning the first quarter portion of the pool if you have a zero box for either team, and the highest overall chance of winning anything of you have both zeroes. This makes sense, as it is common for one team to go scoreless after the first quarter. After the first quarter, however, you winning chances become significantly diluted.

Into half time having a zero box is good, but having a seven box gives you nearly the same chance of winning. Interestingly, into the third quarter it is best to have a seven box for the home team, while everything else is basically a wash. With the final score, everything is basically a wash, as teams are given more opportunity to score and thus adding variance to the counts for each trailing digit. That said, by a very slight margin having a seven box for either the home or away team provides a better chance of winning the final pool.

So, next year, when you are watching the numbers get assigned to the grid; cross your fingers for either a zero or a seven box. If you happen to draw a two, five, or six, consider your wager all but lost. Incidentally, one could argue that a better analysis would have used all historic NFL scoring. Perhaps, though I think most sports analysts would agree that the Super Bowl is a unique game situation. Better, therefore, to focus only on those games, despite the small-N.

Finally, the process of doing this analysis required mostly heavy-lifting on data wrangling; including, scraping the data from Wikipedia, then slicing and counting the trailing digits of the box scores to produce the probabilities for each cell. For those interested, the code is available on my github repository.

There are two R functions used in this analysis, however, that may be of general interest. First, a function that converts an integer into its Roman Numeral equivalent, which is useful when scraping Wikipedia for Super Bowl data (graciously inspired by the Dive Into Python example).

Second, the function used to perform the web scraping. Note that R’s XML package, and the good fortune that the Wikipedia editors have a uniform format for Super Bowl box scores, makes this process very easy.

R Packages used:

Analyze data, save lives, win $3 million

Posted: February 5th, 2011 | Author: | Filed under: Snippets | Tags: , , , | 9 Comments »

Our friends at Kaggle are hosting the Heritage Health Prize. Launching April 4, the competition is seeking an algorithm that can predict patients at high risk for hospital admissions.

It’s difficult to do meaningful work with health data due to a variety of policy, legal, and technical challenges. The success of this contest will be something we can all point to as an indicator that we need to make more mindful decisions about how health data is managed and analyzed.

Who’s up for a dataists team entry?

Our Predictions and Hopes for Data Science in 2011

Posted: January 3rd, 2011 | Author: | Filed under: Opinion | Tags: , , , | 5 Comments »

Happy New Year! 2010 was an amazing year for data science, and we believe that 2011 will truly be the year that data science grows up.

We have a lot to look forward to this year, so without further blather I present to you our top predictions hopes and dreams for data science in 2011:

  1. New tools will make data analysis accessible to everyone.

    You currently have to be able to swing some fly command line fu to really get your hands dirty. We’re already starting to see more libraries that make it easier for programmers to analyze data and more visual and non-programming oriented toolkits.

  2. There will be more public data to play with.

    More companies and government organizations will see the value in sharing data, perhaps through contests like Yahoo!’s Learn to Rank Challenge.Individuals will also have more access to data as tools for scraping web data become more accessible and sensors and other hardware become more affordable and easy to use.

  3. There will be progress in tools and techniques for cleaning data.

    As tools become easier to use and more data becomes available, there will be more attention paid to developing focused tools and techniques for the tedious process of cleaning data for analysis.

  4. Educational resources will improve.

    Data science books, courses and online resources will encourage a wider participation in all things data. Hopefully more open source examples of the practice of data science will make such analysis more approachable to first-time data hackers.

  5. As the tools become more sophisticated, the focus will shift from technology toward discovery.

    Much of what was written about data science in 2010 focused on the marvels of modern technology that allow for the analysis of massive stores of data. As these technologies become ubiquitous, more concern will be on the methods of analysis and presentation of findings.

  6. There will be massive growth in data science jobs.

    We’ve already seen a huge demand for people with data analysis skills in the last part of 2010, and we expect this to continue into 2011.

Let’s make 2011 a great year!

This post was collaboratively written by Vince Buffalo, Drew Conway, Mike Dewar, Hilary Mason, and John Myles White.

Ranking the popularity of programming languages

Posted: December 9th, 2010 | Author: | Filed under: outliers | Tags: , , | 73 Comments »

How would you rank the popularity of a programming language? People often discuss which languages are the best, or which are on the rise, but how do we actually measure that?

One way to do so is to count the number of projects using each language, and rank those with the most projects as being the most popular. Another might be to measure the size of a language’s “community,” and use that as a proxy for its popularity. Each has their advantages and disadvantages. Counting the number of projects is perhaps the “purest” measure of a language’s popularity, but it may overweight languages based on their legacy or use in production systems. Likewise, measuring community size can provide insight into the breadth of applications for a language, but it can be difficult to distinguish among language with a vocal minority versus those that are actually have large communities.

Solution: measure both, and compare.

This week John Myles White and I set out to gather data that measured both the number of projects using various languages, as well as their community sizes. While neither metric has a straightforward means of collection, we decided to exploit data on Github and StackOverflow to measure each respectively. Github provides a popularity ranking for each language based on the number of projects, and using the below R function we were able to collect the number of questions tagged for each language on StackOverflow.

The above chart shows the results of this data collection, where high rank values indicate greater popularity, i.e., the most popular languages on each dimension are in the upper-right of the chart. Even with this simple comparison there are several items of note:

  • Metrics are highly correlated: perhaps unsurprisingly, we find that these ranks have a correlation of almost 0.8. Much less clear, however, is whether extensive use leads to large communities, or vice-a-versa?
  • Popularity is tiered: for those languages that conform to the linear fit, there appears to be clear separation among tiers of popularity. From “super-popular” cluster in the upper-right, to the more specialized languages in the second tier, and then those niche and deprecated languages in the lower-left.
  • What’s up with VimL and Delphi?: The presence of severe outliers may be an indication of weakness in these measures, but they are interesting to consider nonetheless. How is it popular that VimL could be the 10th most popular language on Github, but have almost no questions on StackOverflow? Is the StackOverflow measure actually picking up the opaqueness of languages rather than the size of their community? That might explain the position of R.

We Dataists have a much more shallow language toolkit than is represented in this graph. Having worked with my co-authors a few times, I know we primarily stick to the shell, Python, R stack; and to a lesser extent C, Perl and Ruby, so it is difficult to provide insight as to the position of many of these languages. If you see your favorite language and have a comment, please let us know.

Raw ranking data available here.

Dataists Team Answering Questions on Reddit

Posted: November 8th, 2010 | Author: | Filed under: Opinion | Tags: | 3 Comments »

One of the moderators at the machine learning sub-reddit has asked the Dataists team to do a Q&A with the community. The thread is now live, and we welcome everyone to add questions and comments.

Should be fun!