Strata New York 2011 has just begun, and you can view the livestream here:
James Bridle had an interesting reaction to the revelation that his iPhone was tracking his location: he made a book!
He describes his reaction to his phone’s data collection habits rather poetically:
I love its hunger for new places, the inquisitive sensor blooming in new areas of the city, the way it stripes the streets of Sydney and Udaipur; new to me, new to the machine. It is opening its eyes and looking around, walking the streets beside me with the same surprise.
His book is documented on his site and on flickr.
Editors note: We’d like to invite people with interesting machine learning and data analysis applications to explain the techniques that are working for them in the real world on real data. Accentuate.us is an open-source browser addon that uses machine learning techniques to make it easier for people around the world to communicate.
Many languages around the world use the familiar Latin alphabet (A-Z), but in order to represent the sounds of the language accurately, their writing systems employ diacritical marks and other special characters. For example:
- Vietnamese (Mọi người đều có quyền tự do ngôn luận và bầy tỏ quan điểm),
- Hawaiian (Ua noa i nā kānaka apau ke kūʻokoʻa o ka manaʻo a me ka hōʻike ʻana i ka manaʻo),
- Ewe (Amesiame kpɔ mɔ abu tame le eɖokui si eye wòaɖe eƒe susu agblɔ faa mɔxexe manɔmee),
- and hundreds of others.
Speakers of these languages have difficulty entering text into a computer because keyboards are often not available, and even when they are, typing special characters can be slow and cumbersome. Also, in many cases, speakers may not be completely familiar with the “correct” writing system and may not always know where the special characters belong. The end result is that for many languages, the texts people type in emails, blogs, and social networking sites are left as plain ASCII, omitting any special characters, and leading to ambiguities and confusion.
To solve this problem, we have created a free and open source Firefox add-on called Accentuate.us that allows users to type texts in plain ASCII, and then automatically adds all diacritics and special characters in the correct places–a process we call “Unicodification”. Accentuate.us uses a machine learning approach, employing both character-level and word-level models trained on data crawled from the web for more than 100 languages.
It is easiest to describe our algorithm with an example. Let’s say a user is typing Irish (Gaelic), and they enter the phrase nios mo muinteoiri fiorchliste with no diacritics. For each word in the input, we check to see if it is an “ascii-fied” version of a word that was seen during training.
- In our example, for two of the words, there is exactly one candidate unicodification in the training data: nios is the asciification of the word níos which is very common in our Irish data, and muinteoiri is the asciification of múinteoirí, also very common. As there are no other candidates, we take níos and múinteoirí as the unicodifications.
- There are two possibilities for mo; it could be correct as is, or it could be the asciification of mó. When there is an ambiguity of this kind, we rely on standard word-level n-gram language modeling; in this case, the training data contains many instances of the set phrase níos mó, and no examples of níos mo, so mó is chosen as the correct answer.
- Finally, the word fiorchliste doesn’t appear at all in our training data, so we resort to a character-level model, treating each character that could admit a diacritic as a classification problem. For each language, we train a naive Bayes classifier using trigrams (three character sequences) in a neighborhood of the ambiguous character as features. In this case, the model classifies the first “i” as needing an acute accent, and leaves all other characters as plain ASCII, thereby (correctly) restoring fiorchliste to fíorchliste.
The example above illustrates the ability of the character-level models to handle never-before-seen words; in this particular case fíorchliste is a compound word, and the character sequences in the two pieces fíor and chliste are relatively common in the training data. It is also an effective way of handling morphologically complex languages, where there can be thousands or even millions of forms of any given root word, so many that one is lucky to see even a small fraction of them in a training corpus. But the chances of seeing individual morphemes is much higher, and these are captured reasonably well by the character-level models.
We are far from the first to have studied this problem from the machine learning point of view (full references are given in our paper), but this is the first time that models have been trained for so many languages, and made available in a form that will allow widespread adoption in many language communities.
We have done a detailed evaluation of the performance of the software for all of the languages (all the numbers are in the paper) and this raised a number of interesting issues.
First, we were only able to do this on such a large scale because of the availability of training text on the web in so many languages. But experience has shown that web texts are much noisier than texts found in traditional corpora–does this have an impact on the performance of a statistical systems? The short answer appears to be “yes,” at least for the problem of unicodification. In cases where we had access to high quality corpora of books and newspaper texts, we achieved substantially better performance.
Second, it is probably no surprise that some languages are much harder than others. A simple baseline algorithm is to simply leave everything as plain ASCII, and this performs quite well for languages like Dutch which have only a small number of words containing diacritics (this baseline get 99.3% of words correct for Dutch). In Figure 1 we plot the word-level accuracy of Accentuate.us against this baseline.
But recall there are really two models at play, and we could ask about the relative contribution of, say, the character-level model to the performance of the system. With this in mind, we introduce a second “baseline” which omits the character-level model entirely. More precisely, given an ASCII word as input, it chooses the most common unicodification that was seen in the training data, and leaves the word as ASCII if there were no candidate unicodifications in the training data. In Figure 2 we plot the word-level accuracy of Accentuate.us against this improved baseline. We see that the contribution of the character model is really quite small in most cases, and not surprisingly several of the languages where it helps the most are morphologically quite complex, like Hungarian and Turkish (though Vietnamese is not). In quite a few cases, the character model actually hurts performance, although our analyses show that this is generally due to noise in the training data: a lot of noise in web texts is English (and hence almost pure ASCII) so the baseline will outperform any algorithm that tries to add diacritics.
The Firefox add-on works by communicating with the Accentuate.us web service via its stable API, and we have a number of other clients including a vim plugin (written by fellow St. Louisan Bill Odom) and Perl, Python, and Haskell implementations. We hope that developers interested in supporting language communities around the world will consider integrating this service in their own software.
Please feel free to contact us with any questions, comments, or suggestions.
He looks at factors as varied as traffic on the language mailing lists, number of search results and web site popularity, sales, and finally surveys of use. For example:
It’s interesting to think which of these factors indicate greater adoption. Don’t let me spoil it for you, but R comes out looking good across the board.
The overwhelming theme seems to be a need for tools, visualizations, and a common vocabulary for expressing, exploring, and working with data across disciplines.
Thanks to Chris Wiggins for the pointer.
It’s difficult to do meaningful work with health data due to a variety of policy, legal, and technical challenges. The success of this contest will be something we can all point to as an indicator that we need to make more mindful decisions about how health data is managed and analyzed.
Who’s up for a dataists team entry?
Happy New Year! 2010 was an amazing year for data science, and we believe that 2011 will truly be the year that data science grows up.
We have a lot to look forward to this year, so without further blather I present to you our top
predictions hopes and dreams for data science in 2011:
- New tools will make data analysis accessible to everyone.
You currently have to be able to swing some fly command line fu to really get your hands dirty. We’re already starting to see more libraries that make it easier for programmers to analyze data and more visual and non-programming oriented toolkits.
- There will be more public data to play with.
More companies and government organizations will see the value in sharing data, perhaps through contests like Yahoo!’s Learn to Rank Challenge.Individuals will also have more access to data as tools for scraping web data become more accessible and sensors and other hardware become more affordable and easy to use.
- There will be progress in tools and techniques for cleaning data.
As tools become easier to use and more data becomes available, there will be more attention paid to developing focused tools and techniques for the tedious process of cleaning data for analysis.
- Educational resources will improve.
Data science books, courses and online resources will encourage a wider participation in all things data. Hopefully more open source examples of the practice of data science will make such analysis more approachable to first-time data hackers.
- As the tools become more sophisticated, the focus will shift from technology toward discovery.
Much of what was written about data science in 2010 focused on the marvels of modern technology that allow for the analysis of massive stores of data. As these technologies become ubiquitous, more concern will be on the methods of analysis and presentation of findings.
- There will be massive growth in data science jobs.
We’ve already seen a huge demand for people with data analysis skills in the last part of 2010, and we expect this to continue into 2011.
Let’s make 2011 a great year!
Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science?
We’ve variously heard it said that data science requires some command-line fu for data procurement and preprocessing, or that one needs to know some machine learning or stats, or that one should know how to `look at data’. All of these are partially true, so we thought it would be useful to propose one possible taxonomy — we call it the Snice* taxonomy — of what a data scientist does, in roughly chronological order: Obtain, Scrub, Explore, Model, and iNterpret (or, if you like, OSEMN, which rhymes with possum).
Different data scientists have different levels of expertise with each of these 5 areas, but ideally a data scientist should be at home with them all.
We describe each one of these steps briefly below:
- Obtain: pointing and clicking does not scale.
Getting a list of numbers from a paper via PDF or from within your web browser via copy and paste rarely yields sufficient data to learn something `new’ by exploratory or predictive analytics. Part of the skillset of a data scientist is knowing how to obtain a sufficient corpus of usable data, possibly from multiple sources, and possibly from sites which require specific query syntax. At a minimum, a data scientist should know how to do this from the command line, e.g., in a UN*X environment. Shell scripting does suffice for many tasks, but we recommend learning a programming or scripting language which can support automating the retrieval of data and add the ability to make calls asynchronously and manage the resulting data. Python is a current favorite at time of writing (Fall 2010).
APIs are standard interfaces for accessing web applications, and one should be familiar with how to manipulate them (and even identify hidden, ‘internal’ APIs that may be available but not advertised). Rich actions on web sites often use APIs underneath. You have probably generated thousands of API calls already today without even knowing it! APIs are a two-way street: someone has to have written an API — a syntax — for you to interact with it. Typically one then writes a program which can execute commands to obtain these data in a way which respects this syntax. For example, let’s say we wish to query the NYT archive of stories in bash. Here’s a command-line trick for doing so to find stories about Justin Beiber (and the resulting JSON): Now let’s look for stories with the word ‘data’ in the title, but in python:
- Scrub: the world is a messy place
Whether provided by an experimentalist with missing data and inconsistent labels, or via a website with an awkward choice of data formatting, there will almost always be some amount of data cleaning (or scrubbing) necessary before analysis of these data is possible. As with Obtaining data, herein a little command line fu and simple scripting can be of great utility. Scrubbing data is the least sexy part of the analysis process, but often one that yields the greatest benefits. A simple analysis of clean data can be more productive than a complex analysis of noisy and irregular data.
The most basic form of scrubbing data is just making sure that it’s read cleanly, stripped of extraneous characters, and parsed into a usable format. Unfortunately, many data sets are complex and messy. Imagine that you decide to look at something as simple as the geographic distribution of twitter users by self-reported location in their profile. Easy, right? Even people living in the same place may use different text to represent it. Values for people who live in New York City contain “New York, NY”, “NYC”, “New York City”, “Manhattan, NY”, and even more fanciful things like “The Big Apple”. This could be an entire blog post (and will!), but how do you disambiguate it? (Example)
Sed, awk, grep are enough for most small tasks, and using either Perl or Python should be good enough for the rest. Additional skills which may come to play are familiarity with databases, including their syntax for representing data (e.g., JSON, above) and for querying databases.
- Explore: You can see a lot by looking
Visualizing, clustering, performing dimensionality reduction: these are all part of `looking at data.’ These tasks are sometimes described as “exploratory” in that no hypothesis is being tested, no predictions are attempted. Wolfgang Pauli would call these techniques “not even wrong,” though they are hugely useful for getting to know your data. Often such methods inspire predictive analysis methods used later. Tricks to know:
- more or less (though less is more): Yes, that more and less. You can see a lot by looking at your data. Zoom out if you need to, or use unix’s head to view the first few lines, or awk or cut to view the first few fields or characters.
- Single-feature histograms visually render the range of single features and their distribution. Since histograms of real-valued data are contingent on choice of binning, we should remember that they an art project rather than a form of analytics in themselves.
- Similarly, simple feature-pair scatter plots can often reveal characteristics of the data that you miss when just looking at raw numbers.
- Dimensionality reduction (MDS, SVD, PCA, PLS etc): Hugely useful for rendering high-demensional data on the page. In most cases we are performing ‘unsupervised’ dimensionality reduction (as in PCA), in which we find two-dimensional shadows which capture as much variance of the data as possible. Occasionally, low-dimensional regression techniques can provide insight, for example in this review article describing the Netflix Prize which features a scatterplot of movies (Fig. 3) derived from a regression problem in which one wishes to predict users’ movie ratings.
- Clustering: Unsupervised machine learning techniques for grouping observations; this can include grouping nodes of a graph into “modules” or “communities”, or inferring latent variable assignments in a generative model with latent structure (e.g., Gaussian mixture modeling, or K-means, which can be derived via a limiting case of Gaussian mixture modeling).
- Models: always bad, sometimes ugly
Whether in the natural sciences, in engineering, or in data-rich startups, often the ‘best’ model is the most predictive model. E.g., is it `better’ to fit one’s data to a straight line or a fifth-order polynomial? Should one combine a weighted sum of 10 rules or 10,000? One way of framing such questions of model selection is to remember why we build models in the first place: to predict and to interpret. While the latter is difficult to quantify, the former can be framed not only quantitatively but empirically. That is, armed with a corpus of data, one can leave out a fraction of the data (the “validation” data or “test set”), learn/optimize a model using the remaining data (the “learning” data or “training set”) by minimizing a chosen loss function (e.g., squared loss, hinge loss, or exponential loss), and evaluate this or another loss function on the validation data. Comparing the value of this loss function for models of differing complexity yields the model complexity which minimizes generalization error. The above process is sometimes called “empirical estimation of generalization error” but typically goes by its nickname: “cross validation.” Validation does not necessarily mean the model is “right.” As Box warned us, “all models are wrong, but some are useful”. Here, we are choosing from among a set of allowed models (the `hypothesis space’, e.g., the set of 3rd, 4th, and 5th order polynomials) which model complexity maximizes predictive power and is thus the least bad among our choices.
Above we mentioned that models are built to predict and to interpret. While the former can be assessed quantitatively (`more predictive’ is `less bad’) the latter is a matter of which is less ugly, and is in the mind of the beholder. Which brings us to…
- iNterpret: “The purpose of computing is insight, not numbers.”
Consider the task of automated digit recognition. The value of an algorithm which can predict ‘4’ and distinguish from ‘5’ is assessed by its predictive power, not on theoretical elegance; the goal of machine learning for digit recognition is not to build a theory of ‘3.’ However, in the natural sciences, the ability to predict complex phenomena is different from what most mean by ‘understanding’ or ‘interpreting.’
The predictive power of a model lies in its ability to generalize in the quantitative sense: to make accurate quantitative predictions of data in new experiments. The interpretability of a model lies in its ability to generalize in the qualitative sense: to suggest to the modeler which would be the most interesting experiments to perform next.
The world rarely hands us numbers; more often the world hands us clickstreams, text, graphs, or images. Interpretable modeling in data science begins with choosing a natural set of input features — e.g., choosing a representation of text in terms of a bag-of-words, rather than bag-of-letters; choosing a representation of a graph in terms of subgraphs rather than the spectrum of the Laplacian. In this step, domain expertise and intuition can be more important than technical or coding expertise. Next one chooses a hypothesis space, e.g., linear combinations of these features vs. exponentiated products of special functions or lossy hashes of these features’ values. Each of these might have advantages in terms of computational complexity vs interpretability. Finally one chooses a learning/optimization algorithm, sometimes including a “regularization” term (which penalizes model complexity but does not involve observed data). For example, interpretability can be aided by learning by boosting or with an L1 penalty to yield sparse models; in this case, models which can be described in terms of a comprehensible number of nonzero weights of, ideally, individually-interpretable features. Rest assured that interpretability in data science is not merely a desideratum for the natural scientist.
Startups building products without the perspective of multi-year research cycles are often both exploring the data and constructing systems on the data at the same time. Interpretable models offer the benefits of producing useful products while at the same time suggesting which directions are best to explore next.
For example, at bit.ly, we recently completed a project to classify popular content by click patterns over time and topic. In most cases, topic identification was straightforward, e.g., identifying celebrity gossip (you can imagine those features!). One particular click pattern was difficult to interpret, however; with further exploration we realized that people were using bit.ly links on images embedded in a page in order to study their own real-time metrics. Each page load counted as a ‘click’ (the page content itself was irrelevant), and we discovered a novel use case ‘in the wild’ for our product.
Data science is clearly a blend of the hackers’ arts (primarily in steps “O” and “S” above); statistics and machine learning (primarily steps “E” and “M” above); and the expertise in mathematics and the domain of the data for the analysis to be interpretable (that is, one needs to understand the domain in which the data were generated, but also the mathematical operations performed during the “learning” and “optimization”). It requires creative decisions and open-mindedness in a scientific context.
Our next post addresses how one goes about learning these skills, that is: “what does a data science curriculum look like?”
Thanks to Mike Dewar for comments on an earlier draft of this.