Snippet: Science special collection on Dealing with Data

Posted: February 15th, 2011 | Author: | Filed under: Snippets | Tags: , , , | 2 Comments »

The February edition of Science offers a special collection of articles from scientists in a variety of fields on the challenges and opportunities of working with large amounts of data.

The overwhelming theme seems to be a need for tools, visualizations, and a common vocabulary for expressing, exploring, and working with data across disciplines.

Thanks to Chris Wiggins for the pointer.


The Data Science Venn Diagram

Posted: September 30th, 2010 | Author: | Filed under: Philosophy of Data | Tags: , , | 29 Comments »

by Drew Conway

Last Monday I—humbly—joined a group of NYC’s most sophisticated thinkers on all things data for a half-day unconference to help O’Reily organize their upcoming Strata conference. The break out sessions were fantastic, and the number of people in each allowed for outstanding, expert driven, discussions. One of the best sessions I attended focused on issues related to teaching data science, which inevitably led to a discussion on the skills needed to be a fully competent data scientist.

As I have said before, I think the term “data science” is a bit of a misnomer, but I was very hopeful after this discussion; mostly because of the utter lack of agreement on what a curriculum on this subject would look like. The difficulty in defining these skills is that the split between substance and methodology is ambiguous, and as such it is unclear how to distinguish among hackers, statisticians, subject matter experts, their overlaps and where data science fits.

What is clear, however, is that one needs to learn a lot as they aspire to become a fully competent data scientist. Unfortunately, simply enumerating texts and tutorials does not untangle the knots. Therefore, in an effort to simplify the discussion, and add my own thoughts to what is already a crowded market of ideas, I present the Data Science Venn Diagram.

Data science Venn diagram

How to read the Data Science Venn Diagram

The primary colors of data: hacking skills, math and stats knowledge, and substantive expertise

  • On Monday we spent a lot of time talking about “where” a course on data science might exist at a university. The conversation was largely rhetorical, as everyone was well aware of the inherent interdisciplinary nature of the these skills; but then, why have I highlighted these three? First, none is discipline specific, but more importantly, each of these skills are on their own very valuable, but when combined with only one other are at best simply not data science, or at worst downright dangerous.
  • For better or worse, data is a commodity traded electronically; therefore, in order to be in this market you need to speak hacker. This, however, does not require a background in computer science—in fact—many of the most impressive hackers I have met never took a single CS course. Being able to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically; these are the hacking skills that make for a successful data hacker.
  • Once you have acquired and cleaned the data, the next step is to actually extract insight from it. In order to do this, you need to apply appropriate math and statistics methods, which requires at least a baseline familiarity with these tools. This is not to say that a PhD in statistics in required to be a competent data scientist, but it does require knowing what an ordinary least squares regression is and how to interpret it.
  • In the third critical piece—substance—is where my thoughts on data science diverge from most of what has already been written on the topic. To me, data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science. Science is about discovery and building knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods. On the flip-side, substantive expertise plus math and statistics knowledge is where most traditional researcher falls. Doctoral level researchers spend most of their time acquiring expertise in these areas, but very little time learning about technology. Part of this is the culture of academia, which does not reward researchers for understanding technology. That said, I have met many young academics and graduate students that are eager to bucking that tradition.
  • Finally, a word on the hacking skills plus substantive expertise danger zone. This is where I place people who, “know enough to be dangerous,” and is the most problematic area of the diagram. In this area people who are perfectly capable of extracting and structuring data, likely related to a field they know quite a bit about, and probably even know enough R to run a linear regression and report the coefficients; but they lack any understanding of what those coefficients mean. It is from this part of the diagram that the phrase “lies, damned lies, and statistics” emanates, because either through ignorance or malice this overlap of skills gives people the ability to create what appears to be a legitimate analysis without any understanding of how they got there or what they have created. Fortunately, it requires near willful ignorance to acquire hacking skills and substantive expertise without also learning some math and statistics along the way. As such, the danger zone is sparsely populated, however, it does not take many to produce a lot of damage.

I hope this brief illustration has provided some clarity into what data science is and what it takes to get there. By considering these questions at a high level it prevents the discussion from degrading into minutia, such as specific tools or platforms, which I think hurts the conversation.

I am sure I have overlooked many important things, but again the purpose was not to be speific. As always, I welcome any and all comments.

Cross-posted at Zero Intelligence Agents


A Taxonomy of Data Science

Posted: September 25th, 2010 | Author: | Filed under: Philosophy of Data | Tags: , , , | 72 Comments »
by Hilary Mason and Chris Wiggins

Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science?

We’ve variously heard it said that data science requires some command-line fu for data procurement and preprocessing, or that one needs to know some machine learning or stats, or that one should know how to `look at data’. All of these are partially true, so we thought it would be useful to propose one possible taxonomy — we call it the Snice* taxonomy — of what a data scientist does, in roughly chronological order: Obtain, Scrub, Explore, Model, and iNterpret (or, if you like, OSEMN, which rhymes with possum).

Different data scientists have different levels of expertise with each of these 5 areas, but ideally a data scientist should be at home with them all.

We describe each one of these steps briefly below:

  1. Obtain: pointing and clicking does not scale.

    Getting a list of numbers from a paper via PDF or from within your web browser via copy and paste rarely yields sufficient data to learn something `new’ by exploratory or predictive analytics. Part of the skillset of a data scientist is knowing how to obtain a sufficient corpus of usable data, possibly from multiple sources, and possibly from sites which require specific query syntax. At a minimum, a data scientist should know how to do this from the command line, e.g., in a UN*X environment. Shell scripting does suffice for many tasks, but we recommend learning a programming or scripting language which can support automating the retrieval of data and add the ability to make calls asynchronously and manage the resulting data. Python is a current favorite at time of writing (Fall 2010). 

    APIs are standard interfaces for accessing web applications, and one should be familiar with how to manipulate them (and even identify hidden, ‘internal’ APIs that may be available but not advertised). Rich actions on web sites often use APIs underneath. You have probably generated thousands of API calls already today without even knowing it! APIs are a two-way street: someone has to have written an API — a syntax — for you to interact with it. Typically one then writes a program which can execute commands to obtain these data in a way which respects this syntax. For example, let’s say we wish to query the NYT archive of stories in bash. Here’s a command-line trick for doing so to find stories about Justin Beiber (and the resulting JSON): Now let’s look for stories with the word ‘data’ in the title, but in python:

  2. Scrub: the world is a messy place

    Whether provided by an experimentalist with missing data and inconsistent labels, or via a website with an awkward choice of data formatting, there will almost always be some amount of data cleaning (or scrubbing) necessary before analysis of these data is possible. As with Obtaining data, herein a little command line fu and simple scripting can be of great utility. Scrubbing data is the least sexy part of the analysis process, but often one that yields the greatest benefits. A simple analysis of clean data can be more productive than a complex analysis of noisy and irregular data.

    The most basic form of scrubbing data is just making sure that it’s read cleanly, stripped of extraneous characters, and parsed into a usable format. Unfortunately, many data sets are complex and messy. Imagine that you decide to look at something as simple as the geographic distribution of twitter users by self-reported location in their profile. Easy, right? Even people living in the same place may use different text to represent it. Values for people who live in New York City contain “New York, NY”, “NYC”, “New York City”, “Manhattan, NY”, and even more fanciful things like “The Big Apple”. This could be an entire blog post (and will!), but how do you disambiguate it? (Example)

    Sed, awk, grep are enough for most small tasks, and using either Perl or Python should be good enough for the rest. Additional skills which may come to play are familiarity with databases, including their syntax for representing data (e.g., JSON, above) and for querying databases.

  3. Explore: You can see a lot by looking

    Visualizing, clustering, performing dimensionality reduction: these are all part of `looking at data.’ These tasks are sometimes described as “exploratory” in that no hypothesis is being tested, no predictions are attempted. Wolfgang Pauli would call these techniques “not even wrong,” though they are hugely useful for getting to know your data. Often such methods inspire predictive analysis methods used later. Tricks to know:

    • more or less (though less is more): Yes, that more and less. You can see a lot by looking at your data. Zoom out if you need to, or use unix’s head to view the first few lines, or awk or cut to view the first few fields or characters.
    • Single-feature histograms visually render the range of single features and their distribution. Since histograms of real-valued data are contingent on choice of binning, we should remember that they an art project rather than a form of analytics in themselves.
    • Similarly, simple feature-pair scatter plots can often reveal characteristics of the data that you miss when just looking at raw numbers.
    • Dimensionality reduction (MDS, SVD, PCA, PLS etc): Hugely useful for rendering high-demensional data on the page. In most cases we are performing ‘unsupervised’ dimensionality reduction (as in PCA), in which we find two-dimensional shadows which capture as much variance of the data as possible. Occasionally, low-dimensional regression techniques can provide insight, for example in this review article describing the Netflix Prize which features a scatterplot of movies (Fig. 3) derived from a regression problem in which one wishes to predict users’ movie ratings.
    • Clustering: Unsupervised machine learning techniques for grouping observations; this can include grouping nodes of a graph into “modules” or “communities”, or inferring latent variable assignments in a generative model with latent structure (e.g., Gaussian mixture modeling, or K-means, which can be derived via a limiting case of Gaussian mixture modeling).
  4. Models: always bad, sometimes ugly

    Whether in the natural sciences, in engineering, or in data-rich startups, often the ‘best’ model is the most predictive model. E.g., is it `better’ to fit one’s data to a straight line or a fifth-order polynomial? Should one combine a weighted sum of 10 rules or 10,000? One way of framing such questions of model selection is to remember why we build models in the first place: to predict and to interpret. While the latter is difficult to quantify, the former can be framed not only quantitatively but empirically. That is, armed with a corpus of data, one can leave out a fraction of the data (the “validation” data or “test set”), learn/optimize a model using the remaining data (the “learning” data or “training set”) by minimizing a chosen loss function (e.g., squared loss, hinge loss, or exponential loss), and evaluate this or another loss function on the validation data. Comparing the value of this loss function for models of differing complexity yields the model complexity which minimizes generalization error. The above process is sometimes called “empirical estimation of generalization error” but typically goes by its nickname: “cross validation.” Validation does not necessarily mean the model is “right.” As Box warned us, “all models are wrong, but some are useful”. Here, we are choosing from among a set of allowed models (the `hypothesis space’, e.g., the set of 3rd, 4th, and 5th order polynomials) which model complexity maximizes predictive power and is thus the least bad among our choices.

    Above we mentioned that models are built to predict and to interpret. While the former can be assessed quantitatively (`more predictive’ is `less bad’) the latter is a matter of which is less ugly, and is in the mind of the beholder. Which brings us to…

  5. iNterpret: “The purpose of computing is insight, not numbers.”

    Consider the task of automated digit recognition. The value of an algorithm which can predict ’4′ and distinguish from ’5’ is assessed by its predictive power, not on theoretical elegance; the goal of machine learning for digit recognition is not to build a theory of ’3.’ However, in the natural sciences, the ability to predict complex phenomena is different from what most mean by ‘understanding’ or ‘interpreting.’

    The predictive power of a model lies in its ability to generalize in the quantitative sense: to make accurate quantitative predictions of data in new experiments. The interpretability of a model lies in its ability to generalize in the qualitative sense: to suggest to the modeler which would be the most interesting experiments to perform next.

    The world rarely hands us numbers; more often the world hands us clickstreams, text, graphs, or images. Interpretable modeling in data science begins with choosing a natural set of input features — e.g., choosing a representation of text in terms of a bag-of-words, rather than bag-of-letters; choosing a representation of a graph in terms of subgraphs rather than the spectrum of the Laplacian. In this step, domain expertise and intuition can be more important than technical or coding expertise. Next one chooses a hypothesis space, e.g., linear combinations of these features vs. exponentiated products of special functions or lossy hashes of these features’ values. Each of these might have advantages in terms of computational complexity vs interpretability. Finally one chooses a learning/optimization algorithm, sometimes including a “regularization” term (which penalizes model complexity but does not involve observed data). For example, interpretability can be aided by learning by boosting or with an L1 penalty to yield sparse models; in this case, models which can be described in terms of a comprehensible number of nonzero weights of, ideally, individually-interpretable features. Rest assured that interpretability in data science is not merely a desideratum for the natural scientist.

    Startups building products without the perspective of multi-year research cycles are often both exploring the data and constructing systems on the data at the same time. Interpretable models offer the benefits of producing useful products while at the same time suggesting which directions are best to explore next.

    For example, at bit.ly, we recently completed a project to classify popular content by click patterns over time and topic. In most cases, topic identification was straightforward, e.g., identifying celebrity gossip (you can imagine those features!). One particular click pattern was difficult to interpret, however; with further exploration we realized that people were using bit.ly links on images embedded in a page in order to study their own real-time metrics. Each page load counted as a ‘click’ (the page content itself was irrelevant), and we discovered a novel use case ‘in the wild’ for our product.

Deep thoughts:

Data science is clearly a blend of the hackers’ arts (primarily in steps “O” and “S” above); statistics and machine learning (primarily steps “E” and “M” above); and the expertise in mathematics and the domain of the data for the analysis to be interpretable (that is, one needs to understand the domain in which the data were generated, but also the mathematical operations performed during the “learning” and “optimization”). It requires creative decisions and open-mindedness in a scientific context.

Our next post addresses how one goes about learning these skills, that is: “what does a data science curriculum look like?”

* named after Snice, our favorite NYC café, where this blog post was hatched.

Thanks to Mike Dewar for comments on an earlier draft of this.