A Taxonomy of Data Science

Posted: September 25th, 2010 | Author: | Filed under: Philosophy of Data | Tags: , , , | 73 Comments »
by Hilary Mason and Chris Wiggins

Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science?

We’ve variously heard it said that data science requires some command-line fu for data procurement and preprocessing, or that one needs to know some machine learning or stats, or that one should know how to `look at data’. All of these are partially true, so we thought it would be useful to propose one possible taxonomy — we call it the Snice* taxonomy — of what a data scientist does, in roughly chronological order: Obtain, Scrub, Explore, Model, and iNterpret (or, if you like, OSEMN, which rhymes with possum).

Different data scientists have different levels of expertise with each of these 5 areas, but ideally a data scientist should be at home with them all.

We describe each one of these steps briefly below:

  1. Obtain: pointing and clicking does not scale.

    Getting a list of numbers from a paper via PDF or from within your web browser via copy and paste rarely yields sufficient data to learn something `new’ by exploratory or predictive analytics. Part of the skillset of a data scientist is knowing how to obtain a sufficient corpus of usable data, possibly from multiple sources, and possibly from sites which require specific query syntax. At a minimum, a data scientist should know how to do this from the command line, e.g., in a UN*X environment. Shell scripting does suffice for many tasks, but we recommend learning a programming or scripting language which can support automating the retrieval of data and add the ability to make calls asynchronously and manage the resulting data. Python is a current favorite at time of writing (Fall 2010). 

    APIs are standard interfaces for accessing web applications, and one should be familiar with how to manipulate them (and even identify hidden, ‘internal’ APIs that may be available but not advertised). Rich actions on web sites often use APIs underneath. You have probably generated thousands of API calls already today without even knowing it! APIs are a two-way street: someone has to have written an API — a syntax — for you to interact with it. Typically one then writes a program which can execute commands to obtain these data in a way which respects this syntax. For example, let’s say we wish to query the NYT archive of stories in bash. Here’s a command-line trick for doing so to find stories about Justin Beiber (and the resulting JSON): Now let’s look for stories with the word ‘data’ in the title, but in python:

  2. Scrub: the world is a messy place

    Whether provided by an experimentalist with missing data and inconsistent labels, or via a website with an awkward choice of data formatting, there will almost always be some amount of data cleaning (or scrubbing) necessary before analysis of these data is possible. As with Obtaining data, herein a little command line fu and simple scripting can be of great utility. Scrubbing data is the least sexy part of the analysis process, but often one that yields the greatest benefits. A simple analysis of clean data can be more productive than a complex analysis of noisy and irregular data.

    The most basic form of scrubbing data is just making sure that it’s read cleanly, stripped of extraneous characters, and parsed into a usable format. Unfortunately, many data sets are complex and messy. Imagine that you decide to look at something as simple as the geographic distribution of twitter users by self-reported location in their profile. Easy, right? Even people living in the same place may use different text to represent it. Values for people who live in New York City contain “New York, NY”, “NYC”, “New York City”, “Manhattan, NY”, and even more fanciful things like “The Big Apple”. This could be an entire blog post (and will!), but how do you disambiguate it? (Example)

    Sed, awk, grep are enough for most small tasks, and using either Perl or Python should be good enough for the rest. Additional skills which may come to play are familiarity with databases, including their syntax for representing data (e.g., JSON, above) and for querying databases.

  3. Explore: You can see a lot by looking

    Visualizing, clustering, performing dimensionality reduction: these are all part of `looking at data.’ These tasks are sometimes described as “exploratory” in that no hypothesis is being tested, no predictions are attempted. Wolfgang Pauli would call these techniques “not even wrong,” though they are hugely useful for getting to know your data. Often such methods inspire predictive analysis methods used later. Tricks to know:

    • more or less (though less is more): Yes, that more and less. You can see a lot by looking at your data. Zoom out if you need to, or use unix’s head to view the first few lines, or awk or cut to view the first few fields or characters.
    • Single-feature histograms visually render the range of single features and their distribution. Since histograms of real-valued data are contingent on choice of binning, we should remember that they an art project rather than a form of analytics in themselves.
    • Similarly, simple feature-pair scatter plots can often reveal characteristics of the data that you miss when just looking at raw numbers.
    • Dimensionality reduction (MDS, SVD, PCA, PLS etc): Hugely useful for rendering high-demensional data on the page. In most cases we are performing ‘unsupervised’ dimensionality reduction (as in PCA), in which we find two-dimensional shadows which capture as much variance of the data as possible. Occasionally, low-dimensional regression techniques can provide insight, for example in this review article describing the Netflix Prize which features a scatterplot of movies (Fig. 3) derived from a regression problem in which one wishes to predict users’ movie ratings.
    • Clustering: Unsupervised machine learning techniques for grouping observations; this can include grouping nodes of a graph into “modules” or “communities”, or inferring latent variable assignments in a generative model with latent structure (e.g., Gaussian mixture modeling, or K-means, which can be derived via a limiting case of Gaussian mixture modeling).
  4. Models: always bad, sometimes ugly

    Whether in the natural sciences, in engineering, or in data-rich startups, often the ‘best’ model is the most predictive model. E.g., is it `better’ to fit one’s data to a straight line or a fifth-order polynomial? Should one combine a weighted sum of 10 rules or 10,000? One way of framing such questions of model selection is to remember why we build models in the first place: to predict and to interpret. While the latter is difficult to quantify, the former can be framed not only quantitatively but empirically. That is, armed with a corpus of data, one can leave out a fraction of the data (the “validation” data or “test set”), learn/optimize a model using the remaining data (the “learning” data or “training set”) by minimizing a chosen loss function (e.g., squared loss, hinge loss, or exponential loss), and evaluate this or another loss function on the validation data. Comparing the value of this loss function for models of differing complexity yields the model complexity which minimizes generalization error. The above process is sometimes called “empirical estimation of generalization error” but typically goes by its nickname: “cross validation.” Validation does not necessarily mean the model is “right.” As Box warned us, “all models are wrong, but some are useful”. Here, we are choosing from among a set of allowed models (the `hypothesis space’, e.g., the set of 3rd, 4th, and 5th order polynomials) which model complexity maximizes predictive power and is thus the least bad among our choices.

    Above we mentioned that models are built to predict and to interpret. While the former can be assessed quantitatively (`more predictive’ is `less bad’) the latter is a matter of which is less ugly, and is in the mind of the beholder. Which brings us to…

  5. iNterpret: “The purpose of computing is insight, not numbers.”

    Consider the task of automated digit recognition. The value of an algorithm which can predict ’4′ and distinguish from ’5’ is assessed by its predictive power, not on theoretical elegance; the goal of machine learning for digit recognition is not to build a theory of ’3.’ However, in the natural sciences, the ability to predict complex phenomena is different from what most mean by ‘understanding’ or ‘interpreting.’

    The predictive power of a model lies in its ability to generalize in the quantitative sense: to make accurate quantitative predictions of data in new experiments. The interpretability of a model lies in its ability to generalize in the qualitative sense: to suggest to the modeler which would be the most interesting experiments to perform next.

    The world rarely hands us numbers; more often the world hands us clickstreams, text, graphs, or images. Interpretable modeling in data science begins with choosing a natural set of input features — e.g., choosing a representation of text in terms of a bag-of-words, rather than bag-of-letters; choosing a representation of a graph in terms of subgraphs rather than the spectrum of the Laplacian. In this step, domain expertise and intuition can be more important than technical or coding expertise. Next one chooses a hypothesis space, e.g., linear combinations of these features vs. exponentiated products of special functions or lossy hashes of these features’ values. Each of these might have advantages in terms of computational complexity vs interpretability. Finally one chooses a learning/optimization algorithm, sometimes including a “regularization” term (which penalizes model complexity but does not involve observed data). For example, interpretability can be aided by learning by boosting or with an L1 penalty to yield sparse models; in this case, models which can be described in terms of a comprehensible number of nonzero weights of, ideally, individually-interpretable features. Rest assured that interpretability in data science is not merely a desideratum for the natural scientist.

    Startups building products without the perspective of multi-year research cycles are often both exploring the data and constructing systems on the data at the same time. Interpretable models offer the benefits of producing useful products while at the same time suggesting which directions are best to explore next.

    For example, at bit.ly, we recently completed a project to classify popular content by click patterns over time and topic. In most cases, topic identification was straightforward, e.g., identifying celebrity gossip (you can imagine those features!). One particular click pattern was difficult to interpret, however; with further exploration we realized that people were using bit.ly links on images embedded in a page in order to study their own real-time metrics. Each page load counted as a ‘click’ (the page content itself was irrelevant), and we discovered a novel use case ‘in the wild’ for our product.

Deep thoughts:

Data science is clearly a blend of the hackers’ arts (primarily in steps “O” and “S” above); statistics and machine learning (primarily steps “E” and “M” above); and the expertise in mathematics and the domain of the data for the analysis to be interpretable (that is, one needs to understand the domain in which the data were generated, but also the mathematical operations performed during the “learning” and “optimization”). It requires creative decisions and open-mindedness in a scientific context.

Our next post addresses how one goes about learning these skills, that is: “what does a data science curriculum look like?”

* named after Snice, our favorite NYC café, where this blog post was hatched.

Thanks to Mike Dewar for comments on an earlier draft of this.


  • http://Website Mary Z

    This is awesome and informative! Can’t wait to read the next blog entry about skills.

  • http://topsy.com/www.dataists.com/2010/09/a-taxonomy-of-data-science/?utm_source=pingback&utm_campaign=L2 Tweets that mention dataists » Blog Archive » A Taxonomy of Data Science — Topsy.com

    [...] This post was mentioned on Twitter by Golan Levin, Hilary Mason, Mike Olson, Harlan Harris, electronic max and others. electronic max said: RT @golan: mandatory for infovizzers: http://dataists.com/2010/09/a-taxonomy-of-data-science (compare "7 stages" in http://benfry.com/phd) [...]

  • http://Website A.A.

    Thanks for this post! It’s a great read, and I’m looking forward to the next ones.

    One thing that bugs me though – the compressed link urls. I like to hover over a link and see where it leads, so I can decide whether it’s worth my time.

  • http://bit.ly/5bip23 kaes

    very interesting article.

    but please don’t loop all your in-text links through bit.ly again, that way people can’t see where they go by hovering on them, which makes for a very bad user experience.

    i don’t see the reason for shortening links to wikipedia in a blog post, anyway?

  • http://blog.stodden.net/2010/09/26/startups-awash-in-data-quantitative-thinkers-needed/ Startups Awash in Data: Quantitative Thinkers Needed « Victoria Stodden

    [...] and Chris created an excellent guideline for data-driven analysis in the startup context, “A Taxonomy of Data Science:” Obtain, Scrub, Explore, Model, and iNterpret. These data are often measuring phenomena in [...]

  • http://www.blog.arghh.net/aj/?p=622 pinboard September 26, 2010 — arghh.net

    [...] dataists » Blog Archive » A Taxonomy of Data Science [...]

  • http://mike.teczno.com Michal Migurski

    Such a great list!

    I know you work there and all, but a blog post full of bit.ly URLs is a bit jarring. I have no idea what I’m clicking on or even a sense of whether it will be interesting.

  • http://www.netcrema.com/?p=52953 dataists » Blog Archive » A Taxonomy of Data Science « Netcrema – creme de la social news via digg + delicious + stumpleupon + reddit

    [...] dataists » Blog Archive » A Taxonomy of Data Sciencedataists.com [...]

  • http://datalysed.com Łukasz

    Great read, thanks! It’s nice to find a new source of Data Science insight.

  • http://talsafran.com Tal Safran

    This is great.

    MAWR!

  • Safran Newport

    Resonates with life-saving advice given to me by a good friend (which has been invaluable – thanks) 99% of the insight can come from using command line tools and simple scripts.

    http://www.pandamatak.com/people/anand/blog/2008/01/on_data_analysis.html

    Safran

  • http://mshook.appspot.com/z/d4m.htm?/mshook Michael

    A.A. & kaes-

    They’re not using bit.ly to shorten the links, they’re using it to collect data!

  • http://www.neverreadpassively.com Jason Brownlee

    Back at school we called it the “KDD Process”. Google gives me this link: http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html

  • http://www.google-buzz.com/four-short-links-27-september-2010.html Four short links: 27 September 2010

    [...] A Taxonomy of Data Science — great first post on a new blog by data practitioners. [...]

  • http://Website wwwald

    Nice article, but I agree with the commenters before me: the bit.ly links don’t provide any advantage for your readers…

    See also this: http://royal.pingdom.com/2010/09/22/is-the-web-heading-toward-redirect-hell/

  • http://infovore.org/archives/2010/09/28/links-for-september-27th-2/ Infovore » Links for September 27th through September 28th

    [...] dataists » Blog Archive » A Taxonomy of Data Science "Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science?" Great starting point; looking forward to more from the blog. (tags: data machinelearning datascience blog ) [...]

  • http://had.co.nz Hadley Wickham

    This is a very similar breakdown to what I teach my students, but I think you’ve missed an important principle: iteration. An analysis never flows smoothly from collection to conclusions – you will often repeat many parts, discovering problems with your data when you model it, thinking of better models after you’ve looked at residuals, …

    When thinking about a curriculum, you could do much worse to start with those laid out by Deb Nolan and Duncan Temple Lang in “Computing in the Statistics Curriculum”, http://www.mosaic-web.org/KickOff/Readings/nolantas2010.pdf.

  • http://kool-gadgets.com/2010/09/28/closer-look-rise-of-the-data-scientist/ kool-gadgets.com » Closer Look: Rise of the Data Scientist

    [...] the recent blog post A Taxonomy of Data Science the notion of where "hack" fits is presented as being part of a larger mix of areas of interest. [...]

  • http://techaggregator.com/2010/09/closer-look-rise-of-the-data-scientist/ Closer Look: Rise of the Data Scientist | TechAggregator.com

    [...] the recent blog post A Taxonomy of Data Science the notion of where "hack" fits is presented as being part of a larger mix of areas of interest. [...]

  • http://www.samacharexpress.com/2010/09/closer-look-rise-of-the-data-scientist/ Closer Look: Rise of the Data Scientist | Samachar Express

    [...] the recent blog post A Taxonomy of Data Science the notion of where "hack" fits is presented as being…  Read More at [...]

  • http://www.korallenkacke.com/closer-look-rise-of-the-data-scientist/ Closer Look: Rise of the Data Scientist – www.Korallenkacke.com

    [...] the recent blog post A Taxonomy of Data Science the notion of where “hack” fits is presented as being part of a larger mix of areas of [...]

  • http://www.encruise.com/closer-look-rise-of-the-data-scientist Closer Look: Rise of the Data Scientist | encruise.com

    [...] the recent blog post A Taxonomy of Data Science the idea of where “hack” fits is presented as being part of a larger mix of areas of [...]

  • http://www.365online.nu/09/e-commerce-weblogs/closer-look-rise-of-the-data-scientist Closer Look: Rise of the Data Scientist | 365Online: E-Commerce en Online Marketing

    [...] the recent blog post A Taxonomy of Data Science the notion of where "hack" fits is presented as being part of a larger mix of areas of interest. [...]

  • http://cincodata.com/technology/closer-look-rise-of-the-data-scientist/ Closer Look: Rise of the Data Scientist | Technology and Web 2.0

    [...] the recent blog post A Taxonomy of Data Science the notion of where “hack” fits is presented as being part of a larger mix of areas of [...]

  • http://www.onlinemarketingconnect.com/readwriteweb/2010/09/closer-look-rise-of-the-data-scientist/ Closer Look: Rise of the Data Scientist | readwriteweb blog | Online Marketing Connect

    [...] the recent blog post A Taxonomy of Data Science the notion of where “hack” fits is presented as being part of a larger mix of areas of [...]

  • http://www.lytechnology.com/closer-look-rise-of-the-data-scientist/ Ly Technology » Closer Look: Rise of the Data Scientist

    [...] the recent blog post A Taxonomy of Data Science the notion of where "hack" fits is presented as being part of a larger mix of areas of interest. [...]

  • http://www.great-tutorials.info/developer/web-designers-and-bloggers/closer-look-rise-of-the-data-scientist/ Closer Look: Rise of the Data Scientist | The best Tutorials

    [...] the recent blog post A Taxonomy of Data Science the notion of where “hack” fits is presented as being part of a larger mix of areas of [...]

  • http://Website Ryan

    Congratulations, you’ve just invented ETL (http://en.wikipedia.org/wiki/Etl), and have then gone on to describe more processes common in Business Intelligence.

    You’d be best investigating Pentaho’s Data Integration (Kettle), which is a powerful and open source ETL tool which is also fairly easy to use.

  • http://Website Adriaan

    Nice ideas but you seem to confuse yourself between ‘data science’ and technology. I shudder when I see someone try to analyse with tools, you should analyse with concepts and the tools simple give you the values. Whether you use Perl, scripts, C# or Excel makes no difference, its the values you are after

  • http://Website Tim

    This is NOT a Taxonomy. Maybe it could help people to develop taxonomies.

    This is more of a process similar to the ETL (Extract Transform Load) metaphor from data warehousing.

    I was expecting things more along the line of hierarchical, relational, aggregation, etc. and hopefully a few different twists.

  • http://scientopia.org/blogs/bookoftrogool/2010/09/29/tidbits-29-september-2010/ Tidbits, 29 September 2010 | Book of Trogool

    [...] From the brand-new “dataists” blog: A Taxonomy of Data Science. And Alex Holcombe asks What should scientists have to say about where the data’s [...]

  • http://www.vincebuffalo.com vinceb

    Heh, the statistics and machine learning steps are “E” and “M”. Appropriate.

  • http://www.revveal.com Jason

    Nice article on – can’t wait for the next. For those who were bugged by the bit.ly links, use Chrome and you’ll see the links when hovering over them.

    Two comments:

    “Additional skills which may come to play are familiarity with databases, including their syntax for representing data (e.g., JSON, above) and for querying databases.”

    I would say that this is a required skill and not something you should be “familiar” with. Understanding databases and querying data is hugely important.

    Also, I would venture to say that any programming language could be used to access data.

  • http://Website Joe H.

    I’d agree with Tim — this defines a couple of terms, it’s barely a controlled vocabulary ( http://scientopia.org/blogs/bookoftrogool/2010/08/05/librariansplaining-the-controlled-vocabulary/ )

    I saw the title of the post from another blog and was hoping to see relationships between the different types of data processing (as calibration, reduction, visualizing, etc., can be a wide variety of activities), and there are terms that mean different types of data activities in different fields (‘sample’ in earth science is data collection, while ‘sample’ in heliophysics is data reduction)

  • http://marketsubset.com/effective-marketing-strategies/closer-look-rise-of-the-data-scientist/ Effective Marketing Strategies | Closer Look: Rise of the Data Scientist

    [...] that caught RWH’s eye.SponsorRevel in your O’s and S’sIn the recent blog post A Taxonomy of Data Science the notion of where “hack” fits is presented as being part of a larger mix of areas of [...]

  • http://Website Bubnoff

    Someone above mentions excel …WTF?

    Isn’t it time we stopped shoehorning excel in to doing things better left to databases and programming/scripting languages? I think tools do matter, though your point about using concepts is valid. But come on …Excel?

    Maybe for display/report purposes …for others who’re technophobic or techno-challenged to fiddle with, but … anyway

  • http://Website Private Krankenversicherung

    You made some excellent points there. I did a search about the topic and almost not found any specific details on other websites, but then happy to be here, really, appreciate that.

    - Lucas

  • http://kaythaney.wordpress.com/2010/10/03/sunday-morning-linkfest/ sunday morning linkfest « kay

    [...] on dataists, Hilary Mason and Chris Wiggins list 5 areas that data scientists should be comfortable in (and IMHO, are spot on). – A Taxonomy [...]

  • http://www.dataists.com/2010/09/the-data-science-venn-diagram/ dataists » Blog Archive » The Data Science Venn Diagram

    [...] the knots. Therefore, in an effort to simplify the discussion, and add my own thoughts to what is already a crowded market of ideas, I present the Data Science Venn [...]

  • http://techrights.org/2010/10/04/palm-mansion-rumoured/ Links 4/10/2010: Codenames Needed for Fedora 15 , Linux-based Palm ‘Mansion’ Rumoured | Techrights

    [...] A Taxonomy of Data Science Data science is clearly a blend of the hackers’ arts (primarily in steps “O” and “S” above); statistics and machine learning (primarily steps “E” and “M” above); and the expertise in mathematics and the domain of the data for the analysis to be interpretable (that is, one needs to understand the domain in which the data were generated, but also the mathematical operations performed during the “learning” and “optimization”). It requires creative decisions and open-mindedness in a scientific context. [...]

  • http://Website HTML Form

    last week our group held a similar discussion about this subject and you show something we have not covered yet, thanks.

    - Laura

  • http://Website solon

    Just to add to what others have already mentioned: this seems like an updated walk-through of the issues worked out in the data mining literature and professional associations.

    There’s the CRoss Industry Standard Process
    for Data Mining (CRISP-DM) that’s commonly used in DM coursework: http://www.crisp-dm.org/Process/index.htm There’s actually a very interesting history to the development of these process models.

    And then there’s “Data Mining Curriculum: A Proposal” by the Intensive Working Group of ACM SIGKDD: http://www.sigkdd.org/curriculum/index.html This might already address some of the concerns in your forthcoming post…

  • http://doctordata.wordpress.com/2010/10/10/state-of-data-last-week-oct-09/ State of Data Last Week – Oct 09 « Dr Data's Blog

    [...] ‘Data Scientist’ does – Obtain; Scrub; Explore; Model; and [...]

  • http://twitter.com/KCanini Kevin Canini

    Disappointing… this is definitely not a taxonomy.

  • http://gigaom.com/2010/12/16/wanted-data-scientists-to-turn-information-into-gold/ Wanted: Data Scientists To Turn Information Into Gold: Tech News «

    [...] we move on, what is a data scientist? Hilary Mason, a data scientist at Bit.ly, has a good definition. It’s someone who can obtain, scrub, explore, model and interpret data, blending hacking, [...]

  • http://www.obsessedby.com/2010/12/17/wanted-data-scientists-to-turn-information-into-gold-2/ Wanted: Data Scientists to Turn Information Into Gold | Obsessed By

    [...] what is a data scientist? Hilary Mason, a data scientist at Bit.ly, has a good definition. It’s someone who can obtain, scrub, explore, model and interpret data, blending hacking, [...]

  • http://anichaos.com/1/2010/12/16/wanted-data-scientists-to-turn-information-into-gold/ Wanted: Data Scientists to Turn Information Into Gold | AniChaos.com

    [...] what is a data scientist? Hilary Mason, a data scientist at Bit.ly, has a good definition. It’s someone who can obtain, scrub, explore, model and interpret data, blending hacking, [...]

  • http://www.naikmichel.com/2010/12/21/wanted-data-scientists-to-turn-information-into-gold-company-news/ Wanted: Data Scientists to Turn Information Into Gold [Company News] | NAIK MICHEL

    [...] what is a data scientist? Hilary Mason, a data scientist at Bit.ly, has a good definition. It’s someone who can obtain, scrub, explore, model and interpret data, blending hacking, [...]

  • http://thenoisychannel.com/2011/01/04/so-you-like-big-data/ So You Like Big Data…

    [...] Of course, those jobs aren’t for everyone. To get an idea of the necessary qualifications, I suggest you read the answers on Quora for “How do I become a data scientist?” to get an idea of the requisite math and computer science skills. I’m also a fan of Hilary Mason‘s definition which was cited in Ryan Kim’s “Wanted: Data Scientists to Turn Information Into Gold“: a data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning. You can see Hilary’s full explanation in a blog post she co-authored with Chris Wiggins, entitled “A Taxonomy of Data Science“. [...]

  • http://twitter.com/TonySearl TonySearl

    Hilary Mason and Chris Wiggins know their data. #LAK11

  • http://twitter.com/TonySearl TonySearl

    Hilary Mason and Chris Wiggins know their data. #LAK11

  • http://www.datanalytics.com/blog/2011/02/01/data-scientist/ datanalytics » Data Scientist…

    [...] Y este otro de título sugerente, a este otro más específico: [...]

  • http://www.walterjessen.com/data-scientists-dealing-with-data/ Data Scientists Dealing with Data | WalterJessen.com

    [...] model and interpret data (a taxonomy proposed by Hilary Mason, a data scientist at Bit.ly, in A Taxonomy of Data Science). A data scientist blends several skills, including hacking, statistics, mathematics, machine [...]

  • http://news.dice.com/2011/01/13/how-hacking-skills-can-help-you-land-a-data-science-job/ How Hacking Skills Can Help You Land a Data Science Job | Dice Blog Network

    [...] student in political science at New York University. Hilary Mason, a data scientist at Bit.ly, says a data scientist is someone who can “obtain, scrub, explore, model and interpret data” [...]

  • http://www.michaeldhealy.com Michael D. Healy

    Sweet #DataScience article from Hilary Mason and Chris Wiggins

  • http://www.seanlawson.net/?p=1045 From the Listening Post… 04/12/2011 (p.m.) « Sean Lawson, Ph.D.

    [...] A Taxonomy of Data Science [...]

  • http://www.seanlawson.net/?p=1046 From the Listening Post… 04/13/2011 (a.m.) « Sean Lawson, Ph.D.

    [...] A Taxonomy of Data Science [...]

  • http://www.quora.com/What-are-some-good-method-for-data-pre-processing-in-machine-learning#ans520514 Quora

    What are some good method for data pre-processing in machine learning?…

    I agree with Irene Ros, manual or semi-manual manipulations in Google Refine or Excel is probably the most practical thing one can recommend, that is as long as you data size and complexity doesn’t not exceed the capacity of those tools (and your pati…

  • http://imlab.cc/whale/?p=2504 Twitter Weekly Updates for 2011-05-01 | 鲸男 – Lei Gao

    [...] » A Taxonomy of Data Science: http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ [...]

  • Joel Lagan

     This is the direction that the world is moving… every organization’s leaders need to be in touch with what the Dataist has to say. http://www.stac.biz

  • http://www.hilarymason.com/presentations-2/an-introduction-to-machine-learning-with-web-data-is-now-available/ » An Introduction to Machine Learning with Web Data is now available! hilarymason.com

    [...] thinking about this material and how best to present it, particularly Chris Wiggins who co-authored A Taxonomy of Data Science and Andrew, Dennis, Jan, Jesse, and Julie, the members of the studio audience for the class (who [...]

  • http://www.facebook.com/brian.dalessandro Brian Dalessandro

    Very nice to see a concise write up on the subject. When I look to hire data scientists I essentially am looking for at least moderate ability in each of the skills you mention. Finding the complete skill set is not always easy.

    Also, I’d like to re-emphasize the intuitive element of data science. Coding and stats theory are crucial to the job, but interpreting your output (and knowing whether its right or wrong) is driven in large part by your creativity and intimate knowledge of the domain in question. This skill is harder to teach but very valuable when found.

  • Dataist

    I like that the Machine Learning steps are “E-M”, how appropriate!

  • Mike

    OSEMN rhymes with possum, but it is a homophone for awesome.

  • http://www.harlan.harris.name/2011/09/data-science-moores-law-and-moneyball/ Somethink to Chew On » Data Science, Moore’s Law, and Moneyball

    [...] is defined as what “Data Scientists” do. What Data Scientists do has been very well covered, and it runs the gamut from data collection and munging, through application of [...]

  • http://drskippy.net/ Scott Hendrickson

    Thanks for a great post.

    Another activity I find useful to consider is Iteration.  It often takes many runs at a problem to get to a solution.  Becoming efficient at iteration is a great skill. Making iterations shorter and getting to simple or partial answers quickly helps move analysis in a good direction earlier in the process. 

    While it is a waste of time to formalize much exploration, I find myself regretting when I make decisions that don’t allow me to do what I did yesterday again quickly, with a slight change.  A couple of tricks I use all of the time ‘history | tail > my.bash’ or use ‘head -q -n100′ instead of ‘cat’ to sample a small data set until I am ready for the final answer. I am sure people have used many others.

  • http://whatsthebigdata.com/2012/04/26/a-very-short-history-of-data-science/ A Very Short History of Data Science | What's The Big Data?

    [...] 2010  Hilary Mason and Chris Wiggins write in “A Taxonomy of Data Science”:  “…we thought it would be useful to propose one possible taxonomy… of what a data [...]

  • http://data-science.info/?p=32 data-science.info » Data Science: a literature review

    [...] it all”), Mike Loukides (“Data science enables the creation of data products”), Hilary Mason (“Data science is clearly a blend of the hackers”), Drew Conway (“The Data [...]

  • http://www.quora.com/Career-Advice/Whats-the-most-desirable-background-for-a-job-analyzing-social-networks-mathematically#ans1325955 Quora

    What’s the most desirable background for a job analyzing social networks mathematically?…

    I suggest you read some of Daniel Tunkelang’s blog posts and interviews. As principal data scientist at LinkedIn, he probably analyzes social networks mathematically. See the interview at [1], for example, where he mentions desired skills for data sci…

  • http://blogs.oreilly.com/radar2/2011/04/data-hand-tools.html Data hand tools – O'Reilly Radar

    [...] an essential part of a data scientist’s toolkit. Hilary Mason and Chris Wiggins wrote that “Sed, awk, grep are enough for most small tasks,” and there’s a layer of tools below sed, awk, and grep that are equally useful. Hilary has [...]

  • http://radar.oreilly.com/2010/12/strata-gems-sed-and-awk.html Strata Gems: The timeless utility of sed and awk – O'Reilly Radar

    [...] command line utilities sed and awk are useful tools for cleaning up and manipulating data. In their Taxonomy of Data Science, Hilary Mason and Chris Wiggins note that when cleaning data, “Sed, awk, grep are enough for [...]

  • http://kldavenport.com/ Kevin Davenport

    Years later still a great find.

  • ScottRobinett

    nice article