The Data Science Venn Diagram

Posted: September 30th, 2010 | Author: | Filed under: Philosophy of Data | Tags: , , | 31 Comments »

by Drew Conway

Last Monday I—humbly—joined a group of NYC’s most sophisticated thinkers on all things data for a half-day unconference to help O’Reily organize their upcoming Strata conference. The break out sessions were fantastic, and the number of people in each allowed for outstanding, expert driven, discussions. One of the best sessions I attended focused on issues related to teaching data science, which inevitably led to a discussion on the skills needed to be a fully competent data scientist.

As I have said before, I think the term “data science” is a bit of a misnomer, but I was very hopeful after this discussion; mostly because of the utter lack of agreement on what a curriculum on this subject would look like. The difficulty in defining these skills is that the split between substance and methodology is ambiguous, and as such it is unclear how to distinguish among hackers, statisticians, subject matter experts, their overlaps and where data science fits.

What is clear, however, is that one needs to learn a lot as they aspire to become a fully competent data scientist. Unfortunately, simply enumerating texts and tutorials does not untangle the knots. Therefore, in an effort to simplify the discussion, and add my own thoughts to what is already a crowded market of ideas, I present the Data Science Venn Diagram.

Data science Venn diagram

How to read the Data Science Venn Diagram

The primary colors of data: hacking skills, math and stats knowledge, and substantive expertise

  • On Monday we spent a lot of time talking about “where” a course on data science might exist at a university. The conversation was largely rhetorical, as everyone was well aware of the inherent interdisciplinary nature of the these skills; but then, why have I highlighted these three? First, none is discipline specific, but more importantly, each of these skills are on their own very valuable, but when combined with only one other are at best simply not data science, or at worst downright dangerous.
  • For better or worse, data is a commodity traded electronically; therefore, in order to be in this market you need to speak hacker. This, however, does not require a background in computer science—in fact—many of the most impressive hackers I have met never took a single CS course. Being able to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically; these are the hacking skills that make for a successful data hacker.
  • Once you have acquired and cleaned the data, the next step is to actually extract insight from it. In order to do this, you need to apply appropriate math and statistics methods, which requires at least a baseline familiarity with these tools. This is not to say that a PhD in statistics in required to be a competent data scientist, but it does require knowing what an ordinary least squares regression is and how to interpret it.
  • In the third critical piece—substance—is where my thoughts on data science diverge from most of what has already been written on the topic. To me, data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science. Science is about discovery and building knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods. On the flip-side, substantive expertise plus math and statistics knowledge is where most traditional researcher falls. Doctoral level researchers spend most of their time acquiring expertise in these areas, but very little time learning about technology. Part of this is the culture of academia, which does not reward researchers for understanding technology. That said, I have met many young academics and graduate students that are eager to bucking that tradition.
  • Finally, a word on the hacking skills plus substantive expertise danger zone. This is where I place people who, “know enough to be dangerous,” and is the most problematic area of the diagram. In this area people who are perfectly capable of extracting and structuring data, likely related to a field they know quite a bit about, and probably even know enough R to run a linear regression and report the coefficients; but they lack any understanding of what those coefficients mean. It is from this part of the diagram that the phrase “lies, damned lies, and statistics” emanates, because either through ignorance or malice this overlap of skills gives people the ability to create what appears to be a legitimate analysis without any understanding of how they got there or what they have created. Fortunately, it requires near willful ignorance to acquire hacking skills and substantive expertise without also learning some math and statistics along the way. As such, the danger zone is sparsely populated, however, it does not take many to produce a lot of damage.

I hope this brief illustration has provided some clarity into what data science is and what it takes to get there. By considering these questions at a high level it prevents the discussion from degrading into minutia, such as specific tools or platforms, which I think hurts the conversation.

I am sure I have overlooked many important things, but again the purpose was not to be speific. As always, I welcome any and all comments.

Cross-posted at Zero Intelligence Agents

  • http://Website Petr

    Just want to say thanks for great article!

  • M. Edward (Ed) Borasky

    I would add, however, that too often “data science” equals “Here is the story I want you to tell. Go gather data supporting it and make a compelling infographic”. As John Tukey so eloquently put it, “Let the data speak for themselves.”

  • zyxo

    Nice diagram. But I feel a bit uncomfortable about “data science”. It’s not about the data, but the information that’s buried in the data. “Information discovery science” seems a better term to me.
    The same problem exists with the term “data mining”. Like mining for gold, data miners mine to find information, not data.

  • drewconway

    @borasky: That would seem to be the definition of “bad data science,” or analysis with an agenda. This is not unique to analysis of large data sets, and is a persistent problem in all of science.

    @zyxo: I share your feelings, and have waxed on the semantic point at length before,

  • Vulpecula

    [...] Conway的一篇文章, 用Venn Diagram描述了成为Data Scientist所需要的技能. [...]

  • Quora

    How can a non-technical corporate finance type get into Big Data?…

    Learn R[1] and start using it instead of excel for nearly everything. Don’t just learn MapReduce[2] and Hadoop[3], use it. Get an interesting dataset or invent one to practice on Amazon’s Elastic MapReduce[4]. Take a look at Pig[5]. Get _really_ good…

  • Juan Pablo

    In a world of black boxes solutions, this phrase is perfect:
    What is clear, however, is that one needs to learn a lot as they aspire to become a fully competent data scientist

  • How do I become a data scientist?

    [...] Some really interesting nuggets there.  I can’t help but point to Neil Conway’s data science Venn diagram This entry was posted in Big Data. Bookmark the permalink. Post a comment or leave a trackback: [...]

  • Danger Zone! of Data Science « LingPipe Blog

    [...] posts, but I’m still chuckling at the “Danger Zone!” quadrant from the post The Data Science Venn Diagram on the Dataists [...]

  • The Data Science Venn Diagram – Post « Another Word For It

    [...] The Data Science Venn Diagram by Drew Conway is a must see! [...]

  • ScrappyKid

    So the interesting question is, who is doing data science– extracting value from the messy, unstructured data on the web, without biasing analysis toward an agenda?

    Surely the data-as-service providers are doing some of the work, and the Facebooks and Googles and twitters are doing another part of it. But the hard, low-visibility problem of manipulating and extracting and cleaning the data remains difficult.

    The good news is, the university isn’t the only place to learn these sorts of skills. A broad and deep liberal arts education that still challenges people to acquire analytical and technical skills (but as skills/tools, not as a lens to look at the world), is incredibly valuable. I suspect that it’s a lot easier to figure out whether someone can acquire technical skills than it is to ask someone to acquire attention to detail and creative thinking.

  • #LAK11 Data Science and Analytics: the Good, the Bad and the Ugly | A Chronicle of a Learning Journey

    [...] of the naughty list is from Drew Conway’s original definition of danger zone and from George Siemens’ 10 concerns.  Drew’s reason [...]

  • 5 Predictions for Online Data in 2011 | Digi Marketing Pet

    [...] three years ago. Data scientists are officially the hot new hire of choice even though their particular mix of formal skills is still rare…” 5 Predictions for Online Data in [...]

  • Data Science | Sphaerula

    [...] link Smith provides is to a post from last fall by Drew Conway, who created a Data Science Venn Diagram that defines, for Conway, where data science falls in the intersection of “hackers, [...]

  • MOOC newbie Voice – Week 2 Big Data… must be important… it’s big! | Semasajaya News Directory – Page 1

    [...] about how to interpret data… I really like this blog post from the extra resources list. A beginners guide to figuring out what the charts might mean and connected to a bunch of other [...]

  • Data Science « AOS
  • Quora

    Who are some of the most prominent data scientists today and what is their claim to fame?…

    As the description of this question suggests, data science is nebulous term. Sort of like hacker, its a buzzword that means multiple things to multiple people and completely lacks any standardized definition or overseeing body (ie IEEE, AIA, ADA etc). …

  • Data Enthusiast

    [...] a field, but a skillset) and even Mathematics. For a good reference see Hillary Mason’s post ”The Data Science Venn Diagram”The bigger impact that these newly discovered “Data Scientists” have in the short term [...]

  • Claudia Perlich

    Having taught data mining before and teaching it again, this is a great way of putting things into perspective. I could not agree more with the observation that science can only arise when you bring all three of them together!

    PS: May I please use it to warn my students what they have signed up for?

  • El científico de datos | Soraya Paniagua

    [...] Conway en su post The data science venn diagram, nos deja su idea gráfica de los conocimientos y habilidades que componen la Ciencia de los [...]

  • La Ciencia de los Datos | Soraya Paniagua

    [...] Como paso previo a la organización de Strata ,  O´Reilly invitó a una unconference a un selecto grupo de investigadores, relacionados con el mundo de los datos, para determinar los diferentes temas y ponentes de la Conferencia . Entre ellos estaba  Drew Conway que nos ha dejado este estupendo post  The data science venn diagram. [...]

  • The Importance of Thinking Comparatively « Dart-Throwing Chimp

    [...] stories of extreme child abuse and neglect are the disturbing part of the article, but the “data scientist” in me was also annoyed by the sloppy logic that seeks to hold the Pearls accountable for [...]

  • datanalytics » ¿Qué es un “data scientist”?

    [...] de dataists el siguiente [...]

  • So you call yourself a data scientist? | Anne Z.

    [...] domain knowledge, data analysis capability, and a hacker’s mindset (see Drew Conway’s Venn diagram of data science reproduced here). Any term that only incorporates one or two of these circles doesn’t really [...]

  • David Douglas

    I will definitely reuse this (with proper attribution). Thank you!

  • Big Data and Data Science: Love at First Byte | Analytics | DATAVERSITY

    [...] There are no Data Science classes offered at universities, nor are there any books on the subject – not yet anyway, though they are probably on the way. Data Science does not fulfill the General Ed science requirement, but there are many Data Science websites available with lots of information about this emerging field, and rather auspiciously enough people are talking about it to get a more definitive idea of where it is heading. Data science is the practice of “translating massive data into predictive insights that lead to results.” This involves a data scientist skill package that uses what Drew Conway calls the Data Science Venn Diagram: [...]

  • Embasado – Perigo, Will Robinson!

    [...] Drew Conway nos lembra que há mais na análise de dados do que ter acesso a ferramentas bacaninhas. [...]

  • Don’t Judge Your Ecuadorian Cook or How to Learn Statistics (Part 1) | Unnatural Consequences

    [...] The Tools While discussing the subject, I purposely avoid terms like Machine Learning, Data Mining, and Data Science.  A lot has been written about the subtle differences.  If you care about that, here is one discussion thread.  Also, see a related post from Drew Conway here. [...]

  • Pedro

    “To me, data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing datascience. Science is about discovery and building knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods.”

    I would strongly disagree with your statement that Machine Learners do not form and test hypotheses. Also, how do you distinguish a data miner from a data scientist? Seems like just a simple rebranding. I’m sure that data miners have even dealt with big data long before “data science” was conceived.

  • Shubham Makharia

    Warm Regards!
    Hey Drew,
    The article seems to be very helpful so far. I have one request can you please fix the ven diagram file. The file is missing from host server, and we can not access it anyhow.

    Much Thanks

  • Chad Caulkins

    Just wanted to throw in a clarification to @zyxo – data mining is not “mining to find data” rather it is mining the data for information
    @Pedro is right, I think “spot on” if you will in that Machine Learning is much more just as Data Mining and KDD were before Data Science, and Artificial Intelligence was a different term that could be used in conjunction with Big Data – after all what is BD exactly? :)