One way to do so is to count the number of projects using each language, and rank those with the most projects as being the most popular. Another might be to measure the size of a language’s “community,” and use that as a proxy for its popularity. Each has their advantages and disadvantages. Counting the number of projects is perhaps the “purest” measure of a language’s popularity, but it may overweight languages based on their legacy or use in production systems. Likewise, measuring community size can provide insight into the breadth of applications for a language, but it can be difficult to distinguish among language with a vocal minority versus those that are actually have large communities.
Solution: measure both, and compare.
This week John Myles White and I set out to gather data that measured both the number of projects using various languages, as well as their community sizes. While neither metric has a straightforward means of collection, we decided to exploit data on Github and StackOverflow to measure each respectively. Github provides a popularity ranking for each language based on the number of projects, and using the below R function we were able to collect the number of questions tagged for each language on StackOverflow.
The above chart shows the results of this data collection, where high rank values indicate greater popularity, i.e., the most popular languages on each dimension are in the upper-right of the chart. Even with this simple comparison there are several items of note:
- Metrics are highly correlated: perhaps unsurprisingly, we find that these ranks have a correlation of almost 0.8. Much less clear, however, is whether extensive use leads to large communities, or vice-a-versa?
- Popularity is tiered: for those languages that conform to the linear fit, there appears to be clear separation among tiers of popularity. From “super-popular” cluster in the upper-right, to the more specialized languages in the second tier, and then those niche and deprecated languages in the lower-left.
- What’s up with VimL and Delphi?: The presence of severe outliers may be an indication of weakness in these measures, but they are interesting to consider nonetheless. How is it popular that VimL could be the 10th most popular language on Github, but have almost no questions on StackOverflow? Is the StackOverflow measure actually picking up the opaqueness of languages rather than the size of their community? That might explain the position of R.
We Dataists have a much more shallow language toolkit than is represented in this graph. Having worked with my co-authors a few times, I know we primarily stick to the shell, Python, R stack; and to a lesser extent C, Perl and Ruby, so it is difficult to provide insight as to the position of many of these languages. If you see your favorite language and have a comment, please let us know.
Raw ranking data available here.