Transitioning from academia to industry in Data Science can be a daunting task. The sheer breadth of available tools can make it difficult to figure out where to start with learning Data Science skills, and which are most important to learn. To help guide learners on their Data Science journey, we thought it would be useful to rank the most popular tools Data Scientists use. These tools are popular for several reasons, including their capability and usability, and skill with these tools is in high demand for Data Science professionals.
At The Data Incubator, we train students at every level on the latest in-demand tools and technologies for Data Science – ultimately training students at the post-graduate level for positions in the Data Science industry through our free Fellowship program, and training students at the corporate level to increase organizations’ Data Science capabilities. Our curriculum is based on feedback from our corporate and government partners, but we wanted to develop a more data-driven approach to what we should be teaching. So, we decided to analyze the popularity of certain tools and packages, with the understanding that the more people who use a tool, the more capable and usable it must be.
Results and Discussion
d3.js and its derivatives dominate the field
d3.js is at least four standard deviations above the mean on all calculated metrics. d3.js offers users full control of all aspects of their data visualizations. With this power comes a trade-off: d3.js does not come with built-in charts and making a simple bar graph can become quite time consuming. For this reason, dozens of reusable charting packages have been built upon d3.js. D3.js derivatives with premade components make up six of the top 20 packages on our list. These include: plottable (4), plotly.js (5), britecharts (7), c3 (9), recharts (15), and dc.js (18). These derivatives tend to provide charting options for bar, line, and scatter plots. For more specialized visualizations such as maps and networks additional packages are necessary.
leaftlet.js is the most popular map visualization package
leaflet.js (6) is the only package dedicated to mapping to break into the top 20 on our list with scores above the mean on all of our metrics. In addition to specializing in interactive maps, leaflet.js is lightweight (38KB of JS) and mobile-friendly. cesium (27) is the highest ranking package to offer 3D globes and maps. cartodb (29), rickshaw (37), and datamaps (46) also offer powerful geospatial/mapping visualizations.
sigma.js beats cytoscape for the most popular graph/network visualization package
britecharts has the largest growth rate for 2017
With so many data visualization options (we ranked 110), one might think it would be hard for a new charting package to gain a following. britecharts, a reusable charting library based on D3.js and created by eventbrite, was first made publicly available less than two years ago. britecharts earned the number 7 spot in our overall rankings, and the highest compound monthly growth rate (110%) over the last 6 months. The next package to even come close is graphael with a 56% growth rate.
There’s a place near the top for both flot and flotr2
Further, naturally, some packages that have been around longer will have higher metrics, and therefore higher ranking. This is not adjusted for in the Stack Overflow or Github metrics. The download metrics are restricted to the past six months.
The data presented a few difficulties:
All source code and data is on our Github Page.
We first generated a list of 141 Data Science packages from these four sources, and then collected metrics for all of them, to come up with the ranking. Github data is based on both stars and forks, while Stack Overflow data is based on tags and questions containing the package name. Downloads data is from npmjs. Downloads were totaled over a six month period, and the compound monthly growth rate was calculated over the same period. After scraping other sites for JS visualization package names, we had gathered over 200 package names. Many of them were aliases for the same packages (d3, D3JS). If a the first result of Github search returned the same repo as another package, we treated them as the same package, but saved the aliases to search Stack Overflow questions.
A few other notes:
All data was downloaded on August 6, 2017.
About the Authors
Rachel Allen is Lead Scientist at Booz Allen Hamilton. Prior to joining Booz, she worked as an instructor at The Data Incubator after graduating from their free eight-week fellowship. Rachel holds a PhD in Systems Neuroscience from University of Virginia and completed her post-doctoral research at the National Children’s Health System
Michael Li is the founder and CEO of The Data Incubator, a company he started to help organizations hire and train professional Data Scientists. As a data scientist, he has worked at NASA, Google, Foursquare, and Andreessen Horowitz. He is a regular contributor to VentureBeat, The Next Web, and Harvard Business Review.