Transitioning from academia to industry in Data Science can be a daunting task. The sheer breadth of available tools can make it difficult to figure out where to start with learning Data Science skills, and which are most important to learn. To help guide learners on their Data Science journey, we thought it would be useful to rank the most popular tools Data Scientists use. These tools are popular for several reasons, including their capability and usability, and skill with these tools is in high demand for Data Science professionals.
At The Data Incubator, we train students at every level on the latest in-demand tools and technologies for Data Science – ultimately training students at the post-graduate level for positions in the Data Science industry through our free Fellowship program, and training students at the corporate level to increase organizations’ Data Science capabilities. Our curriculum is based on feedback from our corporate and government partners, but we wanted to develop a more data-driven approach to what we should be teaching. So, we decided to analyze the popularity of certain tools and packages, with the understanding that the more people who use a tool, the more capable and usable it must be.
In this article, we rank the top distributed computing packages for Data Science. First we describe our methodology, and then we list the top 20 packages, in order.
Below is a ranking of the top 20 of 140 distributed computing packages that are useful for Data Science, based on Github and Stack Overflow activity, as well as Google Search results. The table shows standardized scores, where a value of 1 means one standard deviation above average (average = score of 0). For example, Apache Hadoop is 6.6 standard deviations above average in Stack Overflow activity, while Apache Flink is close to average. See below for methods.
Results and Discussion
The package ranking is based on equally weighing its three components: Github (stars and forks), Stack Overflow (tags and questions), and number of Google search results. These were obtained using available APIs. Coming up with a comprehensive list of distributed computing packages was tricky – in the end, we scraped three different lists that we thought were representative. We chose to focus on 140 frameworks and distributed programing packages (see methods below for details). Computing standardized scores for each metric allows us to see which packages stand out in each category. The full ranking is here, while the raw data is here.
Apache Spark and Apache Hadoop are in a class of their own
Apache Spark (1) is an incredibly popular open source distributed computing framework. Apache Spark dominated the Github activity metric with its numbers of forks and stars more than eight standard deviations above the mean. Apache Spark utlizes in-memory data processing, which makes it faster than its predecessors and capable of machine learning. It also offers an interactive console in either Scala or, more popular amoung data scientists, Python. Although Apache Spark was intially designed for the Hadoop ecosystem, it can run on its own using one of many different file management systems. Apache Hadoop (2) outperformed Apache Spark in Stack Overflow activity. The disconnect between Hadoop’s Stack Overflow activity and the other two metrics is likely due to the fact that the meaning of Apache Hadoop has evolved over time. Rather than referring to just the framework, the term “Hadoop” can also mean all Hadoop-related projects that make up the ecosystem. This results in a somewhat inflated Stack Overflow score. Nevertheless, most of the frameworks and engines on our list have Apache Hadoop integrations. And it measured at least two standard deviations above the mean on all our metrics, solidifying its number two spot.
Apache Storm and Apache Flink are popular alternative frameworks, especially for streaming
Apache Storm (4), initally touted as the Apache Hadoop of real-time, is a stream-only framework best for near real-time distributed computing. It performed above average on all of our metrics. While Apache Storm processes stream data at scale, it is frequently used with Apache Kafka (3), a platform that processes the raw messages from real-time data feeds at scale. Similiar to Apache Spark, Apache Flink (8) is also a framework capable of both batch and stream processing. However, Apache Spark bills itself as a batch-processor that can handle streaming, while Apache Flink is suited for heavy stream processing with some batch tasks.
Stratio Crossdata is the highest ranked data hub and fastest growing package
Stratio Crossdata (6) extends the capabilites of Apache Spark by providing a unified way to access to multiple datastores. Stratio Crossdata uses a SQL-like language and just one API to access multiple datatstores with different natures, like Apache Cassandra, ElasticSearch, Arvo, or MongoDB. The number of Google search results for Stratio Crossdata have increased by 400% from the last quarter, which is the largest growth rate out of all 140 packages on our list.
Two of the top 10 were developed by Twitter
The most popular of the two Twitter projects on our list, Apache Storm (4), was donated to the Apache Software Foundation by Twitter in 2011. Twitter Heron (9) is a direct successor to Apache Storm released in June 2016. Twitter Heron offers improved realtime, fault-tolerant stream processing with higher throughput than Storm. Twitter Heron had the fifth largest quarterly growth rate with an increase of 180%. It will be interesting to see if Twitter Heron can climb farther up the ranks with time.
The Hadoop Ecosystem dominates
The Hadoop Ecosystem projects are the most prevalent and widely adopted distributed computing frameworks and interfaces. 17 of the top 20 packages we ranked are part of the Hadoop Ecosystem or designed to integrate with Apache Spark or Apache Hadoop (including HDFS). Outside of the Hadoop Ecosystem Hazelcast (10), an in-memory data grid, Google BigQuery (12), cloud-based big data analytics web service using a SQL-like syntax, and Metamarkets Druid (15) a framework for real-time analysis of large datasets performed well on our metrics.
Naturally, some libraries that have been around longer will have higher metrics, and therefore higher ranking. The only metric that takes this into account is the Google search quarterly growth rate.
The data presented a few difficulties:
All source code and data is on our Github Page.
We first generated a list of 140 distributed computing packages from these four sources, and then collected metrics for all of them, to come up with the ranking. Github data is based on both stars and forks, Stack Overflow data is based on tags and questions containing the package name, and Google Results are based on total number of Google search results over the last five years and the quarterly growth rate of results calculated over the past three months as compared to the prior three months.
A few other notes:
About the Authors
Rachel Allen is Lead Scientist at Booz Allen Hamilton. Prior to joining Booz, she worked as an instructor at The Data Incubator after graduating from their free eight-week fellowship. Rachel holds a PhD in Systems Neuroscience from University of Virginia and completed her post-doctoral research at the National Children’s Health System
Michael Li is the founder and CEO of The Data Incubator, a company he started to help organizations hire and train professional Data Scientists. As a data scientist, he has worked at NASA, Google, Foursquare, and Andreessen Horowitz. He is a regular contributor to VentureBeat, The Next Web, and Harvard Business Review.