While R has traditionally been the programming language of choice for data scientists, it is quickly ceding ground to Python.
While there are several reasons for the shift, perhaps the biggest one is that Python is general purpose and comparatively easy to learn whereas R remains a somewhat complex programming environment to master.
In a world increasingly dependent on data and starved for data scientists, “easy” wins.
Part of the reason people struggle to learn R is that it’s not really a language. As R expert John Cook points out, R “is an interactive environment for doing statistics,” not really a programming language. As he suggests, “I find it more helpful to think of R as having a programming language than being a programming language.”
R, then, doesn’t really look like a traditional programming language, which makes it that much harder for would-be R developers to grasp its nuances.
But R is hard even for those well-versed in statistical tools like SAS and SPSS, as Bob Muenchen highlights. R arguably reduces complexity for such analysts because it incorporates the macro and matrix languages, among other things, that tools like SPSS require you to master. But for those expecting R to function like Stata, they’re disappointed.
Across the board, R is … different. Which makes it hard.
Python, however, is much more approachable. For one thing, all sorts of developers are familiar with Python and use it for a wide array of applications. Unlike R, which is pretty much only used for data analysis, a developer could experience Python when first scripting her website or any number of other applications.
As enterprises struggle to put data to work, they’re also struggling to find qualified data scientists. More often than not, however, such data scientists may already work for them and likely have some familiarity with Python. Given the importance of asking the right questions of one’s data, training up homegrown talent on the right Big Data technologies is much more effective than training new-hire data scientists on the complexities of one’s business, as Gartner’s Svetlana Sicular posits.
Beyond tapping into a ready-made Python developer pool, however, one of the biggest benefits of doing data science in Python is added efficiency of using one programming language across different applications. University of Texas at Austin research associate Tal Yarkoni explains:
It turns out that the benefits of doing all of your development and analysis in one language are quite substantial. For one thing, when you can do everything in the same language, you don’t have to suffer the constant cognitive switch costs of reminding yourself say, that Ruby uses blocks instead of comprehensions, or that you need to call len(array) instead of array.length to get the size of an array in Python…
Also, you no longer need to worry about interfacing between different languages used for different parts of a project. Nothing is more annoying than parsing some text data in Python, finally getting it into the format you want internally, and then realizing you have to write it out to disk in a different format so that you can hand it off to R or MATLAB for some other set of analyses. In isolation, this kind of thing is not a big deal. It doesn’t take very long to write out a CSV or JSON file from Python and then read it into R. But it does add up … All of this overhead vanishes as soon as you move to a single language.
This can’t be overstated. As much as we may laud the niche technology that solves one problem really, really well, the technologies that tend to win out are the general-purpose tools that solve an array of problems. As AppNexus director of Optimization and Analytics David Himrod notes, “One of the biggest challenges that [AppNexus] faces is how to get a diverse set of employees working on the same technology stack. Python provides employees with different backgrounds—notably engineers, mathematicians and analysts—a common, easy-to-understand language that can be used to prototype new functionality for the company.”
Python still lacks some of R’s richness for data analytics, but it is closing the gap fast. And remember: key to Python’s success is not necessarily its ability to tackle the more arcane functions of R or any other analytics tool, but rather its approachability and general-purpose nature. Data science is moving out of the realm of the alpha geeks, something that was clearly evident at O’Reilly’s Strata conference in New York last month. PhDs used to haunt its halls. Now mortal business analysts and others, tasked by their enterprises to figure out Big Data, made up the majority of attendees.
This new, early majority of “data scientists” is far more likely to use Python than R. It’s comparatively simple to use, and they’ve likely been able to use it in another project already. As in other markets, the tool you already know or is easy to learn is far more likely to win than the powerful-but-complex tool you’d really rather avoid if possible.