This author will present at Predictive Analytics World, Oct 29 – Nov 2 in New York. This article is excerpted from his book, Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are. The book delivers a fresh overview of big data with an emphasis on the intriguing insights revealed by Google search trends. The book also draws a new perspective on the power and peril of deployed machine learning (calling it “doppelganger discovery”). Click here for information about Seth’s upcoming PAW New York presentation and book signing.
Here’s how Bill Simmons, a sportswriter and passionate Boston Red Sox fan, described what was happening in the early months of the 2009 season: “It’s clear that David Ortiz no longer excels at baseball. . . . Beefy sluggers are like porn stars, wrestlers, NBA centers and trophy wives: When it goes, it goes.” Great sports fans trust their eyes, and Simmons’s eyes told him Ortiz was finished. In fact, Simmons predicted he would be benched or released shortly.
Was Ortiz really finished? If you’re the Boston general manager, in 2009, do you cut him? More generally, how can we predict how a baseball player will perform in the future? Even more generally, how can we use Big Data to predict what people will do in the future?
A theory that will get you far in data science is this: Look at what sabermetricians (those who have used data to study baseball) have done and expect it to spread out to other areas of data science. Baseball was among the first fields with comprehensive datasets on just about everything, and an army of smart people willing to devote their lives to making sense of that data. Now, just about every field is there or getting there. Baseball comes first; every other field follows. Sabermetrics eats the world.
The simplest way to predict a baseball player’s future is to assume he will continue performing as he currently is. If a player has struggled for the past 1.5 years, you might guess that he will struggle for the next 1.5 years.
By this methodology, Boston should have cut David Ortiz.
However, there might be more relevant information. In the 1980s, Bill James, who most consider the founder of sabermetrics, emphasized the importance of age. Baseball players, James found, peaked early—at around the age of twenty-seven. Teams tended to ignore just how much players decline as they age. They overpaid for aging players.
By this more advanced methodology, Boston should definitely have cut David Ortiz.
But this age adjustment might miss something. Not all players follow the same path through life. Some players might peak at twenty-three, others at thirty-two. Short players may age differently from tall players, fat players from skinny players. Baseball statisticians found that there were types of players, each following a different aging path. This story was even worse for Ortiz; “beefy sluggers” indeed do, on average, peak early and collapse shortly past thirty.
If Boston considered his recent past, his age, and his size, they should, without a doubt, have cut David Ortiz.
Then, in 2003, statistician Nate Silver introduced a new model, which he called PECOTA, to predict player performance. It proved to be the best—and, also, the coolest. Silver searched for players’ doppelgangers. Here’s how it works. Build a database of every Major League Baseball player ever, more than 18,000 men. And include everything you know about those players: Their height, age, and position; their home runs, batting average, walks, and strikeouts for each year of their careers. Now, find the twenty ballplayers who look most similar to Ortiz right up until that point in his career—those who played like he did when he was 24, 25, 26, 27, 28, 29, 30, 31, 32, and 33. In other words, find his doppelgangers. Then see how Ortiz’s doppelgangers’ careers progressed.
A doppelganger search is another example of zooming in. It zooms in on the small subset of people most similar to a given person. And, as with all zooming in, it gets better the more data you have. It turns out, Ortiz’s doppelgangers gave a very different prediction for Ortiz’s future. Ortiz’s doppelgangers included Jorge Posada and Jim Thome. These players started their careers a bit slow; had amazing bursts in their late twenties, with world-class power; and then struggled in their early thirties.
Silver then predicted how Ortiz would do based on how these doppelgangers ended up doing. And here’s what he found: They regained their power. For trophy wives, Simmons may be right; when it goes, it goes. But for Ortiz’s doppelgangers, when it went, it came back.
The doppelganger search, the best methodology ever used to predict baseball player performance, said Boston should be patient with Ortiz. And Boston indeed was patient with their aging slugger. In 2010, Ortiz’s average rose to .270. He hit 32 home runs and made the All-Star team. This began a string of four consecutive All-Star games for Ortiz. In 2013, batting in his traditional third spot in the lineup, at the age of thirty-seven, Ortiz batted .688 as Boston defeated St. Louis, 4 games to 2, in the World Series. Ortiz was voted World Series MVP.
As soon as I finished reading Nate Silver’s approach to predicting the trajectory of ballplayers, I immediately began thinking about whether I might have a doppelganger, too.
Doppelganger searches are promising in many fields, not just athletics. Could I find the person who shares the most interests with me? Maybe if I found the person most similar to me, we could hang out. Maybe he would know some restaurants we would like. Maybe he could introduce me to things I had no idea I might have an affinity for.
A doppelganger search zooms in on individuals and even on the traits of individuals. And, as with all zooming in, it gets sharper the more data you have. Suppose I searched for my doppelganger in a dataset of ten or so people. I might find someone who shared my interest in books. Suppose I searched for my doppelganger in a dataset of a thousand or so people. I might find someone who had a thing for popular physics books. But suppose I searched for my doppelganger in a dataset of hundreds of millions of people. Then I might be able to find someone who was really, truly similar to me. One day, I went doppelganger hunting on social media. Using the entire corpus of Twitter profiles, I looked for the people on the planet who have the most common interests with me.
You can certainly tell a lot about my interests from whom I follow on my Twitter account. Overall, I follow some 250 people, showing my passions for sports, politics, comedy, science, and morose Jewish folksingers.
So is there anybody out there in the universe who follows all 250 of these accounts, my Twitter twin? Of course not. Doppelgangers aren’t identical to us, only similar. Nor is there anybody who follows 200 of the accounts I follow. Or even 150.
However, I did eventually find an account that followed an amazing 100 of the accounts I follow: Country Music Radio Today. Huh? It turns out, Country Music Radio Today was a bot (it no longer exists) that followed 750,000 Twitter profiles in the hope that they would follow back.
I have an ex-girlfriend who I suspect would get a kick out of this result. She once told me I was more like a robot than a human being.
All joking aside, my initial finding that my doppelganger was a bot that followed 750,000 random accounts does make an important point about doppelganger searches. For a doppelganger search to be truly accurate, you don’t want to find someone who merely likes the same things you like. You also want to find someone who dislikes the things you dislike.
My interests are apparent not just from the accounts I follow but from those I choose not to follow. I am interested in sports, politics, comedy, and science but not food, fashion, or theater. I follow shows that I like. Bernie Sanders but not Elizabeth Warren, Sarah Silverman but not Amy Schumer, the New Yorker but not the Atlantic, my friends Noah Popp, Emily Sands, and Josh Gottlieb but not my friend Sam Asher. (Sorry, Sam. But your Twitter feed is a snooze.)
Of all 200 million people on Twitter, who has the most similar profile to me? It turns out my doppelganger is Vox writer Dylan Matthews. This was kind of a letdown, for the purposes of improving my media consumption, as I already follow Matthews on Twitter and Facebook and compulsively read his Vox posts. So learning he was my doppelganger hasn’t really changed my life. But it’s still pretty cool to know the person most similar to you in the world, especially if it’s someone you admire. And when I finish this book and stop being a hermit, maybe Matthews and I can hang out and discuss the writings of James Surowiecki.
About the Author:
Seth Stephens-Davidowitz is a New York Times op-ed contributor, a visiting lecturer at The Wharton School, and a former Google data scientist. He received a BA in philosophy from Stanford, where he graduated Phi Beta Kappa, and a PhD in economics from Harvard. His research—which uses new, big data sources to uncover hidden behaviors and attitudes—has appeared in the Journal of Public Economics and other prestigious publications. He lives in New York City.