Machine Learning Times
Machine Learning Times
EXCLUSIVE HIGHLIGHTS
Visualizing Decision Trees with Pybaobabdt
 Originally published in Towards Data Science, Dec 14, 2021....
Correspondence Analysis: From Raw Data to Visualizing Relationships
 Isn’t it satisfying to find a tool that makes...
Podcast: Four Things the Machine Learning Industry Must Learn from Self-Driving Cars
    Welcome to the next episode of The Machine...
A Refresher on Continuous Versus Discrete Input Variables
 How many times have I heard that the most...
SHARE THIS:

2 months ago
Supercharging A/B Testing at Uber

 
Originally published in Uber Engineering, July 21, 2022.

Introduction

“Immensely laborious calculations on inferior data may increase the yield from 95 to 100 percent. A gain of 5 percent, of perhaps a small total. A competent overhauling of the process of collection, or of the experimental design, may often increase the yield ten- or twelve-fold, for the same cost in time and labor. To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. To utilize this kind of experience he must be induced to use his imagination, and to foresee in advance the difficulties and uncertainties with which, if they are not foreseen, his investigations will be beset.” (R. A. Fisher’s Presidential address to the 1st Indian Statistical Congress)

While the statistical underpinnings of A/B testing are a century old, building a correct and reliable A/B testing platform and culture at a large scale is still a massive challenge. Mirroring Fisher’s observation above, carefully constructing the building blocks of an A/B platform and ensuring the data collected is correct is critical to guaranteeing correctness of experiment results, but it’s easy to get wrong. Uber went through a similar journey and this blog post describes why and how we rebuilt the A/B testing platform we had at Uber.

Uber had an experimentation platform, called Morpheus, that was built 7+ years ago in the early days to do both feature flagging and A/B testing. Uber outgrew Morpheus significantly since then in terms of scale, users, use cases, etc.

In early 2020, we took a deeper look at this ecosystem. We discovered that a large percentage of the experiments had fatal problems and often needed to be rerun. Obtaining high-quality results required an expert-level understanding of experimentation and statistics, and an inordinate amount of toil (custom analysis, pipelining, etc.). This slowed down decision-making, and re-running poorly conducted experiments was common.

To continue reading this article, click here.

Leave a Reply