Machine Learning Times
Machine Learning Times
EXCLUSIVE HIGHLIGHTS
Three Best Practices for Unilever’s Global Analytics Initiatives
    This article from Morgan Vawter, Global Vice...
Getting Machine Learning Projects from Idea to Execution
 Originally published in Harvard Business Review Machine learning might...
Eric Siegel on Bloomberg Businessweek
  Listen to Eric Siegel, former Columbia University Professor,...
Effective Machine Learning Needs Leadership — Not AI Hype
 Originally published in BigThink, Feb 12, 2024.  Excerpted from The...
SHARE THIS:

6 years ago
The Machine Learning Reproducibility Crisis

 

Originally published in Pete Warden’s Blog, March 19, 2018

For today’s leading machine learning methods and technology, attend the conference and training workshops at Predictive Analytics World Las Vegas, June 3-7, 2018.

I was recently chatting to a friend whose startup’s machine learning models were so disorganized it was causing serious problems as his team tried to build on each other’s work and share it with clients. Even the original author sometimes couldn’t train the same model and get similar results! He was hoping that I had a solution I could recommend, but I had to admit that I struggle with the same problems in my own work. It’s hard to explain to people who haven’t worked with machine learning, but we’re still back in the dark ages when it comes to tracking changes and rebuilding models from scratch. It’s so bad it sometimes feels like stepping back in time to when we coded without source control.

When I started out programming professionally in the mid-90’s, the standard for keeping track and collaborating on source code was Microsoft’s Visual SourceSafe. To give you a flavor of the experience, it didn’t have atomic check-ins, so multiple people couldn’t work on the same file, the network copy required nightly scans to avoid mysterious corruption, and even that was no guarantee the database would be intact in the morning. I felt lucky though, one of the places I interviewed at just had a wall of post-it notes, one for each file in the tree, and coders would take them down when they were modifying files, and return them when they were done!

This is all to say, I’m no shrinking violet when it comes to version control. I’ve toughed my way through some terrible systems, and I can still monkey together a solution using rsync and chicken wire if I have to. Even with all that behind me, I can say with my hand on my heart, that machine learning is by far the worst environment I’ve ever found for collaborating and keeping track of changes.

To explain why, here’s a typical life cycle of a machine learning model:

  • A researcher decides to try a new image classification architecture.
  • She copies and pastes some code from a previous project to handle the input of the dataset she’s using.
  • This dataset lives in one of her folders on the network. It’s probably one of the ImageNet downloads, but it isn’t clear which one. At some point, someone may have removed some of the images that aren’t actually JPEGs, or made other minor modifications, but there’s no history of that.
  • She tries out a lot of slightly different ideas, fixing bugs and tweaking the algorithms. These changes are happening on her local machine, and she may just do a mass file copy of the source code to her GPU cluster when she wants to kick off a full training run.
  • She executes a lot of different training runs, often changing the code on her local machine while jobs are in progress, since they take days or weeks to complete.
  • There might be a bug towards the end of the run on a large cluster that means she modifies the code in one file and copies that to all the machines, before resuming the job.
  • She may take the partially-trained weights from one run, and use them as the starting point for a new run with different code.
  • She keeps around the model weights and evaluation scores for all her runs, and picks which weights to release as the final model once she’s out of time to run more experiments. These weights can be from any of the runs, and may have been produced by very different code than what she currently has on her development machine.
  • She probably checks in her final code to source control, but in a personal folder.
  • She publishes her results, with code and the trained weights.

To continue reading this article at Pete Warden’s Blog, click here.

About the Author:

Pete Warden, Engineer, CTO of Jetpac Inc, author of The Public Data Handbook and The Big Data Glossary for O’Reilly, builder of OpenHeatMap and the Data Science Toolkit, and other open source projects.

Leave a Reply