If you search for information on ML monitoring online, there is a good chance that you’ll come across various monitoring approaches advocating for putting data drift at the center of monitoring solutions.
While data drift detection is indeed a key component of a healthy monitoring workflow, we found that it is not the most important one. Data drift and its other siblings’, target, and prediction drift can misrepresent the state of an ML model in production.
The purpose of this blog post is to demonstrate that not all data drift impacts model performance. Making drift methods hard to trust since they tend to produce a large number of false alarms. To illustrate this point, we will train an ML model using a real-world dataset, monitor the distribution of the model’s features in production, and report any data drift that might occur.
After, we will present a new algorithm invented by NannyML that will significantly reduce these false alarms.
So, without further ado, let’s check the dataset used in this post.
Power consumption dataset
We use the Power Consumption of Tetouan City dataset, a real and open-source dataset. This data was collected by the Supervisory Control and Data Acquisition System (SCADA) of Amendis, a public service operator in charge of distributing drinking water and electricity in Morocco.