Machine Learning Times
Machine Learning Times
EXCLUSIVE HIGHLIGHTS
Survey: Machine Learning Projects Still Routinely Fail to Deploy
 Originally published in KDnuggets. Eric Siegel highlights the chronic...
Three Best Practices for Unilever’s Global Analytics Initiatives
    This article from Morgan Vawter, Global Vice...
Getting Machine Learning Projects from Idea to Execution
 Originally published in Harvard Business Review Machine learning might...
Eric Siegel on Bloomberg Businessweek
  Listen to Eric Siegel, former Columbia University Professor,...
SHARE THIS:

5 years ago
Common Mistakes When Carrying Out Machine Learning and Data Science

 

Originally published in KDnuggets, December 2018

For today’s leading deep learning methods and technology, attend the conference and training workshops at Deep Learning World, June 16-19, 2019 in Las Vegas.

This is part two of this series, find part one here – How to build a data science project from scratch.

After scraping or getting the data, there are many steps to accomplish before applying a machine learning model.

You need to visualize each of the variables to see distributions, find the outliers, and understand why there are such outliers.

What can you do with missing values in certain features?

What would be the best way to convert categorical features into numerical ones?

There are many such questions, but I will give some details on the ones where the majority of beginners encounter mistakes.

Visualization

Firstly, you should visualize the distribution of the continuous features to get a feeling if there are many outliers, what the distribution would be, and if it makes sense.

There are many ways to visualize it, for example box plotshistogramscumulative distribution functions, and violin plots. However, one should pick the plot that will give the most information about the data.

To see the distribution (if it is normal, or bimodal), the histograms will be the most helpful. Although histograms are a good starting point, the box plots might be superior in identifying the number of outliers and seeing where the median quartiles lie.

Based on the plots, the most interesting question would be: do you see what you expected to see? Answering this question will help you either in finding insights or finding bugs in the data.

To get inspired and understand what plot will give the most value, I frequently referred to the Python’s seaborn gallery. Another good source of inspiration for the visualization and finding insights are kernels on Kaggle. Here is my kaggle kernel of the in-depth visualization of the titanic dataset.

To continue reading this article on KDnuggets, click here.

About the Author

 

Jekaterina Kokatjuhha is a Research Engineer at Zalando in the Berlin area in Germany.

Leave a Reply