Machine Learning Times
Machine Learning Times
EXCLUSIVE HIGHLIGHTS
Effective Machine Learning Needs Leadership — Not AI Hype
 Originally published in BigThink, Feb 12, 2024.  Excerpted from The...
Today’s AI Won’t Radically Transform Society, But It’s Already Reshaping Business
 Originally published in Fast Company, Jan 5, 2024. Eric...
Calculating Customer Potential with Share of Wallet
 No question about it: We, as consumers have our...
A University Curriculum Supplement to Teach a Business Framework for ML Deployment
    In 2023, as a visiting analytics professor...
SHARE THIS:

5 years ago
Common Mistakes When Carrying Out Machine Learning and Data Science

 

Originally published in KDnuggets, December 2018

For today’s leading deep learning methods and technology, attend the conference and training workshops at Deep Learning World, June 16-19, 2019 in Las Vegas.

This is part two of this series, find part one here – How to build a data science project from scratch.

After scraping or getting the data, there are many steps to accomplish before applying a machine learning model.

You need to visualize each of the variables to see distributions, find the outliers, and understand why there are such outliers.

What can you do with missing values in certain features?

What would be the best way to convert categorical features into numerical ones?

There are many such questions, but I will give some details on the ones where the majority of beginners encounter mistakes.

Visualization

Firstly, you should visualize the distribution of the continuous features to get a feeling if there are many outliers, what the distribution would be, and if it makes sense.

There are many ways to visualize it, for example box plotshistogramscumulative distribution functions, and violin plots. However, one should pick the plot that will give the most information about the data.

To see the distribution (if it is normal, or bimodal), the histograms will be the most helpful. Although histograms are a good starting point, the box plots might be superior in identifying the number of outliers and seeing where the median quartiles lie.

Based on the plots, the most interesting question would be: do you see what you expected to see? Answering this question will help you either in finding insights or finding bugs in the data.

To get inspired and understand what plot will give the most value, I frequently referred to the Python’s seaborn gallery. Another good source of inspiration for the visualization and finding insights are kernels on Kaggle. Here is my kaggle kernel of the in-depth visualization of the titanic dataset.

To continue reading this article on KDnuggets, click here.

About the Author

 

Jekaterina Kokatjuhha is a Research Engineer at Zalando in the Berlin area in Germany.

Leave a Reply