Common Mistakes When Carrying Out Machine Learning and Data Science

Dec 8, 2018
No comments yet
Industry News, Left-hand
5082 Views

Originally published in KDnuggets, December 2018

For today’s leading deep learning methods and technology, attend the conference and training workshops at Deep Learning World, June 16-19, 2019 in Las Vegas.

This is part two of this series, find part one here – How to build a data science project from scratch.

After scraping or getting the data, there are many steps to accomplish before applying a machine learning model.

You need to visualize each of the variables to see distributions, find the outliers, and understand why there are such outliers.

What can you do with missing values in certain features?

What would be the best way to convert categorical features into numerical ones?

There are many such questions, but I will give some details on the ones where the majority of beginners encounter mistakes.

Visualization

Firstly, you should visualize the distribution of the continuous features to get a feeling if there are many outliers, what the distribution would be, and if it makes sense.

There are many ways to visualize it, for example box plots, histograms, cumulative distribution functions, and violin plots. However, one should pick the plot that will give the most information about the data.

To see the distribution (if it is normal, or bimodal), the histograms will be the most helpful. Although histograms are a good starting point, the box plots might be superior in identifying the number of outliers and seeing where the median quartiles lie.

Based on the plots, the most interesting question would be: do you see what you expected to see? Answering this question will help you either in finding insights or finding bugs in the data.

To get inspired and understand what plot will give the most value, I frequently referred to the Python’s seaborn gallery. Another good source of inspiration for the visualization and finding insights are kernels on Kaggle. Here is my kaggle kernel of the in-depth visualization of the titanic dataset.

To continue reading this article on KDnuggets, click here.

About the Author

Jekaterina Kokatjuhha is a Research Engineer at Zalando in the Berlin area in Germany.

EXCLUSIVE HIGHLIGHTS

Related

5 years ago
Common Mistakes When Carrying Out Machine Learning and Data Science

Leave a Reply Cancel reply

Login

Industry News

Connect with Us

Subscription

ADVERTISEMENTS

Produced By:

Archives

The Machine Learning Times © 2020 • 1221 State Street • Suite 12, 91940 • Santa Barbara, CA 93190
Produced by: Rising Media & Prediction Impact

EXCLUSIVE HIGHLIGHTS

Related

5 years agoCommon Mistakes When Carrying Out Machine Learning and Data Science

Recommended

The ROI on AI: Advisors struggle to get unbiased answers from tech providers

Large language models use a surprisingly simple mechanism to retrieve some stored knowledge

Apple researchers develop AI that can ‘see’ and understand screen context

A.I. Is Spying on the Food We Throw Away