- Dec 8, 2018
- No comments yet
- Industry News, Left-hand
- 3612 Views

Common Mistakes When Carrying Out Machine Learning and Data Science

*Originally published in KDnuggets, December 2018*

*For today’s leading deep learning methods and technology, attend the conference and training workshops at **Deep Learning World, June 16-19, 2019** in Las Vegas.*

This is part two of this series, find part one here – **How to build a data science project from scratch**.

After scraping or getting the data, there are many steps to accomplish **before** applying a machine learning model.

You need to visualize each of the variables to see distributions, find the outliers, and understand why there are such outliers.

What can you do with missing values in certain features?

What would be the best way to convert categorical features into numerical ones?

There are many such questions, but I will give some details on the ones where the majority of beginners encounter mistakes.

**Visualization**

Firstly, you should visualize the distribution of the continuous features to get a feeling if there are many outliers, what the distribution would be, and if it makes sense.

There are many ways to visualize it, for example box plots, histograms, cumulative distribution functions, and violin plots. However, one should pick the plot that will give the most information about the data.

To see the distribution (if it is normal, or bimodal), the histograms will be the most helpful. Although histograms are a good starting point, the box plots might be superior in identifying the number of outliers and seeing where the median quartiles lie.

Based on the plots, the most interesting question would be: **do you see what you expected to see? **Answering this question will help you either in finding insights or finding bugs in the data.

To get inspired and understand what plot will give the most value, I frequently referred to the Python’s seaborn gallery. Another good source of inspiration for the visualization and finding insights are kernels on Kaggle. Here is my kaggle kernel of the in-depth visualization of the titanic dataset.

**To continue reading this article on KDnuggets, click here.**

**About the Author**

Jekaterina Kokatjuhha is a Research Engineer at Zalando in the Berlin area in Germany.