Workshop – The Advanced Data Preparation Bootcamp: Whip your Data into Shape
Monday, June 4, 2018 in Las Vegas
Full-day: 8:30am – 4:30pm
Room: Pompeian IV
- Practitioners: Analysts who have worked with data and want a deeper understanding of what machine learning algorithms assume about data and how to improve data quality for predictive modeling.
- Technical Managers: Project leaders, and managers who are responsible for developing predictive analytics solutions, who want to understand key principles of data preparation.
Knowledge Level: Familiar with the basics of predictive modeling and statistics.
As crucial as it is, data preparation is perhaps the most under-taught part of the predictive analytics (machine learning) process, even though we spend 60%, 70%, even up to 90% of our time doing data preparation steps. This workshop will cover the most important aspects of data preparation. Each of these topics will be described and connected to specific modeling algorithms that benefit from the data preparation step, including:
- Data cleaning: outlier detection and “fixing”, and which algorithms care about outliers
- Missing value imputation: the simple approaches and more complex and complete methods
- Feature creation: why we do it, which algorithms are helped most by which kinds of features, and how to automate building different kinds of continuous-valued and categorical features
- Feature selection: why it’s important to many algorithms
- Sampling: what kind of sampling we should do, how large the samples should be, should we (ever) stratify samples, and how to sample small data sets to improve model robustness
Examples and demonstrations will be provided in each stage, demonstrating how the data preparation steps improved the models.
While predictive modeling techniques will not be covered in this workshop, we will build predictive models (regression and decision trees) to demonstrate the effects data preparation on final models. Attendees are encouraged to ask questions throughout the workshop to clarify concepts and connect them to their own work experiences.
Participants are expected to know basic principles of statistics, such as mean, standard deviation, and what constitutes missing data. This workshop will include demonstrations of techniques using top-end open source analytics tools (actual tools subject to the discretion of the instructor). All code and workflows demonstrated in the workshop will be made available to participants so they may follow along during the workshop, or repeat the analyses on their own.
Course Notes and Free Textbook:
Course notes, code or workflows, and all data needed for the workshop will be provided on a USB drive at the workshop. These materials will also be made available via an Internet link. Paper copies of the workshop notebook will be distributed to attendees upon arrival. All attendees will also receive a paperback copy of Dean’s book, Applied Predictive Analytics (Wiley, 2014).
While the majority of concepts covered during this workshop apply to all predictive analytics projects – regardless of the particular software employed – attendees of this workshop can gain additional insight by following along in the demonstrations by using analytics software. Mr. Abbott will be conducting demos using the open source software KNIME. but, as time allows, may also show examples using R and/or python.
Attendees will be able to try the techniques using KNIME during the workshop using their own laptops. Your laptop may run KNIME using Windows, Macintosh, or Linux operating systems (please consult http://www.knime.org for minimum requirements). We recommend you download and install KNIME prior to the workshop because Internet bandwidth at the workshop site is not guaranteed to be fast enough for a timely download of the software.
Attendees may receive an official certificate of completion upon request at the completion of the workshop.
- Software installation assistance, if needed at 8:30am
- Workshop starts at 9:00am
- Morning Coffee Break at 10:30am – 11:00am
- Lunch provided at 12:30pm – 1:15pm
- Afternoon Coffee Break at 3:00pm – 3:30pm
- End of the Workshop: 4:30pm
Coffee breaks and lunch are included.
Dean Abbott, President, Abbott Analytics
Dean Abbott is Co-Founder and Chief Data Scientist of SmarterHQ, and President of Abbott Analytics in San Diego, California. Mr. Abbott is an internationally recognized data mining and predictive analytics expert with over three decades of experience applying advanced data mining algorithms, data preparation techniques, and data visualization methods to real-world problems, including fraud detection, risk modeling, text mining, personality assessment, response modeling, survey analysis, planned giving, and predictive toxicology.
Mr. Abbott is the author of Applied Predictive Analytics (Wiley, 2014) and co-author of IBM SPSS Modeler Cookbook (Packt Publishing, 2013). He is a highly-regarded and popular speaker at Predictive Analytics and Data Mining conferences and meetups, and is on the Advisory Boards for the UC/Irvine Predictive Analytics Certificate as well as the UCSD Data Mining Certificate programs.
He has a B.S. in Mathematics of Computation from Rensselaer (1985) and a Master of Applied Mathematics from the University of Virginia (1987).