Winning with Data Science is a compelling and comprehensive guide for customers of data science. It teaches readers how to work with data scientists by emphasizing real-world business applications and focusing on how to collaborate productively with data science teams. The book takes a narrative approach with fictitious characters embracing new projects involving data science teams.
In this adapted excerpt, Brett, a lead data scientist working for Shu Financial, partners with Steve, a newly minted MBA who is half-way through his two-year rotation through the Fraud Department, Recoveries Department, and Real Estate Division, to improve the fraud department’s detection of application fraud. Brett and Steve collaborate throughout Winning with Data Science along with other members of their teams.
Steve’s rotation within Shu Financial was now taking him to a new department, Fraud.
The goal of the Fraud Department is to protect the company from losses as well as ensure Shu Financial customers that their private information is secure. While not a profit center like some other parts of the business, the Fraud Department is considered vital to ensuring the viability of the company.
Descriptive statistics calculated by the data analysts have shown that most of the company’s fraud losses are due to application fraud, which involves creating a new, fraudulent account by applying for a credit card in someone else’s name or taking over an existing account by changing key account information like the name and address on the account.
The company already has one supervised machine learning model that predicts the probability that an application is fraudulent. But a key question that keeps arising is whether there are different types of application fraud. If so, it might be advantageous to identify these distinct types and target them with different predictive models.
Steve was brimming with ideas but uncertain as to what steps to take first, so he arranged a meeting with the data science team about his challenge.
After hearing the problem description, the lead data scientist, Brett, nodded his head.
“Sounds like a classic unsupervised machine learning problem. The first thing we would like to do is find a good way to visualize the customer data. This will help us see if there are obvious groupings of fraudulent applications. At this stage, we are not trying to predict the probability of fraud. We just want to see if we can detect patterns among the frauds. Unsupervised machine learning not only is useful for finding subpopulations but also is a very good tool for conducting exploratory data analysis in general, where we can find patterns in the data that we currently aren’t aware of or don’t use.”
Steve understood the goal, but the actual steps seemed unclear. “There are so many variables for each application: variables on the application itself, comparisons to other applications in that zip code, information from the credit report, the transactions on the account, and the online and telephone activity. How can you possibly graph all those variables and make any sense? Are you going to make hundreds of scatterplots?”
Brett explained, “Unsupervised machine learning will quickly surpass anything you would do on an Excel spreadsheet. You are right; there are huge numbers of possible variables, and we will need to be clever about doing the data exploration. Luckily, there is a great method called principal components analysis that can help us make sense of situations where there are a lot of features. We’ll do that as the first step.”
Principal components analysis is called a dimensionality-reduction method, since it tries to reduce the number of dimensions that a user would consider in the analysis. Essentially, it maps the original data set into a new data set that has the same information as the original data set but that looks at the data through a different lens, a transformed lens.
Steve and Brett worked side by side to examine the amount of variance explained by the different components and agreed that they would focus initially on the first three components, since they explained the majority of the variance in the data.
This partnership between the business unit, which provides domain-specific knowledge, and the data science team is critical. The business unit helps provide insight that can illustrate the interpretation of not only the principal components but also the clusters themselves. The stronger this partnership, the more useful the final product. When this partnership is weak, often the data science products do not meet the business needs, resulting in wasted time and money.
By reducing the data set from a large set of features about the accounts to just these three principal components, they have achieved dimensionality reduction—having fewer dimensions to work with will make the problem much easier to understand analytically and lead to faster solutions. A key in performing this dimensionality reduction is to be certain you don’t discard too much critical information that you will want in the analysis.
The second component had large positive coefficients related to how often the customer checked the account online in the first few days after receiving the card and how many gas station purchases they made in the first few days. Steve understood that these were criminals who wanted to make sure that the card was still active and that they wanted to do anonymous transactions like testing the card at a gas station. This group was cautious as to how they used the card, avoiding interacting in a store or in front of people.
A series of scatterplots helped the team quickly identify a few clusters within the fraudulent applications. The first component, the good credit scores where the criminal immediately tried committing large frauds, was actually composed of three groups, based on the speed and types of transactions the fraudsters attempted. They called these groups the bold, the hesitant, and those “riding the fence.”
With the different groups of fraudulent applications identified, the data science team planned to build separate models to target the different clusters. For each model, the team would provide reason codes to the fraud investigators so they would understand why the account was flagged by the system, as well as what specific style of application fraud the models thought was most likely. This information helps streamline the investigators’ work and leads to further operational improvements.
With a predictive model, we can look at the accuracy of the prediction to decide if the model is sufficiently good. With unsupervised machine learning, we do not have a target variable we are trying to predict. This means we cannot state how “good” the model performed. Rather, we need to see how well the output of the model can be described and used in a practical sense.
Cluster profiling refers to the process of generating unique descriptions for each cluster using the input variables. The profiles are usually generated by looking at the average characteristics in each cluster and labeling each cluster based on a few variables that are the most extreme (large or small) in that specific cluster.
Often subject-matter experts will be called in to review the clusters to assess whether they make sense and are readily distinguishable. In Steve’s case, he can ask members of the Fraud Department to review the cluster analysis and give their opinions on whether these clusters are distinguishable and represent subpopulations in the real world. That consultation also opens the conversation around how they can use these clusters to improve their investigations.
Let’s see how things worked out for Steve in his fraudulent application example.
Brett began the project debrief. “Here’s the rundown of the clustering project. We started by doing a principal components analysis, which showed us that there was definitely going to be some separation between the different fraudulent applications, though it was unclear exactly how many clusters to use. We selected the three top principal components, which accounted for 90 percent of the variance, and then began our cluster analysis. The components were all standardized, where some initial exploration showed us that range standardization worked best. We used K-means clustering to test a range between two to ten different clusters. The final number of clusters was set at six, though we could have made pretty good arguments for any number between four and seven. Once we had our final number of clusters, we reviewed the average input values of each cluster and named them.”
Steve nodded his head. “Yes, I remember all these steps. Of course, my manager is going to want to know what value we got out of doing this clustering. She is very focused on making sure I can answer the question ‘So how does this translate into Shu Financial stopping fraud losses?’”
Brett smiled. “Yes, many times she has interrupted me with that sharp question. But we are prepared to answer it. There are a few ways we used this cluster analysis to help the company. The first is that we used to have only one model to predict the probability of application fraud. The cluster analysis helped us to develop other behavioral-based predictive models that target some of the specific features of the different clusters. These separate models are far more accurate at predicting fraud than our single model was, so our fraud detection rate is now much higher, meaning we can prevent more fraud losses and lower our operating expenses, since investigators’ time is better targeted.”
Steve interrupted. “So, the unsupervised machine learning actually helped us build new supervised machine learning models to help detect fraudsters more quickly.”
Brett continued. “Yes, in fact this happens often. Unsupervised machine learning and supervised machine learning often are partners in solving real-world problems. In addition to the new models, we use cluster analysis directly by scoring each application and assigning it to a cluster. We send the information about which cluster the application belongs to over to the investigators along with some of the application’s distinguishing features. These features serve as reason codes that inform the investigators as to what makes an application unusual and worthy of attention. This cluster and reason code information helps the investigative unit where the employees have divided themselves into different teams to focus more efficiently on different clusters.”
“This means that our cluster analysis helped us not only gain insight into our data but also create new models and improve the operational efficiency of our units.”
Excerpted from Winning with Data Science by Howard Steven Friedman and Akshay Swaminathan, published by Columbia Business School Publishing. Copyright (c) 2024 Howard Steven Friedman and Akshay Swaminathan. Used by arrangement with the Publisher. All rights reserved.
More information about Winning with Data Science is available online.
About The Authors
Howard Steven Friedman is a data scientist, health economist, and writer with decades of experience leading data modeling teams in the private sector, public sector, and academia. He is an adjunct professor, teaching data science, statistics, and program evaluation, at Columbia University, and has authored/co-authored over 100 scientific articles and book chapters in areas of applied statistics, health economics and politics. His previous books include Ultimate Price and Measure of a Nation, which Jared Diamond called the best book of 2012.
Akshay Swaminathan is a data scientist who works on strengthening health systems. He has more than forty peer-reviewed publications, and his work has been featured in the New York Times and STAT. Previously at Flatiron Health, he currently leads the data science team at Cerebral and is a Knight-Hennessy scholar at Stanford University School of Medicine.