As I have stated in previous articles, the most difficult challenge in building predictive models is the creation of the analytical file. Typically, this comprises between 80%-90% of the data scientist’s time with 10%-20% comprising the actual run or runs of the different mathematical/statistical algorithms. In the creation of the analytical file, the two elements in its design are the development of the target variable and the development of the independent variables or potential predictor variables.
The data challenges are a reality in creating the right analytical file. Yet, with certain models such as fraud, the level of complexity increases. With most models, the target variable or objective function is relatively straightforward. Some good examples are:
With fraud models, though, the additional challenge is defining the target variable or action that typifies fraud as many organizations experience difficulties in really capturing all fraudulent behaviour. It is not straightforward as outlined in the above examples.
In my early days of data science at American Express after having developed a number of credit risk models, predicting fraud was another objective of our department. Our business objective was whether or not we could identify a transaction(i.e. target variable) as being fraudulent. In the American Express example, it was more straightforward as we simply identified lost or stolen cards that were used between the time of the reported instance and the time that the card was cancelled. Transactions that occurred during that time were identified as fraudulent. We then built models that differentiated these transactions from non-fraudulent transactions. During our work, the analysis indicated that there were key upfront fraud segments where separate models should be developed for each segment. Not surprisingly, the results overall indicated that “out of pattern” behaviour drives fraudulent activity. Increased activity in terms of both number of transactions as well as transaction amount represented key components of “out of pattern” activity. Furthermore, transactions that occurred outside the normal geographic area of activity or specific types of category spend were other components of this out of pattern type spending.
Yet, in many other cases outside of credit card spend, the actual identification of fraud cases, which are used as the target variable, are not readily apparent or the number of observable fraud cases are simply too sparse to be used in developing robust models. Here is where the data scientist practitioner and domain knowledge expert need to be creative in arriving at definitions that are deemed to be fraudulent. In many cases, it might actually be identifying out of pattern behaviour as the target variable. Essentially, the data scientist and domain knowledge expert are establishing criteria for a “quasi” definition of fraud without knowing with certainty that the particular instance or event is in fact fraud.
Good examples of this type of fraudulent activity are within the insurance sector, particularly in the area of claims processing. Because of the difficulty in actually proving that a fraudulent claim was committed, many actual fraud cases will go unreported. Therefore, analytics would look to explore the data for out of pattern behaviour such as multiple claims occurring for the same policy within a very short period of time. Another approach towards fraud identification might be to look at huge spikes in claim frequency within very short time intervals. Once this quasi-fraud behaviour has been identified, predictive models would be built to examine what these claims look like. For instance, do fraudulent claims comprise the following characteristics:
Other sectors, where difficulties arise in identifying fraud, include the use of identifying fraudulent activities amongst agents and or advisors across a number of different industries. Examples of such industries include the mortgage broker industry and wealth management industry. In the mortgage broker case, we are examining the processing of applications while in the wealth management industry, we examine the processing of deposits. In both cases, we are attempting to identify those current activities by that agent/advisor that are atypical when compared to the history of activities being processed by that same agent/advisor. Furthermore, is this out of pattern behaviour extremely different from what we observe across all agents/advisors within the industry? Predictive models can then be used to determine what these out of pattern agents/advisors look like and where they are located.
Fraud models can certainly prioritize those activities based on their fraudulent likelihood. However, the deployment of these fraud models must be exercised with caution particularly where the target variable is based on out of pattern behaviour rather than on observable fraud activities or events. But given these limitations, the real benefit of these models is to prioritize investigative efforts where initial efforts are focussed in gathering more information. For example, by concentrating more efforts on those insurance claims that are predicted to be most fraudulent, one outcome might be that there are specific institutions that seem to be associated with the processing of these claims. Further investigation of these institutions might uncover certain business practices within these institutions that tend to lead them more exposed towards fraudulent behaviour. As well, the model profile characteristics might identify certain types of claims that are predicted to be more fraudulent. The insurance company can then allocate more resources towards the processing of that type of claim.
Using the same approach as outlined above in the insurance example above but within wealth management, could we identify deposits that appear to be fraudulent. More importantly, are there certain investment agents/advisors that that are associated with these type of deposits? Might we be able to prevent another Bernie Madoff Ponzi scheme?
Recognizing the unique challenges in developing fraud models, the ability to be pre-emptive vs. reactive in addressing fraud can yield tremendous benefits. Being pre-emptive with predictive models allows us to take advantage of these benefits. Think of the dollars saved by introducing new business practices for certain hospitals before they incur more fraudulent claim behaviour or the saved dollars to investors by being able to shut down Ponzi schemes in their early stages. Predictive analytics techniques represent the critical link in crafting these preventive fraud strategies.
About the Author:
Richard Boire, B.Sc. (McGill), MBA (Concordia), is the founding partner at the Boire Filler Group, a nationally recognized expert in the database and data analytical industry and is among the top experts in this field in Canada, with unique expertise and background experience. Boire Filler Group was recently acquired by Environics Analytics where I am currently senior vice-president.
Mr. Boire’s mathematical and technical expertise is complimented by experience working at and with clients who work in the B2C and B2B environments. He previously worked at and with Clients such as: Reader’s Digest, American Express, Loyalty Group, and Petro-Canada among many to establish his top notch credentials.
After 12 years of progressive data mining and analytical experience, Mr. Boire established his own consulting company – Boire Direct Marketing in 1994. He writes numerous articles for industry publications, is a well-sought after speaker on data mining, and works closely with the Canadian Marketing Association on a number of areas including Education and the Database and Technology councils.