By: Richard Boire, Founding Partner, Boire Filler Group
As we all know, predictive analytics is a discipline, which is used to identify patterns, or trends that can explain future events or outcomes. Yet, these tools perform abysmally if rare events or situations have not been properly handled within the analytical file. These rare events can be handled in a variety of different ways. For example, if we are building risk models, we may discover that a certain area of the city has a much higher rate of property claims due to fraud. Those records that relate to the fraud cases would then be dropped from the analytical file and not used in developing a predictive model.
On the severity side, we may discover that there are certain claims with extreme large loss amounts. Under this scenario, the records do not have to be dropped but the extreme loss amounts would be capped. One approach is to sort records based on loss amount from highest to lowest into 100 centiles. Variance analysis is then done within the top centile group(the highest loss amount interval) to then determine what this cap should be. This rare event scenario is different from the above fraud case in that we are essentially dealing with outlier values pertaining to a specific variable which is a very common scenario for all practitioners when dealing with any variable that has continuous values. In the above rare event fraud case, more exploratory analysis is required in order to identify records that need to be excluded whereas in outlier analysis, records are never excluded but merely capped on their extreme values.
But what about if the objective function is to use data mining and predictive analytics to actually predict these rare events. For example in the insurance sector, the ability to predict fraud is a significant challenge for many insurance organizations. On the severity side particularly with Accident Benefit claims, there are loss amounts, which can be in excess of $500K relative to an average loss amount of $30000. The ability to better predict these rare events has always been a lofty and challenging goal but one that can accrue significant benefits to insurers. Yet, the fundamental issue as in any model is whether or not this behavior is indeed predictable. In other words, can we identify trends or patterns amongst these rare events or is the behavior simply noise or random behavior. In building models for rare events, the key is validation. Of course, validation has always been a critical component of the predictive analytics process but in attempting to predict rare events, validation of the model even needs to be more rigorous with multiple validation datasets.
Before validation, though, the first challenge is to extract the data. Do we have enough of these rare events that we can utilize to build our target variable. In the era of Big Data, this scenario of sparse data is becoming less of an issue. In building models, a rule of thumb is to obtain a minimum number of rare events(1’s) which is typically 1000. In some cases, I have used 500 actual occurrences and have obtained results where the top decile has a performance lift of 5 times over the bottom decile.
Once the analytical file is created, the analyst then needs to consider multiple validation scenarios. Let’s consider these scenarios. The first task before actual development of the model is to split the analytical file into a development and validation sample. Some practitioners will use a split of 50/50 while others might vary the proportions such as 30% development and 70% validation. The development file is then stratified where all the rare events/occurrences (1’s) and an equal portions of the non rare events (0’s) are extracted. The validation file is also stratified in an identical manner. The model is then built on the stratified sample arising from the development file. The first validation is to then apply the model against the stratified sample from the validation sample and to observe its performance. AUC (Area Under the Curve) results under this scenario may be difficult to use in validation, as the results do not represent the real world. However, the results should at least yield rank ordering of performance between the top decile and the bottom decile. Obviously minimal rank ordering of results would dictate that more work needs to be done before proceeding to applying the model to further validation datasets.
The second validation, though, would be conducted against the unstratified data of the validation sample. Here AUC results can be used to assess model performance since the data here does mimic the real world. But the validation process is still not complete. A third validation is done which utilizes a period of time where the corresponding data is completely different from the period of time used in the original analytical file. The model is then applied to the so-called out of period holdout sample. Here the AUC results are produced and the desired expectation is to observe results that are similar to the results of the second stratified validation sample. Other validations would comprise different out of sample time periods. Some practitioners opt to use one out of sample time period which occurs after the time period of the analytical file while another out of sample time period would occur prior to the time period of the analytical file. Here the practitioner is simply to create different validation scenarios that will yield more or less consistent results.
One question that arises is why this rigor and discipline is not adopted in all predictive analytics exercises. In fact, this approach is adopted in many cases presuming that time and data are not constraints. But this type of rigor is often not required as one typical validation (unstratified) sample can produce very reliable results for non-rare outcomes.
The objective of building stable and robust rare event models is challenging within an environment where volatility and noise are the norm. In some cases, the decision might be to not go forward with implementing the results as validation has yielded inconsistent results across multiple validation datasets. Yet, it is only through a rigorous validation process where success or failure can be determined when building these rare event models.