Machine Learning Times
Machine Learning Times

2 months ago
Sampling For Your Analysis


So you have a mailing campaign you are about to conduct. Your goal is to secure both increased response rates and sales volume. And a customer targeting methodology is crafted. Nothing elaborate-but response and sales models will be developed. You have results from a previous program, and are prepared to aggregate the data so that it can be mined. We have 974,232 individuals tagged as mailed, and 11,418 flagged as responders.

Response models can be developed in the standard way. We may use a sample of mailed households, with associated responders, and a model can be constructed. Similarly, we retrieve our 11,418 responders (those that made a purchase), and formulate a sales model. While this certainly is a typical approach for marketers, there is a thorny question that must be addressed.   How do we use these models? Granted, the response models can be deployed against the new mailing population. But the sales analysis was developed, on responders, only. How do we deploy a model constructed on responders only, and apply it to the larger universe-the mailed population? Targeted modeling based on non-randomly selected samples can lead to erroneous results.

When we employ machine learning algorithms to our sales prediction problem, we may very well encounter a problem of selection bias.  As spending (sales) is said to be a censored variable (if you didn’t respond, you have NO sales), it results in  many records having zero values (i.e. no sales). Researchers have observed, that if this is not considered in the analysis, then our modeling result may well produce biased results.

Other similar sample selection issues are quite prevalent. Take for example, the bank that uses sophisticated model algorithms to determine who should be approved for a mortgage loan. The analytic process is statistical and ‘objective’ in nature. Or is it? The models typically build a profile of previous applicants that have been good borrowers-that is they have repaid their loans according to agreed upon terms.

But wait a minute. The statistical models employed to assess new applicants, are developed from previous historical loan data. The sample used to construct the model is based on those that have already been approved, and are currently the bank’s customers!  And here’s the issue: You cannot develop a model on a bank’s customers, and deploy those results to the entire world!  The bank and the world are different universes.

The above credit issue is addressed through various means, specifically through a process known as reject inference. Much has been written on this topic, and I leave it to the interested reader to further investigate the details.

The details of the recently issued Apple credit card, however, are not as clear. Claims that men were assigned higher credit lines than women were confirmed by several sources. Does that make the credit grantor biased?

Leo Kelion, Technology desk editor at suggests that the difference in credit line assignment may be due to howthe algorithms involved were developed, they were trained on a data set in which women indeed posed a greater financial risk than the men. This could cause the software to spit out lower credit limits for women in general, even if the assumption it is based on is not true for the population at large.” Again, we observe a sample bias-one that has caused much commotion in the industry.

Another often quoted illustration of sample bias involves a problem, that unfortunately, many managers do not consider.

Take the researcher interested in describing the incomes of those attending college to those who did not.  The straight-forward procedure is to compare these incomes would be to determine the average income for college students as compared to the mean income for non-college attending students. Simple enough, you think.  But wait a minute, again. It could very well be that those attending college possess additional attributes that may not be identified in the non-college population.  For example, patience or concentration, may be characteristics that impact income whether or not an individual attends college. These additional characteristics may very well misrepresent the evaluation and comparison of the two groups.

Let’s briefly return to the response/sales models referred to earlier, in an attempt to assess the situation. First, a brief recap.

Marketers have developed response models to target audiences most likely to respond. This has often provided to be the bread and butter of marketing campaigns. But hold on. These response models may be very good-but are the highly responsive segment the most profitable? Are they purchasing the most?  The fact is that response models can often locate the most responsive customers who actually spend the least! To minimize the potential to target responsive low spending individuals, researchers have extended the response model scenario to include response AND spending algorithms. This, as mentioned above, results in a real selection bias. Models can be developed for spending. That’s good.  But spending implies response. As we proceed to operationalize the model, we are applying the results to the full contact universe-not only the responders!

There are a number of approaches that have been suggested to address this problem. These tactics may very well not address the underlying problems, but in practice, at least some of them, some of the time, appear to do a credible job at confronting our dilemma.

  1. Use just a response model. Perhaps the response model correlates well with sales, and one analysis is adequate.
  2. We can ignore the problem. Even though, theoretically, this is not appropriate, in practice it may meet the marketer’s objective. Build two models on the appropriate universe, and select the winning model based on a criterion that it performs well on the initial mailing population. Use expected value as a criterion. Expected value in nothing more than the result that is achieved my multiplying the response score by the sales score. Sort by this value, and select your mailing population.
  3. Construct only a sales model. Skip over the response effort. Perhaps response will correlate with the ten sales deciles.
  4. Fill the non-responders with a ‘0’ sales value. After all, they didn’t spend!
  5. Fill the non-responders with random values (very low values)
  6. Use what is referred to as the Heckman correction-a tool developed to respond to sample bias.

Let me share some quick thoughts on Item 1 and 6 above.

  1. Use just response analysis

The marketer’s response model consisted of typical predictors including wealth indicators, current retailer’s performance in geographies, and family composition. The final algorithm, generated through a logistic regression, produced the following results.

DECILE Response Rate
     1      5.86%
     2      2.60%
     3      1.52%
     4      0.86%
     5      0.48%
     6      0.26%
     7      0.10%
     8      0.04%
     9      0.01%
     10      0.00%
   Total      1.17%


Results were better than expected. Let’s now add sales to the above decile table. Remember, for non-responders, sales are all ‘0’.

DECILE  Response Rate   Sales
        1         5.86% $136.33
        2         2.60% $120.64
        3         1.52% $114.96
        4         0.86% $112.11
        5         0.48% $106.33
        6         0.26% $171.82
        7         0.10% $175.40
        8         0.04% $81.57
        9         0.01% $74.39
       10         0.00% $80.15
      Total         1.17% $123.83


While there certainly appears to be a relationship between response and sales, it is also evident that there isn’t that much of a distinction in sales volume for the first 6 or so deciles. While this approach is not really satisfying, it nevertheless, does provide increased response rates, and ‘better’ that average sales estimates. Beware, this is not always the case. I have seen performance reports showing a fairly constant spending throughout the ten deciles.

In any event, this procedure does not directly address the sample selection bias.

James Heckman, in a landmark study (Heckman, J. (1979). “Sample selection bias as a specification error”. Econometrica 47 (1): 153–61), proposed a two-stage estimation procedure to tackle the selection bias problem. In a first step, a regression analysis is performed for analyzing response. With a bit more statistical juggling, the output of this first step regression in then incorporated as an additional explanatory variable in the spending regression model. This tactic, popularly referred to as the Heckman correction should not be considered as the ultimate solution. It is not magical. It doesn’t always produce the results you may be looking for. So be wary.

While there are numerous software packages designed to perform a Heckman analysis, three that I am aware of include:

  • R procedure SampleSelection
  • Stata procedure Xtheckman
  • QLIM procedure in SAS

Constructing carefully developed samples is a necessary ingredient in building predictive models and in performing objective analyses. But analysts must be vigilant and careful in constructing those subsets without bias. These samples must characterize the universe as a whole, if inferences gleaned on the sample are to be correctly deployed to records outside of the sample. While there are techniques that may be used to mitigate the issues, none are really foolproof. Careful and deliberate construction of data for model building, and communication about the potential issues of these data can make the job of the analyst and the marketer somewhat easier, but still complex.

About the Author

Sam Koslowsky serves as Senior Analytic Consultant for Harte Hanks. Sam’s responsibilities include developing quantitative and analytic solutions for a wide variety of firms. Sam is a frequent speaker at industry conferences, a contributor to many analytic related publications, and has taught at Columbia and New York Universities. He has an undergraduate degree in mathematics, an MBA in finance from New York University, and has completed post-graduate work in statistics and operations research.

Harte Hanks is a global marketing services firm specializing in multichannel marketing solutions that connect our clients with their customers in powerful ways. Experts in defining, executing and optimizing the customer journey, Harte Hanks offers end-to-end marketing services including consulting, strategic assessment, data, analytics, digital, social, mobile, print, direct mail and contact center. From visionary thinking to tactical execution Harte Hanks delivers smarter customer interactions for some of the world’s leading brands. Harte Hanks 5000+ employees are located in North America, Asia-Pacific and Europe.

One thought on “Sampling For Your Analysis

  1. Pingback: Sampling For Your Analysis - Blockchain, Artificial Intelligence and Data Education Academy

Leave a Reply

Pin It on Pinterest

Share This