Machine Learning Times
Machine Learning Times
Three Best Practices for Unilever’s Global Analytics Initiatives
    This article from Morgan Vawter, Global Vice...
Getting Machine Learning Projects from Idea to Execution
 Originally published in Harvard Business Review Machine learning might...
Eric Siegel on Bloomberg Businessweek
  Listen to Eric Siegel, former Columbia University Professor,...
Effective Machine Learning Needs Leadership — Not AI Hype
 Originally published in BigThink, Feb 12, 2024.  Excerpted from The...

4 years ago
Keeping Data Inclusivity Without Diluting your Results

Originally published in, January 17, 2020

Let’s say you are surveying 100 people out of 10,000. You want to analyze the data from your sample of 100 to get answers about the likely behaviors and preferences of the overall 10,000 person population.

Part of your project focuses on equity among sexual orientations. You don’t want to leave anyone out and you know that having a question about sexual orientation where people select ‘heterosexual or homosexual’ isn’t inclusive enough. You consult experts and the local community and decide to include ‘Heterosexual, Gay, Lesbian, Bisexual, Pan Sexual, or Asexual’ as options in that question.

Once your responses have come in, you have data from respondents across each of those categories, however only a few respondents identified as bisexual and only one person identified as pan sexual and asexual respectively. When trying to analyze the data to represent the responses of all these orientations, you realize that you have such a small amount of data from some categories that you can’t say anything statistically relevant about them, you can’t extrapolate the preferences and likely opinions about all Asexually identifying people in your population of 10,000 from one person’s data.

Rather than completely discount the categories in which you have very few responses, you decide it’s better to combine them into an amalgamated category, so that they can be better represented. When you publish your findings, you frame your results as Heterosexual, Homosexual and Other, the very thing you were trying to avoid. People are mad and hurt that they aren’t well represented and feel lumped into an ‘other’ category. Respondents who took your survey feel cheated by being asked detailed questions that you just combined anyway.

This kind of ‘collapsing’ or ‘amalgamating’ of data categories happens all the time and not just with sexual orientation. Almost all demographic questions are susceptible to being limited in the survey or condensed in the analysis; race, ethnicity, gender, language, etc. Imagine how difficult and how statistically useless it would be to list all possible spoken languages as an option on a survey. How can we be inclusive without making minority categories so small that only the majority data has statistical relevance?

Competing Priorities:

  1. It’s important that the diversity among your respondents is given respect.
  2. It’s important that the results you show be statistically meaningful.

To continue reading this article, click here.

2 thoughts on “Keeping Data Inclusivity Without Diluting your Results

  1. Pingback: Keeping Data Inclusivity Without Diluting your Results - Machine Learning Times : Rlogger

  2. Pingback: Keeping Data Inclusivity Without Diluting your Results – Machine Learning Times – NikolaNews

Leave a Reply