Machine Learning Times
Machine Learning Times
EXCLUSIVE HIGHLIGHTS
How Generative AI Helps Predictive AI
 Originally published in Forbes, August 21, 2024 This is the...
4 Ways Machine Learning Can Perpetuate Injustice and What to Do About It
 Originally published in Built In, July 12, 2024 When ML...
The Great AI Myth: These 3 Misconceptions Fuel It
 Originally published in Forbes, July 29, 2024 The hottest thing...
Where FICO Gets Its Data for Screening Two-Thirds of All Card Transactions
 Originally published in The European Business Review, March 21,...
SHARE THIS:

5 years ago
Keeping Data Inclusivity Without Diluting your Results

 
Originally published in WeAllCount.com, January 17, 2020

Let’s say you are surveying 100 people out of 10,000. You want to analyze the data from your sample of 100 to get answers about the likely behaviors and preferences of the overall 10,000 person population.

Part of your project focuses on equity among sexual orientations. You don’t want to leave anyone out and you know that having a question about sexual orientation where people select ‘heterosexual or homosexual’ isn’t inclusive enough. You consult experts and the local community and decide to include ‘Heterosexual, Gay, Lesbian, Bisexual, Pan Sexual, or Asexual’ as options in that question.

Once your responses have come in, you have data from respondents across each of those categories, however only a few respondents identified as bisexual and only one person identified as pan sexual and asexual respectively. When trying to analyze the data to represent the responses of all these orientations, you realize that you have such a small amount of data from some categories that you can’t say anything statistically relevant about them, you can’t extrapolate the preferences and likely opinions about all Asexually identifying people in your population of 10,000 from one person’s data.

Rather than completely discount the categories in which you have very few responses, you decide it’s better to combine them into an amalgamated category, so that they can be better represented. When you publish your findings, you frame your results as Heterosexual, Homosexual and Other, the very thing you were trying to avoid. People are mad and hurt that they aren’t well represented and feel lumped into an ‘other’ category. Respondents who took your survey feel cheated by being asked detailed questions that you just combined anyway.

This kind of ‘collapsing’ or ‘amalgamating’ of data categories happens all the time and not just with sexual orientation. Almost all demographic questions are susceptible to being limited in the survey or condensed in the analysis; race, ethnicity, gender, language, etc. Imagine how difficult and how statistically useless it would be to list all possible spoken languages as an option on a survey. How can we be inclusive without making minority categories so small that only the majority data has statistical relevance?

Competing Priorities:

  1. It’s important that the diversity among your respondents is given respect.
  2. It’s important that the results you show be statistically meaningful.

To continue reading this article, click here.

2 thoughts on “Keeping Data Inclusivity Without Diluting your Results

  1. Pingback: Keeping Data Inclusivity Without Diluting your Results - Machine Learning Times : Rlogger

  2. Pingback: Keeping Data Inclusivity Without Diluting your Results – Machine Learning Times – NikolaNews

Leave a Reply