Minority Voices ‘Filtered’ Out of Google Natural Language Processing Models

AI, artificial intelligence, data analytics, Machine Learning, Natural language processing, Predictive Analytics
2856 Views

How to Build a Recommendation System at Scale: Insights from Instacart
Government by AI? Trump Administration Plans to Write Regulations Using Artificial Intelligence
From Text To Tables: Why Structured Data Is AI’s Next $600 Billion Frontier

4 years ago
Minority Voices ‘Filtered’ Out of Google Natural Language Processing Models

Originally published in United AI, Sept 24, 2021.

According to new research, one of the largest Natural Language Processing (NLP) datasets available has been extensively ‘filtered’ to remove black and Hispanic authors, as well as material related to gay and lesbian identities, and source data that deals with a number of other marginal or minority identities.

The dataset was used to train Google’s Switch Transformer and T5 model, and was curated by Google AI itself.

The report asserts that the Colossal Clean Crawled Corpus (‘C4’) dataset, which contains 156 billion tokens scraped from more than 365 million internet domains, and is a subset of the massive Common Crawl scraped database, has been extensively (algorithmically) filtered to exclude ‘offensive’ and ‘toxic’ content, and that the filters used to distill C4 have effectively targeted content and discussion from minority groups.

The report states:

‘Our examination of the excluded data suggests that documents associated with Black and Hispanic authors and documents mentioning sexual orientations are signiﬁcantly more likely to be excluded by C4.EN’s blocklist filtering, and that many excluded documents contained non-offensive or non-sexual content (e.g., legislative discussions of same-sex marriage, scientiﬁc and medical content).’

To continue reading this article, click here.

EXCLUSIVE HIGHLIGHTS

Related

4 years ago
Minority Voices ‘Filtered’ Out of Google Natural Language Processing Models

Originally published in United AI, Sept 24, 2021.

One thought on “Minority Voices ‘Filtered’ Out of Google Natural Language Processing Models”

Login

Industry News

Connect with Us

Subscription

ADVERTISEMENTS

Produced By:

Archives

The Machine Learning Times © 2026 • 1221 State Street • Suite 12, 91940 • Santa Barbara, CA 93190
Produced by: Rising Media & Prediction Impact

EXCLUSIVE HIGHLIGHTS

Related

4 years agoMinority Voices ‘Filtered’ Out of Google Natural Language Processing Models

Originally published in United AI, Sept 24, 2021.

Recommended

How to Build a Recommendation System at Scale: Insights from Instacart

Government by AI? Trump Administration Plans to Write Regulations Using Artificial Intelligence

From Text To Tables: Why Structured Data Is AI’s Next $600 Billion Frontier

Is A.I. Actually a Bubble?

One thought on “Minority Voices ‘Filtered’ Out of Google Natural Language Processing Models”

Login

Industry News

Connect with Us

Subscription

ADVERTISEMENTS

Produced By:

Archives

The Machine Learning Times © 2026 • 1221 State Street • Suite 12, 91940 • Santa Barbara, CA 93190 Produced by: Rising Media & Prediction Impact

4 years ago
Minority Voices ‘Filtered’ Out of Google Natural Language Processing Models

The Machine Learning Times © 2026 • 1221 State Street • Suite 12, 91940 • Santa Barbara, CA 93190
Produced by: Rising Media & Prediction Impact