{"id":12307,"date":"2021-10-07T08:44:44","date_gmt":"2021-10-07T12:44:44","guid":{"rendered":"https:\/\/www.predictiveanalyticsworld.com\/machinelearningtimes\/?p=12307"},"modified":"2021-10-07T08:44:44","modified_gmt":"2021-10-07T12:44:44","slug":"minority-voices-filtered-out-of-google-natural-language-processing-models","status":"publish","type":"post","link":"https:\/\/www.predictiveanalyticsworld.com\/machinelearningtimes\/minority-voices-filtered-out-of-google-natural-language-processing-models\/12307\/","title":{"rendered":"Minority Voices \u2018Filtered\u2019 Out of Google Natural Language Processing Models"},"content":{"rendered":"Originally published in United AI, Sept 24, 2021. According to new research, one of the largest\u00a0Natural Language Processing\u00a0(NLP) datasets available has been extensively \u2018filtered\u2019 to remove black and Hispanic authors, as well as material related to gay and lesbian identities, and source data that deals with a number of other marginal or minority identities. The dataset was used to train Google\u2019s\u00a0Switch Transformer\u00a0and\u00a0T5 model, and was curated by Google AI itself. The report asserts that the\u00a0Colossal Clean Crawled Corpus\u00a0(\u2018C4\u2019) dataset, which contains 156 billion tokens scraped from more than 365 million internet domains, and is a subset of the <a href=\"https:\/\/www.predictiveanalyticsworld.com\/machinelearningtimes\/minority-voices-filtered-out-of-google-natural-language-processing-models\/12307\/\" class=\"more-link\">(more&hellip;)<\/a>","protected":false},"excerpt":{"rendered":"<p>Originally published in United AI, Sept 24, 2021. According to new research, one of the largest\u00a0Natural Language Processing\u00a0(NLP) datasets available has been extensively \u2018filtered\u2019 to remove black and Hispanic authors, as well as material related to gay and lesbian identities, and source data that deals with a number of other marginal or minority identities. The [&hellip;]<\/p>\n","protected":false},"author":72,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[11,48],"tags":[879,368,230,243,1170,8],"class_list":["post-12307","post","type-post","status-publish","format-standard","hentry","category-industry-news","category-left-hand","tag-ai","tag-artificial-intelligence","tag-data-analytics","tag-machine-learning","tag-natural-language-processing","tag-predictive-analytics"],"_links":{"self":[{"href":"https:\/\/www.predictiveanalyticsworld.com\/machinelearningtimes\/wp-json\/wp\/v2\/posts\/12307","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.predictiveanalyticsworld.com\/machinelearningtimes\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.predictiveanalyticsworld.com\/machinelearningtimes\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.predictiveanalyticsworld.com\/machinelearningtimes\/wp-json\/wp\/v2\/users\/72"}],"replies":[{"embeddable":true,"href":"https:\/\/www.predictiveanalyticsworld.com\/machinelearningtimes\/wp-json\/wp\/v2\/comments?post=12307"}],"version-history":[{"count":1,"href":"https:\/\/www.predictiveanalyticsworld.com\/machinelearningtimes\/wp-json\/wp\/v2\/posts\/12307\/revisions"}],"predecessor-version":[{"id":12308,"href":"https:\/\/www.predictiveanalyticsworld.com\/machinelearningtimes\/wp-json\/wp\/v2\/posts\/12307\/revisions\/12308"}],"wp:attachment":[{"href":"https:\/\/www.predictiveanalyticsworld.com\/machinelearningtimes\/wp-json\/wp\/v2\/media?parent=12307"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.predictiveanalyticsworld.com\/machinelearningtimes\/wp-json\/wp\/v2\/categories?post=12307"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.predictiveanalyticsworld.com\/machinelearningtimes\/wp-json\/wp\/v2\/tags?post=12307"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}