Following a dispute over some emails and a research paper Wednesday, AI ethics pioneer research scientist Timnit Gebru no longer works at Google. According to a draft copy of the unpublished paper obtained by VentureBeat, the research paper surrounding her exit questions the wisdom of building large language models, who benefits from them, who is impacted by the negative consequences of their deployment, and whether there’s such thing as a language model that’s too big.
Gebru’s research has been influential on algorithmic fairness, bias, and facial recognition. In an email to Google researchers Thursday, Google AI chief Jeff Dean said he accepted Gebru’s resignation following a disagreement about the paper, but Gebru said she never offered to resign.
“..most language technology is in fact built first and foremost to serve the needs of those who already have the most privilege in society,” the paper reads. “A methodology that relies on datasets too large to document is therefore inherently risky. While documentation allows for potential accountability, similar to how we can hold authors accountable for their produced text, undocumented training data perpetuates harm without recourse. If the training data is considered too large to document, one cannot try to understand its characteristics in order to mitigate some of these documented issues or even unknown ones.”
In the paper titled “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?,” authors say risks associated with deploying large language models range from environmental racism from AI’s carbon footprint impacting marginalized communities more than others, and how models absorb a “hegemonic world view from the training data.” There’s also the risk the AI can perpetuate abusive language, hate speech, microaggressions, stereotypes, and other forms of language that can dehumanize some groups of people and absorb the “hegemonic world view from the training data.”
There’s also the consequence that costs associated with training large language models can create a barrier to entry of deep learning research, and increases the chance that people will trust predictions made by language models without questioning the results.
Gebru is listed as first author of the paper alongside Google researcher Emily Denton. Other authors include Google AI co-lead Meg Mitchell, and Google researchers Ben Hutchinson, Mark Diaz, and Vinodkumar Prabhakaran, as well as University of Washington PhD student Angelina McMillan-Major.
On Thursday, Denton joined more than 230 Googlers and more than 200 supporters from academia, industry, and civil society today in signing a letter with a series of demands including a transparent evaluation of who was involved in determining that Denton and Gebru should withdraw their research for the general public and Google users.
“This has become a matter of public concern, and there needs to be public accountability to ensure any trust in Google Research going forward,” the letter reads.
Google AI chief Jeff Dean was critical of the paper in an email to Google researchers Thursday, because he said a review process found that the paper “ignored too much relevant research” about large language models and did not take into account recent research into mitigation of bias in language models.
A trend toward creating language models with more parameters and training data was triggered by a move toward use of the Transformer architecture, and massive amounts of training data scraped from the web or sites like Reddit or Wikipedia.
Google’s BERT and variations like ALBERT and XLNet led the way in that trend alongside models like Nvidia’s Megatron and OpenAI’s GPT-2 and GPT-3. Wheras Google’s BERT had 340 million parameters, Megatron has 8.3 billion parameters, Microsoft’s T-NLG has 17 billion parameters, and GPT-3, introduced in May by Open AI and the largest language model to date released earlier this year, has 175 billion parameters. With growth in size, large models achieved higher scores in tasks like question-answering or reading understanding.
Numerous studies have found forms of bias in large pretrained language models. This spring, for example, NLP researchers introduced the StereoSet dataset, benchmark, and leaderboard and found that virtually all popular pretrained language models today exhibit bias based on ethnicity, race, and sex.
Coauthors suggest language models be evaluated based on other metrics like energy efficiency and the estimated CO2 emissions involved with training a model rather than evaluating performance using benchmarks like GLUE based on performance of a series of tasks.
They argue that a trend toward large pretrained language models also has the potential to mislead AI researchers and the general public to mistake text generated by large language models like OpenAI’s GPT-3 as meaningful.
“If a large language model, endowed with hundreds of billions of parameters and trained on a very large dataset, can manipulate linguistic form well enough to cheat its way through tests meant to require language understanding, have we learned anything of value about how to build machine language understanding or have we been led down the garden path?” the paper reads. “In summary, we advocate for an approach to research that centers the people who stand to be affected by the resulting technology, with a broad view on the possible ways that technology can affect people.”
The paper recommends solution like working with impacted communities, value sensitive design, improved data documentation, and adoption of frameworks such as Bender’s data statements for NLP, or the datasheets for datasets approach coauthored by Gebru while at Microsoft Research.
In the vein of the report’s conclusions, a McKinsey survey of business leaders conducted earlier this year found that little progress has been made in mitigating 10 major risks associated with deploying AI models.
Criticism of large models trained using massive datasets scraped from the web has been a marked AI research trend in 2020.
University of Washington linguist Emily Bender coauthored an award-winning paper that urges NLP researchers to question the hype surrounding large language models being capable of understanding. In an interview with VentureBeat, she stressed the need for better testing methods and lamented a culture in language model research that overfits models to benchmark tasks, a pursuit she says can stand in the way of “good science.”
In computer vision, an audit of 80 Million Tiny Images, a large image dataset released this summer revealed the inclusion of a number of racist, sexist, and pornographic content. As a result, instead of taking recommended steps to change the dataset, creators from MIT and NYU to stop using it and delete existing copies.
Last month, researchers analyzed papers published at conferences and found that elite universities and Big Tech companies enjoy a competitive advantage in the age of deep learning that’s created a compute divide concentrates powers in the hands of a few and accelerates inequality.