The trouble with large language models

Yesterday I summarized the first two articles in a series about algorithms and AI by Hayden Field, a technology journalist at Morning Brew. Today I’ll finish out the series.

The third article, This Powerful AI Technique Led to Clashes at Google and Fierce Debate in Tech. Here’s Why, explores the basis of the volatile situation around the firing of Timnit Gebru and later Margaret Mitchell from Google’s Ethical AI unit earlier this year. Both women are highly respected and experienced AI researchers. Mitchell founded the team in 2017.

Central to the situation is a criticism of large language models and a March 2021 paper (On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?) co-authored by Gebru, Mitchell, and two researchers at the University of Washington. The biggest current example is GPT-3, previously covered in several posts here.

“Models this big require an unthinkable amount of data; the entirety of English-language Wikipedia makes up just 0.6% of GPT-3’s training data.”

—”This Powerful AI Technique Led to Clashes at Google and Fierce Debate in Tech. Here’s Why”

The Morning Brew article sums up the very recent and very big improvements in large language models that have come about thanks to new algorithms and faster computer hardware (GPUs running in parallel). It highlights BERT, “the model that now underpins Google Search,” which came out of the research that resulted in the first Transformer. A good at-the-time article about GPT-3’s release was published in July 2020 in MIT’s Technology Review: “OpenAI first described GPT-3 in a research paper published in May [2020].”

One point being — Google fired Timnit Gebru very soon after news and discussion of large language models (GPT-3 especially, but remember Google’s investment in BERT too) ramped up — way up. Her criticism of a previously obscure AI technology (not obscure among NLP researchers, but in the wider world) might have been seen as increasingly inconvenient for Google. Morning Brew summarizes the criticism (not attributed to Gebru): “Because large language models often scrape data from most of the internet, racism, sexism, homophobia, and other toxic content inevitably filter in.”

“Once the barrier to create AI tools and generate text is lower, people could just use it to create misinformation at scale, and having that data coupled with certain other platforms can just be a very disastrous situation.”

—Sandhini Agarwal, AI policy researcher, OpenAi

The Morning Brew article goes well beyond Google’s dismissal of Gebru and Mitchell, bringing in a lot of clear, easy-to-understand explanation of what large language models require (for example, significant energy resources), what they’re being used for, and even the English-centric nature of such models — lacking a gigantic corpus of digitized text in a given human language, you can’t create a large model in that language.

The turmoil in Google’s Ethical AI unit is covered in more detail in this May 2021 article, also by Hayden Field.

It’s easy to find articles that discuss “scary things GPT-3 can do and does” and especially the bias issues; it’s much harder to find information about some of the other aspects covered here. It’s also not just about GPT-3. I appreciated insights from an interview with Emily M. Bender, first author on the “Stochastic Parrots” article. I also liked the explicit statement that many useful NLP tasks can be done well without a large language model. In smaller datasets, finding and accounting for toxic content can be more manageable.

“Do we need this at all? What’s the actual value proposition of the technology? … Who is paying the environmental price for us doing this, and is this fair?”

—Emily M. Bender, professor and director, Professional MS in Computational Linguistics, University of Washington

Finally, in a recap of Morning Brew’s “Demystifying Algorithms” event, editor Dan McCarthy summarized two AI researchers’ answers to one of my favorite questions: What can an algorithm actually know?

An AI system’s ability to generalize — to transfer learning from one domain to another — is still a wide-open frontier, according to Mark Riedl, a computer science professor at Georgia Tech. This is something I remind my students of over and over — what’s called “general intelligence” is still a long way off for artificial intelligence. Riedl works on aspects of storytelling to test whether an AI system is able to “make something new” out of what it has ingested.

Saška Mojsilović, head of Trusted AI Foundations at IBM Research, made a similar point — and also emphasized that “narrow AI” (which is all the AI we’ve ever had, up to now and for the foreseeable future) is not nothing.

She suggested: “We may want to take a pause from obsessing over artificial general intelligence and maybe think about how we create AI solutions for these kinds of problems” — for example, narrow domains such as drug discovery (e.g. new antibiotics) and creation of new molecules. These are extraordinary accomplishments within the capabilities of today’s AI.

This is a half-hour conversation with those two experts:

Thanks to the video, I learned about the Lovelace 2.0 Test, which Riedl developed in 2014. It’s an alternative to the Turing Test.

Mojsilović talked about the perceptions that arise when we use the word intelligence when talking about machines. “The reality is that many things that we call AI today are the same old models that we used to call data science maybe five or six years ago,” she said (at 21:55). She also talked about the need for collaboration between AI researchers and experts in entirely separate fields: “Because we can’t create solutions for the problems that we don’t understand” (at 29:24).

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Identifying toxic comments with AI

The basic idea: Immediately detect and remove hateful or dangerous posts in social media and other online forums. With advances in natural language processing (NLP), identification of harmful speech becomes more accurate and more practical.

In this essay published in Scientific American (2021), researchers from the private company Unitary (see their public Detoxify code on GitHub) discuss the challenges in rating the level of toxicity or harmfulness in text content. One aspect is what is considered harmful: profanity is easy to detect; misinformation is complicated. Another aspect: Terms describing gender, race, or ethnicity can be used hatefully or as (non-toxic) self-description.

(I’ve written before about machine learning used in comment moderation, which is a large concern in media companies that permit users to post comments on articles and blog posts.)

Jigsaw, a Google division, “released two public data sets containing over one million toxic and non-toxic comments from Wikipedia and a service called Civil Comments.” Each comment was labeled with a rating such as “Toxic” or “Very Toxic.” The data sets were used as training data in three competitions, hosted by Google, in which AI researchers could enter their trained models and see how they compared to others (and win money). The three “Jigsaw challenges” (one per year):

.

“We decided to take inspiration from the best Kaggle solutions and train our own algorithms with the specific intent of releasing them publicly.”

— Unitary researchers

The Unitary researchers describe Detoxify, “an open-source, user-friendly comment detection library,” which is intended “to help researchers and practitioners identify potential toxic comments.” The library includes three separate models, one for each Jigsaw challenge. These models can be fine-tuned using additional data sets.

One particular limitation pointed out by the researchers is that a high toxicity score does not always indicate actually toxic content: “As an example, the sentence ‘I am tired of writing this stupid essay’ will give a toxicity score of 99.7 percent, while removing the word ‘stupid’ will change the score to 0.05 percent.”

There’s still a long way to go before harmful comments and social media posts can be instantly removed from platforms.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.