Red teaming to find flaws in LLMs – AI in Media and Society

I came across this Aug. 20, 2023, post and got a lot out of reading it:

Author Eryk Salvaggio describes himself as “a trained journalist, artist, researcher and science communicator who has done weird things with technology since 1997.” He attended and presented at DEFCON 31, the largest hacker convention in the world, and that inspired his post. There’s a related post about it, also by Salvaggio.

“In cybersecurity circles, a Red Team is made up of trusted allies who act like enemies to help you find weaknesses. The Red Team attacks to make you stronger, point out vulnerabilities, and help harden your defenses.”
—Eryk Salvaggio

You use a red team operation to test the security of your systems — whether they are information systems protecting sensitive data, or automation systems that run, say, the power grid. The goal is to find the weak points before malicious hackers do. The red team operation will simulate the techniques that malicious hackers would use to break into your system for a ransomware attack or other harmful activity. The red team stops short of actually harming your systems.

Salvaggio shared his thoughts about the Generative Red Team, an event at DEFCON 31 in which volunteer hackers had an opportunity to attack several large language models (LLMs), which had been contributed by various companies or developers. The individual hacker didn’t know which LLM they were interacting with. The hacker could switch back and forth among different LLMs in one session of hacking. The goal: to elicit “a behavior from an LLM that it was not meant to do, such as generate misinformation, or write harmful content.” Hackers got points when they succeeded.

The point system likely affected what individual hackers did and did not do, Salvaggio noted. If a hacker took risks by trying out new methods of attacking LLMs, they might not get as many points as another hacker who used tried-and-true exploits. This subverted the value of red teaming, which aims to discover new and novel ways to break in — ways the system designers did not think of.

“The incentives seemed to encourage speed and practicing known attack patterns,” Salvaggio wrote.

Other flaws in the design of the Generative Red Team activity: (1) Time limits — each hacker could work for 50 minutes only and then had to leave the computer; they could go again, but the results of each 50-minute session were not combined. (2) The absence of actual teams — each hacker had to work solo. (3) Lack of diversity — hackers are a somewhat homogeneous group, and the prompts they authored might not have reflected a broad range of human experience.

“The success column of the Red Teaming event included the education about prompt injection methods it provided to new users, and a basic outline of the types of harms it can generate. More benefits will come from whatever we learn from the data that was produced and what sense researchers can make of it. (We will know early next year),” Salvaggio wrote.

He pointed out that there should be more of this, and not only at rarified hacker conferences. Results should be publicized. The AI companies and developers should be doing much more of this on their own — and publicizing the how and why as well as the results.

“To open up these systems to meaningful dialogue and critique” would require much more of this — a significant expansion of the small demonstration provided by the Generative Red Team event, Salvaggio wrote.

Critiquing AI

Salvaggio went on to talk about a fundamental tension between efforts aimed at security in AI systems and efforts aimed at social accountability. LLMs “spread harmful misinformation, commodify the commons, and recirculate biases and stereotypes,” he noted — and the companies that develop LLMs then ask the public to contribute time and effort to fixing those flaws. It’s more than ironic. I thought of pollution spilling out of factories, and the factory owners telling the community to do the cleanup at community expense. They made the nasty things, and now they expect the victims of the nastiness to fix it.

“Proper Red Teaming assumes a symbiotic relationship, rather than parasitic: that both parties benefit equally when the problems are solved.”
—Eryk Salvaggio

We don’t really have a choice, though, because the AI companies are rushing pell-mell to build and release more and models that are less than thoroughly tested, that are capable of harms yet unknown.

Toward the end of his post, Salvaggio lists “10 Things ARRG! Talked About Repeatedly.” They are well worth reading and considering — they are the things that should disturb us, everyone, about AI and especially LLMs. (ARRG! is the Algorithmic Resistance Research Group. It was founded by Salvaggio.) They include questions such as where the LLM data sets come from; the environmental effects of AI models (which require tremendous energy outputs); and “Is red teaming the right tool — or right relationship — for building responsible and safe systems for users?”

You could go straight to the list, but I got a lot out of reading Salvaggio’s entire post, as well as articles linked below to help me understand what was going on around the group from ARRG! in the AI Village at DEFCON 31.

When he floated the idea of “artists as a cultural red team,” I got a little choked up.

Related items

The AI Village describes itself as “a community of hackers and data scientists working to educate the world on the use and abuse of artificial intelligence in security and privacy. We aim to bring more diverse viewpoints to this field and grow the community of hackers, engineers, researchers, and policy makers working on making the AI we use and create safer.” The AI Village organized red teaming events at DEFCON 31.

When Hackers Descended to Test A.I., They Found Flaws Aplenty, in The New York Times, Aug. 16, 2023. This longer article covers the AI red teaming event at DEFCON 31. “A large, diverse and public group of testers was more likely to come up with creative prompts to help tease out hidden flaws, said Dr. [Rumman] Chowdhury, a fellow at Harvard University’s Berkman Klein Center for Internet and Society focused on responsible A.I. and co-founder of a nonprofit called Humane Intelligence.”

What happens when thousands of hackers try to break AI chatbots, on NPR.com, Aug. 15, 2023. Another view of the AI events at DEFCON 31. More than 2,000 people “pitted their skills against eight leading AI chatbots from companies including Google, Facebook parent Meta, and ChatGPT maker OpenAI,” according to this report.

Humane Intelligence describes itself as a 501(c)(3) non-profit that “supports AI model owners seeking product readiness review at-scale,” focusing on “safety, ethics, and subject-specific expertise (e.g. medical).”

AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.