Symbolic AI: Good old-fashioned AI

The distinction between symbolic (explicit, rule-based) artificial intelligence and subsymbolic (e.g. neural networks that learn) artificial intelligence was somewhat challenging to convey to non–computer science students. At first I wasn’t sure how much we needed to dwell on it, but as the semester went on and we got deeper into the differences among types of neural networks, it was very useful to keep reminding the students that many of the things neural nets are doing today would simply be impossible with symbolic AI.

The difficulty lies in the shallow math/science background of many communications students. They might have studied logic problems/puzzles, but their memory of how those problems work might be very dim. Most of my students have not learned anything about computer programming, so they don’t come to me with an understanding of how instructions are written in a program.

This post by Ben Dickson at his TechTalks blog offers a very nice summary of symbolic AI, which is sometimes referred to as good old-fashioned AI (or GOFAI, pronounced GO-fie). This is the AI from the early years of AI, and early attempts to explore subsymbolic AI were ridiculed by the stalwart champions of the old school.

The requirements of symbolic AI are that someone — or several someones — needs to be able to specify all the rules necessary to solve the problem. This isn’t always possible, and even when it is, the result might be too verbose to be practical. As many people have said, things that are easy for humans are hard for computers — like recognizing an oddly shaped chair as a chair, or distinguishing a large upholstered chair from a small couch. Things we do almost without thinking are very hard to encode into rules a computer can follow.

“Symbolic artificial intelligence is very convenient for settings where the rules are very clear cut, and you can easily obtain input and transform it into symbols.”

—Ben Dickson

Subsymbolic AI does not use symbols, or rules that need symbols. It stems from attempts to write software operations that mimic the human brain. Not copy the way the brain works — we still don’t know enough about how the brain works to do that. Mimic is the word usually used because a subsymbolic AI system is going to take in data and form connections on its own, and that’s what our brains do as we live and grow and have experiences.

Dickson uses an image-recognition example: How would you program specific rules to tell a symbolic system to recognize a cat in a photo? You can’t write rules like “Has four legs,” or “Has pointy ears,” because it’s a photo. Your rules would need to be about pixels and edges and clusters of contrasting shades. Your rules would also need to account for infinite variations in photos of cats.

“You can’t define rules for the messy data that exists in the real world.”

—Ben Dickson

Thus “messy” problems such as image recognition are ideally handled by neural networks — subsymbolic AI.

Problems that can be drawn as a flow chart, with every variable accounted for, are well suited to symbolic AI. But scale is always an issue. Dickson mentions expert systems, a classic application of symbolic AI, and notes that “they require a huge amount of effort by domain experts and software engineers and only work in very narrow use cases.” On top of that, the knowledge base is likely to require continual updating.

An early, much-praised expert system (called MYCIN) was designed to help doctors determine treatment for patients with blood diseases. In spite of years of investment, it remained a research project — an experimental system. It was not sold to hospitals or clinics. It was not used in day-to-day practice by any doctors diagnosing patients in a clinical setting.

“I have never done a calculation of the number of man-years of labor that went into the project, so I can’t tell you for sure how much time was involved … it is such a major chore to build up a real-world expert system.”

—Edward H. Shortliffe, principal developer of the MYCIN expert system (source)

Even though expert systems are impractical for the most part, there are other useful applications for symbolic AI. Dickson mentions “efforts to combine neural networks and symbolic AI” near the end of his post. He points out that symbolic systems are not “opaque” the way neural nets are — you can backtrack through a decision or prediction and see how it was made.

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

The trouble with large language models

Yesterday I summarized the first two articles in a series about algorithms and AI by Hayden Field, a technology journalist at Morning Brew. Today I’ll finish out the series.

The third article, This Powerful AI Technique Led to Clashes at Google and Fierce Debate in Tech. Here’s Why, explores the basis of the volatile situation around the firing of Timnit Gebru and later Margaret Mitchell from Google’s Ethical AI unit earlier this year. Both women are highly respected and experienced AI researchers. Mitchell founded the team in 2017.

Central to the situation is a criticism of large language models and a March 2021 paper (On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?) co-authored by Gebru, Mitchell, and two researchers at the University of Washington. The biggest current example is GPT-3, previously covered in several posts here.

“Models this big require an unthinkable amount of data; the entirety of English-language Wikipedia makes up just 0.6% of GPT-3’s training data.”

—”This Powerful AI Technique Led to Clashes at Google and Fierce Debate in Tech. Here’s Why”

The Morning Brew article sums up the very recent and very big improvements in large language models that have come about thanks to new algorithms and faster computer hardware (GPUs running in parallel). It highlights BERT, “the model that now underpins Google Search,” which came out of the research that resulted in the first Transformer. A good at-the-time article about GPT-3’s release was published in July 2020 in MIT’s Technology Review: “OpenAI first described GPT-3 in a research paper published in May [2020].”

One point being — Google fired Timnit Gebru very soon after news and discussion of large language models (GPT-3 especially, but remember Google’s investment in BERT too) ramped up — way up. Her criticism of a previously obscure AI technology (not obscure among NLP researchers, but in the wider world) might have been seen as increasingly inconvenient for Google. Morning Brew summarizes the criticism (not attributed to Gebru): “Because large language models often scrape data from most of the internet, racism, sexism, homophobia, and other toxic content inevitably filter in.”

“Once the barrier to create AI tools and generate text is lower, people could just use it to create misinformation at scale, and having that data coupled with certain other platforms can just be a very disastrous situation.”

—Sandhini Agarwal, AI policy researcher, OpenAi

The Morning Brew article goes well beyond Google’s dismissal of Gebru and Mitchell, bringing in a lot of clear, easy-to-understand explanation of what large language models require (for example, significant energy resources), what they’re being used for, and even the English-centric nature of such models — lacking a gigantic corpus of digitized text in a given human language, you can’t create a large model in that language.

The turmoil in Google’s Ethical AI unit is covered in more detail in this May 2021 article, also by Hayden Field.

It’s easy to find articles that discuss “scary things GPT-3 can do and does” and especially the bias issues; it’s much harder to find information about some of the other aspects covered here. It’s also not just about GPT-3. I appreciated insights from an interview with Emily M. Bender, first author on the “Stochastic Parrots” article. I also liked the explicit statement that many useful NLP tasks can be done well without a large language model. In smaller datasets, finding and accounting for toxic content can be more manageable.

“Do we need this at all? What’s the actual value proposition of the technology? … Who is paying the environmental price for us doing this, and is this fair?”

—Emily M. Bender, professor and director, Professional MS in Computational Linguistics, University of Washington

Finally, in a recap of Morning Brew’s “Demystifying Algorithms” event, editor Dan McCarthy summarized two AI researchers’ answers to one of my favorite questions: What can an algorithm actually know?

An AI system’s ability to generalize — to transfer learning from one domain to another — is still a wide-open frontier, according to Mark Riedl, a computer science professor at Georgia Tech. This is something I remind my students of over and over — what’s called “general intelligence” is still a long way off for artificial intelligence. Riedl works on aspects of storytelling to test whether an AI system is able to “make something new” out of what it has ingested.

Saška Mojsilović, head of Trusted AI Foundations at IBM Research, made a similar point — and also emphasized that “narrow AI” (which is all the AI we’ve ever had, up to now and for the foreseeable future) is not nothing.

She suggested: “We may want to take a pause from obsessing over artificial general intelligence and maybe think about how we create AI solutions for these kinds of problems” — for example, narrow domains such as drug discovery (e.g. new antibiotics) and creation of new molecules. These are extraordinary accomplishments within the capabilities of today’s AI.

This is a half-hour conversation with those two experts:

Thanks to the video, I learned about the Lovelace 2.0 Test, which Riedl developed in 2014. It’s an alternative to the Turing Test.

Mojsilović talked about the perceptions that arise when we use the word intelligence when talking about machines. “The reality is that many things that we call AI today are the same old models that we used to call data science maybe five or six years ago,” she said (at 21:55). She also talked about the need for collaboration between AI researchers and experts in entirely separate fields: “Because we can’t create solutions for the problems that we don’t understand” (at 29:24).

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Summary of the challenges facing algorithms, AI

Hayden Field, a technology journalist at Morning Brew, published a series of articles about algorithms and AI earlier this year, and they’ve been on my TBR list.

First up was Nine Experts on the Single Biggest Obstacle Facing AI and Algorithms in the Next Five Years. Experts: Drago Anguelov (Waymo); Kathy Baxter (Salesforce); David Cox (IBM Watson); Natasha Crampton (Microsoft); Mark Diaz (Ethical AI at Google); Charles Isbell (professor and dean, College of Computing, Georgia Institute of Technology); Peter Lofgren (Stripe); Andrew Ng (co-founder and former head, Google Brain); Cathy O’Neil (author, Weapons of Math Destruction).

Predictably, ethics was noted as a big challenge — O’Neil asked what we will do about unfairness in decisions made by algorithms. Diaz pointed to the need for involving “experts from a wide range of disciplines, including non-technical disciplines,” in the development process, long before an end product emerges. This intersects with ethics and fairness, as the absence of experts and stakeholders opens the door wide to omissions and errors. Baxter was explicit about systemic racism that is embedded in both training data and models. She listed “medical care decisions, hiring recommendations, access to housing and social programs, visa application approvals, school exam results, hate speech detection, dynamic pricing algorithms for ride hailing services, and even dating apps” — as well as face recognition and predictive policing.

“In essence, problems that are not purely technical require solutions that are not purely technical.”

—Mark Diaz, Ethical AI at Google

Isbell spoke of systematic solutions that can be widely applied. “We cannot treat minority groups as exceptions and edge cases,” he said. Cox highlighted transparency and explainability, as well as ethics and bias. He also alluded to adversarial attacks as well as the non-adversarial errors that surprise researchers (possibly due to overfitting). He grouped all this under trust. Crampton also focused on fairness and referred to diversity in teams, similar to Diaz’s and Isbell’s concerns.

Anguelov explained the need for reliable simulations so that systems can scale up to real-world use. He’s talking about the Long Tail problem: the real world throws up too many unexpected situations. Simulations allow testing in ways that don’t risk human lives (think self-driving cars). Lofgren also talked about scale, but in terms of personalization — his example is detecting credit card fraud in real-time based on Big Data that detects abusing IP addresses and then drills down to the individual cards being used. Ng talked about the difficulty in making dependable commercial AI products — basically off-the-shelf solutions.

“We will often need to make hard decisions based on competing priorities, including decisions to not build or deploy a system for certain purposes.”

—Natasha Crampton, Microsoft

Second in the series is titled Amex’s Fraud Detection AI Was Ready to Go Live. Then Covid Hit. This article starts with the idea that large AI models in the field will still need adjustments as unforeseen problems crop up. This echos the concerns about scale raised by Anguelov and Lofgren in the first article in the series.

The challenge thrown by COVID-19 was that all existing models had been developed and adjusted in a non-pandemic world. Then the world changed.

Amex’s fraud-detecting systems are a blend of old-school rule-based systems and newer machine learning techniques. A team of about 30 decision scientists monitors the system round-the-clock and updates it when necessary, at least once a year. The pandemic came at a bad time for Amex, just as they were rolling out a new model.

“Since each generation of a gradient-boosting ML model is typically developed on data from earlier that same year, many of the model’s assumptions no longer made sense” in 2020.

—”Amex’s Fraud Detection AI Was Ready to Go Live. Then Covid Hit”

This is a really interesting article — although I’d read others about issues caused for AI models by pandemic changes, most of those had to do with either healthcare or travel.

Because of increased online traffic in 2020 — more people online, every day, as the pandemic drove work-from-home and stay-at-home schooling — demands on Amazon Web Services (providing servers and processing power to millions of commercial clients such as Amex) grew enormously. This “dwindling cloud capacity” meant testing new solutions for Amex’s model took much longer than usual. The team had to run new simulations that took our new way of life into account, and those simulations required lots of processor juice.

In the end, Amex’s rollout was successful — but it came months later than originally planned. This was a really neat case study and could be discussed in a lot of different contexts.

I’m going to look at the other articles in the series in tomorrow’s post.

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Attention, in machine learning and NLP

Let’s begin at the beginning, with Attention Is All You Need (Vaswani et al., 2017). This is a conference paper with eight authors, six of whom then worked at Google. They contended that neither recurrent neural networks nor convolutional neural networks are necessary for machine translation of languages, and hence the Transformer, “a new simple network architecture,” was born. (Note: It relies on feed-forward neural networks.)

Transformers are the basis for machine translation and other tasks relying on language models. GPT-3 has recently become infamous; others include BERT (from Google) and ELMo.

Before the work by Vaswani and his co-equal co-authors, progress in NLP was limited (although it had advanced a lot since 2012) because of the ways in which RNN models depend on the sequence and position of words in a text. Transformers eliminate those limitations. With recurrent neural networks, there are impediments to parallel processing. Other researchers had previously cracked that nut using ConvNets, but then other limitations were inherent (exponential increase in the number of computational operations). Transformers also eliminate those limitations.

So the Transformer was a first in NLP, a breakthrough. For machine translation, the paper claimed “a new state of the art” (p. 10).

I had learned that an encoder and a decoder connected by an attention module is a standard architecture for machine language translation, e.g. Google Translate. This was true before 2017, so what is the difference effected by the Transformer? It eliminates RNNs and ConvNets from the architecture, yes (“our model contains no recurrence and no convolution”) — but what else?

Attention used in a new way

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key”(Vaswani et al., 2017, p. 3). I’m okay with that, although I doubt I would be able to explain it to my non–computer science students. (I do explain weights and features when I introduce neural nets to them, and I explain word vectors when we start NLP. The trouble is they don’t know how to write a program, and they certainly don’t understand what a function is.)

There are different attention functions that could be used. One is additive attention; another is dot-product attention, which is multiplicative rather than additive. Dot-product is “much faster and more space-efficient in practice.” Vaswani et al. used a scaled dot-product attention function (p. 4). They also used multi-head attention, meaning the model uses eight parallel attention layers, or heads. The explanation was a bit beyond me, but the gist is that the model can look at multiple things at the same time, like juggling more balls simultaneously.

Multi-head attention — plus the freedom of no-sequence, no-position — enables the Transformer to look at all the context for a word, and do it for multiple words at the same time.

With my rudimentary understanding of recurrent neural nets, I have a fuzzy idea of how this use of attention functions produces better results, mainly by being able to take in and compare more of the text, a little closer to the way human brains hold an entire conversation even though it’s not a literal “recording” of the exact conversation. The way we comprehend meaning when we read has to do with millions of associations built up over a lifetime, as well as many associations within that present text. We are not processing separate little slices of a sentence — our brains handle a text more holistically.

A Transformer does use word embeddings to convert the tokens (both inout and output) to vectors (Vaswani et al., 2017, p. 5). It uses softmax but no LSTMs (because, again, “no recurrence”).

Please help me, YouTube

I found a video (13:04) that helped me in my struggle to understand the Transformer architecture:

It was still a tough climb for me, but this video was particularly helpful with how multi-head attention improves the process. (Obviously the speed improvement is huge.)

Another helpful video (5:33) does a nice job summing up the sequence-based limitations of RNNs: “In general it’s easier for [RNNs] to capture relationships between points that are close to each other than it is to capture relationships between points that are very far from each other — say, several thousand points in the sequence.” In the paper, this is called “path length between long-range dependencies in the network” (Vaswani et al., 2017, p. 6) and identified as one of three motivations for developing the self-attention layers in Transformer.

In fact this second video is much better than the one above, but I liked that one when I watched it first, and maybe (haha!!) the order in which I watched them had an effect. The diagrams for self-attention in this shorter video are very good!

Back to Vaswani et al.

Speaking of self-attention — it was interesting that the authors thought it “could yield more interpretable models.” As in any hidden layer in any neural network, features are determined and weights set by the system itself, not by the human programmers. This is the “learning” in machine learning. The authors noted that the “individual attention heads clearly learn to perform different tasks,” and that many of them “appear to exhibit behavior related to the syntactic and semantic structure of the sentences” (p. 7; my italics).

Cool.

The results section of the paper describes performance using BLEU scores on two different NLP tasks (WMT 2014 English-to-German translation; WMT 2014 English-to-French translation) — reported as best-ever at that time — as well as record-breaking lower training costs, which means time to train the model factored by processor power used (number of GPUs, estimate of the number of floating-point operations).

The successor to the code on which this seminal paper was based is Trax, available on GitHub.

At the end of the paper (pages 13–15) there are math-free visualizations that illustrate what the attention mechanism does. These are well worth a look.

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

How to educate the public about AI

Two new items related to educating the general public about artificial intelligence:

The A–Z guide comes from the Oxford Internet Institute and Google. It’s slick, pretty, and animated. It consists of exactly 26 short items, one for each letter of the alphabet: artificial intelligence, bias, climate, datasets, ethics, fakes, etc. The aim is to provide answers in a not-overwhelming way.

I love the idea, but I’m not in love with the execution. For example, the neural networks piece tells us that neural nets “attempt to mimic the structure of the brain,” but they “cannot ‘think’ like humans.” That’s great — clear and accurate. We could quibble about “attempt to mimic the structure,” but we can also let that slide. But then:

“AI design teams can assign each piece of a network to recognizing one of many characteristics. The sections of the network then work as one to build an understanding of the relationships and correlations between those elements — working out how they typically fit together and influence each other.”

To me, that seems misleading. It sounds as if the layers of the neural net are directed by specifically programmed instructions, but all my reading has indicated that the layers determine on their own which features they are detecting. (I’m thinking specifically about image recognition and supervised learning here.) This is important because it contributes to the “black box” problem of machine learning systems.

I also dislike phrases such as “build an understanding,” because that implies more intentionality than these networks actually have.

Giving people short, understandable explanations of specific aspects of AI is a wonderful idea, but the explanations need to be both straightforward and true.

The second education item I linked above comes from MIT’s news office. It describes a “new cross-disciplinary research initiative … to promote the understanding and use of AI across all segments of society.”

“People need to be AI-literate to understand the responsible use of AI and create things with it at individual, community, and societal levels.”

—Cynthia Breazeal, MIT professor, director of Responsible AI for Social Empowerment and Education (RAISE)

This sentiment is becoming more widely voiced as claims for the benefits of AI increase in the media. The idea behind RAISE is good and admirable — yes, people in all walks of life should have some understanding of AI, at least as much as they have an understanding of what makes airplanes fly and what makes computers able to store and retrieve our vacation photos.

Oh, wait.

In the United States, the average person’s understanding of any process involving physics or electronics might not be very good. Many students with stellar high-school grades don’t have a solid grasp of how their laptops or phones work at a basic level. I’m not talking about the students who attend MIT, but I am talking about those who can manage high SAT scores and gain admission to top public universities.

The RAISE initiative has identified four strategic areas for research, education, and outreach:

  • Diversity and inclusion in AI
  • AI literacy in pre-K–12 education
  • AI workforce training
  • AI-supported learning

But let’s go back to the A–Z guide and look at the segment about binary code, Zeros & Ones. It tells us that 0’s and 1’s are “the foundational language of computers.” It tells us that a particular long sequence of 0’s and 1’s means “Hello” to a computer. In one sense, that is true — but it really explains nothing to a layman. A computer system doesn’t know what “Hello” is (or means) any more than a rock does.

To accomplish AI literacy, we need to accomplish computer literacy. We need to teach and explain — clearly and accurately — to students at all levels what computers can and cannot do, how they are programmed, and how AI is different from, say, writing a game program that plays tic-tac-toe as well as any human can. I can write and run a winning tic-tac-toe program on an average laptop if I know which algorithms to use in my code — but there’s nothing remotely like intelligence in that program.

We need to add caveats every time we say something like “the computer learns,” or “the system understands.”

It will be fantastic if RAISE (and other outreach programs) can raise the level of computer literacy among Americans. It’s an important goal in this era of AI hype and euphoric claims, because it will be so much easier for people to be duped, exploited, mistreated, sidelined, marginalized, and/or denied jobs, loans, mortgages, healthcare, or admission to universities if they don’t understand what AI is and how it works.

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Multiple facets of ethics in AI

The Center for Responsible AI at New York University has published a free online course titled “AI Ethics: Global Perspectives.”

The course consists of a series of videos produced by many different people in countries around the world. The instructors include computer science and engineering professors as well as researchers in various fields, including government, health care, and the humanities. These are the lectures I intend to watch:

Lectures still to come:

  • Renee Cummings, a U.S. criminologist and consultant, will discuss “Bias in Data and AI: Myth, Mistrust, and Myopia.”
  • Susan Scott-Parker will discuss “AI Powered Disability Discrimination: How Do You Lip Read a Robot Recruiter?”

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Figuring It Out: Transformers for NLP

It was a challenge for me to figure out how to teach non–computer science students about word vectors. I wanted them to have a clear idea of how words and their meanings are represented for use in an AI system — otherwise, I worried they would assume something like a written dictionary with text and definitions. I also wanted them to know that it wasn’t something simple like “each word has a numerical code assigned to it.” So we spent some time talking about what a vector is and what “n-dimensional space” means.

Slide above by Mindy McAdams (copyright © 2021)
Slide above by Mindy McAdams (copyright © 2021)

Now I need to work out how to teach them about transformers. I found a surprisingly clear article at Orange.com (formerly France Télécom), on their Hello Future website about research and innovation. I’m going to quote a large section from that article:

“Originally, in 2013, word embeddings (such as Word2Vec, Glove, or Fasttext) were able to capture representations of words in the form of vectors taking into account the context of neighboring words in large volumes of text. Two words appearing in similar contexts were ‘embedded’ into N-dimensional space, to neighboring points in this space. This approach has led to significant advances in the field of NLP, but also has its limitations. From 2018 a new way of generating these word vectors emerged. Rather than selecting the vector of a word in a previously learnt static ‘dictionary,‘ a model is responsible for dynamically generating the vector representation of a word. A word is thus projected to a vector not only according to its prior meaning, but also according to the context in which it appears. The models for effective realization of these contextual projections (BERT, ELMO and derivatives, GPT and its successors) are based on a simple yet powerful architecture called Transformer.” (Spelling and punctuation edited for American English.)

I know that paragraph might not make sense if you haven’t already learned about word vectors. The key is that transformers are able to build on and enhance the machine accuracy of what a word or sentence means by taking into account its context in the current data. So you do have a language model, previously trained on a large corpus, but the transformer analyzes the present text input in a more holistic way, transforming the vectors as it goes.

Again quoting from the Orange.com article: “While previous approaches … could model contextual dependencies, they were always constrained by referencing words by their positions [in the sentence]. Attention is about referencing by content. Instead of looking for relationships with other words in the context at given positions, attention allows you to search for relationships with all words in the context, and through a very effective implementation, it allows you to rely on the most similar words to improve prediction, whatever their position in context.”

The role of the attention module is explained in a 2017 paper that, according to Google Scholar, has been cited more than 20,000 times: Attention Is All You Need. See the PDF for diagrams of the Transformer network architecture.

Language models produced by transformers include BERT (developed by Google, and which powers Google searches), ELMo, and GPT-3. These so-called large language models have raised many concerns, particularly around ethics, as their interior processes are a black box, and their immense training data has included biased and toxic texts. The Orange.com article includes two charts that illustrate differences among BERT, ELMo, and three generations of GPT.

An important aspect of transformers is that they produce these large language models from unlabeled data, and when developing applications based on transformers and such models, good results can be obtained with only a small amount of additional training data (“few-shot learning”).

Orange — like many other companies — is using large language models for classification and information-extraction tasks such as: “sentiment analysis, personal data detection, detection and identification of named entities, syntactic dependency analysis, semantic parsing, co-reference resolution,” and question answering. These tasks involve customer-service applications as well as internal data analysis.

Much of this post is based on the article The GPT-3 language model, revolution or evolution? (February 2021).

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Identifying toxic comments with AI

The basic idea: Immediately detect and remove hateful or dangerous posts in social media and other online forums. With advances in natural language processing (NLP), identification of harmful speech becomes more accurate and more practical.

In this essay published in Scientific American (2021), researchers from the private company Unitary (see their public Detoxify code on GitHub) discuss the challenges in rating the level of toxicity or harmfulness in text content. One aspect is what is considered harmful: profanity is easy to detect; misinformation is complicated. Another aspect: Terms describing gender, race, or ethnicity can be used hatefully or as (non-toxic) self-description.

(I’ve written before about machine learning used in comment moderation, which is a large concern in media companies that permit users to post comments on articles and blog posts.)

Jigsaw, a Google division, “released two public data sets containing over one million toxic and non-toxic comments from Wikipedia and a service called Civil Comments.” Each comment was labeled with a rating such as “Toxic” or “Very Toxic.” The data sets were used as training data in three competitions, hosted by Google, in which AI researchers could enter their trained models and see how they compared to others (and win money). The three “Jigsaw challenges” (one per year):

.

“We decided to take inspiration from the best Kaggle solutions and train our own algorithms with the specific intent of releasing them publicly.”

— Unitary researchers

The Unitary researchers describe Detoxify, “an open-source, user-friendly comment detection library,” which is intended “to help researchers and practitioners identify potential toxic comments.” The library includes three separate models, one for each Jigsaw challenge. These models can be fine-tuned using additional data sets.

One particular limitation pointed out by the researchers is that a high toxicity score does not always indicate actually toxic content: “As an example, the sentence ‘I am tired of writing this stupid essay’ will give a toxicity score of 99.7 percent, while removing the word ‘stupid’ will change the score to 0.05 percent.”

There’s still a long way to go before harmful comments and social media posts can be instantly removed from platforms.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Pastries, cancer cells, and neural networks

The system described in this wonderful New Yorker article from March 2021 is NOT a neural network, and that’s one of the things that make it fascinating. I’ve written before about ImageNet and how neural networks, trained on humongous datasets of labeled digital images, are able to very accurately say what is in a photograph that the system has never “seen” before.

This is different.

This system, developed by a small company in Japan, does not require hundreds or thousands of images of each object it needs to identify precisely because it doesn’t use a neural network. The technologies it uses can be called good old-fashioned AI (GOFAI). Essentially it consists of a collection of manually constructed algorithms.

Above: BakeryScan at work: Screen capture from video (2017)

The system also “learns,” but not in the typical black-box sense of today’s machine learning systems. It is widely used in the checkout systems of Japanese bakeries, which offer a bewilderingly large assortment of pastries and small bread items, many of which look quite similar to one another. BakeryScan was released in 2013; it was 15 years in development.

More recently, the bakery system has been adapted to recognize specific types of cancer cells. The new system is able to “look at an entire microscope slide and identify the cells that might be cancerous” (source: The New Yorker article).

Rather than summarizing the article further, I’m just going to urge you to read it. It’s very much worth your time.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

‘Ground truth’ and labeled data

Cassie Kozyrkov, who wrote this article, is head of decision intelligence at Google. It starts out with what looks like a standard explanation of an image-recognition system — which she deprecatingly refers to as the “the cat/not-cat task.” But don’t be fooled — Kozyrkov communicates with clear, sharp precision, and very quickly she asks us to consider circumstances in which we would want a tiger to be considered a cat and those in which we would want it to be not-cat.

This leads to a discussion of ground truth. This is “an ideal expected result” — but for whom? Well, for the people who originally built the system. Kozyrkov notes that ground truth is NOT an objective, perfect truth like something studied in a philosophy class (Truth with a capital T). It’s whether a tiger is a cat in your reality or not-cat in mine.

I am reminded of one of my favorite lines in the rock opera Jesus Christ Superstar: “But what is truth? Is truth unchanging law? We both have truths. Are mine the same as yours?”

“When such a dataset is used to train ML/AI systems, systems based on it will inherit and amplify the implicit values of the people who decided what the ideal system behavior looked like to them.”

— Cassie Kozyrkov

It also brings to mind the practice of testing for intercoder reliability — standard practice in research that relies on qualitative data. (More about that here.)

Say you are using an existing labeled dataset — not one you yourself have created — which is often the case. The labels attached to the data items are the ground truth for that dataset. If it’s a dataset of images, and some labels applied to photos of people are racist, then that’s the ground truth in that dataset. If it’s a dataset for sentiment analysis, and a lot of toxic comments are labeled “not toxic,” then that’s the ground truth you’re adopting.

It’s essential for developers to test systems extensively to uncover these flaws in ground truth.

“You wouldn’t want to fall victim to a myopic fraud detection system with sloppy definitions of what financial fraud looks like, especially if such a system is allowed to falsely accuse people without giving them an easy way to prove their innocence.”

— Cassie Kozyrkov

In a video embedded in the same article, Kozyrkov pithily proclaims: “There are only actually two real lines there. Here’s what they are: This objective. That data set.” (At 9:16.) Of course there’s a ton more code than that (she’s talking about the programming of the system that creates the model), but in terms of what you want the system to be able to do, that’s it in a nutshell: How have you framed your objective? And what’s in your dataset? More important, in many cases, is what’s NOT in your dataset.

She says this is where the core danger in AI lies, because in traditional programming “it might take 10,000 lines of code, a hundred thousand lines of code maybe, and some human being has to worry about every single one of those lines, agonize over it.” With supervised machine learning, you’ve only got the objective and the (gigantic) dataset, and the question is, Have enough people with expertise really agonized over each of those things?

My other favorite bits from the video:

  • “A system that is built and designed for one purpose may not work for a different purpose.” (6:17)
  • “Remember that the objective is subjective.” (6:31)
  • “And if you take those two parts really seriously, that is how you are going to build a safe and effective and kind AI system.” (20:16)

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.