Encoding language for a machine learning system

The vocabulary of medicine is different from the vocabulary of physics. If you’re building a vocabulary for use in machine learning, you need to start with a corpus — a collection of text — that suits your project. A general-purpose vocabulary in English might be derived from, say, 6 million articles from Google News. From this, you could build a vocabulary of, say, the 1 million most common words.

Although I surely do not understand all the math, last week I read Efficient Estimation of Word Representations in Vector Space, a 2013 research article written by four Google engineers. They described their work on a then-new, more efficient way of accurately predicting word meanings — the outcome being word2vec, a tool to produce a set of word vectors.

After publishing a related post last week, I knew I still didn’t have a clear picture in my mind of where the word vectors fit into various uses of machine learning. And how do the word vectors get made, anyhow? While word2vec is not the only system you can use to get word vectors, it is well known and widely used. (Other systems: fastText, GloVe.)

How the vocabulary is created

First, the corpus: You might choose a corpus that suits your project (such as a collection of medical texts, or a set of research papers about physics), and feed it into word2vec (or one of the other systems). At the end you will have a file — a dataset. (Note, it should be a very large collection.)

Alternatively, you might use a dataset that already exists — such as 3 million words and phrases with 300 vector values, trained on a Google News dataset of about 100 billion words (linked on the word2vec homepage): GoogleNews-vectors-negative300. This is a file you can download and use with a neural network or other programs or code libraries. The size of the file is 1.5 gigabytes.

What word2vec does is compute the vector representations of words. What word2vec produces is a single computer file that contains those words and a list of vector values for each word (or phrase).

As an alternative to Google News, you might use the full text of Wikipedia as your corpus, if you wanted a general English-language vocabulary.

The breakthrough of word2vec

Back to that (surprisingly readable) paper by the Google engineers: They set out to solve a problem, which was — scale. There were already systems that ingested a corpus and produced word vectors, but they were limited. Tomas Mikolov and his colleagues at Google wanted to use a bigger corpus (billions of words) to produce a bigger vocabulary (millions of words) with high-quality vectors, which meant more dimensions, e.g. 300 instead of 50 to 100.

“Because of the much lower computational complexity, it is possible to compute very accurate high-dimensional word vectors from a much larger data set.”

—Mikolov et al., 2013

With more vectors per word, the vocabulary represents not only that bigger is related to big and biggest but also that big is to bigger as small is to smaller. Algebra can be used on the vector representations to return a correct answer (often, not always) — leading to a powerful discovery that substitutes for language understanding: Take the vector for king, subtract the vector for man, and add the vector for woman. What is the answer returned? It is the vector for queen.

Algebraic equations are used to test the quality of the vectors. Some imperfections can be seen in the table below.

From Mikolov et al., 2013; color and circle added

Mikolov and his colleagues wanted to reduce the time required for training the system that assigns the vectors to words. If you’re using only one computer, and the corpus is very large, training on a neural network could take days or even weeks. They tested various models and concluded that simpler models (not neural networks) could be trained faster, thus allowing them to use a larger corpus and more vectors (more dimensions).

How do you know if the vectors are good?

The researchers defined a test set consisting of 8,869 semantic questions and 10,675 syntactic questions. Each question begins with a pair of associated words, as seen in the highlighted “Relationship” column in the table above. The circled answer, small: larger, is a wrong answer; synonyms are not good enough. The authors noted that “reaching 100% accuracy is likely to be impossible,” but even so, a high percentage of answers are correct.

I am not sure how the test set determined correct vs. incorrect answers. Test sets are complex.

Mikolov et al. compared word vectors obtained from two simpler architectures, CBOW and Skip-gram, with word vectors obtained from two types of neural networks. One neural net model was superior to the other. CBOW was superior on syntactic tasks and “about the same” as the better neural net on the semantic task. Skip-gram was “slightly worse on the syntactic task” than CBOW but better than the neural net; CBOW was “much better on the semantic part of the test than all the other models.”

CBOW and Skip-gram are described in the paper.

Another way to test a model for accuracy in semantics is to use the data from the Microsoft Research Sentence Completion Challenge. It provides 1,040 sentences in which one word has been omitted and four wrong words (“impostor words”) provided to replace it, along with the correct one. The task is to choose the correct word from the five given.

Summary

A word2vec model is trained using a text corpus. The final model exists as a file, which you can use in various language-related machine learning tasks. The file contains words and phrases — likely more than 1 million words and phrases — together with a unique list of vectors for each word.

The vectors represent coordinates for the word. Words that are close to one another in the vector space are related either semantically or syntactically. If you use a popular already-trained model, the vectors have been rigorously tested. If you use word2vec to build your own model, then you need to do the testing.

The model — this collection of word embeddings — is human-language knowledge for a computer to use. It’s (obviously) not the same as humans’ knowledge of human language, but it’s proved to be good enough to function well in many different applications.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Imagining words as numbers in n-dimensional space

The vocabulary of a neural network is represented as vectors — which I wrote about yesterday. This enables many related words to be “close to” one another, which is how the network perceives similarity and difference. This is as near as a computer comes to understanding meaning — which is not very near at all, but good enough for a lot of practical applications of natural language processing.

A previous way of representing vocabulary for a neural network was to assign just one number to each word. If the neural net had a vocabulary of 20,000 words, that meant it had 20,000 separate inputs in the first layer — the input layer. (I discussed neural nets in an earlier post here.) For each word, only one input was activated. This is called “one-hot encoding.”

Representing words as vectors (instead of with a single number) means that each number in the array for one word is an input for the neural net. Among the many possible inputs, several or many are “hot,” not just one.

Photo by Timothy Newman on Unsplash

As I was sorting this in my mind today, reading and thinking, I had to think about how to convey to my students (who might have no computer science background at all) this idea of words. The word itself doesn’t exist. The word is represented in the system as a list of numbers. The numbers have meaning; they locate the the word-object in a mathematical space, for which computers are ideally suited. But there is no word.

Long ago in school I learned about the signifier and the signified. Together, they create a sign. Language is our way of representing the world in speech and in writing. The word is not the thing itself; the map is not the territory. And here we are, building a representation of human language in code, where a vocabulary of tens of thousands of human words exists in an imaginary space consisting of numbers — because numbers are the only things a computer can use.

I had a much easier time understanding the concepts of image recognition than I am having with NLP.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

How does machine learning understand sentiment?

Sometimes I come across a video on YouTube that’s almost too simple — and that’s exactly what makes it great. Andy Kim, a junior at the elite prep school Deerfield Academy in Massachusetts, gave a local TED Talk about sentiment analysis, and I think it’s really perfect for anyone who’s spent a little time on understanding image recognition, but who has not yet studied much about natural language processing.

Your first thought might be that detecting the sentiment of a tweet, a movie review, or a response to customer service is just a matter of word definitions. Love is a positive word; hate is a negative word.

But as Melanie Mitchell wrote in Artificial Intelligence: A Guide for Thinking Humans (2019): “Looking at single words or short sequences in isolation is generally not sufficient to glean the overall sentiment; it’s necessary to capture the semantics of words in the context of the whole sentence” (p. 183; my emphasis).

Kim, in his TED Talk, does a good job of explaining how words are represented as vectors, and how this enables complex associations with similar or related terms. He doesn’t use a diagram of three-dimensional space (which I find helpful for conceptualizing this in my own mind); instead he refers to “an n dimensional space,” which I think my journalism students might not instantly visualize.

“These word vectors can span from 25 up to a thousand components. Now, conveniently, as these vectors are still simply a list of numbers, they can be plotted on an n dimensional space …”

—Andy Kim

In computer programming, a vector is a list of values, which you can think of as points or coordinates. In a two-dimensional space, you might have x and y, with the value of x representing the point’s position on a horizontal line, and the value of y representing the point’s position on a vertical line. Add a third dimension, and you have a third coordinate, z.

To simulate more dimensions, we add even more values to the list. A single word will have a list of many values, and those values signify its relations to other words in the collection of all words in the system.

At about the middle of his talk, Kim makes it perfectly clear why so many dimensions are needed to represent relationships among terms that have multiple meanings.

Kim goes on to talk about the labeled data for training a system to detect, or recognize, sentiment in text. He used a freely available dataset from Kaggle, probably the Sentiment140 dataset with 1.6 million tweets. (Another widely used dataset for sentiment analysis training is the IMDB Dataset of 50K Movie Reviews.) Kim also demonstrates cleaning the Twitter data so that usernames, hashtags and stop words are eliminated.

Kim used the GloVe algorithm to construct vectors for the words in his dataset, but he skips over the details of the training and just tells us that he wasn’t very successful; his model only reached a 60 percent accuracy level. He closes by summarizing some of the uses of sentiment analysis.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

How to start learning about algorithms

After writing yesterday’s post, I was thinking about how much students should know about algorithms if they are to have a basic understanding of how AI works. Is it enough to tell them an algorithm is a set of instructions?

So I turned, as I often do, to Khan Academy — a free online learning site that often helps me through my lack of a mathematics background. I found a set of three short lessons, starting with a video.

Screenshot from Khan Academy video

In the introductory video, “What is an algorithm and why should you care?”, we see various practical uses of algorithms, followed by the statement above, and a brief description of how route finding works — what Google Maps does when it gives you directions. Route finding is often used as an example of accepting a “good enough” output for the sake of speed (that is, efficiency).

Watching the animation, we comprehend that the computer is following a set of instructions to determine a good route for a delivery truck with 25 stops to make. We see the process of the algorithm at work, rather than seeing formulas and equations.

I love that the video also shows us, with animation, how the efficiency of an algorithm is calculated.

The second lesson, “A guessing game,” demonstrates binary search (an algorithm) by allowing you to discover it interactively. Wonderful!

The third lesson, “Route-finding,” is much more reading intensive. It explains the algorithm in terms of solving a maze. Without knowing the exact path to solve the maze, the algorithm can “know” which choice for its next step takes it closer to the goal (the center of the maze). I don’t consider this lesson very helpful, but that’s because I saw a much better explanation of maze-solving algorithms here:

Start video at 54:35 for demo of the greedy best-first search algorithm

I am continually amazed and humbled by the variety of ways in which people teach these concepts. More important, I realize how some ways of explaining a concept are not at all effective — for me, at least — and another way of explaining makes it clear as crystal.

So, how much should students know about algorithms, if they are to have a general understanding of AI? I think a good start would be to watch and discuss the introductory Khan Academy video, and also to see a further visual (probably animated) representation of another kind of algorithm at work.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

What do we talk about when we talk about algorithms?

Mashable recently published a series about algorithms.

  1. What is an algorithm, anyway?
  2. Algorithms control your online life. Here’s how to reduce their influence.
  3. It’s almost impossible to avoid triggering content on TikTok
  4. The algorithms defining sexuality suck. Here’s how to make them better.
  5. Why it’s impossible to forecast the weather too far into the future (The Dominance of Chaos)
  6. 12 unexpected ways algorithms control your life
  7. People are fighting algorithms for a more just and equitable future. You can, too.
  8. How to escape your social media bubble before the election
  9. An open letter to the most disappointing algorithms in my life

The first post, “What is an algorithm, anyway?”, addresses the fact that the word algorithm is often bandied about as if it means a mysterious, possibly evil, machine-embedded power.

But an algorithm doesn’t need to have anything to do with computers. An algorithm is a set of instructions for how to solve a problem. A recipe for a cake is an algorithm.

Image by Gerd Altmann from Pixabay

And yes, of course, computer software is full of algorithms. The programs that make machine learning and artificial intelligence work are full of algorithms. So algorithms are not magical, and they are not good or bad by nature. Also, they are not perfect.

We went through a period — maybe five years, maybe more — when there were a ton of articles about algorithms, and the word became almost common in nonfiction book titles. Now I see a shift toward the term AI — or artificial intelligence, or machine learning — substituting for algorithms in provocative headlines.

Too many articles, though, don’t make much of an effort to differentiate, to explain what they’re really talking about. They may as well just say computers, or software.

An algorithm is real. It is constructed by a person, or people, to do a certain task. Algorithms are often combined, so that inside one algorithm, another algorithm is followed. Thus algorithms can be components of other algorithms.

Photo by Mindy McAdams

I’m often reminded of a book I read three years ago, Algorithms to Live By: The Computer Science of Human Decisions. It was fun to read, but it was hardly the breezy self-help type of thing the cover blurbs might lead one to believe. The authors describe and explain a number of established algorithms used widely in various fields and applications — and they apply each one to everyday life.

Stories about the people who discovered (authored) many of the algorithms are woven in. I appreciated seeing how someone working on one problem sometimes ended up solving another. I also saw how an algorithm built for one use gets repurposed for other ends. Best of all, I understood what many of the algorithms are meant to do — as well as how they do it.

What I’d like to see in general articles about algorithms is a little more of what Christian and Griffiths managed to do in their book.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Would you let AI create a recipe for you?

On Fridays I try to find something to write about that’s a little less heavy than explanations of neural networks and examinations of embedded biases in AI systems. I call it Friday AI Fun.

The BBC recently wrote about a mobile app that uses AI to help you concoct a meal from the ingredients you already have at home. Plant Jammer is available for both iOS and Android, and it doesn’t merely take your ingredients and find an existing recipe for you — it actually creates a new recipe.

According to BBC journalist Nell Mackenzie, the results are not always delicious. She made some veggie burgers that came out tasting like oatmeal.

I was interested in how the app uses AI, and this is what I found: The team behind Plant Jammer consists of 15 chefs and data scientists, based in Copenhagen, Denmark. They admit that “AI is only a fraction” of what powers the app, framing that as a positive because the app incorporates “gastronomical learnings from chefs.”

Image from Plant Jammer

The app includes multiple databases, including one of complete recipes. An aspect of the AI is a recommender system, which they compare to Netflix’s. As Plant Jammer learns more about you, it will improve at creating recipes you like, based on “people like you.”

“We asked the chefs which ingredients are umami, and how umami they are. This part reflects the ‘human intelligence’ we used to build our system, a great ‘engine’ that has led to very interesting findings.”

—Michael Haase, CEO, Plant Jammer

My searches led me to an interview with Michael Haase, Plant Jammer’s CEO, in which he described the “gastro-wheel” feature in the app. The wheel encourages you to find balance in your ingredients among a base, something fresh, umami, crunch, sweet-spicy-bitter, and something that ties the ingredients together in harmony.

I’ve downloaded the app but, unlike Mackenzie, I haven’t been brave enough yet to let it create a recipe for me. Exploring some of the recommended recipes in the app, I did find the ability to select any ingredient and instantly see substitutions for it — that could come in handy!

Mackenzie’s article for the BBC also describes other AI–powered food and beverage successes, such as media agency Tiny Giant using AI to help clients “find new combinations of flavors for cupcakes and cocktails.”

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Using Super Mario to understand neural networks

I doubt I will ever program a neural network, but I’m trying to understand how they work — and how they are trained — well enough to make assumptions about how the systems work. What I want to be able to do is raise questions when I hear about a new-to-me AI system. I don’t want to take it on faith that a system is safe and likely to function well.

Ultimately I want to help my journalism and communications students understand this too.

Last week I discussed here a video about how neural networks work. Some time before I found that video, I had watched this one a couple of times. It’s from 2015 and it’s only 6 minutes long. It’s been viewed on YouTube more than 9 million times. In fact, it’s pretty close to 1 billion views!

Video game designer Seth Bling demonstrates a fully trained neural network that plays Mario expertly. Then he shows us how the system looks at the start, when the Mario character just stands in one place and dies every time. This is the untrained neural network, when it “knows” nothing.

Unlike the example in my earlier post — where the input to the neural network was an image of a handwritten number, and the output was the number (thereby “reading” the image) — here the input is the game state, which changes by the split second. The game state is a simplified digital representation of the Mario character, the surfaces he can run on or jump to, and any obstacles or rewards that are present. The output is which button should be pressed — holding down right continuously makes Mario run toward the right without stopping.

So the output layer in this neural network is the set of all possible actions Mario can take. For a human playing the game, these would be the buttons on the game controller.

In the training, Mario has a “fitness level,” which is a number. When Mario is dying all the time, that number stays around 2. When Mario reaches the end of the level without dying (but without scoring extra points), his fitness is 528. So by “looking at” the fitness level, the neural net assesses success. If the number has increased, then keep doing the same thing.

“The more lines and neurons you have, the more nuanced the decisions can be.”

—Seth Bling

Of course there are more actions than only moving right. Training the neural net to make Mario jump and perform more actions required many generations of neural nets, and only the best-performing ones were selected for the next generation. After 34 generations, the fitness level reached 4,000.

One thing I especially like about this video is the simultaneous visual of real Mario running in the real game level, along with a representation of the neural net showing its pathways in green and red. There is no code and no math in this video, and so while watching it, you are only thinking about how the connections come to be made and reinforced.

The method used is called NeuroEvolution of Augmenting Topologies (NEAT), which I’ve read almost nothing about — but apparently it enables the neural net to grow itself, essentially. Which is kind of mind blowing.

Bling shared his code here; it’s written in the Lua language.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

What’s the use of machine learning?

I’m interested in applications of machine learning in journalism. This is natural, as my field is journalism. In the field of computer science, however, accolades and honors tend to favor research on new algorithms or procedures, or new network architectures. Applications are practical uses of algorithms, networks, etc., to solve real-world problems — and developing them often doesn’t garner the acclaim that researchers need to advance their careers.

Hannah Kerner, a professor and machine learning researcher at the University of Maryland, wrote about this in the MIT Technology Review. Her essay is aptly titled “Too many AI researchers think real-world problems are not relevant.”

“The first image of a black hole was produced using machine learning. The most accurate predictions of protein structures, an important step for drug discovery, are made using machine learning.”

—Hannah Kerner

Noting that applications of machine learning are making real contributions to science in fields outside computer science, Kerner (who works on machine learning solutions for NASA’s food security and agriculture program) asks how much is lost because of the priorities set by the journals and conferences in the machine learning field.

She also ties this focus on ML research for the sake of advancing ML to the seepage of bias out from widely used datasets into the mainstream — the most famous cases being in face recognition, with systems (machine learning models) built on flawed datasets that disproportionately skew toward white and male faces.

“When studies on real-world applications of machine learning are excluded from the mainstream, it’s difficult for researchers to see the impact of their biased models, making it far less likely that they will work to solve these problems.”

—Hannah Kerner

Machine learning is rarely plug-and-play. In creating an application that will be used to perform useful work — to make new discoveries, perhaps, or to make medical diagnoses more accurate — the machine learning researchers will do substantial new work, even when they use existing models. Just think, for a moment, about the data needed to produce an image of a black hole. Then think about the data needed to make predictions of protein structures. You’re not going to handle those in exactly the same way.

I imagine the work is quite demanding when a number of non–ML experts (say, the biologists who work on protein structures) get together with a bunch of ML experts. But either group working separately from the other is unlikely to come up with a robust new ML application. Kerner linked to this 2018 news report about a flawed cancer-detection system — leaked documents said that “instead of feeding real patient data into the software,” the system was trained on data about hypothetical patients. (OMG, I thought — you can’t train a system on fake data and then use it on real people!)

Judging from what Kerner has written, machine learning researchers might be caught in a loop, where they work on pristine and long-used datasets (instead of dirty, chaotic real-world data) to perfect speed and efficiency of algorithms that perhaps become less adaptable in the process.

It’s not that applications aren’t getting made — they are. The difficulty lies in the priorities for research, which might dissuade early-career ML researchers in particular from work on solving interesting and even vital real-world problems — and wrestling with the problems posed by messy real-world data.

I was reminded of something I’ve often heard from data journalists: If you’re taught by a statistics professor, you’ll be given pre-cleaned datasets to work with. (The reason being: She just wants you to learn statistics.) If you’re taught by a journalist, you’ll be given real dirty data, and the first step will be learning how to clean it properly — because that’s what you have to do with real data and a real problem.

So the next time you read about some breakthrough in machine learning, consider whether it is part of a practical application, or instead, more of a laboratory experiment performed in isolation, using a tried-and-true dataset instead of wild data.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Journalists use machine learning to examine medical device records

Some investigations in the public interest require journalists to search through large quantities of official documents. Often the set of documents is very diverse — that is, the format, structure, and even language of the documents might vary greatly.

One of the more impressive investigations I know of is the ongoing Implant Files project, conducted originally by 250 journalists in 36 countries. The purpose: To examine how medical devices (specifically, those implanted into human bodies) are “tested, approved, marketed, and monitored” (source). I’ve heard this project discussed at conferences, and I’m full of admiration for the editors and reporters involved, led by the International Consortium of Investigative Journalists (ICIJ).

At the heart of the investigation, with its first results published in 2018, was “an analysis of more than 8 million device-related health records, including death and injury reports and recalls.”

“The entire process involved text mining, clustering, feature selection, association rules and classification algorithms to identify events not always described consistently in different parts of the data.”

How ICIJ Used Machine Learning to Help Find Medical Device Issues

These implanted devices — hip replacements, defibrillators, breast implants, intraocular lenses, and more — are used all around the world. When something goes wrong and a product recall is issued, however, the news might not spread to all the locations where the devices continue to be used in new surgeries for new patients. Moreover, people who already have a faulty implant might not be notified. This is why a global investigation was sorely needed.

Above: An ICIJ video summarizes how patients who receive implants are left unprotected

In 2018, ICIJ shared “a publicly searchable database of more than 70,000 recalls and safety warnings in 11 countries.” The project has continued since then, and the database now contains “more than 120,000 recalls, safety alerts and field safety notices” for medical devices. Throughout 2019, thousands more records were added.

A December 2018 post details the team’s data methodology for the Implant Files. First, journalists had to get the records — and often, their legitimate requests for public records were denied. Of the 8 million device-related records they managed to obtain, 5.4 million came from the U.S. Food and Drug Administration.

The records “describe cases where a device is suspected to have caused or contributed to a serious injury or death or has experienced a malfunction that would likely lead to harm if it were to recur.”

The value in these records was in the connections — connections among cases, and connections among devices. The ICIJ analysis concluded that “devices that broke, misfired, corroded, ruptured or otherwise malfunctioned after implantation or use were linked to more than 1.7 million injuries and nearly 83,000 deaths” in just one decade.

To identify the records that involved a patient’s death, it was necessary for humans to determine various terms and phrasing used instead of the word “death” in the documents. Eventually they developed “a set of more than 3,400 key phrases” that were used to train the machine learning system. After using that model to extract the relevant records, it was necessary to run them through another algorithm configured to determine whether the implant device had contributed to the death.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.