Who labels the data for AI?

In yesterday’s post, I referred to the labels that are required for supervised machine learning. To train a model — which enables an AI system to correctly identify or sort images or documents or iris flowers (and so much more) — each data record must include one or more labels. For an image of a dog, for example, the labels might be dog and Great Dane. For an iris flower, the label is the name of the exact species of that individual flower.

Nowadays there are people all around the world sitting at computers and labeling data.

In the 6-minute video above, BBC journalist Dave Lee travels to Kenya, where about 2,000 people work in a Nairobi office for Samasource, which produces training data for use in machine learning.

You’ll see exactly how every single item in one video frame is marked and tagged — this is what a vision system for a self-driving car needs if it is to avoid crashing into mailboxes or people.

In the Nairobi office, 52 percent of the workers are women. The pay is terribly low by Silicon Valley standards, but high for Kenya. Lee doesn’t gloss over this aspect of the story — in fact, it’s central to the telling.

Financial Times journalist Madhumita Murgia wrote about Samasource in July 2019. Her story also covers iMerit, a similar company with offices in Kolkata, India, as well as California and Louisiana.

“An hour of video takes eight hours to annotate. In fact, a McKinsey report from 2018 listed data labeling as the biggest obstacle to AI adoption in industry.”

—Financial Times

Some very large and widely used datasets such as ImageNet were labeled by self-employed workers for extremely low rates of pay — often through the Amazon-owned Mechanical Turk crowdsourcing website (which also offers up far worse tasks for similarly low compensation). In contrast, Samasource’s CEO Leila Janah told Murgia that the company’s pay rate is “almost quadruple” the previous income of their workers in developing countries.

Janah also pointed out that these workers are not just labeling cats and dogs. They have been trained, for example, to label diseased cells in photos of cross-sections of plants for one particular project. They are providing real human intelligence that is specialized to very particular problem sets.

Fortune journalist Jeremy Kahn wrote about other companies that also provide data-labeling services for top multinational firms. Labelbox and Scale AI have received heaps of funding from venture capitalists, but I couldn’t find any information about their workers who label the data. Is this something we should be concerned about? Probably so.

Both Samasource and iMerit are upfront about who their workers are and where they do the work (this might have changed since the spread of COVID-19 in early 2020). Are the dozens of other companies supplying labeled data to corporations and universities in the wealthy countries paying their workers a living wage?

“Often companies have a need for both general and more expert labeling and employ a combination of outsourcing firms, freelancers, and in-house experts to affix these annotations.”

—Fortune

Labelbox, in fact, doesn’t employ people who do the labeling work, according to Fortune. It provides “a tool for managing labeling projects and data across different contract labelers, who often work for large outsourcing firms.”

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

ImageNet and labels for data

Supervised learning is a type of machine learning in which a model is trained using labeled data. You begin with a very large collection of labeled data. (In the case of ImageNet, the data were all digital images. For the Iris Data Set, the data all refer to individual iris flowers, which can be divided into three related species. For the MNIST dataset, the data are images of about 70,000 handwritten numbers, 0 through 9.)

You divide the dataset into two parts, the training data and the test data. The split might be 70/30, or 80/20. You don’t choose which data goes into which group. Then you run the training data many, many, many times, adjusting certain parameters in the code along the way, until the code consistently returns good results — that is, the thing the code identifies (an object in an image, an iris species, a number) matches the label (which is hidden from the code).

At that point, you have a trained model. You feed the test data set to it and see whether the accuracy rate is also high. (It’s important that none of the test data were used to train the model.) Again, the proof is in the labels.

In a later post I will discuss how data come to be labeled. (Hint: It’s not elves.) In this post, I will discuss bad labels. Specifically, I want to highlight the work that AI researcher Kate Crawford and artist-researcher Trevor Paglen did around the famous ImageNet dataset.

In the video above, Crawford and Paglen present this work and show a lot of great examples. They also published a long article about the work, if you’d rather read than watch.

ImageNet is a huge collection of labeled images. More than 14 million images. They were labeled according to a set of categories and synonym groupings from WordNet, an English-language lexical database. The images were labeled by humans.

And that, it seems, is at the root of the problem.

Crawford and Paglen were interested in the ImageNet photos of people. Person is a category in WordNet. Within the category, there are many descriptive terms for people, such as “cheerleaders, scuba divers, welders, Boy Scouts, fire walkers, and flower girls.” So the photos of people in ImageNet are labeled with these terms. However, not all terms are neutral.

“A young man drinking beer is categorized as an ‘alcoholic, alky, dipsomaniac, boozer, lush, soaker, souse.’ A child wearing sunglasses is classified as a ‘failure, loser, non-starter, unsuccessful person.’”

—Crawford and Paglen

You might say, well, where’s the harm? They are only labels in a database, after all.

The ImageNet database has been used to train many convolutional neural networks used in image-recognition software.

When you feed a photo of yourself into an image-recognition application, you might be surprised at the labels that are applied to you. For example, an image of Paglen (a white man with a shaved head) was labeled as “Klansman, Ku Kluxer.”

Paglen built a web app called ImageNet Roulette so that anyone could upload a photo of themselves or a friend and see what labels were applied. (The app is no longer online.) It became clear that perfectly innocuous people in photos were being labeled as criminals or dangerous, or with racist or sexist terms.

About 952,000 of ImageNet’s 14 million images were in the person category as of 2010 (source). Many of those images — with their labels — were removed after the opening of Crawford and Paglen’s art exhibition, Training Humans, in Milan in September 2019.

ImageNet has been used to train countless image-recognition systems since 2010.

Additional information:

Leading online database to remove 600,000 images after art project reveals its racist bias (September 2019), The Art Newspaper.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.