Who labels the data for AI?

In yesterday’s post, I referred to the labels that are required for supervised machine learning. To train a model — which enables an AI system to correctly identify or sort images or documents or iris flowers (and so much more) — each data record must include one or more labels. For an image of a dog, for example, the labels might be dog and Great Dane. For an iris flower, the label is the name of the exact species of that individual flower.

Nowadays there are people all around the world sitting at computers and labeling data.

In the 6-minute video above, BBC journalist Dave Lee travels to Kenya, where about 2,000 people work in a Nairobi office for Samasource, which produces training data for use in machine learning.

You’ll see exactly how every single item in one video frame is marked and tagged — this is what a vision system for a self-driving car needs if it is to avoid crashing into mailboxes or people.

In the Nairobi office, 52 percent of the workers are women. The pay is terribly low by Silicon Valley standards, but high for Kenya. Lee doesn’t gloss over this aspect of the story — in fact, it’s central to the telling.

Financial Times journalist Madhumita Murgia wrote about Samasource in July 2019. Her story also covers iMerit, a similar company with offices in Kolkata, India, as well as California and Louisiana.

“An hour of video takes eight hours to annotate. In fact, a McKinsey report from 2018 listed data labeling as the biggest obstacle to AI adoption in industry.”

—Financial Times

Some very large and widely used datasets such as ImageNet were labeled by self-employed workers for extremely low rates of pay — often through the Amazon-owned Mechanical Turk crowdsourcing website (which also offers up far worse tasks for similarly low compensation). In contrast, Samasource’s CEO Leila Janah told Murgia that the company’s pay rate is “almost quadruple” the previous income of their workers in developing countries.

Janah also pointed out that these workers are not just labeling cats and dogs. They have been trained, for example, to label diseased cells in photos of cross-sections of plants for one particular project. They are providing real human intelligence that is specialized to very particular problem sets.

Fortune journalist Jeremy Kahn wrote about other companies that also provide data-labeling services for top multinational firms. Labelbox and Scale AI have received heaps of funding from venture capitalists, but I couldn’t find any information about their workers who label the data. Is this something we should be concerned about? Probably so.

Both Samasource and iMerit are upfront about who their workers are and where they do the work (this might have changed since the spread of COVID-19 in early 2020). Are the dozens of other companies supplying labeled data to corporations and universities in the wealthy countries paying their workers a living wage?

“Often companies have a need for both general and more expert labeling and employ a combination of outsourcing firms, freelancers, and in-house experts to affix these annotations.”

—Fortune

Labelbox, in fact, doesn’t employ people who do the labeling work, according to Fortune. It provides “a tool for managing labeling projects and data across different contract labelers, who often work for large outsourcing firms.”

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.