‘Ground truth’ and labeled data – AI in Media and Society

Cassie Kozyrkov, who wrote this article, is head of decision intelligence at Google. It starts out with what looks like a standard explanation of an image-recognition system — which she deprecatingly refers to as the “the cat/not-cat task.” But don’t be fooled — Kozyrkov communicates with clear, sharp precision, and very quickly she asks us to consider circumstances in which we would want a tiger to be considered a cat and those in which we would want it to be not-cat.

This leads to a discussion of ground truth. This is “an ideal expected result” — but for whom? Well, for the people who originally built the system. Kozyrkov notes that ground truth is NOT an objective, perfect truth like something studied in a philosophy class (Truth with a capital T). It’s whether a tiger is a cat in your reality or not-cat in mine.

I am reminded of one of my favorite lines in the rock opera Jesus Christ Superstar: “But what is truth? Is truth unchanging law? We both have truths. Are mine the same as yours?”

“When such a dataset is used to train ML/AI systems, systems based on it will inherit and amplify the implicit values of the people who decided what the ideal system behavior looked like to them.”
— Cassie Kozyrkov

It also brings to mind the practice of testing for intercoder reliability — standard practice in research that relies on qualitative data. (More about that here.)

Say you are using an existing labeled dataset — not one you yourself have created — which is often the case. The labels attached to the data items are the ground truth for that dataset. If it’s a dataset of images, and some labels applied to photos of people are racist, then that’s the ground truth in that dataset. If it’s a dataset for sentiment analysis, and a lot of toxic comments are labeled “not toxic,” then that’s the ground truth you’re adopting.

It’s essential for developers to test systems extensively to uncover these flaws in ground truth.

“You wouldn’t want to fall victim to a myopic fraud detection system with sloppy definitions of what financial fraud looks like, especially if such a system is allowed to falsely accuse people without giving them an easy way to prove their innocence.”
— Cassie Kozyrkov

In a video embedded in the same article, Kozyrkov pithily proclaims: “There are only actually two real lines there. Here’s what they are: This objective. That data set.” (At 9:16.) Of course there’s a ton more code than that (she’s talking about the programming of the system that creates the model), but in terms of what you want the system to be able to do, that’s it in a nutshell: How have you framed your objective? And what’s in your dataset? More important, in many cases, is what’s NOT in your dataset.

She says this is where the core danger in AI lies, because in traditional programming “it might take 10,000 lines of code, a hundred thousand lines of code maybe, and some human being has to worry about every single one of those lines, agonize over it.” With supervised machine learning, you’ve only got the objective and the (gigantic) dataset, and the question is, Have enough people with expertise really agonized over each of those things?

My other favorite bits from the video:

“A system that is built and designed for one purpose may not work for a different purpose.” (6:17)
“Remember that the objective is subjective.” (6:31)
“And if you take those two parts really seriously, that is how you are going to build a safe and effective and kind AI system.” (20:16)

AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.