For Friday AI Fun, let’s look at an oldie but goodie: Google’s Quick, Draw!
You are given a word, such as whale, or bandage, and then you need to draw that in 20 seconds or less.
Thanks to this game, Google has labeled data for 50 million drawings made by humans. The drawings “taught” the system what people draw to represent those words. Now the system uses that “knowledge” to tell you what you are drawing — really fast! Often it identifies your subject before you finish.
It is possible to stump the system, even though you’re trying to draw what it asked for. My drawing of a sleeping bag is apparently an outlier. My drawings of the Mona Lisa and a rhinoceros were good enough — although I doubt any human would have named them as such!
Google’s AI thought my sleeping bag might be a shoe, or a steak, or a wine bottle.
The system has “learned” to identify only 345 specific things. These are called its categories.
You can look at the data the system has stored — for example, here are a lot of drawings of beard.
The most wonderful thing about YouTube is you can use it to learn just about anything.
One of the 10,000 annoying things about YouTube is finding a good, satisfying version of the lesson you want to learn can take hours of searching. This is especially true of videos about technical aspects of machine learning. Of course there are one- and two-hour recordings of course lectures by computer science professors. But I’ve been seeking out shorter videos with more animations and illustrations of concepts.
Understanding what a neural network is and how it processes data is necessary to demystifying machine learning. Data goes in, results come out — but in between is a “black box” consisting of code and hardware. It sort of works like a human brain, and yet, it really doesn’t.
So here at last is a painless, math-free video that walks us through a neural network. The particular example shown uses the MNIST dataset, which consists of 70,000 images of handwritten digits, 0–9. So the task being performed is the recognition of those digits. (This kind of system can be used to sort mail using postal codes, for example.)
What you’ll see is how the first layer (a vertical line of circles on the left side) represents the input. If each of the MNIST images is 28 pixels wide by 28 pixels high, then that first layer has to represent 784 pixels and each of their color values — which is a number. (One image is the input — only one at a time.)
The final vertical layer, all the way to right side, is the output of the neural network. In this example, the output tells us which digit was in the input — 0, 1, 2, etc. To see the value in this, go back to the mail-sorting idea. If a system can read postal codes, it recognizes several numbers and then transmits them to another system that “knows” which postal code goes to which geographical location. My letter gets sorted into the Florida bin and yours into the bin for your home.
In between the input and the output are the vertical “hidden” layers, and that’s where the real work gets done. In the video you’ll see that the number of circles — often called neurons, but they can also be called just units — in a hidden layer might well be less than the number of units in the input layer. The number of units in the output layer can also differ from the numbers in other layers.
Beautifully, during an animation, our teacher Grant Sanderson explains and shows that the weights exist not in or on the units (the “neurons”) but in fact in or on the connectionsbetween the units.
Okay, I lied a little. There is some math shown here. The weight assigned to the connection is multiplied by the value of the unit to the left. The results are all summed, for all left-side units, and that sum is assigned to the unit to the right (meaning the right side of that one connection).
The video bogs down just a bit between the Sigmoid squishification function and applying the bias, but all you really need to grasp is that the value of the right-side unit shows whether or not that little region of the image (in this case, it’s an image) has a significant difference. The math is there to determine if the color, the amount of color, is significant enough to count. And how much it should count.
I know — math, right?
But seriously, watch the video. It’s excellent.
“And that’s a lot to think about! With this hidden layer of 16 neurons, that’s a total of 784 times 16 weights, along with 16 biases. And all of that is just the connections from the first layer to the second.”
—Grant Sanderson, But what is a neural network? (video)
Sanderson doesn’t burden us with the details of the additional layers. Once you’ve seen the animations for that first step — from the input layer through the connections to the first hidden layer — you’ll have a real appreciation for what’s happening under the hood in a neural network.
In the final 6 minutes of this 19-minute video, you’ll also learn how the “learning” takes place in machine learning when a neural net is involved. All those weights and bias values? They are not determined by humans.
“Digging into what the weights and biases are doing is a good way to challenge your assumptions and really expose the full space of possible solutions.”
—Grant Sanderson, But what is a neural network? (video)
I confess it does get rather mathy at the end, but hang on through the parts that are beyond your personal math background and listen to what Sanderson is telling us. You can get a lot out of it even if the equation itself is like hieroglyphics to you.
The video content ends at 16:26, followed by the usual “subscribe to my channel” message. More info about Sanderson and his excellent videos is on his website, 3Blue1Brown.
I was surprised when I watched this video about how most face detection works. Granted, this is not face recognition (identifying the specific person). Face detection looks at an image or video and can almost instantly point out all the human faces. In a consumer camera, this is part of the code that puts a rectangle around each person’s face while you’re framing your shot.
What’s wonderful in the video is how the Viola–Jones object detection framework is illustrated and explained so that even we non-math types can understand it.
Like the game cases I wrote about yesterday, this is a case where tried-and-true algorithms are used, but deep neural networks are not.
As is typical with AI, there is a model. How does the code identify a human face? It “knows” some things about the shape and proportions of human faces. But it knows these attributes (features) not as noses and eyes and mouths — as we humans do. Instead, it knows them as rectangular shapes that map very well to the pixels in a digital image.
Make sure you stay with the video until 3:30, when Mike Pound begins to draw on paper. (This drawing-by-hand is a large part of why I love the videos from Computerphile!) At 8:30 he begins drawing a face to show how the algorithm analyzes that segment of an image.
The one part that might not be clear (depending on how much time you spend thinking about pixels in images) is that the numbers in the grid he draws represent values of lightness or darkness in the image. In all cases, computers require knowledge to be represented as numbers. When dealing with images, numbers represent differences. To compare sections of an image with other sections, the numeric values for one section are added up and compared with the sum of numeric values from another section.
The animations in the final three minutes of the video provide an awesomely clear explanation of how the regions of the image are assessed and quickly discarded as “not a face” or retained for further examination.
Computers are lightning-fast at these kinds of calculations. This method is so efficient, it runs rapidly even on simple hardware — which is why this method of face detection has been in use since 2002.
To see for yourself the product, or end results, of an AI system, check out the Visual Chatbot online. It’s free. It’s fun.
This app invites you to upload any image of your choice. It then generates a caption for that image. As you see above, the caption is not always 100 percent accurate. Yes, there is a dog in the photo, but there is no statue. There is a live person, who happens to be a soldier and a woman.
You can then have a conversation about the photo with the chatbot. The chatbot’s answer to my first question, “What color is the dog?”, was spot-on. Further questions, however, reveal limits that persist in most of today’s image-recognition systems.
In spite of the mistakes the chatbot makes in its answers to questions about this image, it serves as a nice demonstration of how today’s chatbots do not need to follow a set script. Earlier chatbots were programmed with rules that stepped through a tree or flowchart of choices — if the human’s question contains x, then reply with y.
Supervised learning is a type of machine learning in which a model is trained using labeled data. You begin with a very large collection of labeled data. (In the case of ImageNet, the data were all digital images. For the Iris Data Set, the data all refer to individual iris flowers, which can be divided into three related species. For the MNIST dataset, the data are images of about 70,000 handwritten numbers, 0 through 9.)
You divide the dataset into two parts, the training data and the test data. The split might be 70/30, or 80/20. You don’t choose which data goes into which group. Then you run the training data many, many, many times, adjusting certain parameters in the code along the way, until the code consistently returns good results — that is, the thing the code identifies (an object in an image, an iris species, a number) matches the label (which is hidden from the code).
At that point, you have a trained model. You feed the test data set to it and see whether the accuracy rate is also high. (It’s important that none of the test data were used to train the model.) Again, the proof is in the labels.
In a later post I will discuss how data come to be labeled. (Hint: It’s not elves.) In this post, I will discuss bad labels. Specifically, I want to highlight the work that AI researcher Kate Crawford and artist-researcher Trevor Paglen did around the famous ImageNet dataset.
In the video above, Crawford and Paglen present this work and show a lot of great examples. They also published a long article about the work, if you’d rather read than watch.
ImageNet is a huge collection of labeled images. More than 14 million images. They were labeled according to a set of categories and synonym groupings from WordNet, an English-language lexical database. The images were labeled by humans.
And that, it seems, is at the root of the problem.
Crawford and Paglen were interested in the ImageNet photos of people. Person is a category in WordNet. Within the category, there are many descriptive terms for people, such as “cheerleaders, scuba divers, welders, Boy Scouts, fire walkers, and flower girls.” So the photos of people in ImageNet are labeled with these terms. However, not all terms are neutral.
“A young man drinking beer is categorized as an ‘alcoholic, alky, dipsomaniac, boozer, lush, soaker, souse.’ A child wearing sunglasses is classified as a ‘failure, loser, non-starter, unsuccessful person.’”
—Crawford and Paglen
You might say, well, where’s the harm? They are only labels in a database, after all.
The ImageNet database has been used to train many convolutional neural networks used in image-recognition software.
When you feed a photo of yourself into an image-recognition application, you might be surprised at the labels that are applied to you. For example, an image of Paglen (a white man with a shaved head) was labeled as “Klansman, Ku Kluxer.”
Paglen built a web app called ImageNet Roulette so that anyone could upload a photo of themselves or a friend and see what labels were applied. (The app is no longer online.) It became clear that perfectly innocuous people in photos were being labeled as criminals or dangerous, or with racist or sexist terms.
About 952,000 of ImageNet’s 14 million images were in the person category as of 2010 (source). Many of those images — with their labels — were removed after the opening of Crawford and Paglen’s art exhibition, Training Humans, in Milan in September 2019.
ImageNet has been used to train countless image-recognition systems since 2010.
Different AI systems do different things when they attempt to identify humans. Everyone has heard about face recognition (a k a facial recognition), which you might expect would return a name and other personal data about a person whose face is “seen” with a camera.
No, not always.
A system that analyzes human faces might simply try to return information about the person that you or I would tag in our minds when we see a stranger. The person’s gender, for example. That’s relatively easy to do most of the time for most humans — but it turns out to be tricky for machines.
Machines often get it wrong when trying to identify the gender of a trans person. But machines also misidentify the gender of people of color. In particular, they have a big problem recognizing Black women as women.
A short and good article about this ran in Time magazine in 2019, and the accompanying video is well worth watching. It shows various face recognition software systems at work.
Another serious problem concerns differentiating among people of Asian descent. When apartment buildings and other housing developments have installed face recognition as a security system — to open for residents and stay locked for others — the Asian residents can find themselves locked out of their own home. The doors can also open for Asian people who don’t live there.
You can find a lot of articles about this widespread and very serious problem with AI technology, including the deservedly famous mug shots test by the American Civil Liberties Union.
“While it is usually incorrect to make statements across algorithms, we found empirical evidence for the existence of demographic differentials in the majority of the face recognition algorithms we studied.”
—Patrick Grother, NIST computer scientist
So how does this happen? How do companies with almost infinite resources deploy products that are so seriously — and even dangerously — flawed?
Yesterday I wrote a little about training data for object-detection AI. To identify any image, or any part of an image, an AI system is usually trained on an immense set of images. If you want to identify human faces, you feed the system hundreds of thousands, or even millions, of pictures of human faces. If you’re using supervised learning to train the system, the images are labeled: Man, woman. Black, white. Old, young. Convicted criminal. Sex offender. Psychopath.
Who is in the images? How are those images labeled?
This is part of how the whole thing goes sideways. There’s more to it, though. Before a system is marketed, or released to the public, its developers are going to test it. They’re going to test the hell out of it. This can be compared with when an AI is developed that plays a particular game, like Go, or chess. After the system has been trained, you test it. To test the system, you’re going to have it play, and see if it can win — consistently. So when developers create a face recognition system, and they’ve tested it extensively, and they say, great, now it’s ready for the public, it’s ready for commercial use — ask yourself how they missed these glaring flaws.
Ask yourself how they missed the fact that the system can’t differentiate between various Asian faces.
Ask yourself how they missed the fact that the system identifies Black women as men.
Fortunately, in just the past year these flaws have received so much attention that a number of large firms (Amazon, IBM, Microsoft) have pulled back on commercial deployments of face recognition technologies. Whether they will be able to build more trustworthy systems remains to be seen.
Timnit Gebru: Machine learning, bias, and product design (March 2018) — although this interview is older, I like how Gebru speaks about design: “We need more people who think about design working in AI, because oftentimes what’s happening is little things. … think about Siri or Alexa or these personal assistants who are women — what does that do to society, just portraying that stereotype?”
If a computer can correctly identify an object (an apple, a tricycle) or an animal such as a zebra, can it produce a drawing of that object or animal? This is something most people can do, even if their drawing skills are minimal. After all, almost anyone can play Pictionary.
This 8-minute video shows us what happened when a programmer-artist reversed the process of an AI that recognizes objects and animals in digital images. I really admire the deft storytelling here.
Object recognition has improved amazingly in the past 10 years, but that does not mean these AI systems see the same way as a human does. In some cases, that might not matter at all. In other cases, it can mean the difference between life and death.
In yesterday’s post I mentioned the way a convolutional neural network (part of a machine learning system) processes an image through many stacked layers of detection units (sometimes called neurons), identifying edges and shapes that eventually lead to a conclusion that the image is likely to contain such-and-such an object, animal, or person. Today’s video shows a bit more about the training process that an AI goes through before it can perform these identifications.
Training is necessary in the type of machine learning called supervised learning. The training data (in this case, digital images of objects and animals) must be labeled in advance. That is, the system receives thousands of images labeled “tiger” before it is able to recognize a tiger in a random photo or video. If a system can identify 20 different animals, that system was trained on thousands of images of each animal.
If the system was never trained on tigers, it cannot recognize a tiger.
So today’s video gives us a nice glimpse into how and why that training works, and what its limitations are. What’s really fascinating to me, though, are the images produced by programmer-artist Tom White‘s system.
“I have created a drawing system that allows neural networks to produce abstract ink prints that reveal their visual concepts. Surprisingly, these prints are recognized not only by the neural networks that created them, but also universally across most AI systems which have been trained to recognize the same objects.”
In the video, you’ll see that humans cannot recognize what the AI drew. The rendering is too abstract, too unlike what we see and what we would draw ourselves. Note what White says, though, about other AI systems: they can recognize the object in these AI-produced drawings.
This is, I think, related to what is called adversarial AI, which I’ll discuss in a future post.
I am fascinated by image recognition. I read about how ImageNet changed the whole universe of machine “vision” in 2009 in the excellent book Artificial Intelligence: A Guide for Thinking Humans, but I’m not going to discuss ImageNet in this post. (I will get to it eventually.)
To think about how a machine sees requires us first to think about human eyes vs. cameras. The machine doesn’t have a biological eyeball and an optic nerve and a brain. The machine might have one or more cameras to allow it to take in visual information.
Whether the machine has cameras or not, the images it receives are the same: digital images, made up entirely of pixels. This is true even if the visual inputs are video. The machine will need to sample that video, taking discrete frames from it to process and analyze.
So the first thing to absorb, as you begin to understand how a machine sees, is that it receives a grid of pixels. If it’s video, then there are a lot of separate grids. If it’s one still image, there is one grid. And how does the machine process that grid? It analyzes the differences between groups of pixels.
This 4-minute video, from an artist and programmer named Gene Kogan, helped me a lot.
Most people have an idea (possibly vague) of how the human brain works, with neurons kind of “wired together” in a network. When we imagine a computer neural network, most of us probably factor in that mental image of a brain full of neurons. This is both semi-accurate and wildly inaccurate.
In his video, Kogan points out that an image-recognition system uses a convolutional neural network, and this network has many, many layers.
When he’s clicking down the list in his video, Kogan is showing us what the different layers are “paying attention to” as the video is continuously chopped into one-frame segments. The mind-blowing thing (to me) is that the layers feed forward and backward to each other — ultimately producing the result he shows near the end, when he can hold a water bottle in front of his webcam, and the software says it sees a water bottle.
Notice too, that “water bottle” is the machine’s top guess at that moment. Its number 2 guess is “bow tie.” Its confidence in “water bottle” is not very high, as shown by the red bar to the left of the label. However, the machine’s confidence in “water bottle” is much higher than all the other things it determines it might be seeing in that frame.
After watching this video, I understood why super-fast graphics-processing hardware is so important to image recognition and machine vision.
In tomorrow’s post, I’m going to say a bit more about these ideas and share a completely different video that also helped me a lot in my attempt to understand how machines see.