I had not been all that interested to learn about perceptrons, even though the perceptron is known as an ancestor of present-day machine learning.
That changed when I read an account that said the big names in AI in the 1960s were convinced that symbolic AI was the road to glory — and their misplaced confidence smothered the development of the first systems that learned and modified their own code.
Symbolic AI is built with strictly programmed rules. Also known as “good old-fashioned AI,” or GOFAI, the main applications you can produce with symbolic AI are expert systems.
The original perceptron was conceived and programmed by Frank Rosenblatt, who earned his Ph.D. in 1956. A huge IBM computer running his code was touted by the U.S. Office of Naval Research in 1958 as “capable of receiving, recognizing and identifying its surroundings without any human training or control,” according to a New York Times article published on July 8, 1958. That was hype, but the perceptron actually did receive visual information from the environment and learn from it, in much the same way as today’s ML systems do.
“At the time, he didn’t know how to train networks with multiple layers. But in hindsight, his algorithm is still fundamental to how we’re training deep networks today.”
After leading AI researchers Marvin Minsky and Seymour Papert, both of MIT, published a book in 1969 that essentially said perceptrons were a dead end, all the attention — and pretty much all the funding — went to symbolic AI projects and research. Symbolic AI was the real dead end, but it took 50 years for that truth to be fully accepted.
Frank Rosenblatt died in a boating accident on his 43rd birthday, according to his obituary in The New York Times. It was 1971. Had he lived, he might have trained dozens of AI researchers who could have gone on to change the field much sooner.
Descriptions of machine learning are often centered on training a model. Not having a background in math or statistics, I was puzzled by this the first time I encountered it. What is the model?
This 10-minute video first describes how you select labeled data for training. You examine the features in the data, so you know what’s available to you (such as color and alcohol content of beers and wines). Then the next step is choosing the model that you will train.
In the video, Yufeng Guo chooses a small linear model without much explanation as to why. For those of us with an impoverished math background, this choice is completely mysterious. (Guo does point out that some models are better suited for image data, while others might be better suited for text data, and so on.) But wait, there’s help. You can read various short or long explanations about the kinds of models available.
It’s important for the outsider to grasp that this is all code. The model is an algorithm, or a set of algorithms (not a graph). But this is not the final model. This is a model you will train, using the data.
What are you doing while training? You are — or rather, the system is — adjusting numbers known as weights and biases. At the outset, these numbers are randomly selected. They have no meaning and no reason for being the numbers they are. As the data go into the algorithm, the weights and biases are used with the data to produce a result, a prediction. Early predictions are bad. Wine is called beer, and beer is called wine.
The output (the prediction) is compared to the “correct answer” (it is wine, or it is beer). The weights and biases are adjusted by the system. The predictions get better as the training data are run again and again and again. Running all the data through the system once is called an epoch; the weights and biases are not adjusted until after all the data have run through once. Then the adjustment. Then run the data again. Epoch 2: adjust, repeat. Many epochs are required before the predictions become good.
After the predictions are good for the training data, it’s time to evaluate the model using data that were set aside and not used for training. These “test data” (or “evaluation data”) have never run through the system before.
The results from the evaluation using the test data can be used to further fine-tune the system, which is done by the programmers, not by the code. This is called adjusting the hyperparameters and affects the learning process (e.g., how fast it runs; how the weights are initialized). These adjustments have been called “a ‘black art’ that requires expert experience, unwritten rules of thumb, or sometimes brute-force search” (Snoek et al., 2012).
And now, what you have is a trained model. This model is ready to be used on data similar to the data it was trained on. Say it’s a model for machine vision that’s part of a robot assembling cars in a factory — it’s ready to go into all the robots in all the car factories. It will see what it has been trained to see and send its prediction along to another system that turns the screw or welds the door or — whatever.
And it’s still just — code. It can be copied and sent to another computer, uploaded and downloaded, and further modified.
For Friday AI Fun, let’s look at an oldie but goodie: Google’s Quick, Draw!
You are given a word, such as whale, or bandage, and then you need to draw that in 20 seconds or less.
Thanks to this game, Google has labeled data for 50 million drawings made by humans. The drawings “taught” the system what people draw to represent those words. Now the system uses that “knowledge” to tell you what you are drawing — really fast! Often it identifies your subject before you finish.
It is possible to stump the system, even though you’re trying to draw what it asked for. My drawing of a sleeping bag is apparently an outlier. My drawings of the Mona Lisa and a rhinoceros were good enough — although I doubt any human would have named them as such!
Google’s AI thought my sleeping bag might be a shoe, or a steak, or a wine bottle.
The system has “learned” to identify only 345 specific things. These are called its categories.
You can look at the data the system has stored — for example, here are a lot of drawings of beard.
I doubt I will ever program a neural network, but I’m trying to understand how they work — and how they are trained — well enough to make assumptions about how the systems work. What I want to be able to do is raise questions when I hear about a new-to-me AI system. I don’t want to take it on faith that a system is safe and likely to function well.
Ultimately I want to help my journalism and communications students understand this too.
Last week I discussed here a video about how neural networks work. Some time before I found that video, I had watched this one a couple of times. It’s from 2015 and it’s only 6 minutes long. It’s been viewed on YouTube more than 9 million times. In fact, it’s pretty close to 1 billion views!
Video game designer Seth Bling demonstrates a fully trained neural network that plays Mario expertly. Then he shows us how the system looks at the start, when the Mario character just stands in one place and dies every time. This is the untrained neural network, when it “knows” nothing.
Unlike the example in my earlier post — where the input to the neural network was an image of a handwritten number, and the output was the number (thereby “reading” the image) — here the input is the game state, which changes by the split second. The game state is a simplified digital representation of the Mario character, the surfaces he can run on or jump to, and any obstacles or rewards that are present. The output is which button should be pressed — holding down right continuously makes Mario run toward the right without stopping.
So the output layer in this neural network is the set of all possible actions Mario can take. For a human playing the game, these would be the buttons on the game controller.
In the training, Mario has a “fitness level,” which is a number. When Mario is dying all the time, that number stays around 2. When Mario reaches the end of the level without dying (but without scoring extra points), his fitness is 528. So by “looking at” the fitness level, the neural net assesses success. If the number has increased, then keep doing the same thing.
“The more lines and neurons you have, the more nuanced the decisions can be.”
Of course there are more actions than only moving right. Training the neural net to make Mario jump and perform more actions required many generations of neural nets, and only the best-performing ones were selected for the next generation. After 34 generations, the fitness level reached 4,000.
One thing I especially like about this video is the simultaneous visual of real Mario running in the real game level, along with a representation of the neural net showing its pathways in green and red. There is no code and no math in this video, and so while watching it, you are only thinking about how the connections come to be made and reinforced.
The method used is called NeuroEvolution of Augmenting Topologies (NEAT), which I’ve read almost nothing about — but apparently it enables the neural net to grow itself, essentially. Which is kind of mind blowing.
I’m interested in applications of machine learning in journalism. This is natural, as my field is journalism. In the field of computer science, however, accolades and honors tend to favor research on new algorithms or procedures, or new network architectures. Applications are practical uses of algorithms, networks, etc., to solve real-world problems — and developing them often doesn’t garner the acclaim that researchers need to advance their careers.
“The first image of a black hole was produced using machine learning. The most accurate predictions of protein structures, an important step for drug discovery, are made using machine learning.”
Noting that applications of machine learning are making real contributions to science in fields outside computer science, Kerner (who works on machine learning solutions for NASA’s food security and agriculture program) asks how much is lost because of the priorities set by the journals and conferences in the machine learning field.
She also ties this focus on ML research for the sake of advancing ML to the seepage of bias out from widely used datasets into the mainstream — the most famous cases being in face recognition, with systems (machine learning models) built on flawed datasets that disproportionately skew toward white and male faces.
“When studies on real-world applications of machine learning are excluded from the mainstream, it’s difficult for researchers to see the impact of their biased models, making it far less likely that they will work to solve these problems.”
Machine learning is rarely plug-and-play. In creating an application that will be used to perform useful work — to make new discoveries, perhaps, or to make medical diagnoses more accurate — the machine learning researchers will do substantial new work, even when they use existing models. Just think, for a moment, about the data needed to produce an image of a black hole. Then think about the data needed to make predictions of protein structures. You’re not going to handle those in exactly the same way.
I imagine the work is quite demanding when a number of non–ML experts (say, the biologists who work on protein structures) get together with a bunch of ML experts. But either group working separately from the other is unlikely to come up with a robust new ML application. Kerner linked to this 2018 news report about a flawed cancer-detection system — leaked documents said that “instead of feeding real patient data into the software,” the system was trained on data about hypothetical patients. (OMG, I thought — you can’t train a system on fake data and then use it on real people!)
Judging from what Kerner has written, machine learning researchers might be caught in a loop, where they work on pristine and long-used datasets (instead of dirty, chaotic real-world data) to perfect speed and efficiency of algorithms that perhaps become less adaptable in the process.
It’s not that applications aren’t getting made — they are. The difficulty lies in the priorities for research, which might dissuade early-career ML researchers in particular from work on solving interesting and even vital real-world problems — and wrestling with the problems posed by messy real-world data.
I was reminded of something I’ve often heard from data journalists: If you’re taught by a statistics professor, you’ll be given pre-cleaned datasets to work with. (The reason being: She just wants you to learn statistics.) If you’re taught by a journalist, you’ll be given real dirty data, and the first step will be learning how to clean it properly — because that’s what you have to do with real data and a real problem.
So the next time you read about some breakthrough in machine learning, consider whether it is part of a practical application, or instead, more of a laboratory experiment performed in isolation, using a tried-and-true dataset instead of wild data.
The most wonderful thing about YouTube is you can use it to learn just about anything.
One of the 10,000 annoying things about YouTube is finding a good, satisfying version of the lesson you want to learn can take hours of searching. This is especially true of videos about technical aspects of machine learning. Of course there are one- and two-hour recordings of course lectures by computer science professors. But I’ve been seeking out shorter videos with more animations and illustrations of concepts.
Understanding what a neural network is and how it processes data is necessary to demystifying machine learning. Data goes in, results come out — but in between is a “black box” consisting of code and hardware. It sort of works like a human brain, and yet, it really doesn’t.
So here at last is a painless, math-free video that walks us through a neural network. The particular example shown uses the MNIST dataset, which consists of 70,000 images of handwritten digits, 0–9. So the task being performed is the recognition of those digits. (This kind of system can be used to sort mail using postal codes, for example.)
What you’ll see is how the first layer (a vertical line of circles on the left side) represents the input. If each of the MNIST images is 28 pixels wide by 28 pixels high, then that first layer has to represent 784 pixels and each of their color values — which is a number. (One image is the input — only one at a time.)
The final vertical layer, all the way to right side, is the output of the neural network. In this example, the output tells us which digit was in the input — 0, 1, 2, etc. To see the value in this, go back to the mail-sorting idea. If a system can read postal codes, it recognizes several numbers and then transmits them to another system that “knows” which postal code goes to which geographical location. My letter gets sorted into the Florida bin and yours into the bin for your home.
In between the input and the output are the vertical “hidden” layers, and that’s where the real work gets done. In the video you’ll see that the number of circles — often called neurons, but they can also be called just units — in a hidden layer might well be less than the number of units in the input layer. The number of units in the output layer can also differ from the numbers in other layers.
Beautifully, during an animation, our teacher Grant Sanderson explains and shows that the weights exist not in or on the units (the “neurons”) but in fact in or on the connectionsbetween the units.
Okay, I lied a little. There is some math shown here. The weight assigned to the connection is multiplied by the value of the unit to the left. The results are all summed, for all left-side units, and that sum is assigned to the unit to the right (meaning the right side of that one connection).
The video bogs down just a bit between the Sigmoid squishification function and applying the bias, but all you really need to grasp is that the value of the right-side unit shows whether or not that little region of the image (in this case, it’s an image) has a significant difference. The math is there to determine if the color, the amount of color, is significant enough to count. And how much it should count.
I know — math, right?
But seriously, watch the video. It’s excellent.
“And that’s a lot to think about! With this hidden layer of 16 neurons, that’s a total of 784 times 16 weights, along with 16 biases. And all of that is just the connections from the first layer to the second.”
—Grant Sanderson, But what is a neural network? (video)
Sanderson doesn’t burden us with the details of the additional layers. Once you’ve seen the animations for that first step — from the input layer through the connections to the first hidden layer — you’ll have a real appreciation for what’s happening under the hood in a neural network.
In the final 6 minutes of this 19-minute video, you’ll also learn how the “learning” takes place in machine learning when a neural net is involved. All those weights and bias values? They are not determined by humans.
“Digging into what the weights and biases are doing is a good way to challenge your assumptions and really expose the full space of possible solutions.”
—Grant Sanderson, But what is a neural network? (video)
I confess it does get rather mathy at the end, but hang on through the parts that are beyond your personal math background and listen to what Sanderson is telling us. You can get a lot out of it even if the equation itself is like hieroglyphics to you.
The video content ends at 16:26, followed by the usual “subscribe to my channel” message. More info about Sanderson and his excellent videos is on his website, 3Blue1Brown.
Reading course descriptions and degree plans has helped me understand more about the fields of artificial intelligence and data science. I think some universities have whipped up a program in one of these hot fields of study just to put something on the books. It’s quite unfair to students if this is just a collection of existing courses and not a deliberate, well structured path to learning.
I came across this page from Northeastern University that attempts to explain the “difference” between artificial intelligence and machine learning. (I use those quotation marks because machine learning is a subset of artificial intelligence.) The university has two different master’s degree programs for artificial intelligence; neither one has “machine learning” in its name — but read on!
One of the two programs does not require a computer science undergraduate degree. It covers data science, robotics, and machine learning.
The other master’s program is for students who do have a background in computer science. It covers “robotic science and systems, natural language processing, machine learning, and special topics in artificial intelligence.”
I noticed that data science is in the program for those without a computer science background, while it’s not mentioned in the other program. This makes sense if we understand that data science and machine learning really go hand in hand nowadays. A data scientist likely will not develop any new machine learning systems, but she will almost certainly use machine learning to solve some problems. Training in statistics is necessary so that one can select the best algorithm for use in machining learning for solving a particular problem.
Graduates of the other program, with their prior experience in computer science, should be ready to break ground with new and original AI work. They are not going to analyze data for firms and organizations. Instead, they are going to develop new systems that handle data in new ways.
The distinction between these two degree programs highlights a point that perhaps a lot of people don’t yet understand: people (like journalists who have code experience) are training models — using machine learning systems through writing code to control them — and yet they are not people who create new machine learning systems.
Separately there are developers who create new AI software systems, and engineers who create new AI hardware systems. In other words, there are many different roles in the AI field.
Finally, there are so-called AI systems sold to banks and insurance companies, and many other types of firms, for which the people using the system do not write code at all. Using them requires data to be entered, and results are generated (such as whose insurance rates will go up next year). The workers who use these systems don’t write code any more than an accountant writes code. Moreover, they can’t explain how the system works — they need only know what goes in and what comes out.
Continuing my summary of the lessons in Introduction to Machine Learning from the Google News Initiative, today I’m looking at Lesson 5 of 8, “Training your Machine Learning model.” Previous lessons were covered here and here.
Now we get into the real “how it works” details — but still without looking at any code or computer languages.
The “lesson” (actually just a text) covers a common case for news organizations: comment moderation. If you permit people to comment on articles on your site, machine learning can be used to identify offensive comments and flag them so that human editors can review them.
With supervised learning (one of three approaches included in machine learning; see previous post here), you need labeled data. In this case, that means complete comments — real ones — that have already been labeled by humans as offensive or not. You need an equally large number of both kinds of comments. Creating this dataset of comments is discussed more fully in the lesson.
You will also need to choose a machine learning algorithm. Comments are text, obviously, so you’ll select among the existing algorithms that process language (rather than those that handle images and video). There are many from which to choose. As the lesson comes from Google, it suggests you use a Google algorithm.
In all AI courses and training modules I’ve looked at, this step is boiled down to “Here, we’ll use this one,” without providing a comparison of the options available. This is something I would expect an experienced ML practitioner to be able to explain — why are they using X algorithm instead of Y algorithm for this particular job? Certainly there are reasons why one text-analysis algorithm might be better for analyzing comments on news articles than another one.
What is the algorithm doing? It is creating and refining a model. The more accurate the final model is, the better it will be at predicting whether a comment is offensive. Note that the model doesn’t actually know anything. It is a computer’s representation of a “world” of comments in which some — with particular features or attributes perceived in the training data — are rated as offensive, and others — which lack a sufficient quantity of those features or attributes — are rated as not likely to be offensive.
The lesson goes on to discuss false positives and false negatives, which are possibly unavoidable — but the fewer, the better. We especially want to eliminate false negatives, which are offensive comments not flagged by the system.
“The most common reason for bias creeping in is when your training data isn’t truly representative of the population that your model is making predictions on.”
—Lesson 6, Bias in Machine Learning
Lesson 6 in the course covers bias in machine learning. A quick way to understand how ML systems come to be biased is to consider the comment-moderation example above. What if the labeled data (real comments) included a lot of comments offensive to women — but all of the labels were created by a team of men, with no women on the team? Surely the men would miss some offensive comments that women team members would have caught. The training data are flawed because a significant number of comments are labeled incorrectly.
There’s a pretty good video attached to this lesson. It’s only 2.5 minutes, and it illustrates interaction bias, latent bias, and selection bias.
Lesson 6 also includes a list of questions you should ask to help you recognize potential bias in your dataset.
It was interesting to me that the lesson omits a discussion of how the accuracy of labels is really just as important as having representative data for training and testing in supervised learning. This issue is covered in ImageNet and labels for data, an earlier post here.
The separation of machine learning into three different approaches — supervised learning, unsupervised learning, and reinforcement learning — is standard (Lesson 3). In keeping with the course’s focus on journalism applications of ML, the example given for supervised learning is The Atlanta Journal-Constitution‘s deservedly famous investigative story about sex abuse of patients by doctors. Supervised learning was used to sort more than 100,000 disciplinary reports on doctors.
The example of unsupervised learning is one I hadn’t seen before. It’s an investigation of short-term rentals (such as Airbnb rentals) in Austin, Texas. The investigator used locality-sensitive hashing (LSH) to group property records in a set of about 1 million documents, looking for instances of tax evasion.
The main example given for reinforcement learning is AlphaGo (previously covered in this blog), but an example from The New York Times — How The New York Times Is Experimenting with Recommendation Algorithms — is also offered. Reinforcement learning is typically applied when a clear “reward” can be identified, which is why it’s useful in training an AI system to play a game (winning the game is a clear reward). It can also be used to train a physical robot to perform specified actions, such as pouring a liquid into a container without spilling any.
Also in Lesson 3, we find a very brief description of deep learning (it doesn’t mention layers and weights). and just a mention of neural networks.
“What you should retain from this lesson is fairly simple: Different problems require different solutions and different ML approaches to be tackled successfully.”
—Lesson 3, Different approaches to Machine Learning
The examples in this lesson are really good, so maybe you should just read it directly. You’ll learn about a variety of unusual stories that could only be told when journalists used machine learning to augment their reporting.
“Machine learning is not magic. You might even say that it can’t do anything you couldn’t do — if you just had a thousand tireless interns working for you.”
Note (added April 4, 2022): The two links above to Quartz AI Studio content have been updated. The original domain, qz-dot-ai, was given up when, at renewal time, the price of all dot-ai domains had skyrocketed. Unfortunately, all the images have been lost, according to a personal communication from Merrill.