Mindy McAdams – Page 3 – AI in Media and Society

Explaining common misconceptions about AI

August 27, 2021August 27, 2021 Mindy McAdams

Sometimes people make a statement that an artificial intelligence system is a computer system that learns, or that learns on its own.

That is inaccurate. Machine learning is a subset of artificial intelligence, not the whole field. Machine learning systems are computer systems that learn from data. Other AI systems do not. Various systems are wholly programmed by humans to follow explicit rules and do not generate any code or instructions on their own.

The error probably arises from the fact that many of the exciting advances in AI since 2012 have involved some form of machine learning.

The recent successes of machine learning have much to do with neural networks, each of which is a system of algorithms that (in some respects) mimics the way neurons work in the brains of humans and other animals — but only in some respects. In other words, a neural network shares some features with human brains, but is not extremely similar to a human brain in all its complexity.

Advances in neural networks have been made possible not only by new algorithms (written by humans) but also by new computer hardware that did not exist in the earlier decades of AI development. The main advance concerns graphical processing units, commonly called GPUs. If you’ve noticed how computer games have evolved from simple flat pixel blocks (e.g. Pac-Man) to vast 3D worlds through which the player can fly or run at high speed, turning in different directions to view vast new landscapes, you can extrapolate how the advanced hardware has increased the speed of processing of graphical information by many orders of magnitude.

Without today’s GPUs, you can’t create a neural network that runs multiple algorithms in parallel fast enough to achieve the amazing things that AI systems have achieved. To be clear, the GPUs are just engines, powering the code that creates a neural network.

More about the role of GPUs in today’s AI: Computational Power and the Social Impact of Artificial Intelligence (2018), by Tim Hwang.

Another reason why AI has leapt onto the public stage recently is Big Data. Headlines alerted us to the existence and importance of Big Data a few years ago, and it’s tied to AI because how else could we process that ginormous quantity of data? If all we were doing with Big Data was adding sums, well, that’s no big deal. What businesses and governments and the military really want from Big Data, though, is insights. Predictions. They want to analyze very, very large datasets and discover information there that helps them control populations, make greater profits, manage assets, etc.

Big Data became available to businesses, governments, the military, etc., because so much that used to be stored on paper is now digital. As the general population embraced digital devices for everyday use (fitness, driving cars, entertainment, social media), we contributed even more data than we ever had before.

Very large language models (an aspect of AI that contributes to Google Translate, automatic subtitles on YouTube videos, and more) are made possible by very, very large collections of text that are necessary to train those models. Something I read recently that made an impression on me: For languages that do not have such extensive text corpuses, it can be difficult or even impossible to train an effective model. The availability of a sufficiently enormous amount of data is a prerequisite for creating much of the AI we hear and read about today.

If you ever wonder where all the data comes from — don’t forget that a lot of it comes from you and me, as we use our digital devices.

Perhaps the biggest misconception about AI is that machines will soon become as intelligent as humans, or even more intelligent than all of us. As a common feature in science fiction books and movies, the idea of a super-intelligent computer or robot holds a rock-solid place in our minds — but not in the real world. Not a single one of the AI systems that have achieved impressive results is actually intelligent in the way humans (even baby humans!) are intelligent.

The difference is that we learn from experience, and we are driven by curiosity and the satisfaction we get from experiencing new things — from not being bored. Every AI system is programmed to perform particular tasks on the data that is fed to it. No AI system can go and find new kinds of data. No AI system even has a desire to do so. If a system is given a new kind of data — say, we feed all of Wikipedia’s text to a face-recognition AI system — it has no capability to produce meaningful outputs from that new kind of input.

AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

Book notes: Atlas of AI, by Kate Crawford

August 8, 2021August 8, 2021 Mindy McAdams

Published earlier this year by Yale University Press, Atlas of AI carries the subtitle “Power, Politics, and the Planetary Costs of Artificial Intelligence.” This is a remarkably accurate subtitle — or maybe I should say the book fulfills the promise of the subtitle better than many other books do.

Planetary costs are explained in chapter 1, “Earth,” which discusses not only the environment-destroying batteries required by both giant data centers and electric cars but also the immense electrical power requirements of training large language models and others with deep-learning architectures. Extraction is a theme Crawford returns to more than once; here it’s about the extraction of rare earth minerals. Right away we can see in the end notes that this is no breezy “technology of the moment” nonfiction book; the wealth of cited works could feed my curiosity for years of reading.

Photo: Book cover and cat on a porch — *Photo copyright © 2021 Mindy McAdams*

Crawford comes back to the idea of depleting resources in the Coda, titled “Space,” which follows the book’s conclusion. There she discusses the mineral-extraction ambitions of Jeff Bezos (and other billionaires) as they build their own rockets — they don’t want only to fly into space for their own pleasure and amusement; they also want to pillage it like 16th– to 19th–century Europeans pillaged Africa and the Americas.

Politics are a focus in chapter 6, “State,” and in the conclusion, “Power” — politics not of any political party or platform but rather the politics of domination, of capitalism, of the massive financial resources of Bezos and Silicon Valley. Crawford has done a great job of laying the groundwork for these final chapters without stating the same arguments in the earlier chapters, which is a big peeve of mine when reading many books about the progress of technologies — that is, the author has told me the same thing so many times before the conclusion that I am already bored with the ideas. That’s not what happened here.

Chapter 2, “Labor,” focuses on low pay, surveillance of workers, deskilling, and time in particular. It’s a bit of “how the sausage gets made,” which is nothing new to me because I’ve been interested for a while already in how data gets labeled by a distributed global workforce. I like how Crawford frames it, in part, as not being about robots who will take our skilled jobs — in fact, that tired old trope is ignored in this book. The more real concern is that like the minerals being extracted to feed the growing AI industrial complex, the labor of many, many humans is required to enable the AI industrial complex to function. Workers’ time at work is increasingly monitored down to the second, and using analysis of massive datasets, companies such as Amazon can track and penalize anyone whose output falls below the optimum. The practice of “faking AI” with human labor is likened to Potemkin villages (see Sadowski, 2018), and we should think about how many of those so-called AI-powered customer service systems (and even decision-support systems) are really “Potemkin AI.” (See also “The Automation Charade”: Taylor, 2018.) Crawford reminds us of the decades of time-and-motion research aimed at getting more value out of workers in factories and fast-food restaurants. This is a particularly rich chapter.

“Ultimately, ‘data’ has become a bloodless word; it disguises both its material origins and its ends.”
—Crawford, p. 113

In “Data,” the third chapter, Crawford looks at where images of faces have come from — the raw material of face recognition systems. Mug shots, of course, but also scraping all those family photos that moms and dads have posted to social media platforms. This goes beyond face recognition and on to all the data about us that is collected or scraped or bought and sold by the tech firms that build and profit from the AI that uses it as training data to develop systems that in turn can be used to monitor us and our lives. Once again, we’re looking at extraction. Crawford doesn’t discuss ImageNet as much as I expected here (which is fine; it comes around again in the next chapter). She covers the collection of voice data and the quantities of text needed to train large language models, detailing some earlier (1980s–90s) NLP efforts with which I was not familiar. In the section subheaded “The End of Consent,” Crawford covers various cases of the unauthorized capture or collection of people’s faces and images — it got me thinking about how the tech firms never ask permission, and there is no informed consent. Another disturbing point about datasets and the AI systems that consume them: Researchers might brush off criticism by saying they don’t know how their work will be used. (This and similar ethical concerns were detailed in a wonderful New Yorker article earlier this year.)

I’m not sure whether chapter 3 is the first time she mention the commons, but she does, and it will come up again. Even though the publicly available data remains available, she says the collection and mining and classification of public data centers the value of it in private hands. It’s not literally enclosure, but it’s as good as, she argues.

“Every dataset … contains a worldview.”
—Crawford, p. 135

Chapter 4, “Classification,” is very much about power. When you name a thing, you have power over it. When you assign labels to the items in a dataset, you exclude possible interpretations at the same time. Labeling images for race, ethnicity, or gender is as dangerous as labeling human skulls for phrenology. The ground truth is constructed, not pristine, and never free of biases. Here Crawford talks more about ImageNet and the language data, WordNet, on which it was built. I made a margin note here: “boundaries, boxes, centers/margins.” At the end of the chapter, Crawford points out that we can examine training datasets when they are made public, like the UTKFace dataset — but the datasets underlying systems being used on us today by Facebook, TikTok, Google, and Baidu are proprietary and therefore not open to scrutiny.

The chapter I enjoyed most was “Affect,” chapter 5, because it covers lots of unfamiliar territory. A researcher named Paul Ekman (apparently widely known, but unknown to me) figures prominently in the story of how psychologists and others came to believe we can discern a person’s feelings and emotions from the expression on their face. At first you think, yes, that makes sense. But then you learn about how people were asked to “perform” an expression of happiness, or sadness, or fear, etc., and then photographs were made of them pulling those expressions. Based on such photos, machine learning models have been trained. Uh-oh! Yes, you see where this goes. But it gets worse. Based on your facial expression, you might be tagged as a potential shoplifter in a store. Or as a terrorist about to board a plane. “Affect recognition is being built into several facial recognition platforms,” we learn on page 153. Guess where early funding for this research came from? The U.S. Advanced Research Projects Agency (ARPA), back in the 1960s. Now called Defense Advanced Research Projects Agency (DARPA), this agency gets massive funding for research on ways to spy on and undermine the governments of other countries. Classifying types of facial expressions? Just think about it.

In chapter 6, “State,” which I’ve already mentioned, Crawford reminds us that what starts out as expensive, top-secret, high-end military technology later migrates to state and governments and local police for use against our own citizens. Much of this has to do with surveillance, and of course Edward Snowden and his leaked files are mentioned more than once. The ideas of threats and targets are discussed. We recall the chapter about classification. Crawford also brings up the paradox that huge multinationals (Amazon, Apple, Facebook, Google, IBM, Microsoft) suddenly transform into patriotic all–American firms when it comes to developing top-secret surveillance tech that we would not want to share with China, Iran, or Russia. Riiight. There’s a description of the DoD’s Project Maven (which Wired magazine covered in 2018), anchoring a discussion of drone targets. This chapter alerted me to an article titled “Algorithmic warfare and the reinvention of accuracy” (Suchman, 2020). The chapter also includes a long section about Palantir, one of the more creepy data/surveillance/intelligence companies (subject of a long Vox article in 2020). Lots about refugees, ICE, etc., in this chapter. Ring doorbell surveillance. Social credit scores — and not in China! It boils down to domestic eye-in-the-sky stuff, countries tracking their own citizens under the guise of safety and order but in fact setting up ways to deprive the poorest and most vulnerable people even further.

This book is short, only 244 pages before the end notes and reference list — but it’s very well thought-out and well focused. I wish more books about technology topics were this good, with real value in each chapter and a comprehensive conclusion at the end that brings it all together. Also — awesome references! I applaud all the research assistants!

Intro to Machine Learning course

August 1, 2021 Mindy McAdams

A couple of days ago, I wrote about Kaggle’s free introductory Python course. Then I started the next free course in the series: Intro to Machine Learning. The course consists of seven modules; the final module, like the last module in the Python course, shows you how to enter a Kaggle competition using the skills from the course.

The first module, “How Models Work,” begins with a simple decision tree, which is nice because (I think) everyone can grasp how that works, and how you add complexity to the tree to get more accurate answers. The dataset is housing data from Melbourne, Australia; it includes the type of housing unit, the number of bedrooms, and most important, the selling price (and other data too). The data have already been cleaned.

In the second module, we load the Python Pandas library and the Melbourne CSV file. We call one basic statistics function that is built into Pandas — describe() — and get a quick explanation of the output: count, mean, std (standard deviation), min, max, and the three quartiles: 25%, 50% (median), 75%.

When you do the exercise for the module, you can copy and paste the code from the lesson into the learner’s notebook.

The third module, “Your First Machine Learning Model,” introduces the Pandas columns attribute for the dataframe and shows us how to make a subset of column headings — thus excluding any data we don’t need to analyze. We use the dropna() method to eliminate rows that have missing data (this is not explained). Then we set the prediction target (y) — here it will be the Price column from the housing data. This should make sense to the learner, given the earlier illustration of the small decision tree.

y = df.Price

We use the previously created list of selected column headings (named features) to create X, the features of each house that will go into the decision tree model (such as the number of rooms, and the size of the lot).

X = df[features]

Then we build a model using Python’s scikit-learn library. Up to now, this will all be familiar to anyone who’s had an intro-to-Pandas course, particularly if the focus was data science or data journalism. I do like the list of steps given (building and using a model):

Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
Fit: Capture patterns from provided data. This is the heart of modeling.
Predict: Just what it sounds like.
Evaluate: Determine how accurate the model’s predictions are. (List quoted from Kaggle course.)

Since fit() and predict() are commands in scikit-learn, it begins to look like machine learning is just a walk in the park! And since we are fitting and predicting on the same data, the predictions are perfect! Never fear, that bubble will burst in module 4, “Model Validation,” in which the standard practice of splitting your data into a training set and a test set is explained.

First, though, we learn about predictive accuracy. Out of all the various metrics for summarizing model quality, we will use one called Mean Absolute Error (MAE). This is explained nicely using the housing prices, which is what we are attempting to predict: If the house sold for $150,000 and we predicted it would sell for $100,000, then the error is $150,000 minus $100,000, or $50,000. The function for MAE sums up all the errors and returns the mean.

This is where the lesson says, “Uh-oh! We need to split our data!” We use scikit-learn’s train_test_split() method, and all is well.

MAE shows us our model is pretty much crap, though. In the fifth module, “Underfitting and Overfitting,” we get a good explanation of the title topic and learn how to limit the number of leaf nodes at the end of our decision tree — DecisionTreeRegressor(max_leaf_nodes).

After all that, our model’s predictions are still crap — because a decision tree model is “not very sophisticated by modern machine learning standards,” the module text drolly explains. That leads us to the sixth module, “Random Forests,” which is nice for two reasons: (1) The explanation of a random forest model should make sense to most learners who have worked through the previous modules; and (2) We get to see that using a different model from scikit-learn is as simple as changing

my_model = DecisionTreeRegressor(random_state=1)

my_model = RandomForestRegressor(random_state=1)

Overall I found this a helpful course, and I think a lot of beginners could benefit from taking it — depending on their prior level of understanding. I would assume at least a familiarity with datasets as CSV files and a bit more than beginner-level Python knowledge.

Free courses at Kaggle

July 30, 2021 Mindy McAdams

I recently found out that Kaggle has a set of free courses for learning AI skills.

The first course is an introduction to Python, and these are the course modules:

Hello, Python: A quick introduction to Python syntax, variable assignment, and numbers
Functions and Getting Help: Calling functions and defining our own, and using Python’s builtin documentation
Booleans and Conditionals: Using booleans for branching logic
Lists: Lists and the things you can do with them. Includes indexing, slicing and mutating
Loops and List Comprehensions: For and while loops, and a much-loved Python feature: list comprehensions
Strings and Dictionaries: Working with strings and dictionaries, two fundamental Python data types
Working with External Libraries: Imports, operator overloading, and survival tips for venturing into the world of external libraries

Even though I’m an intermediate Python coder, I skimmed all the materials and completed the seven problem sets to see how they are teaching Python. The problems were challenging but reasonable, but the module on functions is not going to suffice for anyone who has little prior experience with programming languages. I see this in a lot of so-called introductory materials — functions are glossed over with some ready-made examples, and then learners have no clue how returns work, or arguments, etc.

At the end of the course, the learner is encouraged to join a Kaggle competition using the Titanic passengers dataset. However, the learner is hardly prepared to analyze the Titanic data at this point, so really this is just an introduction to how to use files provided in a competition, name your notebook, save your work, and submit multiple attempts. The tutorial gives you all the code to run a basic model with the data, so it’s really more a demo than a tutorial.

My main interest is in the machine learning course, which I’ll begin looking at today.

Rules and ethics for use of AI by governments

June 19, 2021June 19, 2021 Mindy McAdams

The governments of British Columbia and Yukon, in Canada, have jointly issued a report (June 2021) about ethical use of AI in the public sector. It’s interesting to me as it covers issues of privacy and fairness, and in particular, the rights of people to question decisions derived from AI systems. The report notes that the public increasingly expects services provided by governments to be as fast and as personalized as services provided by online platforms such as Amazon — and this leads or will lead to increasing adoption of AI systems to aid in delivery of government services to members of the public.

The report’s concluding recommendations (pages 47–48) cover eight points (edited):

Establish guiding principles for AI use: “Each public authority should make a public commitment to guiding principles for the use of AI that incorporate transparency, accountability, legality, procedural fairness and protection of privacy.”
Inform the public: “If an ADS [automated decision system] is used to make a decision about an individual, public authorities must notify and describe how that system operates to the individual in a way that is understandable.”
Provide human accountability: “Identify individuals within the public authority who are responsible for engineering, maintaining, and overseeing the design, operation, testing and updating of any ADS.”
Ensure that auditing and transparency are possible: “All ADS should include robust and open auditing functionality with enhanced transparency measures for closed-source, proprietary datasets used to develop and update any ADS.”
Protect privacy of individuals: “Wherever possible, public authorities should use synthetic or de-identified data in any ADS.” See synthetic data definition, below.
Build capacity and increase education (for understanding of AI): This point covers “public education initiatives to improve general knowledge of the impact of AI and other emerging technologies on the public, on organizations that serve the public,” etc.; “subject-matter knowledge and expertise on AI across government ministries”; “knowledge sharing and expertise between government and AI developers and vendors”; development of “open-source, high-quality data sets for training and testing ADS”; “ongoing training of ADS administrators” within government agencies.
Amend privacy legislation to include: “an Artificial Intelligence Fairness and Privacy Impact Assessment for all existing and future AI programs”; “the right to notification that ADS is used, an explanation of the reasons and criteria used, and the ability to object to the use of ADS”; “explicit inclusion of service providers to the same obligations as public authorities”; “stronger enforcement powers in both the public and private sector …”; “special rules or restrictions for the processing of highly sensitive information by ADS”; “shorter legislative review periods of 4 years.”
Review legislation to make sure “oversight bodies are able to review AIFPIAs [see item 7 above] and conduct investigations regarding the use of ADS alone or in collaboration with other oversight bodies.”

Synthetic data is defined (on page 51) as: “A type of anonymized data used as a filter for information that would otherwise compromise the confidentiality of certain aspects of data. Personal information is removed by a process of synthesis, ensuring the data retains its statistical significance. To create synthetic data, techniques from both the fields of cryptography and statistics are used to render data safe against current re-identification attacks.”

The report uses the term automated decision systems (ADS) in view of the Government of Canada’s Directive on Automated Decision Making, which defines them as: “Any technology that either assists or replaces the judgement of human decision-makers.”

Book notes: Hello World, by Hannah Fry

June 5, 2021 Mindy McAdams

I finished reading this book back in April, and I’d like to revisit it before I read a couple of new books I just got. This was published in 2018, but that’s no detriment. The author, Hannah Fry, is a “mathematician, science presenter and all-round badass,” according to her website. She’s also a professor at University College London. Her bio at UCL says: “She was trained as a mathematician with a first degree in mathematics and theoretical physics, followed by a PhD in fluid dynamics.”

The complete title, Hello World: Being Human in the Age of Algorithms, doesn’t sound like this is a book about artificial intelligence. She refers to control, and “the boundary between controller and controlled,” from the very first pages, and this reflects the link between “just” talking about algorithms and talking about AI. Software is made of algorithms, and AI is made of software, so there we go.

In just over 200 pages and seven chapters simply titled Power, Data, Justice, Medicine, Cars, Crime, and Art, this author organizes primary areas of concern for the question of “Are we in control?” and provides examples in each area.

Power. I felt disappointed when I saw this chapter starts with Deep Blue beating world chess champion Garry Kasparov in 1997 — but my spirits soon lifted as I saw how she framed this example as the way we perceive a computer system affects how we interact with it (shades of Sherry Turkle and Reeves & Nass). She discusses machine learning and image recognition here, briefly. She talks about people trusting GPS map directions and search engines. She explains a 2012 ACLU lawsuit involving Medicaid assistance, bad code, and unwarranted trust in code. Intuition tells us when something seems “off,” and that’s a critical difference between us and the machines.

Algorithms “are what makes computer science an actual science.”
—Hannah Fry, p. 8

Data. Sensibly, this chapter begins with Facebook and the devil’s bargain most of us have made in giving away our personal information. Fry talks about the first customer loyalty cards at supermarkets. The pregnant teenager/Target story is told. In explaining how data brokers operate, Fry describes how companies buy access to you via your interests and your past behaviors (not only online). She summarizes a 2017 DEFCON presentation that showed how supposedly anonymous browsing data is easily converted into real names, and the dastardly Cambridge Analytica exploit. I especially liked how she explains how small the effects of newsfeed manipulation are likely to be (based on research) and then adds — a small margin might be enough to win an election. This chapter wraps up with China’s citizen rating system (Black Mirror in reality) and the toothlessness of GDPR.

Justice. First up is inequality in sentences for crimes, using two U.K. examples. Fry then surveys studies where multiple judges ruled on the same hypothetical cases and inconsistencies abounded. Then the issues with sentencing guidelines (why judges need to be able to exercise discretion). So we arrive at calculating the probability that a person will “re-offend”: the risk assessment. Fry includes a nice, simple decision-tree graphic here. She neatly explains the idea of combining multiple decision trees into an ensemble, used to average the results of all the trees (the random forest algorithm is one example). More examples from research; the COMPAS product and the 2016 ProPublica investigation. This leads to a really nice discussion of bias (pp. 65–71 in the U.S. paperback edition).

Medicine. Although image recognition was mentioned very briefly earlier, here Fry gets more deeply into the topic, starting off with the idea of pattern recognition — and what pattern, exactly, is being recognized? Classifying and detecting anomalies in biopsy slides doesn’t have perfect results when humans do it, so this is one of the promising frontiers for machine learning. Fry describes neural networks here. She gets into specifics about a system trained to detect breast cancer. But image recognition is not necessarily the killer app for medical diagnosis. Fry describes a study of 678 nuns (which previously I’d never heard about) in which it was learned that essays the nuns had written before taking vows could be used to predict which nuns would have dementia later in life. The idea is that an analysis of more data about women (not only their mammograms) could be a better predictor of malignancy.

“Even when our detailed medical histories are stored in a single place (which they often aren’t), the data itself can take so many forms that it’s virtually impossible to connect … in a way that’s useful to an algorithm.”
—Hannah Fry, p. 103

The Medicine chapter also mentions IBM Watson; challenges with labeling data; diabetic retinopathy; lack of coordination among hospitals, doctor’s offices, etc., that lead to missed clues; privacy of medical records. Fry zeroes in on DNA data in particular, noting that all those “find your ancestors” companies now have a goldmine of data to work with. Fry ends with a caution about profit — whatever medical systems might be developed in the future, there will always be people who stand to gain and others who will lose.

Cars. I’m a little burnt out of the topic of self-driving cars, having already read a lot about them. I liked that Fry starts with DARPA and the U.S. military’s longstanding interest in autonomous vehicles. I can’t agree with her that “the future of transportation is driverless” (p. 115). After discussing LiDAR and the flaws of GPS and conflicting signals from different systems in one car, Fry takes a moment to explain Bayes’ theorem, saying it “offers a systematic way to update your belief in a hypothesis on the basis of evidence,” and giving a nice real-world example of probabilistic inference. And of course, the trolley problem. She brings up something I don’t recall seeing before: Humans are going to prank autonomous vehicles. That opens a whole ‘nother box of trouble. Her anecdote under the heading “The company baby” leads to a warning: Always flying on autopilot can have unintended consequences when the time comes to fly manually.

Crime. This chapter begins with a compelling anecdote, followed by a neat historical case from France in the 1820s, and then turns to predictive policing and all its woes. I hadn’t read about the balance between the buffer zone and distance decay in tracking serial criminals, so that was interesting — it’s called the geoprofiling algorithm. I also didn’t know about Jack Maple, a New York City police officer, and his “Charts of the Future” depicting stations of the city’s subway system, which evolved into a data tool named CompStat. I enjoyed learning what burglaries and earthquakes have in common. And then — PredPol. There have been thousands of articles about this since its debut in 2011, as Fry points out. Her summary of the issues related to how police use predictive policing data is quite good, compact and clear. PredPol is one specific product, and not the only one. It is, Fry says, “a proprietary algorithm, so the code isn’t available to the public and no one knows exactly how it works” (p. 157).

“The [PredPol] algorithm can’t actually tell the future. … It can only predict the risk of future events, not the events themselves — and that’s a subtle but important difference.”
—Hannah Fry, p. 153

Face recognition is covered in the Crime chapter, which makes perfect sense. Fry offers a case where a white man was arrested based on incorrect identification of him from CCTV footage at a bank robbery. The consequences of being the person arrested by police can be injury or death, as we all know — not to mention the legal expenses as you try to clear your name after the erroneous arrest. Even though accuracy rates are rising, the chances that you will match a face that isn’t yours remains worrying.

“How do you decide on that trade-off between privacy and protection, fairness and safety?”
—Hannah Fry, p. 172

Art. Here we have “a famous experiment” I’d never heard of — Music Lab, where thousands of music fans logged into a music player app, listened to songs, rated them, and chose what to download (back when we downloaded music). The results showed that for all but the very best and very worst songs, the ratings by other people had a huge influence on what was downloaded in different segments of the app. A song that became a massive hit in one “world” was dead and buried in another. This leads us to recommendation engines such as those used by Netflix and Amazon. Predicting how well movies would do at the box office, turned out to be badly unreliable. The trouble is the lack of an objective measure of quality — it’s not “This is cancer/This is not cancer.” Beauty in the eye of the beholder and all that. A recommendation engine is different because it’s not using a quality score — it’s matching similarity. You liked these 10 movies; I like eight of those; chances are I might like the other two.

Fry goes on to discuss programs that create original (or seemingly original) works of art. A system may produce a new musical or visual composition, but it doesn’t come from any emotional basis. It doesn’t indicate a desire to communicate with others, to touch them in any way.

In her Conclusion, Fry returns to the questions about bias, fairness, mistaken identity, privacy — and the idea of the control we give up when we trust the algorithms. People aren’t perfect, and neither are algorithms. Taking the human consequences of machine errors into account at every stage is a step toward accountability. Building in the capability to backtrack and explain decisions, predictions, outputs, is a step toward transparency.

For details about categories of algorithms based on tasks they perform (prioritization, classification, association, filtering; rule-based vs. machine learning), see the Power chapter (pp. 8–13 in the U.S. paperback edition).

Symbolic AI: Good old-fashioned AI

June 3, 2021 Mindy McAdams

The distinction between symbolic (explicit, rule-based) artificial intelligence and subsymbolic (e.g. neural networks that learn) artificial intelligence was somewhat challenging to convey to non–computer science students. At first I wasn’t sure how much we needed to dwell on it, but as the semester went on and we got deeper into the differences among types of neural networks, it was very useful to keep reminding the students that many of the things neural nets are doing today would simply be impossible with symbolic AI.

The difficulty lies in the shallow math/science background of many communications students. They might have studied logic problems/puzzles, but their memory of how those problems work might be very dim. Most of my students have not learned anything about computer programming, so they don’t come to me with an understanding of how instructions are written in a program.

This post by Ben Dickson at his TechTalks blog offers a very nice summary of symbolic AI, which is sometimes referred to as good old-fashioned AI (or GOFAI, pronounced GO-fie). This is the AI from the early years of AI, and early attempts to explore subsymbolic AI were ridiculed by the stalwart champions of the old school.

The requirements of symbolic AI are that someone — or several someones — needs to be able to specify all the rules necessary to solve the problem. This isn’t always possible, and even when it is, the result might be too verbose to be practical. As many people have said, things that are easy for humans are hard for computers — like recognizing an oddly shaped chair as a chair, or distinguishing a large upholstered chair from a small couch. Things we do almost without thinking are very hard to encode into rules a computer can follow.

“Symbolic artificial intelligence is very convenient for settings where the rules are very clear cut, and you can easily obtain input and transform it into symbols.”
—Ben Dickson

Subsymbolic AI does not use symbols, or rules that need symbols. It stems from attempts to write software operations that mimic the human brain. Not copy the way the brain works — we still don’t know enough about how the brain works to do that. Mimic is the word usually used because a subsymbolic AI system is going to take in data and form connections on its own, and that’s what our brains do as we live and grow and have experiences.

Dickson uses an image-recognition example: How would you program specific rules to tell a symbolic system to recognize a cat in a photo? You can’t write rules like “Has four legs,” or “Has pointy ears,” because it’s a photo. Your rules would need to be about pixels and edges and clusters of contrasting shades. Your rules would also need to account for infinite variations in photos of cats.

“You can’t define rules for the messy data that exists in the real world.”
—Ben Dickson

Thus “messy” problems such as image recognition are ideally handled by neural networks — subsymbolic AI.

Problems that can be drawn as a flow chart, with every variable accounted for, are well suited to symbolic AI. But scale is always an issue. Dickson mentions expert systems, a classic application of symbolic AI, and notes that “they require a huge amount of effort by domain experts and software engineers and only work in very narrow use cases.” On top of that, the knowledge base is likely to require continual updating.

An early, much-praised expert system (called MYCIN) was designed to help doctors determine treatment for patients with blood diseases. In spite of years of investment, it remained a research project — an experimental system. It was not sold to hospitals or clinics. It was not used in day-to-day practice by any doctors diagnosing patients in a clinical setting.

“I have never done a calculation of the number of man-years of labor that went into the project, so I can’t tell you for sure how much time was involved … it is such a major chore to build up a real-world expert system.”
—Edward H. Shortliffe, principal developer of the MYCIN expert system (source)

Even though expert systems are impractical for the most part, there are other useful applications for symbolic AI. Dickson mentions “efforts to combine neural networks and symbolic AI” near the end of his post. He points out that symbolic systems are not “opaque” the way neural nets are — you can backtrack through a decision or prediction and see how it was made.

The trouble with large language models

June 2, 2021June 2, 2021 Mindy McAdams

Yesterday I summarized the first two articles in a series about algorithms and AI by Hayden Field, a technology journalist at Morning Brew. Today I’ll finish out the series.

The third article, This Powerful AI Technique Led to Clashes at Google and Fierce Debate in Tech. Here’s Why, explores the basis of the volatile situation around the firing of Timnit Gebru and later Margaret Mitchell from Google’s Ethical AI unit earlier this year. Both women are highly respected and experienced AI researchers. Mitchell founded the team in 2017.

Central to the situation is a criticism of large language models and a March 2021 paper (On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?) co-authored by Gebru, Mitchell, and two researchers at the University of Washington. The biggest current example is GPT-3, previously covered in several posts here.

“Models this big require an unthinkable amount of data; the entirety of English-language Wikipedia makes up just 0.6% of GPT-3’s training data.”
—”This Powerful AI Technique Led to Clashes at Google and Fierce Debate in Tech. Here’s Why”

The Morning Brew article sums up the very recent and very big improvements in large language models that have come about thanks to new algorithms and faster computer hardware (GPUs running in parallel). It highlights BERT, “the model that now underpins Google Search,” which came out of the research that resulted in the first Transformer. A good at-the-time article about GPT-3’s release was published in July 2020 in MIT’s Technology Review: “OpenAI first described GPT-3 in a research paper published in May [2020].”

One point being — Google fired Timnit Gebru very soon after news and discussion of large language models (GPT-3 especially, but remember Google’s investment in BERT too) ramped up — way up. Her criticism of a previously obscure AI technology (not obscure among NLP researchers, but in the wider world) might have been seen as increasingly inconvenient for Google. Morning Brew summarizes the criticism (not attributed to Gebru): “Because large language models often scrape data from most of the internet, racism, sexism, homophobia, and other toxic content inevitably filter in.”

“Once the barrier to create AI tools and generate text is lower, people could just use it to create misinformation at scale, and having that data coupled with certain other platforms can just be a very disastrous situation.”
—Sandhini Agarwal, AI policy researcher, OpenAi

The Morning Brew article goes well beyond Google’s dismissal of Gebru and Mitchell, bringing in a lot of clear, easy-to-understand explanation of what large language models require (for example, significant energy resources), what they’re being used for, and even the English-centric nature of such models — lacking a gigantic corpus of digitized text in a given human language, you can’t create a large model in that language.

The turmoil in Google’s Ethical AI unit is covered in more detail in this May 2021 article, also by Hayden Field.

It’s easy to find articles that discuss “scary things GPT-3 can do and does” and especially the bias issues; it’s much harder to find information about some of the other aspects covered here. It’s also not just about GPT-3. I appreciated insights from an interview with Emily M. Bender, first author on the “Stochastic Parrots” article. I also liked the explicit statement that many useful NLP tasks can be done well without a large language model. In smaller datasets, finding and accounting for toxic content can be more manageable.

“Do we need this at all? What’s the actual value proposition of the technology? … Who is paying the environmental price for us doing this, and is this fair?”
—Emily M. Bender, professor and director, Professional MS in Computational Linguistics, University of Washington

Finally, in a recap of Morning Brew’s “Demystifying Algorithms” event, editor Dan McCarthy summarized two AI researchers’ answers to one of my favorite questions: What can an algorithm actually know?

An AI system’s ability to generalize — to transfer learning from one domain to another — is still a wide-open frontier, according to Mark Riedl, a computer science professor at Georgia Tech. This is something I remind my students of over and over — what’s called “general intelligence” is still a long way off for artificial intelligence. Riedl works on aspects of storytelling to test whether an AI system is able to “make something new” out of what it has ingested.

Saška Mojsilović, head of Trusted AI Foundations at IBM Research, made a similar point — and also emphasized that “narrow AI” (which is all the AI we’ve ever had, up to now and for the foreseeable future) is not nothing.

She suggested: “We may want to take a pause from obsessing over artificial general intelligence and maybe think about how we create AI solutions for these kinds of problems” — for example, narrow domains such as drug discovery (e.g. new antibiotics) and creation of new molecules. These are extraordinary accomplishments within the capabilities of today’s AI.

This is a half-hour conversation with those two experts:

Thanks to the video, I learned about the Lovelace 2.0 Test, which Riedl developed in 2014. It’s an alternative to the Turing Test.

Mojsilović talked about the perceptions that arise when we use the word intelligence when talking about machines. “The reality is that many things that we call AI today are the same old models that we used to call data science maybe five or six years ago,” she said (at 21:55). She also talked about the need for collaboration between AI researchers and experts in entirely separate fields: “Because we can’t create solutions for the problems that we don’t understand” (at 29:24).

Summary of the challenges facing algorithms, AI

June 1, 2021May 30, 2021 Mindy McAdams

Hayden Field, a technology journalist at Morning Brew, published a series of articles about algorithms and AI earlier this year, and they’ve been on my TBR list.

First up was Nine Experts on the Single Biggest Obstacle Facing AI and Algorithms in the Next Five Years. Experts: Drago Anguelov (Waymo); Kathy Baxter (Salesforce); David Cox (IBM Watson); Natasha Crampton (Microsoft); Mark Diaz (Ethical AI at Google); Charles Isbell (professor and dean, College of Computing, Georgia Institute of Technology); Peter Lofgren (Stripe); Andrew Ng (co-founder and former head, Google Brain); Cathy O’Neil (author, Weapons of Math Destruction).

Predictably, ethics was noted as a big challenge — O’Neil asked what we will do about unfairness in decisions made by algorithms. Diaz pointed to the need for involving “experts from a wide range of disciplines, including non-technical disciplines,” in the development process, long before an end product emerges. This intersects with ethics and fairness, as the absence of experts and stakeholders opens the door wide to omissions and errors. Baxter was explicit about systemic racism that is embedded in both training data and models. She listed “medical care decisions, hiring recommendations, access to housing and social programs, visa application approvals, school exam results, hate speech detection, dynamic pricing algorithms for ride hailing services, and even dating apps” — as well as face recognition and predictive policing.

“In essence, problems that are not purely technical require solutions that are not purely technical.”
—Mark Diaz, Ethical AI at Google

Isbell spoke of systematic solutions that can be widely applied. “We cannot treat minority groups as exceptions and edge cases,” he said. Cox highlighted transparency and explainability, as well as ethics and bias. He also alluded to adversarial attacks as well as the non-adversarial errors that surprise researchers (possibly due to overfitting). He grouped all this under trust. Crampton also focused on fairness and referred to diversity in teams, similar to Diaz’s and Isbell’s concerns.

Anguelov explained the need for reliable simulations so that systems can scale up to real-world use. He’s talking about the Long Tail problem: the real world throws up too many unexpected situations. Simulations allow testing in ways that don’t risk human lives (think self-driving cars). Lofgren also talked about scale, but in terms of personalization — his example is detecting credit card fraud in real-time based on Big Data that detects abusing IP addresses and then drills down to the individual cards being used. Ng talked about the difficulty in making dependable commercial AI products — basically off-the-shelf solutions.

“We will often need to make hard decisions based on competing priorities, including decisions to not build or deploy a system for certain purposes.”
—Natasha Crampton, Microsoft

Second in the series is titled Amex’s Fraud Detection AI Was Ready to Go Live. Then Covid Hit. This article starts with the idea that large AI models in the field will still need adjustments as unforeseen problems crop up. This echos the concerns about scale raised by Anguelov and Lofgren in the first article in the series.

The challenge thrown by COVID-19 was that all existing models had been developed and adjusted in a non-pandemic world. Then the world changed.

Amex’s fraud-detecting systems are a blend of old-school rule-based systems and newer machine learning techniques. A team of about 30 decision scientists monitors the system round-the-clock and updates it when necessary, at least once a year. The pandemic came at a bad time for Amex, just as they were rolling out a new model.

“Since each generation of a gradient-boosting ML model is typically developed on data from earlier that same year, many of the model’s assumptions no longer made sense” in 2020.
—”Amex’s Fraud Detection AI Was Ready to Go Live. Then Covid Hit”

This is a really interesting article — although I’d read others about issues caused for AI models by pandemic changes, most of those had to do with either healthcare or travel.

Because of increased online traffic in 2020 — more people online, every day, as the pandemic drove work-from-home and stay-at-home schooling — demands on Amazon Web Services (providing servers and processing power to millions of commercial clients such as Amex) grew enormously. This “dwindling cloud capacity” meant testing new solutions for Amex’s model took much longer than usual. The team had to run new simulations that took our new way of life into account, and those simulations required lots of processor juice.

In the end, Amex’s rollout was successful — but it came months later than originally planned. This was a really neat case study and could be discussed in a lot of different contexts.

I’m going to look at the other articles in the series in tomorrow’s post.

Attention, in machine learning and NLP

May 30, 2021May 30, 2021 Mindy McAdams

Let’s begin at the beginning, with Attention Is All You Need (Vaswani et al., 2017). This is a conference paper with eight authors, six of whom then worked at Google. They contended that neither recurrent neural networks nor convolutional neural networks are necessary for machine translation of languages, and hence the Transformer, “a new simple network architecture,” was born. (Note: It relies on feed-forward neural networks.)

Transformers are the basis for machine translation and other tasks relying on language models. GPT-3 has recently become infamous; others include BERT (from Google) and ELMo.

Before the work by Vaswani and his co-equal co-authors, progress in NLP was limited (although it had advanced a lot since 2012) because of the ways in which RNN models depend on the sequence and position of words in a text. Transformers eliminate those limitations. With recurrent neural networks, there are impediments to parallel processing. Other researchers had previously cracked that nut using ConvNets, but then other limitations were inherent (exponential increase in the number of computational operations). Transformers also eliminate those limitations.

So the Transformer was a first in NLP, a breakthrough. For machine translation, the paper claimed “a new state of the art” (p. 10).

I had learned that an encoder and a decoder connected by an attention module is a standard architecture for machine language translation, e.g. Google Translate. This was true before 2017, so what is the difference effected by the Transformer? It eliminates RNNs and ConvNets from the architecture, yes (“our model contains no recurrence and no convolution”) — but what else?

Attention used in a new way

“An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key”(Vaswani et al., 2017, p. 3). I’m okay with that, although I doubt I would be able to explain it to my non–computer science students. (I do explain weights and features when I introduce neural nets to them, and I explain word vectors when we start NLP. The trouble is they don’t know how to write a program, and they certainly don’t understand what a function is.)

There are different attention functions that could be used. One is additive attention; another is dot-product attention, which is multiplicative rather than additive. Dot-product is “much faster and more space-efficient in practice.” Vaswani et al. used a scaled dot-product attention function (p. 4). They also used multi-head attention, meaning the model uses eight parallel attention layers, or heads. The explanation was a bit beyond me, but the gist is that the model can look at multiple things at the same time, like juggling more balls simultaneously.

Multi-head attention — plus the freedom of no-sequence, no-position — enables the Transformer to look at all the context for a word, and do it for multiple words at the same time.

With my rudimentary understanding of recurrent neural nets, I have a fuzzy idea of how this use of attention functions produces better results, mainly by being able to take in and compare more of the text, a little closer to the way human brains hold an entire conversation even though it’s not a literal “recording” of the exact conversation. The way we comprehend meaning when we read has to do with millions of associations built up over a lifetime, as well as many associations within that present text. We are not processing separate little slices of a sentence — our brains handle a text more holistically.

A Transformer does use word embeddings to convert the tokens (both inout and output) to vectors (Vaswani et al., 2017, p. 5). It uses softmax but no LSTMs (because, again, “no recurrence”).

Please help me, YouTube

I found a video (13:04) that helped me in my struggle to understand the Transformer architecture:

It was still a tough climb for me, but this video was particularly helpful with how multi-head attention improves the process. (Obviously the speed improvement is huge.)

Another helpful video (5:33) does a nice job summing up the sequence-based limitations of RNNs: “In general it’s easier for [RNNs] to capture relationships between points that are close to each other than it is to capture relationships between points that are very far from each other — say, several thousand points in the sequence.” In the paper, this is called “path length between long-range dependencies in the network” (Vaswani et al., 2017, p. 6) and identified as one of three motivations for developing the self-attention layers in Transformer.

In fact this second video is much better than the one above, but I liked that one when I watched it first, and maybe (haha!!) the order in which I watched them had an effect. The diagrams for self-attention in this shorter video are very good!

Back to Vaswani et al.

Speaking of self-attention — it was interesting that the authors thought it “could yield more interpretable models.” As in any hidden layer in any neural network, features are determined and weights set by the system itself, not by the human programmers. This is the “learning” in machine learning. The authors noted that the “individual attention heads clearly learn to perform different tasks,” and that many of them “appear to exhibit behavior related to the syntactic and semantic structure of the sentences” (p. 7; my italics).

Cool.

The results section of the paper describes performance using BLEU scores on two different NLP tasks (WMT 2014 English-to-German translation; WMT 2014 English-to-French translation) — reported as best-ever at that time — as well as record-breaking lower training costs, which means time to train the model factored by processor power used (number of GPUs, estimate of the number of floating-point operations).

The successor to the code on which this seminal paper was based is Trax, available on GitHub.

At the end of the paper (pages 13–15) there are math-free visualizations that illustrate what the attention mechanism does. These are well worth a look.