Rules and ethics for use of AI by governments

The governments of British Columbia and Yukon, in Canada, have jointly issued a report (June 2021) about ethical use of AI in the public sector. It’s interesting to me as it covers issues of privacy and fairness, and in particular, the rights of people to question decisions derived from AI systems. The report notes that the public increasingly expects services provided by governments to be as fast and as personalized as services provided by online platforms such as Amazon — and this leads or will lead to increasing adoption of AI systems to aid in delivery of government services to members of the public.

The report’s concluding recommendations (pages 47–48) cover eight points (edited):

  1. Establish guiding principles for AI use: “Each public authority should make a public commitment to guiding principles for the use of AI that incorporate transparency, accountability, legality, procedural fairness and protection of privacy.”
  2. Inform the public: “If an ADS [automated decision system] is used to make a decision about an individual, public authorities must notify and describe how that system operates to the individual in a way that is understandable.”
  3. Provide human accountability: “Identify individuals within the public authority who are responsible for engineering, maintaining, and overseeing the design, operation, testing and updating of any ADS.”
  4. Ensure that auditing and transparency are possible: “All ADS should include robust and open auditing functionality with enhanced transparency measures for closed-source, proprietary datasets used to develop and update any ADS.”
  5. Protect privacy of individuals: “Wherever possible, public authorities should use synthetic or de-identified data in any ADS.” See synthetic data definition, below.
  6. Build capacity and increase education (for understanding of AI): This point covers “public education initiatives to improve general knowledge of the impact of AI and other emerging technologies on the public, on organizations that serve the public,” etc.; “subject-matter knowledge and expertise on AI across government ministries”; “knowledge sharing and expertise between government and AI developers and vendors”; development of “open-source, high-quality data sets for training and testing ADS”; “ongoing training of ADS administrators” within government agencies.
  7. Amend privacy legislation to include: “an Artificial Intelligence Fairness and Privacy Impact Assessment for all existing and future AI programs”; “the right to notification that ADS is used, an explanation of the reasons and criteria used, and the ability to object to the use of ADS”; “explicit inclusion of service providers to the same obligations as public authorities”; “stronger enforcement powers in both the public and private sector …”; “special rules or restrictions for the processing of highly sensitive information by ADS”; “shorter legislative review periods of 4 years.”
  8. Review legislation to make sure “oversight bodies are able to review AIFPIAs [see item 7 above] and conduct investigations regarding the use of ADS alone or in collaboration with other oversight bodies.”

Synthetic data is defined (on page 51) as: “A type of anonymized data used as a filter for information that would otherwise compromise the confidentiality of certain aspects of data. Personal information is removed by a process of synthesis, ensuring the data retains its statistical significance. To create synthetic data, techniques from both the fields of cryptography and statistics are used to render data safe against current re-identification attacks.”

The report uses the term automated decision systems (ADS) in view of the Government of Canada’s Directive on Automated Decision Making, which defines them as: “Any technology that either assists or replaces the judgement of human decision-makers.”

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Book notes: Hello World, by Hannah Fry

I finished reading this book back in April, and I’d like to revisit it before I read a couple of new books I just got. This was published in 2018, but that’s no detriment. The author, Hannah Fry, is a “mathematician, science presenter and all-round badass,” according to her website. She’s also a professor at University College London. Her bio at UCL says: “She was trained as a mathematician with a first degree in mathematics and theoretical physics, followed by a PhD in fluid dynamics.”

The complete title, Hello World: Being Human in the Age of Algorithms, doesn’t sound like this is a book about artificial intelligence. She refers to control, and “the boundary between controller and controlled,” from the very first pages, and this reflects the link between “just” talking about algorithms and talking about AI. Software is made of algorithms, and AI is made of software, so there we go.

In just over 200 pages and seven chapters simply titled Power, Data, Justice, Medicine, Cars, Crime, and Art, this author organizes primary areas of concern for the question of “Are we in control?” and provides examples in each area.

Power. I felt disappointed when I saw this chapter starts with Deep Blue beating world chess champion Garry Kasparov in 1997 — but my spirits soon lifted as I saw how she framed this example as the way we perceive a computer system affects how we interact with it (shades of Sherry Turkle and Reeves & Nass). She discusses machine learning and image recognition here, briefly. She talks about people trusting GPS map directions and search engines. She explains a 2012 ACLU lawsuit involving Medicaid assistance, bad code, and unwarranted trust in code. Intuition tells us when something seems “off,” and that’s a critical difference between us and the machines.

Algorithms “are what makes computer science an actual science.”

—Hannah Fry, p. 8

Data. Sensibly, this chapter begins with Facebook and the devil’s bargain most of us have made in giving away our personal information. Fry talks about the first customer loyalty cards at supermarkets. The pregnant teenager/Target story is told. In explaining how data brokers operate, Fry describes how companies buy access to you via your interests and your past behaviors (not only online). She summarizes a 2017 DEFCON presentation that showed how supposedly anonymous browsing data is easily converted into real names, and the dastardly Cambridge Analytica exploit. I especially liked how she explains how small the effects of newsfeed manipulation are likely to be (based on research) and then adds — a small margin might be enough to win an election. This chapter wraps up with China’s citizen rating system (Black Mirror in reality) and the toothlessness of GDPR.

Justice. First up is inequality in sentences for crimes, using two U.K. examples. Fry then surveys studies where multiple judges ruled on the same hypothetical cases and inconsistencies abounded. Then the issues with sentencing guidelines (why judges need to be able to exercise discretion). So we arrive at calculating the probability that a person will “re-offend”: the risk assessment. Fry includes a nice, simple decision-tree graphic here. She neatly explains the idea of combining multiple decision trees into an ensemble, used to average the results of all the trees (the random forest algorithm is one example). More examples from research; the COMPAS product and the 2016 ProPublica investigation. This leads to a really nice discussion of bias (pp. 65–71 in the U.S. paperback edition).

Medicine. Although image recognition was mentioned very briefly earlier, here Fry gets more deeply into the topic, starting off with the idea of pattern recognition — and what pattern, exactly, is being recognized? Classifying and detecting anomalies in biopsy slides doesn’t have perfect results when humans do it, so this is one of the promising frontiers for machine learning. Fry describes neural networks here. She gets into specifics about a system trained to detect breast cancer. But image recognition is not necessarily the killer app for medical diagnosis. Fry describes a study of 678 nuns (which previously I’d never heard about) in which it was learned that essays the nuns had written before taking vows could be used to predict which nuns would have dementia later in life. The idea is that an analysis of more data about women (not only their mammograms) could be a better predictor of malignancy.

“Even when our detailed medical histories are stored in a single place (which they often aren’t), the data itself can take so many forms that it’s virtually impossible to connect … in a way that’s useful to an algorithm.”

—Hannah Fry, p. 103

The Medicine chapter also mentions IBM Watson; challenges with labeling data; diabetic retinopathy; lack of coordination among hospitals, doctor’s offices, etc., that lead to missed clues; privacy of medical records. Fry zeroes in on DNA data in particular, noting that all those “find your ancestors” companies now have a goldmine of data to work with. Fry ends with a caution about profit — whatever medical systems might be developed in the future, there will always be people who stand to gain and others who will lose.

Cars. I’m a little burnt out of the topic of self-driving cars, having already read a lot about them. I liked that Fry starts with DARPA and the U.S. military’s longstanding interest in autonomous vehicles. I can’t agree with her that “the future of transportation is driverless” (p. 115). After discussing LiDAR and the flaws of GPS and conflicting signals from different systems in one car, Fry takes a moment to explain Bayes’ theorem, saying it “offers a systematic way to update your belief in a hypothesis on the basis of evidence,” and giving a nice real-world example of probabilistic inference. And of course, the trolley problem. She brings up something I don’t recall seeing before: Humans are going to prank autonomous vehicles. That opens a whole ‘nother box of trouble. Her anecdote under the heading “The company baby” leads to a warning: Always flying on autopilot can have unintended consequences when the time comes to fly manually.

Crime. This chapter begins with a compelling anecdote, followed by a neat historical case from France in the 1820s, and then turns to predictive policing and all its woes. I hadn’t read about the balance between the buffer zone and distance decay in tracking serial criminals, so that was interesting — it’s called the geoprofiling algorithm. I also didn’t know about Jack Maple, a New York City police officer, and his “Charts of the Future” depicting stations of the city’s subway system, which evolved into a data tool named CompStat. I enjoyed learning what burglaries and earthquakes have in common. And then — PredPol. There have been thousands of articles about this since its debut in 2011, as Fry points out. Her summary of the issues related to how police use predictive policing data is quite good, compact and clear. PredPol is one specific product, and not the only one. It is, Fry says, “a proprietary algorithm, so the code isn’t available to the public and no one knows exactly how it works” (p. 157).

“The [PredPol] algorithm can’t actually tell the future. … It can only predict the risk of future events, not the events themselves — and that’s a subtle but important difference.”

—Hannah Fry, p. 153

Face recognition is covered in the Crime chapter, which makes perfect sense. Fry offers a case where a white man was arrested based on incorrect identification of him from CCTV footage at a bank robbery. The consequences of being the person arrested by police can be injury or death, as we all know — not to mention the legal expenses as you try to clear your name after the erroneous arrest. Even though accuracy rates are rising, the chances that you will match a face that isn’t yours remains worrying.

“How do you decide on that trade-off between privacy and protection, fairness and safety?”

—Hannah Fry, p. 172

Art. Here we have “a famous experiment” I’d never heard of — Music Lab, where thousands of music fans logged into a music player app, listened to songs, rated them, and chose what to download (back when we downloaded music). The results showed that for all but the very best and very worst songs, the ratings by other people had a huge influence on what was downloaded in different segments of the app. A song that became a massive hit in one “world” was dead and buried in another. This leads us to recommendation engines such as those used by Netflix and Amazon. Predicting how well movies would do at the box office, turned out to be badly unreliable. The trouble is the lack of an objective measure of quality — it’s not “This is cancer/This is not cancer.” Beauty in the eye of the beholder and all that. A recommendation engine is different because it’s not using a quality score — it’s matching similarity. You liked these 10 movies; I like eight of those; chances are I might like the other two.

Fry goes on to discuss programs that create original (or seemingly original) works of art. A system may produce a new musical or visual composition, but it doesn’t come from any emotional basis. It doesn’t indicate a desire to communicate with others, to touch them in any way.

In her Conclusion, Fry returns to the questions about bias, fairness, mistaken identity, privacy — and the idea of the control we give up when we trust the algorithms. People aren’t perfect, and neither are algorithms. Taking the human consequences of machine errors into account at every stage is a step toward accountability. Building in the capability to backtrack and explain decisions, predictions, outputs, is a step toward transparency.

For details about categories of algorithms based on tasks they perform (prioritization, classification, association, filtering; rule-based vs. machine learning), see the Power chapter (pp. 8–13 in the U.S. paperback edition).

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

The trouble with large language models

Yesterday I summarized the first two articles in a series about algorithms and AI by Hayden Field, a technology journalist at Morning Brew. Today I’ll finish out the series.

The third article, This Powerful AI Technique Led to Clashes at Google and Fierce Debate in Tech. Here’s Why, explores the basis of the volatile situation around the firing of Timnit Gebru and later Margaret Mitchell from Google’s Ethical AI unit earlier this year. Both women are highly respected and experienced AI researchers. Mitchell founded the team in 2017.

Central to the situation is a criticism of large language models and a March 2021 paper (On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?) co-authored by Gebru, Mitchell, and two researchers at the University of Washington. The biggest current example is GPT-3, previously covered in several posts here.

“Models this big require an unthinkable amount of data; the entirety of English-language Wikipedia makes up just 0.6% of GPT-3’s training data.”

—”This Powerful AI Technique Led to Clashes at Google and Fierce Debate in Tech. Here’s Why”

The Morning Brew article sums up the very recent and very big improvements in large language models that have come about thanks to new algorithms and faster computer hardware (GPUs running in parallel). It highlights BERT, “the model that now underpins Google Search,” which came out of the research that resulted in the first Transformer. A good at-the-time article about GPT-3’s release was published in July 2020 in MIT’s Technology Review: “OpenAI first described GPT-3 in a research paper published in May [2020].”

One point being — Google fired Timnit Gebru very soon after news and discussion of large language models (GPT-3 especially, but remember Google’s investment in BERT too) ramped up — way up. Her criticism of a previously obscure AI technology (not obscure among NLP researchers, but in the wider world) might have been seen as increasingly inconvenient for Google. Morning Brew summarizes the criticism (not attributed to Gebru): “Because large language models often scrape data from most of the internet, racism, sexism, homophobia, and other toxic content inevitably filter in.”

“Once the barrier to create AI tools and generate text is lower, people could just use it to create misinformation at scale, and having that data coupled with certain other platforms can just be a very disastrous situation.”

—Sandhini Agarwal, AI policy researcher, OpenAi

The Morning Brew article goes well beyond Google’s dismissal of Gebru and Mitchell, bringing in a lot of clear, easy-to-understand explanation of what large language models require (for example, significant energy resources), what they’re being used for, and even the English-centric nature of such models — lacking a gigantic corpus of digitized text in a given human language, you can’t create a large model in that language.

The turmoil in Google’s Ethical AI unit is covered in more detail in this May 2021 article, also by Hayden Field.

It’s easy to find articles that discuss “scary things GPT-3 can do and does” and especially the bias issues; it’s much harder to find information about some of the other aspects covered here. It’s also not just about GPT-3. I appreciated insights from an interview with Emily M. Bender, first author on the “Stochastic Parrots” article. I also liked the explicit statement that many useful NLP tasks can be done well without a large language model. In smaller datasets, finding and accounting for toxic content can be more manageable.

“Do we need this at all? What’s the actual value proposition of the technology? … Who is paying the environmental price for us doing this, and is this fair?”

—Emily M. Bender, professor and director, Professional MS in Computational Linguistics, University of Washington

Finally, in a recap of Morning Brew’s “Demystifying Algorithms” event, editor Dan McCarthy summarized two AI researchers’ answers to one of my favorite questions: What can an algorithm actually know?

An AI system’s ability to generalize — to transfer learning from one domain to another — is still a wide-open frontier, according to Mark Riedl, a computer science professor at Georgia Tech. This is something I remind my students of over and over — what’s called “general intelligence” is still a long way off for artificial intelligence. Riedl works on aspects of storytelling to test whether an AI system is able to “make something new” out of what it has ingested.

Saška Mojsilović, head of Trusted AI Foundations at IBM Research, made a similar point — and also emphasized that “narrow AI” (which is all the AI we’ve ever had, up to now and for the foreseeable future) is not nothing.

She suggested: “We may want to take a pause from obsessing over artificial general intelligence and maybe think about how we create AI solutions for these kinds of problems” — for example, narrow domains such as drug discovery (e.g. new antibiotics) and creation of new molecules. These are extraordinary accomplishments within the capabilities of today’s AI.

This is a half-hour conversation with those two experts:

Thanks to the video, I learned about the Lovelace 2.0 Test, which Riedl developed in 2014. It’s an alternative to the Turing Test.

Mojsilović talked about the perceptions that arise when we use the word intelligence when talking about machines. “The reality is that many things that we call AI today are the same old models that we used to call data science maybe five or six years ago,” she said (at 21:55). She also talked about the need for collaboration between AI researchers and experts in entirely separate fields: “Because we can’t create solutions for the problems that we don’t understand” (at 29:24).

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Multiple facets of ethics in AI

The Center for Responsible AI at New York University has published a free online course titled “AI Ethics: Global Perspectives.”

The course consists of a series of videos produced by many different people in countries around the world. The instructors include computer science and engineering professors as well as researchers in various fields, including government, health care, and the humanities. These are the lectures I intend to watch:

Lectures still to come:

  • Renee Cummings, a U.S. criminologist and consultant, will discuss “Bias in Data and AI: Myth, Mistrust, and Myopia.”
  • Susan Scott-Parker will discuss “AI Powered Disability Discrimination: How Do You Lip Read a Robot Recruiter?”

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

What is the good in GPT-3?

When given a prompt, an app built on the GPT-3 language model can generate an entire essay. Why would we need such an essay? Maybe the more important question is: What harm can such an essay bring about?

I couldn’t get that question out of my mind after I came across a tweet by Abeba Birhane, an award-winning cognitive science researcher based in Dublin.

You can read the essay on the Philosopher AI website or, should that go away, you can see a full image of the page that I captured.

Here is a sample of the generated text: “… it is unclear whether ethiopia’s problems can really be attributed to racial diversity or simply the fact that most of its population is black and thus would have faced the same issues in any country (since africa has had more than enough time to prove itself incapable of self-government).”

Obviously there exist racist human beings who would express a similar racist idea. The machine, however, has written this by default. It was not told to write a racist essay — it was told to write an essay about Ethiopia.

The free online version of Philosopher AI no longer exists to generate texts for you — but you can buy access to it via an app for either iOS or Android. That means anyone with $3 or $4 can spin up an essay to submit for a class, an application for a school or a job, a blog or forum post, an MTurk prompt.

A review of Philosopher AI posted at the iOS app store

The app has built-in blocks on certain terms, such as trans and women — apparently because the app cannot be trusted to write anything inoffensive in response to those prompts.

Why is a GPT-3 app so predisposed to write misogynist and racist and otherwise hateful texts? It goes back to the corpus on which it was trained. (See a related post here.) Philosopher AI offers this disclaimer: “Please remember that the AI will generate different outputs each time; and that it lacks any specific opinions or knowledge — it merely mimics opinions, proven by how it can produce conflicting outputs on different attempts.”

“GPT-3 was trained on the Common Crawl dataset, a broad scrape of the 60 million domains on the internet along with a large subset of the sites to which they link. This means that GPT-3 ingested many of the internet’s more reputable outlets — think the BBC or The New York Times — along with the less reputable ones — think Reddit. Yet, Common Crawl makes up just 60% of GPT-3’s training data; OpenAI researchers also fed in other curated sources such as Wikipedia and the full text of historically relevant books.” (Source: TechCrunch.)

There’s no question that GPT-3’s natural language generation prowess is amazing, stunning. But it’s like a wild beast that can at any moment turn and rip the throat out of its trainer. It has all the worst of humanity already embedded within it.

A previous related post: GPT-3 and automated text generation.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

What’s the use of machine learning?

I’m interested in applications of machine learning in journalism. This is natural, as my field is journalism. In the field of computer science, however, accolades and honors tend to favor research on new algorithms or procedures, or new network architectures. Applications are practical uses of algorithms, networks, etc., to solve real-world problems — and developing them often doesn’t garner the acclaim that researchers need to advance their careers.

Hannah Kerner, a professor and machine learning researcher at the University of Maryland, wrote about this in the MIT Technology Review. Her essay is aptly titled “Too many AI researchers think real-world problems are not relevant.”

“The first image of a black hole was produced using machine learning. The most accurate predictions of protein structures, an important step for drug discovery, are made using machine learning.”

—Hannah Kerner

Noting that applications of machine learning are making real contributions to science in fields outside computer science, Kerner (who works on machine learning solutions for NASA’s food security and agriculture program) asks how much is lost because of the priorities set by the journals and conferences in the machine learning field.

She also ties this focus on ML research for the sake of advancing ML to the seepage of bias out from widely used datasets into the mainstream — the most famous cases being in face recognition, with systems (machine learning models) built on flawed datasets that disproportionately skew toward white and male faces.

“When studies on real-world applications of machine learning are excluded from the mainstream, it’s difficult for researchers to see the impact of their biased models, making it far less likely that they will work to solve these problems.”

—Hannah Kerner

Machine learning is rarely plug-and-play. In creating an application that will be used to perform useful work — to make new discoveries, perhaps, or to make medical diagnoses more accurate — the machine learning researchers will do substantial new work, even when they use existing models. Just think, for a moment, about the data needed to produce an image of a black hole. Then think about the data needed to make predictions of protein structures. You’re not going to handle those in exactly the same way.

I imagine the work is quite demanding when a number of non–ML experts (say, the biologists who work on protein structures) get together with a bunch of ML experts. But either group working separately from the other is unlikely to come up with a robust new ML application. Kerner linked to this 2018 news report about a flawed cancer-detection system — leaked documents said that “instead of feeding real patient data into the software,” the system was trained on data about hypothetical patients. (OMG, I thought — you can’t train a system on fake data and then use it on real people!)

Judging from what Kerner has written, machine learning researchers might be caught in a loop, where they work on pristine and long-used datasets (instead of dirty, chaotic real-world data) to perfect speed and efficiency of algorithms that perhaps become less adaptable in the process.

It’s not that applications aren’t getting made — they are. The difficulty lies in the priorities for research, which might dissuade early-career ML researchers in particular from work on solving interesting and even vital real-world problems — and wrestling with the problems posed by messy real-world data.

I was reminded of something I’ve often heard from data journalists: If you’re taught by a statistics professor, you’ll be given pre-cleaned datasets to work with. (The reason being: She just wants you to learn statistics.) If you’re taught by a journalist, you’ll be given real dirty data, and the first step will be learning how to clean it properly — because that’s what you have to do with real data and a real problem.

So the next time you read about some breakthrough in machine learning, consider whether it is part of a practical application, or instead, more of a laboratory experiment performed in isolation, using a tried-and-true dataset instead of wild data.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

How would you respond to the trolley problem?

MIT has a cool and easy-to-play game (okay, not really a game, but like a game) in which you get to choose what a self-driving car would do when facing an imminent crash situation.

Above: Results from one round of playing the MoralMachine

At the end of one round, you get to see how your moral choices measure up to those of other people who have played. Note that all the drawings of people in the game have distinct meanings. People inside the car are also represented. Try it yourself here.

It is often discussed how the split-second decision affecting who lives, who dies is one of the most difficult aspects of training an autonomous vehicle.

Imagine this scenario:

“The car is programmed to sacrifice the driver and the occupants to preserve the lives of bystanders. Would you get into that car with your child?”

—Meredith Broussard, The Atlantic, 2018

In a 2018 article, Self-Driving Cars Still Don’t Know How to See, data journalist and professor Meredith Broussard tackled this question head-on. We find that the way the question is asked elicits different answers. If you say the driver might die, or be injured, if a child in the street is saved, people tend to respond: Save the child! But if someone says, “You are the driver,” the response tends to be: Save me.

You can see the conundrum. When programming the responses into the self-driving car, there’s not a lot of room for fine-grained moral reasoning. The car is going to decide in terms of (a) Is a crash is imminent? (b) What options exist? (c) Does any option endanger the car’s occupants? (d) Does any option endanger other humans?

In previous posts, I’ve written a little about the weights and probability calculations used in AI algorithms. For the machine, this all comes down to math. If (a) is True, then what options are possible? Each option has a weight. The largest weight wins. The prediction of the “best outcome” is based on probabilities.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

How might we regulate AI to prevent discrimination?

Discussions about regulation of AI, and algorithms in general, often revolve around privacy and misuse of personal data. Protections against bias and unfair treatment are also part of this conversation.

In a recent article in Harvard Business Review, lawyer Andrew Burt (who might prefer to be called a “legal engineer”) wrote about using existing legal standards to guide efforts at ensuring fairness in AI–based systems. In the United States, these include the Equal Credit Opportunity Act, the Civil Rights Act, and the Fair Housing Act.

Photo by Tingey Injury Law Firm on Unsplash

Burt emphasizes the danger of unintentional discrimination, which can arise from basing the “knowledge” in the system on past data. You might think it would make sense to train an AI to do things the way your business has done things in the past — but if that means denying loans disproportionately to people of color, then you’re baking discrimination right into the system.

Burt linked to a post on the Google AI Blog that in turn links to a GitHub repo for a set of code components called ML-fairness-gym. The resource lets developers build a simulation to explore potential long-term impacts of a machine learning decision system — such as one that would decide who gets a loan and who doesn’t.

In several cases, long-term analysis via simulations showed adverse unintended consequences that arose from decisions made by ML. These are detailed in a paper by Google researchers. We can see that determining the true outcomes of use of AI systems is not just a matter of feeding in the data and getting a reliable model to churn out yes/no decisions for a firm.

It makes me wonder about all the cheerleading and hype around “business solutions” offered by large firms such as Deloitte. Have those systems been tested for their long-term effects? Is there any guarantee of fairness toward the people whose lives will be affected by the AI system’s decisions?

And what is “fair,” anyway? Burt points out that statistical methods used to detect a disparate impact depend on human decisions about “what ‘fairness’ should mean in the context of each specific use case” — and also how to measure fairness.

The same applies to the law — not only in how it is written but also in how it is interpreted. Humans write the laws, and humans sit in judgment. However, legal standards are long established and can be used to place requirements on companies that produce, deploy, and use AI systems, Burt suggests.

  • Companies must “carefully monitor and document all their attempts to reduce algorithmic unfairness.”
  • They must also “generate clear, good faith justifications for using the models” that are at the heart of the AI systems they develop, use, or sell.

If these suggested standards were applied in a legal context, it could be shown whether a company had employed due diligence and acted responsibly. If the standards were written into law, companies that deploy unfair and discriminatory AI systems could be held liable and face penalties.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Comment moderation as a machine learning case study

Continuing my summary of the lessons in Introduction to Machine Learning from the Google News Initiative, today I’m looking at Lesson 5 of 8, “Training your Machine Learning model.” Previous lessons were covered here and here.

Now we get into the real “how it works” details — but still without looking at any code or computer languages.

The “lesson” (actually just a text) covers a common case for news organizations: comment moderation. If you permit people to comment on articles on your site, machine learning can be used to identify offensive comments and flag them so that human editors can review them.

With supervised learning (one of three approaches included in machine learning; see previous post here), you need labeled data. In this case, that means complete comments — real ones — that have already been labeled by humans as offensive or not. You need an equally large number of both kinds of comments. Creating this dataset of comments is discussed more fully in the lesson.

You will also need to choose a machine learning algorithm. Comments are text, obviously, so you’ll select among the existing algorithms that process language (rather than those that handle images and video). There are many from which to choose. As the lesson comes from Google, it suggests you use a Google algorithm.

In all AI courses and training modules I’ve looked at, this step is boiled down to “Here, we’ll use this one,” without providing a comparison of the options available. This is something I would expect an experienced ML practitioner to be able to explain — why are they using X algorithm instead of Y algorithm for this particular job? Certainly there are reasons why one text-analysis algorithm might be better for analyzing comments on news articles than another one.

What is the algorithm doing? It is creating and refining a model. The more accurate the final model is, the better it will be at predicting whether a comment is offensive. Note that the model doesn’t actually know anything. It is a computer’s representation of a “world” of comments in which some — with particular features or attributes perceived in the training data — are rated as offensive, and others — which lack a sufficient quantity of those features or attributes — are rated as not likely to be offensive.

The lesson goes on to discuss false positives and false negatives, which are possibly unavoidable — but the fewer, the better. We especially want to eliminate false negatives, which are offensive comments not flagged by the system.

“The most common reason for bias creeping in is when your training data isn’t truly representative of the population that your model is making predictions on.”

—Lesson 6, Bias in Machine Learning

Lesson 6 in the course covers bias in machine learning. A quick way to understand how ML systems come to be biased is to consider the comment-moderation example above. What if the labeled data (real comments) included a lot of comments offensive to women — but all of the labels were created by a team of men, with no women on the team? Surely the men would miss some offensive comments that women team members would have caught. The training data are flawed because a significant number of comments are labeled incorrectly.

There’s a pretty good video attached to this lesson. It’s only 2.5 minutes, and it illustrates interaction bias, latent bias, and selection bias.

Lesson 6 also includes a list of questions you should ask to help you recognize potential bias in your dataset.

It was interesting to me that the lesson omits a discussion of how the accuracy of labels is really just as important as having representative data for training and testing in supervised learning. This issue is covered in ImageNet and labels for data, an earlier post here.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Racial and gender bias in AI

Different AI systems do different things when they attempt to identify humans. Everyone has heard about face recognition (a k a facial recognition), which you might expect would return a name and other personal data about a person whose face is “seen” with a camera.

No, not always.

A system that analyzes human faces might simply try to return information about the person that you or I would tag in our minds when we see a stranger. The person’s gender, for example. That’s relatively easy to do most of the time for most humans — but it turns out to be tricky for machines.

Machines often get it wrong when trying to identify the gender of a trans person. But machines also misidentify the gender of people of color. In particular, they have a big problem recognizing Black women as women.

A short and good article about this ran in Time magazine in 2019, and the accompanying video is well worth watching. It shows various face recognition software systems at work.

Another serious problem concerns differentiating among people of Asian descent. When apartment buildings and other housing developments have installed face recognition as a security system — to open for residents and stay locked for others — the Asian residents can find themselves locked out of their own home. The doors can also open for Asian people who don’t live there.

You can find a lot of articles about this widespread and very serious problem with AI technology, including the deservedly famous mug shots test by the American Civil Liberties Union.

“While it is usually incorrect to make statements across algorithms, we found empirical evidence for the existence of demographic differentials in the majority of the face recognition algorithms we studied.”

—Patrick Grother, NIST computer scientist

So how does this happen? How do companies with almost infinite resources deploy products that are so seriously — and even dangerously — flawed?

Yesterday I wrote a little about training data for object-detection AI. To identify any image, or any part of an image, an AI system is usually trained on an immense set of images. If you want to identify human faces, you feed the system hundreds of thousands, or even millions, of pictures of human faces. If you’re using supervised learning to train the system, the images are labeled: Man, woman. Black, white. Old, young. Convicted criminal. Sex offender. Psychopath.

Who is in the images? How are those images labeled?

This is part of how the whole thing goes sideways. There’s more to it, though. Before a system is marketed, or released to the public, its developers are going to test it. They’re going to test the hell out of it. This can be compared with when an AI is developed that plays a particular game, like Go, or chess. After the system has been trained, you test it. To test the system, you’re going to have it play, and see if it can win — consistently. So when developers create a face recognition system, and they’ve tested it extensively, and they say, great, now it’s ready for the public, it’s ready for commercial use — ask yourself how they missed these glaring flaws.

Ask yourself how they missed the fact that the system can’t differentiate between various Asian faces.

Ask yourself how they missed the fact that the system identifies Black women as men.

Fortunately, in just the past year these flaws have received so much attention that a number of large firms (Amazon, IBM, Microsoft) have pulled back on commercial deployments of face recognition technologies. Whether they will be able to build more trustworthy systems remains to be seen.

More about bias in face recognition systems:

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.