How will AI affect journalism in 2024?

News organizations will use AI technologies to increase efficiency, but not to churn out generic, easily reproducible content, according to Journalism, Media, and Technology Trends and Predictions 2024, a new report published by the Reuters Institute at Oxford. You can read a summary or download the PDF here.

The 41-page report is based in part on a survey of digital news leaders in 56 countries and territories, 314 of whom responded in the final months of 2023.

Sixteen percent of respondents said their organization “already [has] a designated AI leader in the newsroom” and 24 percent said they are “working on it.”

“Forward-thinking news organisations will be looking to build unique content and experiences that can’t be easily replicated by AI. These might include curating live news, deep analysis, human experiences that build connection, as well as longer audio and video formats that might be more defensible than text.”

—Journalism, Media, and Technology Trends and Predictions 2024, page 39

Concerns about deepfakes, and false information generated and spread by AI–enabled bots, are especially strong around national election campaigns, but in spite of promises to be vigilant by the big platforms (Google, Meta, TikTok), no one knows how bad it will be or whether the effects will be serious. The EU remains the only region with legal requirements for platform oversight and accountability (the Digital Services Act). Labeling AI–generated content and deploying fact-checking routines are two defenses that might not be adequate for the task — for example, news audiences might simply ignore labels.

Full Fact is a U.K.–based fact-checking organization that is using various “AI techniques.” The Newsroom is an AI startup developing tools for journalists and news audiences.

Current and future uses of AI in newsrooms, ranked by importance by respondents to the survey:

  • “Back-end automation tasks (56%) such as transcription and copyediting … a top priority”;
  • “Recommender systems (37%)”;
  • “Creation of content (28%) with human oversight … e.g. summaries, headlines”;
  • “Commercial uses (27%)”;
  • “Coding (25%), where some publishers say they have seen very large productivity gains”;
  • “Newsgathering (22%) where AI may be used to support investigations or in fact-checking and verification.”

Some news organizations are using image-generation tools such as Midjourney “to create graphic illustrations around subjects like technology and cooking.”

Back-end automation and coding are considered relatively low-risk applications of AI, but content creation and newsgathering are seen as higher risk, potentially threatening the reputation of the news organization.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

What journalists get wrong about AI

Sayash Kapoor and Arvind Narayanan are writing a book about AI. The title is AI Snake Oil. They’ve been writing a Substack newsletter about it, and on Sept. 30 they published a post titled Eighteen pitfalls to beware of in AI journalism. Narayanan is a computer science professor at Princeton, and Kapoor is a former software engineer at Facebook and current Ph.D. student at Princeton.

“There is seldom enough space in a news article to explain how performance numbers like accuracy are calculated for a given application or what they represent. Including numbers like ‘90% accuracy’ in the body of the article without specifying how these numbers are calculated can misinform readers …”

—Kapoor and Narayanan

They made a checklist, in PDF format, to accompany the post. The list is based on their analysis of more than 50 articles from five major publications: The New York Times, CNN, the Financial Times, TechCrunch, and VentureBeat. In the Substack post, they linked to three annotated examples — one each from The New York Times, CNN, and the Financial Times. The annotated articles are quite interesting and could form a base for great discussions in a journalism class. (Note, in the checklist, the authors over-rely on one article from The New York Times for examples.)

Their goals: The public should be able to detect hype about AI when it appears in the media, and their list of pitfalls could “help journalists avoid them.”

“News articles often cite academic studies to substantiate their claims. Unfortunately, there is often a gap between the claims made based on an academic study and what the study reports.”

—Kapoor and Narayanan

Kapoor and Narayanan have been paying attention to the conversations around journalism and AI. One example is their link to How to report effectively on artificial intelligence, a post published in 2021 by the JournalismAI group at the London School of Economics and Political Science.

I was pleased to read this post because it neatly categorizes and defines many things that have been bothering me in news coverage of AI breakthroughs, products, and even ethical concerns.

  • There’s far too much conflation of AI abilities and human abilities. Words like learning, thinking, guessing, and identifying all serve to obscure computational processes that are only mildly similar to what happens in human brains.
  • “Claims about AI tools that are speculative, sensational, or incorrect”: I am continually questioning claims I see reported uncritically in the news media, with seemingly no effort made to check and verify claims made by vendors and others with vested interests. This is particularly bad with claims about future potential — every step forward nowadays is implied to be leading to machines with human-level intelligence.
  • “Limitations not addressed”: Again, this is slipshod reporting, just taking what the company says about its products (or researchers about their research) and not getting assessments from disinterested parties or critics. Every reporter reporting on AI should have a fat file of critical sources to consult on every story — people who can comment on ethics, labor practices, transparency, and AI safety.

Another neat thing about Kapoor and Narayanan’s checklist: Journalism and mass communication researchers could adapt it for use as a coding instrument for analysis of news coverage of AI.

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Exploring subfields of AI relevant to journalism

Many academic papers about artificial intelligence are focused on a narrow domain or one specific application. In trying to get a grip on the uses of AI in the field of journalism, often we find that one paper bears no similarity to the next, and that makes it hard to talk about AI in journalism comprehensively or in a general sense. We also find that large sections of some papers in this area are more speculative than practical, discussing what could be more than what exists today.

In this post I will summarize two papers that are focused on uses of AI in journalism that do actually exist. These two papers also do a good job of putting into context the disparate applications relevant to journalism work and journalism products.

In the first paper, Artificial Intelligence in News Media: Current Perceptions and Future Outlook (2022; open access), the authors examined 102 case studies from a dataset compiled at JournalismAI, an international initiative based at the London School of Economics. They classified the projects according to seven “major areas” or subfields of AI:

  1. Machine learning
  2. Natural language processing (NLP)
  3. Speech recognition
  4. Expert systems
  5. Planning, scheduling, and optimization
  6. Robotics
  7. Computer vision

I could quibble with the categories, especially as systems in categories 2, 3, 5, 6 and 7 often rely on machine learning. The authors did acknowledge that planning, scheduling, and optimization “is commonly applied in conjunction with machine learning.” They also admit that some of the projects incorporated more than one subfield of AI.

According to the authors, three subfields were missing altogether from the journalism projects in their dataset: expert systems, speech recognition, and robotics.

Screenshot shows 12 rows of the Journalism AI dataset with topic tags
Screenshot of the JournalismAI dataset (partial)

Use of machine learning was common in projects related to increasing users’ engagement with news apps or websites, and in efforts to retain subscribers. These projects included recommendation engines and flexible paywalls “that bend to the individual reader or predict subscription cancellation.”

Uses of computer vision were quite varied. Several projects used it with satellite imagery to detect changes over time. The New York Times used computer vision algorithms for the 2020 Summer Olympics to analyze and compare movements of athletes in events such as gymnastics. Reuters used image recognition to enhance in-house searches of the company’s vast video archive (note, speech-to-text transcripts for video was also part of this project). More than one news organization is using computer vision to detect fake images.

Interestingly, automated stories were categorized as planning, scheduling, and optimization rather than as NLP. It’s true that the day-to-day automation of various reports on financial statements, sporting events, real estate sales, etc., across a range of news organizations is handled with story templates — but the language in each story is adjusted algorithmically, and those algorithms have come at least in part from NLP.

The authors noted that within their limited sample, few projects involved social bots. “Most of the bots that we researched were news bots that write stories,” they said. It is true that “social bots such as Twitter bots do not necessarily use AI” — but in that case, the bot is going to use a rule-based system or de facto expert system, a category of AI the authors said was missing from the dataset.

Most of the projects in the dataset relied on external funding, and mainly from one source: Google’s Digital News Innovation Fund grants.

One thing I like about this research is that it does not conflate artificial intelligence and data journalism — which in my view is a serious flaw in much of the literature about AI in journalism. You might notice that in the foregoing summary, the only instances of AI contributing information to stories involved use of satellite imagery.

The authors of the article discussed above are Mathias-Felipe de-Lima-Santos of the University of Navarra, Spain, and Wilson Ceron of the Federal University of São Paulo, Brazil.

What about using AI as part of data journalism?

In an article published in 2019, Making Artificial Intelligence Work for Investigative Journalism, Jonathan Stray (now a visiting scholar at the UC Berkeley Center for Human-Compatible AI) authoritatively debunked the myth that data journalists are routinely using AI (or soon will be), and he explained why. Two very simple reasons bear mention at the outset:

  • Most journalism investigations are unique. That precludes the time, expense and expertise required to develop an AI solution or tool to aid in one investigation, because it likely would not be usable in any other investigation.
  • Journalists’ salaries are far lower than the salaries of AI developers and data scientists. A news organization won’t hire AI experts to develop systems to aid in journalism investigations.

Data journalists do use a number of digital tools for cleaning, analyzing, and visualizing data, but it must be said that almost all of these tools are not part of what is called artificial intelligence. Spreadsheets, for example, are essential in data journalism but a far cry from AI. Stray points to other tools — for extracting information from digitized documents, or finding and eliminating duplicate records in datasets (e.g. with Dedupe.io). The line gets fuzzy when the journalist needs to train the tool so that it learns the particulars of the given dataset — by definition, that is machine learning. This training of an already-built tool, however, is immensely simpler than the thousands or even millions of training epochs overseen by computer scientists who develop new AI systems.

Stray clarifies his focus as “the application of AI theory and methods to problems that are unique to investigative reporting, or at least unsolved elsewhere.” He identifies these categories for successful uses of AI in journalism so far:

  • Document classification
  • Language analysis
  • Breaking news detection
  • Lead generation
  • Data cleaning

Stray’s journalism examples are cases covered previously. He acknowledges that the “same small set of examples is repeatedly discussed at data journalism conferences” and this “suggests that there are a relatively small number of cases in total” (page 1080).

Supervised document classification is a method for sorting a large number of documents into groups. For investigative journalists, this separates documents likely to be useful from others that are far less likely to be useful; human examination of the “likely” group is still needed.

By language analysis, Stray means use of natural language processing (NLP) techniques. These include unsupervised methods of sorting documents (or forum comments, social media posts, emails) into groups based on similarity (topic modeling, clustering), or determining sentiment (positive/negative, for/against, toxic/nontoxic), or other criteria. Language models, for example, can identify “named entities” such as people or “nationalities or religious or political groups” (NORP) or companies.

Breaking news detection: The standard example is the Reuters Tracer system, which monitors Twitter and alerts journalists to news events. The advantage is getting a head start of as much as 18 minutes over other news organizations that will cover the same event. I am not sure whether any other organization has ever developed a comparable system.

Lead generation is not exactly story discovery but more like “Here’s something you might want to investigate further.” It might pan out; it might not. Stray’s examples here are a bit weak, in my opinion, but the one for using face recognition to detect members of the U.S. Congress in photos uploaded by the public does set the imagination running.

Data cleaning is always necessary, usually tedious, and often takes more time than any other part of the reporting process. It makes me laugh when I hear data-science educators talk about giving their students nice, clean datasets, because real data in the real world is always dirty, and you cannot analyze it properly until it has been cleaned. Data journalists talk about this incessantly, and about reliable techniques not only for cleaning data but also for documenting every step of the process. Stray does not provide examples of using AI for data cleaning, but he devotes a portion of his article to this and data “wrangling” as areas he deems most suitable for AI solutions in the future.

When documents are extremely diverse in format and/or structure (e.g. because they come from different entities and/or were created for different purposes), it can be very difficult to extract data from them in any useful way (for example: names of people, street addresses, criminal charges) unless humans do it by hand. Stray calls it “a challenging research problem” (page 1090). Another challenge is linking disparate documents to one another, for which the ultimate case to date is the Panama Papers. Network analysis can be used (after named entities are extracted), but linkages will still need to be checked by humans.

Stray also (quite interestingly) wrote about what would be needed if AI systems were to determine newsworthiness — the elusive quality that all journalists swear they can recognize (much like Supreme Court Justice Potter Stewart’s famous claim about obscenity).

Conclusions

From my reading so far, I think there are two major applications of AI in the journalism field actually operating at present: production of automated news stories (within limited frameworks), and purpose-built systems for manipulating the content choices offered to users (recommendations and personalization). Automated stories or “robot journalism” have been around for at least seven or eight years now and have been written about extensively.

I’ve read (elsewhere) about efforts to catalog and mine gigantic archives of both video and photographs, and even to produce fully automated videos with machine-generated voiceover narration, but I think those are corporate strategies to extract value from existing resources rather than something intended to produce new journalism in the public interest. I also think those efforts might be taking place mainly outside the journalism area by now.

One thing that’s clear: The typical needs of an investigative journalism project (the highest-cost and possibly most important kind of journalism) are not easily solved by AI, even today. In spite of great advances in NLP, giant collections of documents must still be acquired piecemeal by humans, and while NLP can help with some parts of extracting valuable information from documents, in the end these stories require a great deal of human labor and time.

Another area not addressed in either of the two articles discussed here is verification and fact-checking. The ClaimReview Project is one approach to this, but it is powered by human fact-checkers, not AI. See also the conference paper The Quest to Automate Fact-Checking, presented at the 2015 Computation + Journalism Symposium.

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Research scholarship about AI and journalism

I’ve been reading a lot about artificial intelligence and journalism lately. Yesterday I read two studies that examine the scholarly literature in this area. Both were published in 2021.

The first, Artificial intelligence and journalism: Systematic review of scientific production in Web of Science and Scopus (2008-2019), examined 209 articles published from January 2008 to December 2019. The researchers used these search terms: robot journalism, automated journalism, algorithm journalism, computational journalism, augmented journalism, artificial journalism, and high tech journalism. They also searched for simply journalism and artificial intelligence.

From the 209 articles, they identified these additional themes: audience, authorship, big data, chatbots, credibility, data journalism, ethics, events detection, fact-checking, online comments, personalization, production, social media, technologies, and theory.

The number of articles published per year has increased sharply since 2015 (as you might expect). Sixty-one of the items were published in 2019, the final year in this study. The researchers also counted countries, institutions, citations, authors, and looked at collaborations, noting especially that collaboration among authors from different countries has been rare. One-third of the articles are from the U.S., while Germany, Ireland, Spain, and the U.K. combined account for more than one-third. The journal Digital Journalism had published the most articles (36).

Chart by Calvo Rubio & Ufarte Ruiz (2021) shows number of publications per year, 2008–2019
Chart above by Calvo Rubio & Ufarte Ruiz (2021) shows number of publications per year, 2008–2019.

Keywords were supplied for 80 percent of the publications. Analysis identified more than 1,000 distinct keywords. These were the most common, in order starting with most-used:

  1. Computational journalism
  2. Automated journalism
  3. Robot journalism
  4. Journalism
  5. Artificial intelligence
  6. Data journalism
  7. Algorithms
  8. Automation
  9. Algorithmic journalism
  10. Social media
  11. Big data

Other commonly seen concepts included: bots, fact checking, innovation, and natural language generation (NLG). Verification and personalized content also appeared in several articles.

The five most-cited articles (with more than 100 citations each) are from 2010 through 2015. The authors’ names will not surprise you if you have been following this field of study: C. W. Anderson, Mark Coddington, Nicholas Diakopoulos (three articles; two with co-authors).

The authors of the study described above are Luis Mauricio Calvo Rubio and María José Ufarte Ruiz, both of Universidad de Castilla-La Mancha.

Another study of research on AI and journalism

The second study, The application of artificial intelligence to journalism: An analysis of academic production, did not use a specific start date, and ended with articles published in January 2021. The search string used:

"robot journalism" OR "computational journalism" OR "automated journalism" OR ("artificial intelligence" AND "journalism") OR ("artificial intelligence" AND "media")

After eliminating irrelevant articles, 358 were included for review, significantly more than the 209 items in the earlier study. In covering the entire year of 2020, which was not included in the earlier study, these researchers found there was a drop in the number of publications that year. This might be attributed to the global pandemic — although many articles for publication in 2020 would have been submitted in 2019, the processes of peer review and editorial oversight could well have been slowed by the burdens of that first pandemic year. For 2019, 74 articles were found. For 2020, the number was 43.

Like the other study, this one found a significant increase in relevant publications after 2015, but not the same consistently upward trajectory. Less than 13 percent of the items were published before 2015.

As in the other study, here too more than two-thirds of the articles came from Europe and North America. Only articles published in English were included, so this might not accurately represent all the research that exists in this topic area.

Multidisciplinary work “almost always comes from experts working in the same country. Eighty-six percent of the texts reviewed are written by authors whose universities are in the same country, and very often these authors belong to the same university” (page 5).

Six researchers accounted for 15 percent the articles in the sample (in order by number of publications): Nicholas Diakopoulos, Neil Thurman, Seth C. Lewis, Ester Appelgren, Eddy Borges-Rey, and Meredith Broussard. This was interesting to me, as I am not familiar with work by Appelgren or Thurman, while I have read all the others. (Both Appelgren and Thurman have published a lot about data journalism.)

Note, only those six authors have published four or more articles on this topic (within the 358 texts reviewed).

The researchers noted their surprise that so many of the items were “works of an essayistic nature, without either a well-defined methodology or precise research techniques.” Many articles “reflect generalist, introductory, or exploratory approaches.” In more recent publications, they noted “more specific research, with more consistent objectives, methodologies, or developments — and therefore closer to the orthodox research articles usually published in academic journals” (page 6). Qualitative methods predominate.

Based on their analysis of the 358 items, the researchers identified three principal areas for “application of artificial intelligence in journalism”: data journalism, robotic (or automated) news writing, and news verification (including “fake news”). It’s important to note, I think, that applied AI in journalism is not going to include uses of AI by the social media platforms (or search engines), which affect how news is distributed and shared.

Chart by Parratt-Fernández et al. (2021) shows number of articles that included each area of use of AI as a primary, secondary or tertiary topic
Chart above by Parratt-Fernández, Mayoral-Sánchez, & Mera-Fernández (2021) shows areas of use of AI and number of articles that included each area as a primary, secondary or tertiary focus or topic.

Those three principal areas also exclude what is often called personalization, or news recommendation engines, which are applications of AI currently used by many news organizations. Distinct from the ordering and selection of news content by platforms (e.g. Facebook), this technology determines what individual users see in the apps or websites of the news organizations themselves, e.g. Recommended for You: How Newspapers Normalise Algorithmic News Recommendation to Fit Their Gatekeeping Role (2021).

Other prominent topic areas included “the impact of new AI technologies on the writing of journalistic texts” (I’m not sure how that differs from robotic news writing; maybe chatbots? SEO and clickbait?), and “the use of tools that allow information to be extracted and processed — e.g. from social networks — enabling journalists to discover a news event as quickly as possible” (page 7). The latter topic is also called “social media listening” (but not in this research paper). For example, when numerous mentions of an event such as an explosion, or a protest, or police action, start popping up in relation to one geographic location, an AI-trained model can recognize that it’s an unusual occurrence and send an alert to the newsroom.

The amount of academic research on data journalism was high from 2015 to 2017, but it has decreased since then and “experienced a considerable decline in 2020,” the authors noted. It’s kind of funny how data journalism often gets lumped in with artificial intelligence; much of data journalism has absolutely nothing to do with AI.

Ethical issues related to artificial intelligence and journalism have been neglected, according to this study’s findings. “The potential for development in this area is still enormous,” the authors said (page 8).

These researchers anticipate a need for new research on the professional routines and roles of journalists, assuming these will be affected by an increasing integration of AI systems into newswork. These changes will have an impact on journalist training requirements and university curricula as well.

Without falling into hyperbole, the authors speculated that AI represents “the next phase of technological revolution” in an industry that has been successively transformed by computerized page design and printing, internet news distribution, the rise of social media platforms, and viral disinformation campaigns and fake news (page 9).

The authors of the study described above are Sonia Parratt-Fernández, Javier Mayoral-Sánchez, and Montse Mera-Fernández, all of Universidad Complutense de Madrid.

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Journalists reporting about AI

In the latest JournalismAI newsletter, a list of recommendations called “Reporting on AI Effectively” shares wisdom from several journalists who are reporting about a range of artificial intelligence and machine learning topics. The advice is grouped under these headings:

  • Build a solid foundation
  • Beat the hype
  • Complicate the narrative
  • Be compassionate, but embrace critical thinking

Karen Hao, senior AI editor at MIT Technology Review — whose articles I read all the time! — points out that to really educate yourself about AI, you’re going to need to read some of the research papers in the field. She also recommends YouTube as a resource for learning about AI — and I have to agree. I’ve never used YouTube so much to learn about a topic before I began studying AI.

The post also offers good advice about questions a reporter should ask about AI research and new developments int the field.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Journalists use machine learning to examine medical device records

Some investigations in the public interest require journalists to search through large quantities of official documents. Often the set of documents is very diverse — that is, the format, structure, and even language of the documents might vary greatly.

One of the more impressive investigations I know of is the ongoing Implant Files project, conducted originally by 250 journalists in 36 countries. The purpose: To examine how medical devices (specifically, those implanted into human bodies) are “tested, approved, marketed, and monitored” (source). I’ve heard this project discussed at conferences, and I’m full of admiration for the editors and reporters involved, led by the International Consortium of Investigative Journalists (ICIJ).

At the heart of the investigation, with its first results published in 2018, was “an analysis of more than 8 million device-related health records, including death and injury reports and recalls.”

“The entire process involved text mining, clustering, feature selection, association rules and classification algorithms to identify events not always described consistently in different parts of the data.”

How ICIJ Used Machine Learning to Help Find Medical Device Issues

These implanted devices — hip replacements, defibrillators, breast implants, intraocular lenses, and more — are used all around the world. When something goes wrong and a product recall is issued, however, the news might not spread to all the locations where the devices continue to be used in new surgeries for new patients. Moreover, people who already have a faulty implant might not be notified. This is why a global investigation was sorely needed.

Above: An ICIJ video summarizes how patients who receive implants are left unprotected

In 2018, ICIJ shared “a publicly searchable database of more than 70,000 recalls and safety warnings in 11 countries.” The project has continued since then, and the database now contains “more than 120,000 recalls, safety alerts and field safety notices” for medical devices. Throughout 2019, thousands more records were added.

A December 2018 post details the team’s data methodology for the Implant Files. First, journalists had to get the records — and often, their legitimate requests for public records were denied. Of the 8 million device-related records they managed to obtain, 5.4 million came from the U.S. Food and Drug Administration.

The records “describe cases where a device is suspected to have caused or contributed to a serious injury or death or has experienced a malfunction that would likely lead to harm if it were to recur.”

The value in these records was in the connections — connections among cases, and connections among devices. The ICIJ analysis concluded that “devices that broke, misfired, corroded, ruptured or otherwise malfunctioned after implantation or use were linked to more than 1.7 million injuries and nearly 83,000 deaths” in just one decade.

To identify the records that involved a patient’s death, it was necessary for humans to determine various terms and phrasing used instead of the word “death” in the documents. Eventually they developed “a set of more than 3,400 key phrases” that were used to train the machine learning system. After using that model to extract the relevant records, it was necessary to run them through another algorithm configured to determine whether the implant device had contributed to the death.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Comment moderation as a machine learning case study

Continuing my summary of the lessons in Introduction to Machine Learning from the Google News Initiative, today I’m looking at Lesson 5 of 8, “Training your Machine Learning model.” Previous lessons were covered here and here.

Now we get into the real “how it works” details — but still without looking at any code or computer languages.

The “lesson” (actually just a text) covers a common case for news organizations: comment moderation. If you permit people to comment on articles on your site, machine learning can be used to identify offensive comments and flag them so that human editors can review them.

With supervised learning (one of three approaches included in machine learning; see previous post here), you need labeled data. In this case, that means complete comments — real ones — that have already been labeled by humans as offensive or not. You need an equally large number of both kinds of comments. Creating this dataset of comments is discussed more fully in the lesson.

You will also need to choose a machine learning algorithm. Comments are text, obviously, so you’ll select among the existing algorithms that process language (rather than those that handle images and video). There are many from which to choose. As the lesson comes from Google, it suggests you use a Google algorithm.

In all AI courses and training modules I’ve looked at, this step is boiled down to “Here, we’ll use this one,” without providing a comparison of the options available. This is something I would expect an experienced ML practitioner to be able to explain — why are they using X algorithm instead of Y algorithm for this particular job? Certainly there are reasons why one text-analysis algorithm might be better for analyzing comments on news articles than another one.

What is the algorithm doing? It is creating and refining a model. The more accurate the final model is, the better it will be at predicting whether a comment is offensive. Note that the model doesn’t actually know anything. It is a computer’s representation of a “world” of comments in which some — with particular features or attributes perceived in the training data — are rated as offensive, and others — which lack a sufficient quantity of those features or attributes — are rated as not likely to be offensive.

The lesson goes on to discuss false positives and false negatives, which are possibly unavoidable — but the fewer, the better. We especially want to eliminate false negatives, which are offensive comments not flagged by the system.

“The most common reason for bias creeping in is when your training data isn’t truly representative of the population that your model is making predictions on.”

—Lesson 6, Bias in Machine Learning

Lesson 6 in the course covers bias in machine learning. A quick way to understand how ML systems come to be biased is to consider the comment-moderation example above. What if the labeled data (real comments) included a lot of comments offensive to women — but all of the labels were created by a team of men, with no women on the team? Surely the men would miss some offensive comments that women team members would have caught. The training data are flawed because a significant number of comments are labeled incorrectly.

There’s a pretty good video attached to this lesson. It’s only 2.5 minutes, and it illustrates interaction bias, latent bias, and selection bias.

Lesson 6 also includes a list of questions you should ask to help you recognize potential bias in your dataset.

It was interesting to me that the lesson omits a discussion of how the accuracy of labels is really just as important as having representative data for training and testing in supervised learning. This issue is covered in ImageNet and labels for data, an earlier post here.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Examples of machine learning in journalism

Following on from yesterday’s post, today I looked at more lessons in Introduction to Machine Learning from the Google News Initiative. (Friday AI Fun posts will return next week.)

The separation of machine learning into three different approaches — supervised learning, unsupervised learning, and reinforcement learning — is standard (Lesson 3). In keeping with the course’s focus on journalism applications of ML, the example given for supervised learning is The Atlanta Journal-Constitution‘s deservedly famous investigative story about sex abuse of patients by doctors. Supervised learning was used to sort more than 100,000 disciplinary reports on doctors.

The example of unsupervised learning is one I hadn’t seen before. It’s an investigation of short-term rentals (such as Airbnb rentals) in Austin, Texas. The investigator used locality-sensitive hashing (LSH) to group property records in a set of about 1 million documents, looking for instances of tax evasion.

The main example given for reinforcement learning is AlphaGo (previously covered in this blog), but an example from The New York TimesHow The New York Times Is Experimenting with Recommendation Algorithms — is also offered. Reinforcement learning is typically applied when a clear “reward” can be identified, which is why it’s useful in training an AI system to play a game (winning the game is a clear reward). It can also be used to train a physical robot to perform specified actions, such as pouring a liquid into a container without spilling any.

Also in Lesson 3, we find a very brief description of deep learning (it doesn’t mention layers and weights). and just a mention of neural networks.

“What you should retain from this lesson is fairly simple: Different problems require different solutions and different ML approaches to be tackled successfully.”

—Lesson 3, Different approaches to Machine Learning

Lesson 4, “How you can use Machine Learning,” might be the most useful in this set of eight lessons. Its content comes (with permission) from work done by Quartz AI Studio — specifically from the post How you’re feeling when machine learning might help, by the super-talented Jeremy B. Merrill.

The examples in this lesson are really good, so maybe you should just read it directly. You’ll learn about a variety of unusual stories that could only be told when journalists used machine learning to augment their reporting.

“Machine learning is not magic. You might even say that it can’t do anything you couldn’t do — if you just had a thousand tireless interns working for you.”

—Lesson 4, How you can use Machine Learning

(The Quartz AI Studio was created with a $250,000 grant from the Knight Foundation in 2018. For a year the group experimented, helped several news organizations produce great work, and ran a number of trainings for journalists. Then it was quietly disbanded in early 2020.)

Note (added April 4, 2022): The two links above to Quartz AI Studio content have been updated. The original domain, qz-dot-ai, was given up when, at renewal time, the price of all dot-ai domains had skyrocketed. Unfortunately, all the images have been lost, according to a personal communication from Merrill.

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Uses of AI in journalism

Part of my interest in AI centers on the way it is presented in online, print and broadcast media. Another focal point for me is how journalism organizations are using AI to do journalism work.

At the London School of Economics, a project named JournalismAI mirrors my interests. In November 2019 they published a report on a survey of 71 news organizations in 32 countries. They describe the report as “an introduction to and discussion of journalism and AI.”

Above: From the JournalismAI report

Many people in journalism are aware of the use of automation in producing stories on financial reports, sports, and real estate. Other applications of AI (mostly machine learning) are less well known — and they are numerous.

Above: From page 32 in JournalismAI report

Another resource available from JournalismAI is a collection of case studies — in the form of a Google sheet with links to write-ups about specific projects at news organizations. This list is being updated as new cases arise.

Above: From the JournalismAI case studies

It’s fascinating to open the links in the case studies and discover the innovative projects under way at so many news organizations. Journalism educators (like me) need to keep an eye on these developments to help us prepare journalism students for the future of our field.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.