I’m interested in applications of machine learning in journalism. This is natural, as my field is journalism. In the field of computer science, however, accolades and honors tend to favor research on new algorithms or procedures, or new network architectures. Applications are practical uses of algorithms, networks, etc., to solve real-world problems — and developing them often doesn’t garner the acclaim that researchers need to advance their careers.
Hannah Kerner, a professor and machine learning researcher at the University of Maryland, wrote about this in the MIT Technology Review. Her essay is aptly titled “Too many AI researchers think real-world problems are not relevant.”
“The first image of a black hole was produced using machine learning. The most accurate predictions of protein structures, an important step for drug discovery, are made using machine learning.”—Hannah Kerner
Noting that applications of machine learning are making real contributions to science in fields outside computer science, Kerner (who works on machine learning solutions for NASA’s food security and agriculture program) asks how much is lost because of the priorities set by the journals and conferences in the machine learning field.
She also ties this focus on ML research for the sake of advancing ML to the seepage of bias out from widely used datasets into the mainstream — the most famous cases being in face recognition, with systems (machine learning models) built on flawed datasets that disproportionately skew toward white and male faces.
“When studies on real-world applications of machine learning are excluded from the mainstream, it’s difficult for researchers to see the impact of their biased models, making it far less likely that they will work to solve these problems.”—Hannah Kerner
Machine learning is rarely plug-and-play. In creating an application that will be used to perform useful work — to make new discoveries, perhaps, or to make medical diagnoses more accurate — the machine learning researchers will do substantial new work, even when they use existing models. Just think, for a moment, about the data needed to produce an image of a black hole. Then think about the data needed to make predictions of protein structures. You’re not going to handle those in exactly the same way.
I imagine the work is quite demanding when a number of non–ML experts (say, the biologists who work on protein structures) get together with a bunch of ML experts. But either group working separately from the other is unlikely to come up with a robust new ML application. Kerner linked to this 2018 news report about a flawed cancer-detection system — leaked documents said that “instead of feeding real patient data into the software,” the system was trained on data about hypothetical patients. (OMG, I thought — you can’t train a system on fake data and then use it on real people!)
Judging from what Kerner has written, machine learning researchers might be caught in a loop, where they work on pristine and long-used datasets (instead of dirty, chaotic real-world data) to perfect speed and efficiency of algorithms that perhaps become less adaptable in the process.
It’s not that applications aren’t getting made — they are. The difficulty lies in the priorities for research, which might dissuade early-career ML researchers in particular from work on solving interesting and even vital real-world problems — and wrestling with the problems posed by messy real-world data.
I was reminded of something I’ve often heard from data journalists: If you’re taught by a statistics professor, you’ll be given pre-cleaned datasets to work with. (The reason being: She just wants you to learn statistics.) If you’re taught by a journalist, you’ll be given real dirty data, and the first step will be learning how to clean it properly — because that’s what you have to do with real data and a real problem.
So the next time you read about some breakthrough in machine learning, consider whether it is part of a practical application, or instead, more of a laboratory experiment performed in isolation, using a tried-and-true dataset instead of wild data.
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.