A good intro to machine learning models?

Today’s reading: How to get started with machine learning and AI.

Three factors that go into creating a new machine learning model:

Asking the right question: 25 percent
Data exploration and cleaning, feature engineering, feature selection: 50 percent
Training and evaluating the model: 25 percent

That’s according to Ellen Ambrose, director of AI at Protenus, a healthcare startup based in Baltimore, Maryland, and founded in 2014. She has a Ph.D. in neuroscience from Johns Hopkins University.

The article says: “Once a team has identified the right questions and has determined that the available data can answer those questions, the model needs to be configured.” I think this needs some reexamination: Once the team thinks they have the right question(s), and they think it’s likely that the available data can answer the question(s). Or even this: They think it’s likely that the available data can answer the question(s) adequately and with a high degree of accuracy.

Just as interface design has to be part of product development from the very beginning, a critical approach to the consequences of AI models must be in effect at every stage of the process. (The article doesn’t say this. This is me.) Your question might be harmful in ways you have not yet realized or acknowledged. Your data might contain imbalances or inadequacies that will not be apparent until you run wild data through your model.

The article lists three types of machine learning systems from which you might choose:

Neural network
Support vector machine (SVM)
Gradient boosted forest*

* Wikipedia says: “Gradient boosting is a machine learning technique used in regression and classification tasks, among others. It gives a prediction model in the form of an ensemble of weak prediction models, which are typically decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest” (source). This is new to me; I’ve learned about random forests but not gradient boosted forests.

“Some algorithms are just better suited for certain tasks,” the article says. I wonder whether this might be glossed over in some quickie data science boot camps and courses. If you’re using a tool to build a machine learning model, and the tool lets you choose from various options, do you know enough to choose the one best suited to your task?

Like many descriptions of training an ML model, this article briefly glides over the process of adjusting hyperparameters. It always bothers me when a few principles of statistics and probability are dropped into an article about machine learning as if they were not part of an entire field that existed before ML. The text quickly moves on to: Now that your model is trained, it’s ready to go!

Photo of raw chocolate-chip cookie dough in a large white bowl — *What’s in the cookie dough? Photo by genniebee512 on Pixabay*

A bit later in the article, it’s noted that we can invoke a a trained ML model “with a single Python call” in a Jupyter Notebook. This is in fact how many students supposedly “learn” machine learning — but what are they learning, really, when they simply plug a given dataset into a pre-existing model? I’d say it’s like using the cookie dough sold in the refrigerated section of the supermarket. Sure, you get fresh hot cookies from your oven, but what do you know about making cookies? (What’s in that pre-made dough?)

The idea that someone else has built, tested, trained this ML model (many someones, in fact, with tons of resources you don’t have), and now you can skip all that and just use the model to do what you need to do — sure, that seems great! “The developer could then apply a specific business logic to generate value from an idea without needing to worry about the details of how the model was built and trained,” the article says. Wonderful!

But … you know, that cookie dough could contain an ingredient you’re allergic to. You’re going to want to read the label carefully. Does that ML model you’re using have an ingredients label? Chances are, the answer is no. What’s inside the black box? We can always go back to Crawford and Paglen for an example of how badly this can go. Or look at the many examples Hannah Fry examined.

Unusually for an article of this kind, here we find an acknowledgment that real-world data is constantly changing (and increasing). The ML system — once deployed — will need tending and oversight. MLOps was a new term to me — inspired by DevOps (the collaboration among developers and IT professionals in all stages of the software development lifecycle), the term MLOps refers to collaboration among data scientists, ML engineers, developers and IT professionals to manage the lifecycle of a machine learning system (algorithms and hardware). How well is it working now that it’s out in the world, and its output is relied on for decision-making that affects real people’s lives?

At the end, the article promises that Ars Technica “will be running an entire series on creating, evaluating, and running AI models.” I’m going to be on the lookout for that, but I’m a little skeptical after reading this one. There’s nothing wrong in it, per se, but in my opinion it misrepresents the use of ML models as something simple and safe, ignoring all the ways that your lack of knowledge about the details can lead to flawed results and unreliable outcomes.

AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.