Intro to Machine Learning course

A couple of days ago, I wrote about Kaggle’s free introductory Python course. Then I started the next free course in the series: Intro to Machine Learning. The course consists of seven modules; the final module, like the last module in the Python course, shows you how to enter a Kaggle competition using the skills from the course.

The first module, “How Models Work,” begins with a simple decision tree, which is nice because (I think) everyone can grasp how that works, and how you add complexity to the tree to get more accurate answers. The dataset is housing data from Melbourne, Australia; it includes the type of housing unit, the number of bedrooms, and most important, the selling price (and other data too). The data have already been cleaned.

In the second module, we load the Python Pandas library and the Melbourne CSV file. We call one basic statistics function that is built into Pandas — describe() — and get a quick explanation of the output: count, mean, std (standard deviation), min, max, and the three quartiles: 25%, 50% (median), 75%.

When you do the exercise for the module, you can copy and paste the code from the lesson into the learner’s notebook.

The third module, “Your First Machine Learning Model,” introduces the Pandas columns attribute for the dataframe and shows us how to make a subset of column headings — thus excluding any data we don’t need to analyze. We use the dropna() method to eliminate rows that have missing data (this is not explained). Then we set the prediction target (y) — here it will be the Price column from the housing data. This should make sense to the learner, given the earlier illustration of the small decision tree.

y = df.Price

We use the previously created list of selected column headings (named features) to create X, the features of each house that will go into the decision tree model (such as the number of rooms, and the size of the lot).

X = df[features]

Then we build a model using Python’s scikit-learn library. Up to now, this will all be familiar to anyone who’s had an intro-to-Pandas course, particularly if the focus was data science or data journalism. I do like the list of steps given (building and using a model):

  1. Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
  2. Fit: Capture patterns from provided data. This is the heart of modeling.
  3. Predict: Just what it sounds like.
  4. Evaluate: Determine how accurate the model’s predictions are. (List quoted from Kaggle course.)

Since fit() and predict() are commands in scikit-learn, it begins to look like machine learning is just a walk in the park! And since we are fitting and predicting on the same data, the predictions are perfect! Never fear, that bubble will burst in module 4, “Model Validation,” in which the standard practice of splitting your data into a training set and a test set is explained.

First, though, we learn about predictive accuracy. Out of all the various metrics for summarizing model quality, we will use one called Mean Absolute Error (MAE). This is explained nicely using the housing prices, which is what we are attempting to predict: If the house sold for $150,000 and we predicted it would sell for $100,000, then the error is $150,000 minus $100,000, or $50,000. The function for MAE sums up all the errors and returns the mean.

This is where the lesson says, “Uh-oh! We need to split our data!” We use scikit-learn’s train_test_split() method, and all is well.

MAE shows us our model is pretty much crap, though. In the fifth module, “Underfitting and Overfitting,” we get a good explanation of the title topic and learn how to limit the number of leaf nodes at the end of our decision tree — DecisionTreeRegressor(max_leaf_nodes).

After all that, our model’s predictions are still crap — because a decision tree model is “not very sophisticated by modern machine learning standards,” the module text drolly explains. That leads us to the sixth module, “Random Forests,” which is nice for two reasons: (1) The explanation of a random forest model should make sense to most learners who have worked through the previous modules; and (2) We get to see that using a different model from scikit-learn is as simple as changing

my_model = DecisionTreeRegressor(random_state=1)

to

my_model = RandomForestRegressor(random_state=1)

Overall I found this a helpful course, and I think a lot of beginners could benefit from taking it — depending on their prior level of understanding. I would assume at least a familiarity with datasets as CSV files and a bit more than beginner-level Python knowledge.

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Examples of machine learning in journalism

Following on from yesterday’s post, today I looked at more lessons in Introduction to Machine Learning from the Google News Initiative. (Friday AI Fun posts will return next week.)

The separation of machine learning into three different approaches — supervised learning, unsupervised learning, and reinforcement learning — is standard (Lesson 3). In keeping with the course’s focus on journalism applications of ML, the example given for supervised learning is The Atlanta Journal-Constitution‘s deservedly famous investigative story about sex abuse of patients by doctors. Supervised learning was used to sort more than 100,000 disciplinary reports on doctors.

The example of unsupervised learning is one I hadn’t seen before. It’s an investigation of short-term rentals (such as Airbnb rentals) in Austin, Texas. The investigator used locality-sensitive hashing (LSH) to group property records in a set of about 1 million documents, looking for instances of tax evasion.

The main example given for reinforcement learning is AlphaGo (previously covered in this blog), but an example from The New York TimesHow The New York Times Is Experimenting with Recommendation Algorithms — is also offered. Reinforcement learning is typically applied when a clear “reward” can be identified, which is why it’s useful in training an AI system to play a game (winning the game is a clear reward). It can also be used to train a physical robot to perform specified actions, such as pouring a liquid into a container without spilling any.

Also in Lesson 3, we find a very brief description of deep learning (it doesn’t mention layers and weights). and just a mention of neural networks.

“What you should retain from this lesson is fairly simple: Different problems require different solutions and different ML approaches to be tackled successfully.”

—Lesson 3, Different approaches to Machine Learning

Lesson 4, “How you can use Machine Learning,” might be the most useful in this set of eight lessons. Its content comes (with permission) from work done by Quartz AI Studio — specifically from the post How you’re feeling when machine learning might help, by the super-talented Jeremy B. Merrill.

The examples in this lesson are really good, so maybe you should just read it directly. You’ll learn about a variety of unusual stories that could only be told when journalists used machine learning to augment their reporting.

“Machine learning is not magic. You might even say that it can’t do anything you couldn’t do — if you just had a thousand tireless interns working for you.”

—Lesson 4, How you can use Machine Learning

(The Quartz AI Studio was created with a $250,000 grant from the Knight Foundation in 2018. For a year the group experimented, helped several news organizations produce great work, and ran a number of trainings for journalists. Then it was quietly disbanded in early 2020.)

Note (added April 4, 2022): The two links above to Quartz AI Studio content have been updated. The original domain, qz-dot-ai, was given up when, at renewal time, the price of all dot-ai domains had skyrocketed. Unfortunately, all the images have been lost, according to a personal communication from Merrill.

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

Google’s machine learning ‘course’ for journalists

I couldn’t resist dipping into this free course from the Google News Initiative, and what I found surprised me: eight short lessons that are available as PDFs.

The good news: The lessons are journalism-focused, and they provide a painless introduction to the subject. The bad news: This is not really a course or a class at all — although there is one quiz at the end. And you can get a certificate, for what it’s worth.

There’s a lot here that many journalists might not be aware of, and that’s a plus. You get a brief, clear description of Reuters’ News Tracer and Lynx Insight tools, both used in-house to help journalists discover new stories using social media or other data (Lesson 1). A report I recall hearing about — how automated real-estate stories brought significant new subscription revenue to a Swedish news publisher — is included in a quick summary of “robot reporting” (also Lesson 1).

Lesson 2 helpfully explains what machine learning is without getting into technical operations of the systems that do the “learning.” They don’t get into what training a model entails, but they make clear that once the model exists, it is used to make predictions. The predictions are not like what some tarot-card reader tells you but rather probability-based results that the model is able to produce, based on its prior training.

Noting that machine learning is a subset of the wider field called artificial intelligence is, of course, accurate. What is inaccurate is the definition “specific applications that use data to train a model to perform a given task independently and learn from experience.” They left out Q-learning, a type of reinforcement learning (a subset of machine learning), which does not use a model. It’s okay that they left it out, but they shouldn’t imply that all machine learning requires a trained model.

The explosion of machine learning and AI in the past 10 years is explained nicely and concisely in Lesson 2. The lesson also touches on misconceptions and confusion surrounding AI:

“The lack of an officially agreed definition, the legacy of science-fiction, and a general low level of literacy on AI-related topics are all contributing factors.”

—Lesson 2, Is Machine Learning the same thing as AI?

I’ll be looking at Lessons 3 and 4 tomorrow.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.