A couple of days ago, I wrote about Kaggle’s free introductory Python course. Then I started the next free course in the series: Intro to Machine Learning. The course consists of seven modules; the final module, like the last module in the Python course, shows you how to enter a Kaggle competition using the skills from the course.
The first module, “How Models Work,” begins with a simple decision tree, which is nice because (I think) everyone can grasp how that works, and how you add complexity to the tree to get more accurate answers. The dataset is housing data from Melbourne, Australia; it includes the type of housing unit, the number of bedrooms, and most important, the selling price (and other data too). The data have already been cleaned.
In the second module, we load the Python Pandas library and the Melbourne CSV file. We call one basic statistics function that is built into Pandas —
describe() — and get a quick explanation of the output: count, mean, std (standard deviation), min, max, and the three quartiles: 25%, 50% (median), 75%.
When you do the exercise for the module, you can copy and paste the code from the lesson into the learner’s notebook.
The third module, “Your First Machine Learning Model,” introduces the Pandas
columns attribute for the dataframe and shows us how to make a subset of column headings — thus excluding any data we don’t need to analyze. We use the
dropna() method to eliminate rows that have missing data (this is not explained). Then we set the prediction target (y) — here it will be the Price column from the housing data. This should make sense to the learner, given the earlier illustration of the small decision tree.
y = df.Price
We use the previously created list of selected column headings (named features) to create X, the features of each house that will go into the decision tree model (such as the number of rooms, and the size of the lot).
X = df[features]
Then we build a model using Python’s scikit-learn library. Up to now, this will all be familiar to anyone who’s had an intro-to-Pandas course, particularly if the focus was data science or data journalism. I do like the list of steps given (building and using a model):
- Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
- Fit: Capture patterns from provided data. This is the heart of modeling.
- Predict: Just what it sounds like.
- Evaluate: Determine how accurate the model’s predictions are. (List quoted from Kaggle course.)
predict() are commands in scikit-learn, it begins to look like machine learning is just a walk in the park! And since we are fitting and predicting on the same data, the predictions are perfect! Never fear, that bubble will burst in module 4, “Model Validation,” in which the standard practice of splitting your data into a training set and a test set is explained.
First, though, we learn about predictive accuracy. Out of all the various metrics for summarizing model quality, we will use one called Mean Absolute Error (MAE). This is explained nicely using the housing prices, which is what we are attempting to predict: If the house sold for $150,000 and we predicted it would sell for $100,000, then the error is $150,000 minus $100,000, or $50,000. The function for MAE sums up all the errors and returns the mean.
This is where the lesson says, “Uh-oh! We need to split our data!” We use scikit-learn’s
train_test_split() method, and all is well.
MAE shows us our model is pretty much crap, though. In the fifth module, “Underfitting and Overfitting,” we get a good explanation of the title topic and learn how to limit the number of leaf nodes at the end of our decision tree —
After all that, our model’s predictions are still crap — because a decision tree model is “not very sophisticated by modern machine learning standards,” the module text drolly explains. That leads us to the sixth module, “Random Forests,” which is nice for two reasons: (1) The explanation of a random forest model should make sense to most learners who have worked through the previous modules; and (2) We get to see that using a different model from scikit-learn is as simple as changing
my_model = DecisionTreeRegressor(random_state=1)
my_model = RandomForestRegressor(random_state=1)
Overall I found this a helpful course, and I think a lot of beginners could benefit from taking it — depending on their prior level of understanding. I would assume at least a familiarity with datasets as CSV files and a bit more than beginner-level Python knowledge.
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.