(2) - From Raw Data to Real Predictions: Mastering the Machine Learning Workflow

A deep dive into the types of learning, the 10-step process, and how to handle the messy reality of missing data.

This week, I dove into Module 2 of my Machine Learning Professional Certificate. While Module 1 set the stage from a mathematics point of view, Module 2 got into the mechanics: the vocabulary of data, the different “flavors” of learning, and the nitty-gritty of cleaning up a dataset using Python.

Here are my key takeaways.

1. The Core Mission of Machine Learning

At its simplest, machine learning is about understanding the relationship between a set of input variables (often called features or predictors) and an output variable (the target). We try to learn a function (f) that maps X to Y, which is difficult because data is limited and inherently noisy.

We generally do this for one of two reasons:

Forecasting (Prediction): We want high accuracy on unseen data (e.g., detecting cancer or fraud).
Inference (Explanation): We want to understand how inputs affect outputs (e.g., how marketing spend impacts sales).

2. The “Big Three” Distinctions

Prediction vs. Classification

This depends on your output variable. If the output is a continuous number (like a house price), it’s a Prediction (or Regression) problem. If the output is a category (like “Spam” vs. “Not Spam”), it’s a Classification problem.

Parametric vs. Non-Parametric

Parametric: You assume the data follows a specific shape (like a straight line). It’s simpler and faster but fails if your assumption is wrong.
Non-Parametric: You make no assumptions about the function’s shape. It’s flexible and can learn complex patterns, but requires much more data.

Supervised vs. Unsupervised

Supervised: You have input variables (X) and a known output variable (Y) to train on.
Unsupervised: You only have inputs. The goal is to find natural structures or clusters in the data, like grouping customers by purchasing behavior without pre-existing labels.

3. The 10-Step Machine Learning Workflow

Real-world ML isn’t just about writing code; it’s a process. The module outlined a 10-step guide, though in reality, you often loop back to previous steps.

Define Purpose: Is this a one-off project or an ongoing tool?.
Obtain Data: Gather internal or external datasets.
Explore & Clean: The crucial step of handling missing data, errors and outliers.
Feature Engineering: Removing irrelevant variables or creating new ones.
Define Task: Is it regression, classification, or clustering?.
Partition Data: Split into Train, Validation, and Test sets.
Choose Technique: Select algorithms (e.g., Decision Trees, Neural Networks).
Execute: Run the algorithms.
Interpret: Evaluate the results.
Deploy: Implement the model for real-world use.

4. Handling the “Messy” Reality: Missing Data in Python

One of the most practical parts of this module was learning how to handle missing data using the pandas library.

Detecting It:

We use isnull() to create Boolean masks or isnull().sum() to count missing entries. Heatmaps in Seaborn are also great for visualizing where data is missing.

Removing It:

We can use dropna().

axis=0 removes rows (observations).
axis=1 removes columns (features)—useful if a column is mostly empty.
Removing data can lead to bias or loss of valuable info if the dataset is small.

Imputing (Filling) It:

Instead of deleting, we can fill the gaps:

Mean: Good for numerical data, but sensitive to outliers.
Median: Better if the data has outliers.
Mode: Used for filling categorical data with the most frequent value.
Predictive Imputation: Using a machine learning model (like Linear Regression) to predict the missing values based on other variables.

5. ML in the Real World

Finally, we looked at how these concepts apply to actual businesses:

Yelp: Uses Neural Networks for image classification (identifying “food” vs. “menu” photos) and optimizing ad matching.
Danske Bank: Reduced false positives in fraud detection by 60% using Deep Learning.
Carbon Robotics: Uses Convolutional Neural Networks (CNNs) to detect weeds and kill them with lasers, saving millions of gallons of herbicide.