Google ML Crash Course #1 Notes: ML Models

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This first module covers the fundamentals of building regression and classification models.

Linear regression

Introduction

The linear regression model uses an equation
$$
y' = b + w_1x_1 + w_2x_2 + \ldots
$$
to represent the relationship between features and the label.

y and features x are given. b and w are calculated from training by minimizing the difference between predicted and actual values.

Source: Linear regression | Machine Learning | Google for Developers

Loss

Loss is a numerical value indicating the difference between a model's predictions and the actual values.

The goal of model training is to minimize loss, bringing it as close to zero as possible.

Loss type Definition Equation
L1 loss The sum of the absolute values of the difference between the predicted values and the actual values. $$\sum |\text{actual value}-\text{predicted value}|$$
Mean absolute error (MAE) The average of L1 losses across a set of N examples. $$\frac{1}{N}\sum |\text{actual value}-\text{predicted value}|$$
L2 loss The sum of the squared difference between the predicted values and the actual values. $$\sum (\text{actual value}-\text{predicted value})^2$$
Mean squared error (MSE) The average of L2 losses across a set of N examples. $$\frac{1}{N}\sum (\text{actual value}-\text{predicted value})^2$$

The most common methods for calculating loss are Mean Absolute Error (MAE) and Mean Squared Error (MSE), which differ in their sensitivity to outliers.

A model trained with MSE moves the model closer to the outliers but further away from most of the other data points.
model-mse.png

A model trained with MAE is farther from the outliers but closer to most of the other data points.
model-mae.png

Source: Linear regression: Loss | Machine Learning | Google for Developers

Gradient descent

Gradient descent is an iterative optimisation algorithm used to find the best weights and bias for a linear regression model by minimising the loss function.

  1. Calculate the loss with the current weight and bias.
  2. Determine the direction to move the weights and bias that reduce loss.
  3. Move the weight and bias values a small amount in the direction that reduces loss.
  4. Return to step one and repeat the process until the model can't reduce the loss any further.

A model is considered to have converged when further iterations do not significantly reduce the loss, indicating it has found the weights and bias that produce the lowest possible loss.

Loss curves visually represent the model's progress during training, showing how the loss decreases over iterations and helping to identify convergence.

Linear models have convex loss functions, ensuring that gradient descent will always find the global minimum, resulting in the best possible model for the given data.

Source: Linear regression: Gradient descent | Google for Developers

Hyperparameters

Hyperparameters, such as learning rate, batch size, and epochs, are external configurations that influence the training process of a machine learning model.

The learning rate determines the step size during gradient descent, impacting the speed and stability of convergence.

Batch size dictates the number of training examples processed before updating model parameters, influencing training speed and noise.

Model trained with SGD:
noisy-gradient.png

Model trained with mini-batch SGD:
mini-batch-sgd.png

Epochs represent the number of times the entire training dataset is used during training, affecting model performance and training time.

Source: Linear regression: Hyperparameters | Machine Learning | Google for Developers

Logistic regression

Introduction

Logistic regression is a model used to predict the probability of an outcome, unlike linear regression which predicts continuous numerical values.

Logistic regression models output probabilities, which can be used directly or converted to binary categories.

Source: Logistic Regression | Machine Learning | Google for Developers

Calculating a probability with the sigmoid function

A logistic regression model uses a linear equation and the sigmoid function to calculate the probability of an event.

The sigmoid function ensures the output of logistic regression is always between 0 and 1, representing a probability.
$$
f(x) = \frac{1}{1 + e^{-x}}
$$
sigmoid_function_with_axes.png

Linear component of a logistic regression model:
$$
z = b + w_1 x_1 + w_2 x_2 + \ldots + w_N x_N
$$
To obtain the logistic regression prediction, the z value is then passed to the sigmoid function, yielding a value (a probability) between 0 and 1:
$$
y' = \frac{1}{1+e^{-z}}
$$

z is referred to as the log-odds because if you solve the sigmoid function for z you get:
$$
z = \log(\frac{y}{1-y})
$$
This is the log of the ratio of the probabilities of the two possible outcomes: y and 1 – y.

When the linear equation becomes input to the sigmoid function, it bends the straight line into an s-shape.
linear_to_logistic.png

Source: Logistic regression: Calculating a probability with the sigmoid function | Machine Learning | Google for Developers

Loss and regularisation

Logistic regression models are trained similarly to linear regression models but use Log Loss instead of squared loss and require regularisation.

Log Loss is used in logistic regression because the rate of change isn't constant, requiring varying precision levels unlike squared loss used in linear regression.

The Log Loss equation returns the logarithm of the magnitude of the change, rather than just the distance from data to prediction. Log Loss is calculated as follows:
$$
\text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') – (1 – y)\log(1 – y')
$$

Regularisation, such as L2 regularisation or early stopping, is crucial in logistic regression to prevent overfitting (due to the model's asymptotic nature) and improve generalisation.

Source: Logistic regression: Loss and regularization | Machine Learning | Google for Developers

Classification

Introduction

Logistic regression models can be converted into binary classification models for predicting categories instead of probabilities.

Source: Classification | Machine Learning | Google for Developers

Thresholds and the confusion matrix

To convert the raw output from a logistic regression model into binary classification (positive and negative class), you need a classification threshold.

Confusion matrix

Actual positive Actual negative
Predicted positive True positive (TP) False positive (FP)
Predicted negative False negative (FN) True negative (TN)

Total of each row = all predicted positives (TP + FP) and all predicted negatives (FN + TN)
Total of each column = all real positives (TP + FN) and all real negatives (FP + TN)

When we increase the classification threshold, both TP and FP decrease, and both TN and FN increase.

Source: Thresholds and the confusion matrix | Machine Learning | Google for Developers

Accuracy, Recall, Precision, and related metrics are all calculated at a single classification threshold value.

Accuracy is the proportion of all classifications that were correct.
$$
\text{Accuracy} = \frac{\text{correct classifications}}{\text{total classifications}} = \frac{TP+TN}{TP+TN+FP+FN}
$$

Recall, or true positive rate, is the proportion of all actual positives that were classified correctly as positives. Also known as probability of detection.
$$
\text{Recall (or TPR)} = \frac{\text{correctly classified actual positives}}{\text{all actual positives}} = \frac{TP}{TP+FN}
$$

False positive rate is the proportion of all actual negatives that were classified incorrectly as positives. Also known as probability of a false alarm.
$$
\text{FPR} = \frac{\text{incorrectly classified actual negatives}}{\text{all actual negatives}}=\frac{FP}{FP+TN}
$$

Precision is the proportion of all the model's positive classifications that are actually positive.
$$
\text{Precision} = \frac{\text{correctly classified actual positives}}{\text{everything classified as positive}}=\frac{TP}{TP+FP}
$$

Precision and Recall often show an inverse relationship.

F1 score is the harmonic mean of Precision and Recall.
$$
\text{F1} = 2 * \frac{\text{precision} * \text{recall}}{\text{precision} + \text{recall}} = \frac{2TP}{2TP + FP + FN}
$$

Source: Classification: Accuracy, recall, precision, and related metrics | Machine Learning | Google for Developers

ROC and AUC

ROC and AUC evaluate a model's quality across all possible thresholds.

ROC curve, or receiver operating characteristic curves, plot the true positive rate (TPR) against the false positive rate (FPR) at different thresholds. A perfect model would pass through (0,1), while a random guesser forms a diagonal line from (0,0) to (1,1).

AUC, or area under the curve, represents the probability that the model will rank a randomly chosen positive example higher than a negative example. A perfect model has AUC = 1.0, while a random model has AUC = 0.5.

ROC and AUC of a hypothetical perfect model (AUC = 1.0) and for completely random guesses (AUC = 0.5):
auc_1-0.pngauc_0-5.png

ROC and AUC are effective when class distributions are balanced. For imbalanced data, precision-recall curves (PRCs) can be more informative.
prauc.png

A higher AUC generally indicates a better-performing model.

ROC and AUC of two hypothetical models; the first curve (AUC = 0.65) represents the better of the two models:
auc_0-65.png
auc_0-93.png

Threshold choice depends on the cost of false positives versus false negatives. The most relevant thresholds are those closest to (0,1) on the ROC curve. For costly false positives, a conservative threshold (like A in the chart below) is better. For costly false negatives, a more sensitive threshold (like C) is preferable. If costs are roughly equivalent, a threshold in the middle (like B) may be best.
auc_abc.png

Source: Classification: ROC and AUC | Machine Learning | Google for Developers

Prediction bias

Prediction bias measures the difference between the average of a model's predictions and the average of the true labels in the data. For example, if 5% of emails in the dataset are spam, a model without prediction bias should also predict about 5% as spam. A large mismatch between these averages indicates potential problems.

Prediction bias can be caused by:

Source: Classification: Prediction bias | Machine Learning | Google for Developers

Multi-class classification

Multi-class classification extends binary classification to cases with more than two classes.

If each example belongs to only one class, the problem can be broken down into a series of binary classifications. For instance, with three classes (A, B, C), you could first separate C from A+B, then distinguish A from B within the A+B group.

Source: Classification: Multi-class classification | Machine Learning | Google for Developers