Google ML Crash Course #1 Notes: ML Models
This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This first module covers the fundamentals of building regression and classification models.
Linear regression
Introduction
The linear regression model uses an equation
$$
y' = b + w_1x_1 + w_2x_2 + \ldots
$$
to represent the relationship between features and the label.
- y' is the predicted label—the output
- b is the bias of the model (the y-intercept in algebraic terms), sometimes referred to as w_0
- w_1 is the weight of the feature (the slope in algebraic terms)
- x_1 is a feature—the input
y and features x are given. b and w are calculated from training by minimizing the difference between predicted and actual values.
Source: Linear regression | Machine Learning | Google for Developers
Loss
Loss is a numerical value indicating the difference between a model's predictions and the actual values.
The goal of model training is to minimize loss, bringing it as close to zero as possible.
| Loss type | Definition | Equation |
|---|---|---|
| L1 loss | The sum of the absolute values of the difference between the predicted values and the actual values. | $$\sum |\text{actual value}-\text{predicted value}|$$ |
| Mean absolute error (MAE) | The average of L1 losses across a set of N examples. | $$\frac{1}{N}\sum |\text{actual value}-\text{predicted value}|$$ |
| L2 loss | The sum of the squared difference between the predicted values and the actual values. | $$\sum (\text{actual value}-\text{predicted value})^2$$ |
| Mean squared error (MSE) | The average of L2 losses across a set of N examples. | $$\frac{1}{N}\sum (\text{actual value}-\text{predicted value})^2$$ |
The most common methods for calculating loss are Mean Absolute Error (MAE) and Mean Squared Error (MSE), which differ in their sensitivity to outliers.
A model trained with MSE moves the model closer to the outliers but further away from most of the other data points.

A model trained with MAE is farther from the outliers but closer to most of the other data points.

Source: Linear regression: Loss | Machine Learning | Google for Developers
Gradient descent
Gradient descent is an iterative optimisation algorithm used to find the best weights and bias for a linear regression model by minimising the loss function.
- Calculate the loss with the current weight and bias.
- Determine the direction to move the weights and bias that reduce loss.
- Move the weight and bias values a small amount in the direction that reduces loss.
- Return to step one and repeat the process until the model can't reduce the loss any further.
A model is considered to have converged when further iterations do not significantly reduce the loss, indicating it has found the weights and bias that produce the lowest possible loss.
Loss curves visually represent the model's progress during training, showing how the loss decreases over iterations and helping to identify convergence.
Linear models have convex loss functions, ensuring that gradient descent will always find the global minimum, resulting in the best possible model for the given data.
Source: Linear regression: Gradient descent | Google for Developers
Hyperparameters
Hyperparameters, such as learning rate, batch size, and epochs, are external configurations that influence the training process of a machine learning model.
The learning rate determines the step size during gradient descent, impacting the speed and stability of convergence.
- If the learning rate is too low, the model can take a long time to converge.
- However, if the learning rate is too high, the model never converges, but instead bounces around the weights and bias that minimise the loss.
Batch size dictates the number of training examples processed before updating model parameters, influencing training speed and noise.
- When a dataset contains hundreds of thousands or even millions of examples, using the full batch isn't practical.
- Two common techniques to get the right gradient on average without needing to look at every example in the dataset before updating the weights and bias are stochastic gradient descent and mini-batch stochastic gradient descent.
- Stochastic gradient descent uses only a single random example (a batch size of one) per iteration. Given enough iterations, SGD works but is very noisy.
- Mini-batch stochastic gradient descent is a compromise between full-batch and SGD. For N number of data points, the batch size can be any number greater than 1 and less than N. The model chooses the examples included in each batch at random, averages their gradients, and then updates the weights and bias once per iteration.
Model trained with SGD:

Model trained with mini-batch SGD:

Epochs represent the number of times the entire training dataset is used during training, affecting model performance and training time.
- For example, given a training set with 1,000 examples and a mini-batch size of 100 examples, it will take the model 10 iterations to complete one epoch.
Source: Linear regression: Hyperparameters | Machine Learning | Google for Developers
Logistic regression
Introduction
Logistic regression is a model used to predict the probability of an outcome, unlike linear regression which predicts continuous numerical values.
Logistic regression models output probabilities, which can be used directly or converted to binary categories.
Source: Logistic Regression | Machine Learning | Google for Developers
Calculating a probability with the sigmoid function
A logistic regression model uses a linear equation and the sigmoid function to calculate the probability of an event.
The sigmoid function ensures the output of logistic regression is always between 0 and 1, representing a probability.
$$
f(x) = \frac{1}{1 + e^{-x}}
$$

Linear component of a logistic regression model:
$$
z = b + w_1 x_1 + w_2 x_2 + \ldots + w_N x_N
$$
To obtain the logistic regression prediction, the z value is then passed to the sigmoid function, yielding a value (a probability) between 0 and 1:
$$
y' = \frac{1}{1+e^{-z}}
$$
- y' is the output of the logistic regression model.
- z is the linear output (as calculated in the preceding equation).
z is referred to as the log-odds because if you solve the sigmoid function for z you get:
$$
z = \log(\frac{y}{1-y})
$$
This is the log of the ratio of the probabilities of the two possible outcomes: y and 1 – y.
When the linear equation becomes input to the sigmoid function, it bends the straight line into an s-shape.

Loss and regularisation
Logistic regression models are trained similarly to linear regression models but use Log Loss instead of squared loss and require regularisation.
Log Loss is used in logistic regression because the rate of change isn't constant, requiring varying precision levels unlike squared loss used in linear regression.
The Log Loss equation returns the logarithm of the magnitude of the change, rather than just the distance from data to prediction. Log Loss is calculated as follows:
$$
\text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') – (1 – y)\log(1 – y')
$$
- (x,y) is the dataset containing many labelled examples, which are (x, y) pairs.
- y is the label in a labelled example. Since this is logistic regression, every value of y must either be 0 or 1.
- y' is your model's prediction (somewhere between 0 and 1), given the set of features in x.
Regularisation, such as L2 regularisation or early stopping, is crucial in logistic regression to prevent overfitting (due to the model's asymptotic nature) and improve generalisation.
Source: Logistic regression: Loss and regularization | Machine Learning | Google for Developers
Classification
Introduction
Logistic regression models can be converted into binary classification models for predicting categories instead of probabilities.
Source: Classification | Machine Learning | Google for Developers
Thresholds and the confusion matrix
To convert the raw output from a logistic regression model into binary classification (positive and negative class), you need a classification threshold.
Confusion matrix
| Actual positive | Actual negative | |
|---|---|---|
| Predicted positive | True positive (TP) | False positive (FP) |
| Predicted negative | False negative (FN) | True negative (TN) |
Total of each row = all predicted positives (TP + FP) and all predicted negatives (FN + TN)
Total of each column = all real positives (TP + FN) and all real negatives (FP + TN)
- When positive examples and negative examples are generally well differentiated, with most positive examples having higher scores than negative examples, the dataset is separated.
- When the total of actual positives is not close to the total of actual negatives, the dataset is imbalanced.
- When many positive examples have lower scores than negative examples, and many negative examples have higher scores than positive examples, the dataset is unseparated.
When we increase the classification threshold, both TP and FP decrease, and both TN and FN increase.
Source: Thresholds and the confusion matrix | Machine Learning | Google for Developers
Accuracy, recall, precision, and related metrics
Accuracy, Recall, Precision, and related metrics are all calculated at a single classification threshold value.
Accuracy is the proportion of all classifications that were correct.
$$
\text{Accuracy} = \frac{\text{correct classifications}}{\text{total classifications}} = \frac{TP+TN}{TP+TN+FP+FN}
$$
- Use as a rough indicator of model training progress/convergence for balanced datasets. Typically the default.
- For model performance, use only in combination with other metrics.
- Avoid for imbalanced datasets. Consider using another metric.
Recall, or true positive rate, is the proportion of all actual positives that were classified correctly as positives. Also known as probability of detection.
$$
\text{Recall (or TPR)} = \frac{\text{correctly classified actual positives}}{\text{all actual positives}} = \frac{TP}{TP+FN}
$$
- Use when false negatives are more expensive than false positives.
- Better than Accuracy in imbalanced datasets.
- Improves when false negatives decrease.
False positive rate is the proportion of all actual negatives that were classified incorrectly as positives. Also known as probability of a false alarm.
$$
\text{FPR} = \frac{\text{incorrectly classified actual negatives}}{\text{all actual negatives}}=\frac{FP}{FP+TN}
$$
- Use when false positives are more expensive than false negatives.
- Less meaningful and useful in a dataset where the number of actual negtives is very, very low.
Precision is the proportion of all the model's positive classifications that are actually positive.
$$
\text{Precision} = \frac{\text{correctly classified actual positives}}{\text{everything classified as positive}}=\frac{TP}{TP+FP}
$$
- Use when it's very important for positive predictions to be accurate.
- Less meaningful and useful in a dataset where the number of actual positives is very, very low.
- Improves as false positives decrease.
Precision and Recall often show an inverse relationship.
F1 score is the harmonic mean of Precision and Recall.
$$
\text{F1} = 2 * \frac{\text{precision} * \text{recall}}{\text{precision} + \text{recall}} = \frac{2TP}{2TP + FP + FN}
$$
- Preferable for class-imbalanced datasets.
- When Precision and Recall are close in value, F1 will be close to their value.
- When Precision and Recall are far apart, F1 will be similar to whichever metric is worse.
ROC and AUC
ROC and AUC evaluate a model's quality across all possible thresholds.
ROC curve, or receiver operating characteristic curves, plot the true positive rate (TPR) against the false positive rate (FPR) at different thresholds. A perfect model would pass through (0,1), while a random guesser forms a diagonal line from (0,0) to (1,1).
AUC, or area under the curve, represents the probability that the model will rank a randomly chosen positive example higher than a negative example. A perfect model has AUC = 1.0, while a random model has AUC = 0.5.
ROC and AUC of a hypothetical perfect model (AUC = 1.0) and for completely random guesses (AUC = 0.5):


ROC and AUC are effective when class distributions are balanced. For imbalanced data, precision-recall curves (PRCs) can be more informative.

A higher AUC generally indicates a better-performing model.
ROC and AUC of two hypothetical models; the first curve (AUC = 0.65) represents the better of the two models:


Threshold choice depends on the cost of false positives versus false negatives. The most relevant thresholds are those closest to (0,1) on the ROC curve. For costly false positives, a conservative threshold (like A in the chart below) is better. For costly false negatives, a more sensitive threshold (like C) is preferable. If costs are roughly equivalent, a threshold in the middle (like B) may be best.

Source: Classification: ROC and AUC | Machine Learning | Google for Developers
Prediction bias
Prediction bias measures the difference between the average of a model's predictions and the average of the true labels in the data. For example, if 5% of emails in the dataset are spam, a model without prediction bias should also predict about 5% as spam. A large mismatch between these averages indicates potential problems.
Prediction bias can be caused by:
- Biased and noisy data (e.g., skewed sampling)
- Overly strong regularisation that oversimplifies the model
- Bugs in the model training pipeline
- Insufficient features provided to the model
Source: Classification: Prediction bias | Machine Learning | Google for Developers
Multi-class classification
Multi-class classification extends binary classification to cases with more than two classes.
If each example belongs to only one class, the problem can be broken down into a series of binary classifications. For instance, with three classes (A, B, C), you could first separate C from A+B, then distinguish A from B within the A+B group.
Source: Classification: Multi-class classification | Machine Learning | Google for Developers