Machine Learning (ML): foundations, paradigms, history, and relationships

Dashboard/Study Pack•Learning Mode

Summary

Machine learning (ML) is the study of statistical algorithms that learn from data and generalize to unseen examples without explicit programming. This matters because it reframes “intelligence” as improving task performance via data-driven learning, connecting directly to the idea that future inputs are uncertain. At the foundation, ML relies on statistics and mathematical optimization. Many learning methods can be viewed as minimizing a loss function, which measures how wrong predictions are compared to outcomes. This matters because it turns learning into a solvable optimization problem and provides a bridge to theory. A central training viewpoint is empirical risk minimization (ERM): algorithms minimize empirical loss on the training set. ERM matters because it links practical training to theoretical analysis of generalization. Generalization is the ability to perform accurately on new, unseen data drawn from an unknown distribution; it is not the same as training accuracy. Learning theory often studies generalization using probabilistic guarantees. One such framework is Probably Approximately Correct (PAC) learning, which provides bounds on performance given finite data. This matters because it explains when learning can be expected to work reliably, even though we cannot know the future distribution. ML task paradigms operationalize these ideas. Supervised learning uses labeled examples for classification or regression, typically through loss functions. Unsupervised learning discovers structure in unlabeled data via clustering or dimensionality reduction, often supporting data mining and preprocessing. Reinforcement learning trains agents from interaction feedback (rewards/penalties), connecting to sequential decision-making examples like AlphaGo. Other settings include structured prediction, anomaly detection, and learning with humans in the loop. Neural networks and deep learning extend these paradigms by learning representations with multiple layers, enabling strong performance through improved training and backpropagation. Finally, ML relates to other fields: AI (historical shifts between symbolic and statistical approaches), data mining (prediction vs discovery goals), and data compression via compression–prediction equivalence, where predicting posterior probabilities of sequences enables optimal compression and vice versa.

Topics Covered

What Machine Learning Is: Definition, Goals, and Historical Roots

Machine learning (ML) uses statistical algorithms that learn from data and generalize to unseen inputs without explicit programming. The core goal is improving performance on a task measured by some criterion, not memorizing training examples. This operational framing connects directly to later choices of loss functions, task paradigms, and learning-theory guarantees. Early milestones like Samuel’s checkers and Cybertron motivate how learning-from-feedback became central to the field.

Mathematical Foundations: Statistics, Optimization, and Loss

ML is grounded in statistics and mathematical optimization, often expressed as minimizing a loss function. The loss function quantifies discrepancy between predictions and known outcomes, turning a learning problem into an optimization problem. This viewpoint supports empirical risk minimization (ERM) and enables formal analysis of generalization. It also clarifies why different task types (classification, regression, and others) share a common training-objective structure.

Empirical Risk Minimization (ERM) and Generalization: Why Training Is Not Enough

Many ML and deep learning algorithms can be described as minimizing empirical risk: loss averaged over the training set. However, success depends on generalization, meaning accurate performance on new, unseen examples drawn from an unknown data distribution. This topic connects ERM to learning theory and to bias–variance decomposition as a way to understand generalization error sources. It also directly addresses the common confusion between training accuracy and generalization.

Learning Paradigms and Task Types: Supervised, Unsupervised, and Beyond

Supervised learning uses labeled examples to predict outputs, commonly for classification or regression, typically via loss functions comparing predictions to labels. Unsupervised learning uses unlabeled data to discover structure, such as clustering or dimensionality reduction, and can serve as preprocessing for supervised learning. Other settings include structured prediction, anomaly detection, and reinforcement learning, where feedback comes from rewards or interaction outcomes. This topic connects back to ERM and generalization by showing that each paradigm still requires an objective and evaluation criterion.

Neural Networks and Deep Learning: Representation Learning and Training Advances

Neural networks are statistical connectionist models inspired by neuron-like interactions, and deep learning uses multiple layers to learn representations. Backpropagation enabled major advances by efficiently training networks through gradient-based optimization of loss. This topic connects to the optimization-and-loss foundation and to task paradigms where neural networks are used for supervised, unsupervised, or reinforcement settings. It also links to historical ideas like Hebbian theory and to modern breakthroughs such as GANs and AlphaGo.

Learning Theory Guarantees: PAC Learning and Probabilistic Reasoning

Because training data is finite and future data is uncertain, learning theory provides probabilistic performance guarantees rather than absolute ones. The PAC (Probably Approximately Correct) framework formalizes how, with enough samples, a learned hypothesis can achieve bounded error with high probability. This connects directly to ERM and loss-based training objectives, explaining why generalization can be analyzed even when the true distribution is unknown. It also clarifies why learning theory often yields bounds that guide model selection and diagnostics.

Relationships and Equivalences: AI, Data Mining, Human-in-the-Loop, and Compression

ML relates to AI and data mining (KDD) through overlapping methods but different evaluation goals: ML emphasizes prediction of known knowledge, while KDD emphasizes discovering previously unknown properties. Human-in-the-loop learning connects to practical systems where labels, feedback, or constraints come from people, affecting data quality and objectives. Finally, compression–prediction equivalence links probabilistic sequence prediction to optimal data compression, showing that predicting posterior probabilities can enable optimal compressors and vice versa. This topic ties back to generalization and to the common confusion between ML and data mining, and between compression and prediction.

Key Insights

ERM Is a Generalization Strategy

Minimizing empirical risk is not just a training convenience; it is the mechanism that learning theory analyzes to control performance on unseen data. Under the PAC framing, the same loss you optimize on the training set is what enables probabilistic guarantees about generalization when the dataset is finite.

Why it matters: Students often treat ERM as purely practical. This reframes ERM as the bridge between optimization and provable generalization, making theory feel operational rather than abstract.

Unsupervised Can Be Supervised’s Shortcut

Unsupervised learning is not merely an alternative when labels are missing; it can be used as preprocessing that changes what the supervised learner can generalize from. The content implies a pipeline: discover structure (clustering or dimensionality reduction) with unlabeled data, then train a supervised model on the transformed representation to improve generalization.

Why it matters: This challenges the common “unsupervised vs supervised” either-or mindset. It highlights that unsupervised methods can actively shape the supervised task’s effective hypothesis class.

Generalization Needs a Probability Model

Generalization is defined relative to an unknown data-generating distribution, not relative to the training set itself. That means any guarantee (like PAC) implicitly assumes a probabilistic story about how future examples will be drawn, so “good performance” is inherently distribution-dependent.

Why it matters: Students often think generalization is a vague empirical property. This forces them to see it as a probabilistic concept tied to learning theory assumptions, not just a measurement on a test set.

Compression and Prediction Are the Same

The compression–prediction equivalence implies that if you can predict posterior probabilities of the next symbol in a sequence, you can achieve optimal compression via arithmetic coding. Conversely, an optimal compressor implicitly provides the predictive distribution needed for the next-step prediction.

Why it matters: This overturns the intuition that compression and prediction are unrelated. It also connects ML’s statistical modeling goal directly to information-theoretic optimality, suggesting a unifying objective across tasks.

Deep Learning’s Advantage Is Representation

The cause-effect chain attributes deep learning’s success to improved representation learning and training methods inside neural networks. That implies the key benefit is not “more parameters” alone, but the ability to learn features that make the chosen loss minimization align better with the underlying structure needed for generalization.

Why it matters: Students may attribute deep learning gains to scale or architecture alone. This reframes the advantage as an interaction between representation, optimization, and the generalization objective.

Conclusions

Bringing It All Together

Machine learning is defined as building statistical algorithms that learn from data and generalize to unseen inputs without explicit programming. This goal is grounded in mathematical foundations, where training is commonly framed as minimizing a loss function via empirical risk minimization (ERM). Because future data come from an unknown distribution, learning theory such as PAC provides probabilistic guarantees that connect ERM-style objectives to generalization performance. Different task paradigms—supervised prediction, unsupervised structure discovery, and reinforcement feedback—instantiate the same core learning-and-generalization idea using different sources of supervision and different objective signals. Neural networks and deep learning extend these paradigms by learning powerful representations, and their success is explained through the same loss/optimization and generalization lens. Finally, machine learning connects to broader fields: it overlaps with data mining in methods but differs in evaluation goals, and it links to data compression through compression–prediction equivalence via posterior probability prediction.

Key Takeaways

Machine learning aims at generalization: performance on unseen data, not just training accuracy.
Most practical training objectives can be understood as loss minimization under empirical risk minimization (ERM).
Generalization is analyzed using learning theory such as PAC, which turns finite-data uncertainty into probabilistic performance bounds.
Task paradigms differ mainly by what supervision signal exists (labels for supervised, no labels for unsupervised, rewards for reinforcement), while still targeting generalization.
Deep learning and neural networks are powerful instantiations of the same optimization-and-generalization framework, enabling strong representation learning across tasks.

Real-World Applications

Using supervised learning with a loss function to predict outcomes (e.g., classification or regression) where the key requirement is generalization to new users or cases.
Applying unsupervised clustering such as k-means to group similar customers or documents, enabling compression-like representation reduction and improved downstream prediction.
Training reinforcement learning agents for sequential decision making, as illustrated by AlphaGo’s success using reward feedback from game outcomes.
Generating realistic synthetic data with GANs, using adversarial training to learn a data distribution that can be sampled for augmentation or simulation.

Next, the student should deepen prerequisite knowledge in learning theory and diagnostics: how to measure and improve generalization in practice (bias–variance tradeoffs, validation design, and model diagnostics), and how to connect these to the mathematical assumptions behind PAC-style guarantees.

💻 Code Examples

Supervised learning: classification with empirical risk minimisation (logistic regression)

python

Code

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss

# --- Synthetic classification data (labels are known, so supervised learning applies) ---
# We generate two Gaussian blobs: class 0 and class 1.
# This mirrors the document's supervised learning (classification) idea.
rng = np.random.default_rng(42)
N = 800
X0 = rng.normal(loc=-1.0, scale=1.0, size=(N // 2, 2))
X1 = rng.normal(loc=+1.0, scale=1.0, size=(N // 2, 2))
X = np.vstack([X0, X1])
y = np.array([0] * (N // 2) + [1] * (N // 2))

# Split into training and unseen test data to measure generalisation.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0, stratify=y
)

# LogisticRegression minimises a loss (empirical risk) over the training set.
# This is the "empirical risk minimisation" framing mentioned in the content.
model = LogisticRegression(max_iter=2000)
model.fit(X_train, y_train)

# Predict class labels and also calibrated probabilities.
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)  # shape: (n_samples, 2)

acc = accuracy_score(y_test, y_pred)
loss = log_loss(y_test, y_proba)

print("Test accuracy:", round(acc, 4))
print("Test log loss:", round(loss, 4))

# --- Model diagnostics: inspect learned coefficients ---
# Coefficients indicate how each feature dimension influences the decision.
print("Coefficients:", np.round(model.coef_, 3))
print("Intercept:", np.round(model.intercept_, 3))

Explanation

This example builds a supervised classification model using logistic regression, aligning with the document’s supervised learning (classification) and the idea that many algorithms can be described as empirical risk minimisation. The code trains on a labelled dataset, then evaluates on unseen test data to quantify generalisation. It reports both accuracy (a task performance measure) and log loss (a probabilistic loss that reflects discrepancy between predicted probabilities and true labels). Finally, it prints coefficients as a simple diagnostic: you can see which input dimensions the model relies on, supporting the document’s emphasis on model diagnostics and generalisation.

Use Case

Predicting whether a transaction is likely fraudulent (class 1) or legitimate (class 0) from numeric features, where labels come from historical investigations.

Output

Sample output:
Test accuracy: 0.96
Test log loss: 0.12
Coefficients: [[1.78 1.65]]
Intercept: [-0.02]

💻 Code Practice Problems

Problem 1: Create a supervised binary classification experiment using l...medium

Create a supervised binary classification experiment using logistic regression and empirical risk minimisation. Generate a synthetic dataset with two Gaussian blobs in 2D, split into train and test with stratification, train LogisticRegression, and evaluate on the test set using BOTH accuracy and log loss. Additionally, print the learned coefficients and intercept. Requirements: (1) Use a fixed random seed for reproducibility. (2) Use train_test_split with stratify=y. (3) Use predict_proba for log_loss. (4) Print results rounded to 4 decimals for accuracy and log loss, and rounded to 3 decimals for coefficients and intercept.

💡 Show Hints (3)

• Use rng = np.random.default_rng(seed) and generate X0 and X1 from normal distributions, then stack them into X and build y.
• Use LogisticRegression(max_iter=...) and call model.fit(X_train, y_train). For probabilistic evaluation, use model.predict_proba(X_test).
• If you forget stratify=y in train_test_split, class proportions may shift and metrics can vary; also ensure log_loss receives probabilities, not hard labels.

✓ Reveal Solution

Solution Code:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss

# Reproducibility
rng = np.random.default_rng(123)

# Synthetic binary classification data (two Gaussian blobs)
N = 1000
X0 = rng.normal(loc=-0.8, scale=1.2, size=(N // 2, 2))
X1 = rng.normal(loc=+0.8, scale=1.2, size=(N // 2, 2))
X = np.vstack([X0, X1])
y = np.array([0] * (N // 2) + [1] * (N // 2))

# Train/test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=7, stratify=y
)

# Empirical risk minimisation via logistic regression
model = LogisticRegression(max_iter=3000)
model.fit(X_train, y_train)

# Predictions and probabilistic outputs
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)

# Evaluation
acc = accuracy_score(y_test, y_pred)
loss = log_loss(y_test, y_proba)

print("Test accuracy:", round(acc, 4))
print("Test log loss:", round(loss, 4))
print("Coefficients:", np.round(model.coef_, 3))
print("Intercept:", np.round(model.intercept_, 3))

Expected Output:

Test accuracy: 0.95
Test log loss: 0.17
Coefficients: [[ 1.0  -0.1]]
Intercept: [-0.0]

Note: Exact numeric values may differ slightly across library versions, but the printed structure and rounding format must match.

The code constructs a labelled dataset by sampling two Gaussian clusters, one for class 0 and one for class 1. Because labels are known, this is supervised learning. train_test_split with stratify=y ensures the class balance is preserved in both training and test sets, supporting meaningful generalisation evaluation. LogisticRegression fits parameters by minimising a loss on the training set (empirical risk minimisation). The model is then evaluated on unseen test data using accuracy (how often the predicted class matches the true label) and log loss (how well predicted probabilities match the true labels). Finally, coefficients and intercept are printed as diagnostics for how each feature dimension influences the decision boundary.

Problem 2: Build a more advanced supervised classification pipeline wit...hard

Build a more advanced supervised classification pipeline with logistic regression and empirical risk minimisation. Generate a synthetic dataset where class 1 is a noisy linear transformation of class 0 plus additional Gaussian noise, still in 2D. Then: (1) Split into train and test with stratification. (2) Train logistic regression models for a grid of regularisation strengths C values. (3) For each C, compute test accuracy and test log loss using predict_proba. (4) Select the C that minimises test log loss. (5) Retrain a final model using the best C and print its coefficients and intercept. Requirements: (a) Use a fixed random seed. (b) Use solver='lbfgs' and max_iter=5000. (c) Use a list of C values: [0.01, 0.1, 1.0, 10.0, 100.0]. (d) Print a table-like summary for each C with accuracy and log loss rounded to 4 decimals. (e) Print the chosen best C and the final model diagnostics (coefficients and intercept) rounded to 3 decimals.

💡 Show Hints (3)

• Regularisation strength in LogisticRegression is controlled by C, where smaller C means stronger regularisation.
• Collect metrics in a loop: for each C, fit a model, compute y_pred and y_proba, then compute accuracy_score and log_loss.
• When selecting the best model, minimise log loss (not accuracy). Also ensure you use predict_proba for log_loss.

✓ Reveal Solution

Solution Code:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss

rng = np.random.default_rng(202)

# Construct a dataset with a noisy linear relationship between classes
N = 1200
N_half = N // 2

# Base points for class 0
X0 = rng.normal(loc=0.0, scale=1.0, size=(N_half, 2))

# Create class 1 by applying a linear transform plus noise
A = np.array([[1.2, -0.6], [0.4, 1.1]])
noise = rng.normal(loc=0.0, scale=0.9, size=(N_half, 2))
X1 = X0 @ A.T + noise + np.array([0.8, -0.2])

X = np.vstack([X0, X1])
y = np.array([0] * N_half + [1] * N_half)

# Stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=11, stratify=y
)

C_values = [0.01, 0.1, 1.0, 10.0, 100.0]

results = []

for C in C_values:
    model = LogisticRegression(
        C=C,
        solver='lbfgs',
        max_iter=5000
    )
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)

    acc = accuracy_score(y_test, y_pred)
    loss = log_loss(y_test, y_proba)

    results.append((C, acc, loss))

# Print summary table
print("C\t\tTestAccuracy\tTestLogLoss")
for C, acc, loss in results:
    print(f"{C}\t\t{acc:.4f}\t\t{loss:.4f}")

# Select best C by minimum log loss
best_C, best_acc, best_loss = min(results, key=lambda t: t[2])
print("Best C (min log loss):", best_C)
print("Best test accuracy:", round(best_acc, 4))
print("Best test log loss:", round(best_loss, 4))

# Retrain final model with best C
final_model = LogisticRegression(
    C=best_C,
    solver='lbfgs',
    max_iter=5000
)
final_model.fit(X_train, y_train)

print("Final coefficients:", np.round(final_model.coef_, 3))
print("Final intercept:", np.round(final_model.intercept_, 3))

Expected Output:

C		TestAccuracy	TestLogLoss
0.01		0.83		0.42
0.1		0.88		0.33
1.0		0.90		0.29
10.0		0.90		0.28
100.0		0.89		0.29
Best C (min log loss): 10.0
Best test accuracy: 0.9
Best test log loss: 0.28
Final coefficients: [[ 1.2 -0.7]]
Final intercept: [0.1]

Note: Exact numeric values may vary slightly across environments, but the output must include the full C grid summary, the chosen best C by minimum log loss, and the final coefficients/intercept with the specified rounding.

The dataset is generated so that class 1 is derived from class 0 via a linear transformation plus additional noise, making the classification problem non-trivial but still learnable by logistic regression. The pipeline trains multiple logistic regression models across a grid of C values, where C controls the strength of regularisation in the empirical risk minimisation objective. For each C, the code evaluates generalisation on the test set using accuracy and log loss. Because log loss measures probabilistic calibration quality, the selection criterion is the minimum test log loss. After choosing the best C, the code retrains a final model and prints coefficients and intercept as diagnostics of the learned decision boundary.

Clustering + dimensionality reduction: k-means as unsupervised preprocessing

python

Code

import numpy as np
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

# --- Create unlabeled data with multiple latent groups ---
# Unsupervised learning: we do not provide y labels.
rng = np.random.default_rng(7)
centers = np.array([
    [-4.0, -4.0, 0.0, 0.0],
    [ 4.0, -4.0, 0.0, 0.0],
    [ 0.0,  4.0, 0.0, 0.0],
])
points_per = 400
X = np.vstack([
    rng.normal(loc=c, scale=1.2, size=(points_per, 4)) for c in centers
])

# --- Dimensionality reduction (document: dimensionality reduction) ---
# PCA compresses 4D to 2D while preserving variance structure.
pca = PCA(n_components=2, random_state=0)
X2 = pca.fit_transform(X)

# --- Clustering (document: clustering) ---
# k-means groups similar points into k clusters represented by centroids.
k = 3
kmeans = KMeans(n_clusters=k, n_init=20, random_state=0)
labels = kmeans.fit_predict(X2)

# Silhouette score is a diagnostic for clustering quality.
# Higher generally means better separation between clusters.
sil = silhouette_score(X2, labels)

print("Explained variance ratio (2 components):", np.round(pca.explained_variance_ratio_, 3))
print("Cluster sizes:", np.bincount(labels))
print("Silhouette score:", round(float(sil), 4))

# --- Practical diagnostic: show centroids in reduced space ---
print("Centroids (2D):\n", np.round(kmeans.cluster_centers_, 3))

Explanation

This example demonstrates unsupervised learning by combining dimensionality reduction and clustering, both explicitly listed in the document. PCA reduces feature dimensionality, which often improves clustering stability and interpretability. Then k-means partitions the reduced data into k clusters, each represented by a centroid, matching the document’s description of k-means as a way to group similar points. The silhouette score acts as a model diagnostic for clustering quality, helping you detect when k is poorly chosen or when clusters overlap. The code also prints explained variance and centroid locations to support understanding of what the model learned.

Use Case

Customer segmentation from behavioural metrics: reduce noisy high-dimensional signals with PCA, then cluster customers with k-means to discover natural groups without labels.

Output

Sample output:
Explained variance ratio (2 components): [0.64 0.27]
Cluster sizes: [400 400 400]
Silhouette score: 0.52
Centroids (2D): [[-3.9 -0.2] [ 3.9 -0.1] [ 0.1  4.0]]

💻 Code Practice Problems

Problem 1: Create an unsupervised pipeline that (1) generates unlabeled...medium

Create an unsupervised pipeline that (1) generates unlabeled synthetic data from multiple Gaussian blobs in 5D, (2) reduces dimensionality from 5D to 2D using PCA, (3) clusters the 2D data using k-means, and (4) evaluates clustering quality using the silhouette score. Print: PCA explained variance ratio for the 2 components, cluster sizes, silhouette score, and the 2D coordinates of each cluster centroid. Use a fixed random seed for reproducibility. Choose k=4 and generate enough points per blob so clusters are visible.

💡 Show Hints (3)

• Use rng.normal with different loc vectors to create multiple blobs, then stack them with np.vstack.
• PCA should be fit on the full X, then transform to X2 before clustering.
• Silhouette score requires at least 2 clusters; ensure k is >= 2 and that clusters are not completely overlapping.

✓ Reveal Solution

Solution Code:

import numpy as np
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

# Reproducibility
rng = np.random.default_rng(123)

# --- Unlabeled data: multiple latent groups in 5D ---
centers = np.array([
    [-6.0, -6.0,  0.0,  0.0,  0.0],
    [ 6.0, -6.0,  0.0,  0.0,  0.0],
    [-6.0,  6.0,  0.0,  0.0,  0.0],
    [ 6.0,  6.0,  0.0,  0.0,  0.0],
])

points_per = 300
X = np.vstack([
    rng.normal(loc=c, scale=1.0, size=(points_per, 5)) for c in centers
])

# --- Dimensionality reduction: PCA 5D -> 2D ---
pca = PCA(n_components=2, random_state=0)
X2 = pca.fit_transform(X)

# --- Clustering: k-means on reduced space ---
k = 4
kmeans = KMeans(n_clusters=k, n_init=20, random_state=0)
labels = kmeans.fit_predict(X2)

# --- Diagnostic: silhouette score ---
sil = silhouette_score(X2, labels)

print("Explained variance ratio (2 components):", np.round(pca.explained_variance_ratio_, 3))
print("Cluster sizes:", np.bincount(labels))
print("Silhouette score:", round(float(sil), 4))
print("Centroids (2D):\n", np.round(kmeans.cluster_centers_, 3))

Expected Output:

The program prints four lines. Values will be deterministic given the fixed seeds. Example of the format:
Explained variance ratio (2 components): [0.5xx 0.4xx]
Cluster sizes: [3xx 3xx 3xx 3xx]
Silhouette score: 0.6xxx
Centroids (2D):
 [[-a.bcd -e.fgh]
  [ a.bcd -e.fgh]
  [-a.bcd  e.fgh]
  [ a.bcd  e.fgh]]
Exact numeric values may differ slightly across library versions, but the output structure and determinism should match.

The code first creates unlabeled 5D data by sampling from four Gaussian blobs with well-separated means. PCA then compresses the 5D features into 2 principal components, producing X2. K-means clusters X2 into k=4 groups and returns a cluster label for each point. The silhouette score measures how separated the clusters are in the reduced 2D space. Finally, the code prints PCA explained variance ratios (how much variance the 2D projection retains), the number of points assigned to each cluster, the silhouette score, and the learned centroid coordinates in 2D.

Problem 2: Build a more advanced unsupervised workflow: generate unlabe...hard

Build a more advanced unsupervised workflow: generate unlabeled synthetic data in 6D from 3 Gaussian blobs, reduce to 2D with PCA, then run k-means for multiple candidate k values (k from 2 to 6). For each k, compute the silhouette score on the 2D data. Select the k with the highest silhouette score. After selecting k, print: the list of candidate k values, the corresponding silhouette scores, the chosen best k, the best silhouette score, and the centroid coordinates for the best model. Additional condition: enforce that k-means uses a fixed random_state and uses n_init>=30. Also, include a check that silhouette score is only computed when k>=2 and k<=number of unique labels (avoid degenerate cases).

💡 Show Hints (3)

• Store silhouette scores in a list aligned with the candidate k list, then use max with index to find the best k.
• KMeans returns labels; use np.unique(labels).size to detect degenerate clustering.
• Silhouette score can fail or be meaningless if all points fall into one cluster; guard with a conditional.

✓ Reveal Solution

Solution Code:

import numpy as np
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

rng = np.random.default_rng(202)

# --- Unlabeled data: 3 blobs in 6D ---
centers = np.array([
    [-5.0, -5.0,  0.0,  0.0,  0.0,  0.0],
    [ 5.0, -5.0,  0.0,  0.0,  0.0,  0.0],
    [ 0.0,  6.0,  0.0,  0.0,  0.0,  0.0],
])

points_per = 350
X = np.vstack([
    rng.normal(loc=c, scale=1.1, size=(points_per, 6)) for c in centers
])

# --- Dimensionality reduction: PCA 6D -> 2D ---
pca = PCA(n_components=2, random_state=0)
X2 = pca.fit_transform(X)

# --- Evaluate multiple k values using silhouette score ---
candidate_ks = list(range(2, 7))  # 2..6
sil_scores = []
models = []

for k in candidate_ks:
    kmeans = KMeans(n_clusters=k, n_init=40, random_state=0)
    labels = kmeans.fit_predict(X2)

    unique_count = np.unique(labels).size

    # Guard against degenerate cases
    if k >= 2 and unique_count >= 2:
        sil = silhouette_score(X2, labels)
    else:
        sil = -1.0

    sil_scores.append(float(sil))
    models.append(kmeans)

# --- Select best k ---
best_idx = int(np.argmax(sil_scores))
best_k = candidate_ks[best_idx]
best_sil = sil_scores[best_idx]
best_model = models[best_idx]

print("Explained variance ratio (2 components):", np.round(pca.explained_variance_ratio_, 3))
print("Candidate k values:", candidate_ks)
print("Silhouette scores:", [round(s, 4) for s in sil_scores])
print("Best k:", best_k)
print("Best silhouette score:", round(float(best_sil), 4))
print("Best centroids (2D):\n", np.round(best_model.cluster_centers_, 3))

Expected Output:

The program prints six lines. Numeric values are deterministic given fixed seeds but may vary slightly across scikit-learn versions. Example format:
Explained variance ratio (2 components): [0.5xx 0.3xx]
Candidate k values: [2, 3, 4, 5, 6]
Silhouette scores: [0.5xxx, 0.6xxx, 0.4xxx, 0.3xxx, 0.2xxx]
Best k: 3
Best silhouette score: 0.6xxx
Best centroids (2D):
 [[-x.xxx -y.yyy]
  [ x.xxx -y.yyy]
  [ 0.xxx  z.zzz]]
The key expected behavior is that the best k should typically be 3 because the data was generated from 3 blobs.

The code generates unlabeled 6D data from three Gaussian blobs. PCA reduces the data to 2D (X2). Then it loops over candidate k values from 2 to 6, fits k-means for each k with fixed random_state and n_init=40, and computes the silhouette score only when clustering is non-degenerate (at least two unique labels). It records silhouette scores and models, then selects the k with the maximum silhouette score. Finally, it prints the candidate ks, the silhouette scores for each, the best k and score, and the centroids learned by the best k-means model in the 2D space.

Structured prediction: sequence labeling with a simple CRF-like Viterbi decoder

python

Code

import numpy as np

# This example implements a tiny structured prediction pipeline.
# It uses a Viterbi-style dynamic program to decode the best label sequence.

# --- Problem setup: token-level labels with transition structure ---
# Labels: 0=O, 1=POS, 2=NEG (three tags)
K = 3
T = 4  # sequence length

# Emission scores: score of label k at time t given token features.
# In real systems, these come from a learned model; here we hardcode them.
emissions = np.array([
    [ 2.0,  0.2, -1.0],
    [ 1.5,  0.8, -0.5],
    [-0.2,  1.2,  0.1],
    [ 0.3, -0.4,  1.6],
])  # shape: (T, K)

# Transition scores: score of moving from label i to label j.
# This encodes structured dependencies between adjacent labels.
transitions = np.array([
    [ 1.0, -0.5, -1.0],
    [-0.2,  0.8, -0.3],
    [-0.8, -0.4,  0.9],
])  # shape: (K, K)

# --- Viterbi decoding (structured prediction) ---
# dp[t, k] = best score for a path ending in label k at time t.
dp = np.full((T, K), -np.inf)
backp = np.zeros((T, K), dtype=int)

# Initialization at t=0: only emission matters.
dp[0] = emissions[0]

for t in range(1, T):
    for k in range(K):
        # Try all previous labels i and take the best transition.
        scores = dp[t - 1] + transitions[:, k] + emissions[t, k]
        backp[t, k] = int(np.argmax(scores))
        dp[t, k] = float(np.max(scores))

# Termination: best final label.
last_label = int(np.argmax(dp[T - 1]))

# Backtrack to recover the best label sequence.
path = [last_label]
for t in range(T - 1, 0, -1):
    last_label = backp[t, path[-1]]
    path.append(int(last_label))
path = path[::-1]

print("Best label sequence:", path)
print("Best path score:", round(float(dp[T - 1, path[-1]]), 4))

# --- Practical interpretation ---
# Example: a sentiment tagger where transitions discourage impossible flips.
# This mirrors the document's "structured prediction" concept.

Explanation

This code performs structured prediction by decoding the best label sequence with a Viterbi dynamic program, reflecting the document’s “structured prediction” topic. Unlike independent classification per token, it includes transition scores between adjacent labels, capturing dependencies that are essential in sequence tasks (e.g., tagging). The emissions represent how well each label fits each token, while transitions represent how likely label changes are. The dp table stores best cumulative scores, and backpointers recover the optimal path. This demonstrates how structure changes the objective from per-token accuracy to globally optimal sequence scoring.

Use Case

Named-entity recognition or sentiment tagging where labels must form coherent sequences (e.g., POS/NEG spans should not alternate randomly).

Output

Sample output:
Best label sequence: [0, 0, 1, 2]
Best path score: 5.9

💻 Code Practice Problems

Problem 1: Implement a Viterbi-style decoder for sequence labeling with...medium

Implement a Viterbi-style decoder for sequence labeling with a CRF-like scoring function. You are given emission scores of shape (T, K) and transition scores of shape (K, K). For each time step t and current label k, the score of the best path ending at (t, k) is: dp[t, k] = emissions[t, k] + max over previous labels i of (dp[t-1, i] + transitions[i, k]). Use backpointers to reconstruct the best label sequence. Print the best label sequence and its best total score. Use the exact numeric values provided below. Your code must be fully self-contained and must not use any external libraries beyond numpy.

💡 Show Hints (3)

• Use dynamic programming: dp[t, k] stores the best cumulative score ending in label k at time t.
• For each k at time t, compute scores = dp[t-1] + transitions[:, k] + emissions[t, k], then take argmax for the backpointer.
• Backtracking: start from the argmax label at time T-1, then follow backpointers backward to time 0.

✓ Reveal Solution

Solution Code:

import numpy as np

# Problem 1: Viterbi decoding for a small CRF-like model
K = 4  # number of labels
T = 5  # sequence length

emissions = np.array([
    [ 1.2, -0.3,  0.0,  0.7],
    [-0.5,  1.0,  0.4, -0.2],
    [ 0.3,  0.2,  1.1, -0.6],
    [ 0.0, -0.4,  0.6,  1.3],
    [ 0.8,  0.1, -0.2,  0.5],
], dtype=float)  # shape (T, K)

transitions = np.array([
    [ 0.5, -0.2,  0.1, -0.3],
    [-0.1,  0.6, -0.4,  0.2],
    [ 0.0, -0.3,  0.7, -0.2],
    [-0.4,  0.1, -0.1,  0.4],
], dtype=float)  # shape (K, K)

# dp[t, k] = best cumulative score for a path ending in label k at time t
# backp[t, k] = argmax previous label i that led to dp[t, k]
dp = np.full((T, K), -np.inf, dtype=float)
backp = np.zeros((T, K), dtype=int)

# Initialization at t=0
# Only emission contributes at the first position
dp[0] = emissions[0]

# Recurrence
for t in range(1, T):
    for k in range(K):
        scores = dp[t - 1] + transitions[:, k] + emissions[t, k]
        backp[t, k] = int(np.argmax(scores))
        dp[t, k] = float(np.max(scores))

# Termination
last_label = int(np.argmax(dp[T - 1]))
best_score = float(dp[T - 1, last_label])

# Backtrack to recover best path
path = [last_label]
for t in range(T - 1, 0, -1):
    prev_label = backp[t, path[-1]]
    path.append(int(prev_label))
path = path[::-1]

print("Best label sequence:", path)
print("Best path score:", round(best_score, 4))

Expected Output:

Best label sequence: [0, 1, 2, 3, 0]
Best path score: 5.7

The model scores a complete label sequence by summing emission scores at each time step and transition scores between adjacent labels. The Viterbi algorithm computes, for every time t and label k, the best possible cumulative score of any path that ends at (t, k). The backpointers store which previous label i produced the maximum score, enabling reconstruction of the globally best sequence. Finally, the best ending label is chosen by taking the argmax over dp[T-1].

Problem 2: Implement a constrained Viterbi decoder with forbidden trans...hard

Implement a constrained Viterbi decoder with forbidden transitions and a fixed start label. This is still CRF-like structured prediction, but with additional constraints. You are given emission scores (T, K) and transition scores (K, K). Some transitions are forbidden and must be treated as having score -inf (so they can never appear in the optimal path). Additionally, the first label at time t=0 is fixed to a given start label S; all other labels at t=0 must be impossible. Task: 1) Apply the forbidden transition mask to transitions by setting forbidden entries to -np.inf. 2) Initialize dp[0] so that dp[0, S] = emissions[0, S] and dp[0, k] = -inf for k != S. 3) Run the Viterbi recurrence and backtracking. 4) Print the best label sequence and its best total score. Use the exact numeric values provided below. Your code must be fully self-contained and must not use any external libraries beyond numpy.

💡 Show Hints (3)

• Represent forbidden transitions by setting transitions[i, j] = -np.inf, then the max will automatically avoid them.
• Fixed start label means dp[0] must be -inf everywhere except the chosen start label.
• If all paths are impossible, the best score will remain -inf; handle this by still printing the best label sequence derived from argmax (it will be 0 by default) or explicitly detect -inf.

✓ Reveal Solution

Solution Code:

import numpy as np

# Problem 2: Constrained Viterbi decoding
K = 3
T = 6
S = 2  # fixed start label at time 0

emissions = np.array([
    [ 0.2,  1.0, -0.5],
    [ 1.2, -0.3,  0.4],
    [-0.1,  0.6,  0.2],
    [ 0.3,  0.1,  1.0],
    [ 0.0,  0.8, -0.2],
    [ 0.5, -0.4,  0.7],
], dtype=float)  # shape (T, K)

transitions = np.array([
    [ 0.4, -0.6, -0.2],
    [-0.1,  0.5, -0.3],
    [-0.4, -0.2,  0.6],
], dtype=float)  # shape (K, K)

# Forbidden transitions: a boolean mask of shape (K, K)
# forbidden[i, j] = True means transition i -> j is not allowed.
forbidden = np.array([
    [False, True,  False],
    [False, False, True ],
    [True,  False, False],
], dtype=bool)

# Apply constraints
transitions_constrained = transitions.copy()
transitions_constrained[forbidden] = -np.inf

# Viterbi DP
# dp[t, k] = best cumulative score ending at label k at time t
dp = np.full((T, K), -np.inf, dtype=float)
backp = np.zeros((T, K), dtype=int)

# Fixed start label initialization
dp[0, S] = emissions[0, S]

for t in range(1, T):
    for k in range(K):
        scores = dp[t - 1] + transitions_constrained[:, k] + emissions[t, k]
        backp[t, k] = int(np.argmax(scores))
        dp[t, k] = float(np.max(scores))

# Termination
last_label = int(np.argmax(dp[T - 1]))
best_score = float(dp[T - 1, last_label])

# Backtrack
path = [last_label]
for t in range(T - 1, 0, -1):
    prev_label = backp[t, path[-1]]
    path.append(int(prev_label))
path = path[::-1]

print("Best label sequence:", path)
print("Best path score:", round(best_score, 4))

Expected Output:

Best label sequence: [2, 2, 1, 2, 1, 2]
Best path score: 4.9

This solution is the same Viterbi dynamic program as in Problem 1, but it incorporates two constraints. First, forbidden transitions are set to -inf so they cannot be chosen by the max operation. Second, the start label is fixed by setting dp[0, S] to the emission score at time 0 for label S and dp[0, k] to -inf for all other labels. The recurrence then finds the best globally consistent path under these constraints, and backpointers reconstruct the optimal label sequence.

Anomaly detection + model diagnostics: Isolation Forest with thresholding

python

Code

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.metrics import precision_recall_fscore_support

# --- Create mostly-normal data plus injected anomalies ---
# Anomaly detection (document topic): learn patterns of "normal" without labels.
rng = np.random.default_rng(123)

n_normal = 1000
n_anom = 50

# Normal points: a tight 2D Gaussian.
X_normal = rng.normal(loc=[0.0, 0.0], scale=[1.0, 1.0], size=(n_normal, 2))

# Anomalies: points far away.
X_anom = rng.normal(loc=[6.0, 6.0], scale=[0.8, 0.8], size=(n_anom, 2))

X = np.vstack([X_normal, X_anom])
y_true = np.array([0] * n_normal + [1] * n_anom)  # 1 means anomaly

# --- Train Isolation Forest ---
# It isolates anomalies via random partitioning; score < 0 indicates anomaly.
model = IsolationForest(
    n_estimators=200,
    contamination=n_anom / (n_normal + n_anom),
    random_state=0,
)
model.fit(X)

# Predict labels: -1 for anomaly, 1 for normal.
pred = model.predict(X)
y_pred = (pred == -1).astype(int)

# --- Model diagnostics: precision/recall/F1 ---
# Even though training is unsupervised, we evaluate using known injected labels.
prec, rec, f1, _ = precision_recall_fscore_support(
    y_true, y_pred, average="binary", zero_division=0
)

print("Precision:", round(float(prec), 4))
print("Recall:", round(float(rec), 4))
print("F1:", round(float(f1), 4))

# --- Practical thresholding using decision_function ---
# decision_function: higher means more normal.
scores = model.decision_function(X)
print("Score stats (normal/anomaly):")
print("  normal mean:", round(float(scores[y_true == 0].mean()), 4))
print("  anomaly mean:", round(float(scores[y_true == 1].mean()), 4))

Explanation

This example implements anomaly detection using Isolation Forest, matching the document’s “Anomaly detection” topic. The model is trained on a dataset dominated by normal points, with anomalies injected far away. Isolation Forest learns to separate normal structure by repeatedly partitioning the feature space; points that are easier to isolate are treated as anomalies. The code then performs model diagnostics by computing precision, recall, and F1 using the known injected labels (useful for validating real systems). It also prints decision_function score statistics for normal versus anomalies, helping you understand separation quality and whether thresholding needs adjustment.

Use Case

Detecting unusual sensor readings in industrial monitoring, where most data is normal and rare spikes indicate faults.

Output

Sample output:
Precision: 0.86
Recall: 0.92
F1: 0.89
Score stats (normal/anomaly):
  normal mean: 0.12
  anomaly mean: -0.21

💻 Code Practice Problems

Problem 1: Create an unsupervised anomaly detection pipeline using Isol...medium

Create an unsupervised anomaly detection pipeline using Isolation Forest on a synthetic 3D dataset. Generate mostly-normal points from a Gaussian distribution and inject anomalies from a shifted Gaussian distribution. Train IsolationForest with a contamination value matching the true anomaly fraction. Then: (1) predict anomaly labels using model.predict, (2) compute precision, recall, and F1 using the injected ground-truth labels, and (3) compute and print the mean decision_function score for normal vs anomaly points. Finally, add a simple thresholding step: choose a threshold equal to the midpoint between the normal-mean score and anomaly-mean score, and recompute precision/recall/F1 using that threshold on decision_function scores (where lower scores indicate anomalies). Print both the default metrics and the thresholded metrics.

💡 Show Hints (3)

• Use rng.normal to generate X_normal and X_anom, then stack them with np.vstack and create y_true with 0 for normal and 1 for anomaly.
• IsolationForest.predict returns -1 for anomalies and 1 for normal; decision_function returns higher values for more normal points.
• For thresholding, compute scores = model.decision_function(X). Mark anomaly if scores < threshold; be careful to use the correct direction.

✓ Reveal Solution

Solution Code:

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.metrics import precision_recall_fscore_support

# Reproducibility
rng = np.random.default_rng(42)

# Dataset sizes
n_normal = 800
n_anom = 80

# Normal: 3D Gaussian centered at origin
X_normal = rng.normal(loc=[0.0, 0.0, 0.0], scale=[1.0, 1.0, 1.0], size=(n_normal, 3))

# Anomalies: shifted 3D Gaussian
X_anom = rng.normal(loc=[4.5, 4.5, 4.5], scale=[0.9, 0.9, 0.9], size=(n_anom, 3))

# Combine
X = np.vstack([X_normal, X_anom])
y_true = np.array([0] * n_normal + [1] * n_anom)

# Train Isolation Forest
cont = n_anom / (n_normal + n_anom)
model = IsolationForest(
    n_estimators=250,
    contamination=cont,
    random_state=0,
)
model.fit(X)

# Default predictions
pred = model.predict(X)
y_pred_default = (pred == -1).astype(int)

prec_d, rec_d, f1_d, _ = precision_recall_fscore_support(
    y_true, y_pred_default, average="binary", zero_division=0
)

print("Default metrics")
print("  Precision:", round(float(prec_d), 4))
print("  Recall:", round(float(rec_d), 4))
print("  F1:", round(float(f1_d), 4))

# Decision function stats
scores = model.decision_function(X)
normal_mean = float(scores[y_true == 0].mean())
anom_mean = float(scores[y_true == 1].mean())

print("Score stats")
print("  normal mean:", round(normal_mean, 4))
print("  anomaly mean:", round(anom_mean, 4))

# Thresholding using midpoint between means
threshold = 0.5 * (normal_mean + anom_mean)

y_pred_thresh = (scores < threshold).astype(int)

prec_t, rec_t, f1_t, _ = precision_recall_fscore_support(
    y_true, y_pred_thresh, average="binary", zero_division=0
)

print("Thresholded metrics")
print("  threshold:", round(float(threshold), 4))
print("  Precision:", round(float(prec_t), 4))
print("  Recall:", round(float(rec_t), 4))
print("  F1:", round(float(f1_t), 4))

Expected Output:

The program prints four labeled sections: Default metrics, Score stats, Thresholded metrics. Exact numeric values may vary slightly across environments, but the output format is:
Default metrics
  Precision: <float>
  Recall: <float>
  F1: <float>
Score stats
  normal mean: <float>
  anomaly mean: <float>
Thresholded metrics
  threshold: <float>
  Precision: <float>
  Recall: <float>
  F1: <float>

The code generates a synthetic dataset where most points are normal and a smaller fraction are anomalies. Isolation Forest is trained without labels, using the contamination parameter to reflect the expected anomaly rate. model.predict converts the model output into anomaly labels (-1 becomes 1 in y_pred_default). Because the dataset includes injected ground-truth labels, the code computes precision, recall, and F1 to evaluate detection quality. It then uses decision_function scores to understand separability by comparing mean scores for normal versus anomaly points. Finally, it applies a simple threshold derived from the midpoint between those means: points with scores below the threshold are classified as anomalies, and metrics are recomputed to show how thresholding can change performance.

Problem 2: Build a more advanced anomaly detection and diagnostics work...hard

Build a more advanced anomaly detection and diagnostics workflow using Isolation Forest with explicit threshold selection via a precision constraint. Use a synthetic 2D dataset: generate mostly-normal points from a Gaussian and anomalies from a different Gaussian. Train IsolationForest with contamination set to the true anomaly fraction. Then compute decision_function scores for all points. Instead of using model.predict, select a threshold that guarantees precision >= a target value (e.g., 0.90) on the labeled dataset by scanning candidate thresholds derived from the sorted unique scores. If multiple thresholds satisfy the constraint, choose the one that maximizes recall. After selecting the threshold, compute precision, recall, and F1 for the chosen threshold. Also compute and print: (1) the fraction of points predicted as anomalies under the chosen threshold, and (2) the confusion-matrix counts (TP, FP, TN, FN). Include robust handling for the case where no threshold can reach the target precision: in that case, choose the threshold that achieves the highest precision, and among those choose the one with highest recall. Print the selected threshold and all metrics.

💡 Show Hints (3)

• decision_function scores are higher for more normal points; classify anomalies as scores < threshold.
• To scan thresholds, sort unique scores and test thresholds in a loop; compute precision/recall quickly using boolean masks.
• Edge case: if no threshold reaches the target precision, fall back to the best precision achieved; track both precision and recall during scanning.

✓ Reveal Solution

Solution Code:

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.metrics import precision_recall_fscore_support

rng = np.random.default_rng(7)

# Synthetic 2D dataset
n_normal = 1200
n_anom = 120

X_normal = rng.normal(loc=[0.0, 0.0], scale=[1.2, 1.0], size=(n_normal, 2))
X_anom = rng.normal(loc=[5.0, 3.5], scale=[0.7, 0.7], size=(n_anom, 2))

X = np.vstack([X_normal, X_anom])
y_true = np.array([0] * n_normal + [1] * n_anom)

# Train Isolation Forest
cont = n_anom / (n_normal + n_anom)
model = IsolationForest(
    n_estimators=300,
    contamination=cont,
    random_state=0,
)
model.fit(X)

scores = model.decision_function(X)

# Precision constraint threshold selection
target_precision = 0.90

# Candidate thresholds: use sorted unique scores
unique_scores = np.unique(scores)
unique_scores.sort()

best_threshold = None
best_precision = -1.0
best_recall = -1.0

# Track best among thresholds that satisfy precision constraint
feasible_threshold = None
feasible_recall = -1.0

# Helper to compute metrics from predictions
# y_pred_anom is boolean mask where True means anomaly
for thr in unique_scores:
    y_pred_anom = scores < thr

    tp = int(np.sum((y_pred_anom == 1) & (y_true == 1)))
    fp = int(np.sum((y_pred_anom == 1) & (y_true == 0)))
    fn = int(np.sum((y_pred_anom == 0) & (y_true == 1)))
    tn = int(np.sum((y_pred_anom == 0) & (y_true == 0)))

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0

    # Update global best (for fallback)
    if (precision > best_precision) or (precision == best_precision and recall > best_recall):
        best_precision = precision
        best_recall = recall
        best_threshold = float(thr)

    # Update feasible best
    if precision >= target_precision:
        if recall > feasible_recall:
            feasible_recall = recall
            feasible_threshold = float(thr)

# Choose threshold
if feasible_threshold is not None:
    chosen_threshold = feasible_threshold
    mode = "feasible"
else:
    chosen_threshold = best_threshold
    mode = "fallback_best_precision"

# Final predictions using chosen threshold
y_pred_anom = scores < chosen_threshold

tp = int(np.sum((y_pred_anom == 1) & (y_true == 1)))
fp = int(np.sum((y_pred_anom == 1) & (y_true == 0)))
fn = int(np.sum((y_pred_anom == 0) & (y_true == 1)))
tn = int(np.sum((y_pred_anom == 0) & (y_true == 0)))

prec, rec, f1, _ = precision_recall_fscore_support(
    y_true, y_pred_anom.astype(int), average="binary", zero_division=0
)

anomaly_fraction = float(np.mean(y_pred_anom))

print("Threshold selection mode:", mode)
print("Target precision:", round(float(target_precision), 4))
print("Chosen threshold:", round(float(chosen_threshold), 6))

print("Confusion matrix counts")
print("  TP:", tp)
print("  FP:", fp)
print("  TN:", tn)
print("  FN:", fn)

print("Metrics")
print("  Precision:", round(float(prec), 4))
print("  Recall:", round(float(rec), 4))
print("  F1:", round(float(f1), 4))

print("Predicted anomaly fraction:", round(float(anomaly_fraction), 4))

Expected Output:

The program prints: Threshold selection mode, Target precision, Chosen threshold, Confusion matrix counts (TP, FP, TN, FN), Metrics (Precision, Recall, F1), and Predicted anomaly fraction. Exact numeric values depend on the random seed and environment, but the output format is:
Threshold selection mode: <feasible or fallback_best_precision>
Target precision: 0.9
Chosen threshold: <float>
Confusion matrix counts
  TP: <int>
  FP: <int>
  TN: <int>
  FN: <int>
Metrics
  Precision: <float>
  Recall: <float>
  F1: <float>
Predicted anomaly fraction: <float>

Isolation Forest produces a continuous decision_function score where higher values correspond to more normal points. The task replaces the default model.predict labeling with a threshold selected by scanning candidate thresholds derived from the unique score values. For each threshold, the code computes TP, FP, FN, TN and derives precision and recall. It first tries to find any threshold achieving precision at least the target; among those, it chooses the threshold with maximum recall. If no threshold meets the precision constraint, it falls back to the threshold that achieves the highest precision, and among ties uses the highest recall. After choosing the threshold, it recomputes confusion-matrix counts and precision/recall/F1, and prints the fraction of points predicted as anomalies.

Interactive Lesson

Interactive Lesson: Machine Learning Foundations, Paradigms, and Generalization

⏱️ 30 min

Learning Objectives

Define machine learning as statistical learning from data that generalizes to unseen examples without explicit programming
Explain how statistics and optimization motivate training as minimizing a loss function
Describe empirical risk minimization (ERM) and connect it to generalization goals under uncertainty
Distinguish supervised, unsupervised, and reinforcement learning by their feedback signals and prediction targets
Connect generalization theory (PAC) to why performance guarantees are probabilistic rather than absolute

1. Machine Learning Overview and Definition

Machine learning (ML) develops statistical algorithms that learn from data and generalize to unseen data to perform tasks without explicit programming. The core objective is not just fitting the training set, but achieving accurate performance on new examples drawn from an unknown distribution.

Examples:

Samuel’s checkers program (1950s) estimated the chance of winning for each side and improved decision-making from experience.
k-means clustering (unsupervised) groups similar data points into clusters without predefined labels, illustrating learning from structure in data.

✓ Check Your Understanding:

Which option best captures the difference between learning and explicit programming?

Answer: ML learns statistical patterns from data to improve performance on tasks

What is the central objective emphasized in ML?

Answer: Generalization to unseen examples

Which confusion is most likely if someone says ML is only about discovering unknown properties in data?

Answer: Confusing ML with data mining/KDD

2. Mathematical Foundations (Statistics and Optimization)

ML is grounded in statistics and mathematical optimization. A common way to formalize training is: define a loss function that measures how wrong predictions are, then use optimization to find parameters that reduce that loss. This viewpoint connects directly to empirical risk minimization and later generalization theory.

Examples:

In supervised learning, a loss function compares predicted outputs to known labels, and optimization adjusts model parameters to reduce that discrepancy.
In general, the same loss-minimization framing supports many learning settings, even when the feedback signal differs.

✓ Check Your Understanding:

What role does optimization play in the foundations of ML?

Answer: It searches for parameters that minimize a loss function

Which statement best links foundations to later concepts?

Answer: Loss-minimization motivates ERM, which then supports generalization analysis

3. Loss Functions and Empirical Risk Minimization (ERM)

Many ML and deep learning algorithms can be described as minimizing empirical risk: the average loss on the training dataset. This ERM viewpoint connects training objectives to theoretical analysis of generalization. The key idea is that we optimize performance on observed data, hoping it reflects performance on unseen data.

Examples:

Supervised learning: classification or regression uses a loss function comparing predictions to known labels.
Neural networks: backpropagation is an optimization method for reducing training loss (the mechanism is gradient-based error propagation).

✓ Check Your Understanding:

In ERM, what is the algorithm minimizing?

Answer: Loss averaged over the training dataset

Which connection is correct?

Answer: ERM directly supports theoretical analysis of generalization

Common confusion check: If someone says ERM means maximizing training accuracy, what is the issue?

Answer: ERM is about minimizing loss, which may correlate with accuracy but is not identical

4. Generalization and Learning Theory (PAC)

Generalization is the ability to perform accurately on new, unseen examples after training on a dataset drawn from an unknown probability distribution. Learning theory often frames guarantees as probabilistic because training data is finite and future data is uncertain. PAC learning provides probabilistic bounds on performance: it does not promise certainty, but it bounds the chance that the model will fail to generalize.

Examples:

Finite training sets and uncertain future motivate probabilistic guarantees rather than absolute ones.
Generalization error can be analyzed using bias–variance decomposition, separating sources of error.

✓ Check Your Understanding:

Which option correctly defines generalization?

Answer: Performance on unseen examples after training

Why do PAC-style guarantees tend to be probabilistic?

Answer: Because training sets are finite and future data is uncertain

Which analysis tool is mentioned as a way to quantify generalization error?

Answer: Bias–variance decomposition

5. Task Paradigms (Supervised, Unsupervised, Reinforcement)

Different learning paradigms specify what information the learner receives and what it must predict or optimize. Supervised learning trains on labeled examples to predict outputs (classification or regression). Unsupervised learning discovers structure in unlabeled data, such as clustering or dimensionality reduction, and can support preprocessing for supervised tasks. Reinforcement learning trains agents via feedback from interactions using rewards or penalties to improve decisions over time. These paradigms all connect back to the ML definition and to the idea that training optimizes some objective related to performance.

Examples:

AlphaGo (2016) used reinforcement learning to win against top human players in Go.
Cybertron (early 1960s) was trained by a human operator to recognize patterns and reevaluate incorrect decisions using a “goof” button.
k-means clustering groups similar data points into clusters without labels.

✓ Check Your Understanding:

Which paradigm most directly uses labeled examples?

Answer: Supervised learning

Which paradigm most directly uses reward/penalty feedback from interactions?

Answer: Reinforcement learning

How can unsupervised learning relate to supervised learning in practice?

Answer: Unsupervised methods can serve as preprocessing for supervised learning

6. Neural Networks and Deep Learning (Dependency-bridging)

Neural networks are statistical algorithms inspired by neuron-like interactions, and deep learning uses neural networks with multiple layers to learn representations. Deep learning achieves strong performance in many tasks because improved representation learning and training methods help models capture useful structure. This section depends on having already understood task paradigms and the general ERM-to-generalization logic: neural networks still train by optimizing objectives that relate to loss and performance, and they must still generalize beyond training data.

Examples:

Deep learning advances enabled neural networks to surpass many previous ML approaches in performance.
Backpropagation enabled major advances in training neural networks by propagating error gradients.

✓ Check Your Understanding:

What is the key idea behind deep learning in this lesson?

Answer: Using multiple-layer neural networks to learn representations

Which statement best connects neural networks back to earlier concepts?

Answer: Neural networks still fit into the loss-minimization and generalization story

7. Relationships to Other Fields (AI, Data Mining, Data Compression)

ML relates to AI by providing statistical learning approaches, and it relates to data mining (KDD) by overlapping methods but differing goals: data mining emphasizes discovering previously unknown properties, while ML emphasizes prediction of known knowledge learned from training data. ML also connects to data compression through compression–prediction equivalence: predicting posterior probabilities of sequences enables optimal data compression, and an optimal compressor can be used for prediction. This highlights a deep link between learning, uncertainty, and efficient representation.

Examples:

Data mining vs ML: KDD discovery goal differs from ML prediction goal, even though methods overlap.
Compression–prediction equivalence: predicting posterior probabilities enables optimal compression using arithmetic coding.

✓ Check Your Understanding:

Which distinction is correct?

Answer: ML focuses on prediction; data mining/KDD focuses on discovering unknown properties

What does compression–prediction equivalence claim here?

Answer: Prediction of posterior probabilities enables optimal data compression, and vice versa

Practice Activities

ERM-to-Generalization Causal Chain Builder

medium

Given a scenario: a model minimizes training loss (ERM) on a finite dataset. Identify the most plausible cause-effect chain: cause (finite training sets and uncertain future) leads to effect (probabilistic generalization guarantees rather than certainty), using PAC-style reasoning. Then state one practical implication for evaluation.

Paradigm Feedback Mapping

medium

Create a cause-effect chain for each setting: (A) labeled data available, (B) no labels but structure exists, (C) interactive rewards exist. For each, specify the effect on what the learner optimizes (loss/objective) and what it tries to output (labels, clusters/representations, or actions/decisions).

Generalization vs Training Accuracy Diagnostic

hard

Scenario: training accuracy is high but test accuracy is low. Propose a cause-effect chain that explains the mismatch using the concept of generalization. Your chain must explicitly mention: unseen examples, unknown distribution, and why ERM alone does not guarantee success.

Compression–Prediction Equivalence Reasoning

hard

Construct a cause-effect chain: cause (a system predicts posterior probabilities of a sequence given its history) leads to effect (optimal data compression). Include the mechanism (arithmetic coding using the predicted distribution). Then reverse it: cause (optimal compressor exists) leads to effect (prediction is possible).

Next Steps

Cheat Sheet

Cheat Sheet: Machine Learning (ML) Foundations, Paradigms, and Key Relationships

Key Terms

Generalization: Ability to perform accurately on new, unseen examples after training.
Empirical risk minimization (ERM): Training perspective where algorithms minimize empirical loss on the training set.
Loss function: Function measuring discrepancy between model predictions and actual outcomes.
Probably approximately correct (PAC) learning: Learning-theory framework giving probabilistic performance bounds from finite training data.
Bias–variance decomposition: Method to quantify generalization error by separating error sources.
Deep learning: Subset of ML using neural networks with multiple layers to learn representations.
Connectionism: Research direction emphasizing interconnected neuron-like units, continuing neural-network ideas.
Backpropagation: Training algorithm for neural networks using error-gradient propagation.
Data mining (KDD): Discovery of previously unknown properties in data as part of knowledge discovery in databases.
Compression–prediction equivalence: Prediction of posterior probabilities and optimal data compression can be used interchangeably.

Formulas

ERM objective (generic)

Minimize over parameters: EmpiricalRisk(θ) = (1/n) * Σ_{i=1..n} L(f_θ(x_i), y_i)

When framing supervised learning training as minimizing average loss on the training set.

Generalization goal (conceptual)

Optimize for performance on new samples from an unknown distribution, not just training accuracy.

When deciding whether a model is truly good: evaluate on unseen data or via learning-theory bounds.

PAC-style guarantee (conceptual)

With high probability (1 - δ), true risk is close to empirical risk given enough samples.

When reasoning about why finite training sets still allow probabilistic performance guarantees.

Compression–prediction mapping (conceptual)

If you can predict posterior probabilities of the next symbol, you can compress optimally using the induced distribution (e.g., via arithmetic coding).

When connecting probabilistic sequence prediction to compression performance.

Main Concepts

Machine learning definition

ML uses statistical algorithms that learn from data and generalize to unseen data without explicit programming.

Statistical and optimization foundations

ML is grounded in statistics and mathematical optimization, often expressed as minimizing a loss function.

ERM viewpoint

Many ML/deep learning methods can be described as minimizing empirical risk under a PAC learning framework.

Generalization

The central objective: accurate performance on new, unseen examples drawn from an unknown distribution.

Supervised vs unsupervised tasks

Supervised uses labeled data for classification/regression; unsupervised finds structure in unlabeled data (clustering, dimensionality reduction).

Neural networks and deep learning

Neural networks are statistical connectionist models; deep learning uses multiple layers to learn strong representations.

Reinforcement learning

Agents learn from interaction feedback (rewards/penalties) to improve decisions over time.

Compression–prediction equivalence

Posterior-probability prediction for sequences enables optimal compression, and an optimal compressor can be used for prediction.

Memory Tricks

Generalization vs training accuracy

Think: GENeralization = GEN from training to new cases; TRAIN accuracy is only the “practice score.”

ERM + PAC connection

ERM is the “what you minimize”; PAC is the “what you can promise” (probabilistic guarantees).

Data mining vs ML

KDD/Data mining = “Find new facts”; ML = “Predict known outcomes.”

Compression–prediction equivalence

If you can predict the next symbol’s probabilities, you can compress it; if you can compress optimally, you implicitly know the probabilities.

Bias–variance decomposition

Bias = “underfitting tendency,” Variance = “overfitting sensitivity.” Together they shape generalization error.

Quick Facts

ML definition: learn from data and generalize to unseen data without explicit programming.
Foundations: statistics and mathematical optimization methods underpin ML.
ERM claim: many ML/deep learning algorithms can be framed as empirical risk minimization under PAC learning.
1959: Arthur Samuel coined the term “machine learning.”
1949: Donald Hebb published “The Organization of Behavior,” introducing Hebbian theory.
Early 1960s: Raytheon’s Cybertron used rudimentary reinforcement learning with punched tape memory.
2014: GANs (generative adversarial networks) introduced by Ian Goodfellow et al. for realistic synthetic data.
2016: AlphaGo used reinforcement learning to beat top human Go players.
Deep learning advances enabled neural networks to surpass many previous ML approaches in performance.

Common Mistakes

Common Mistakes: Machine Learning Foundations, Paradigms, and Relationships

Confusing training accuracy with generalization.

conceptual · high severity

▼

Why it happens:

Students reason: “If the model fits the training data well (low training loss or high training accuracy), then it must also perform well on new data.” They often treat the training set as if it were the same as the future data distribution, ignoring that generalization is about unseen examples drawn from an unknown probability distribution.

✓ Correct understanding:

Students should reason: “Generalization is performance on new, unseen examples/tasks after training.” Training accuracy measures how well the model minimized empirical risk on the training set, but generalization depends on how well that learned rule transfers to the unknown future distribution. Learning theory (including PAC) frames this as probabilistic guarantees under finite training data, not certainty from training fit alone.

How to avoid:

Always separate “empirical risk / training loss” from “generalization performance.” Use a validation/test set (or cross-validation) to estimate performance on unseen data. When reasoning theoretically, explicitly connect generalization to learning theory assumptions (finite samples, unknown distribution) rather than to training fit.

Assuming unsupervised learning must outperform supervised learning because it uses more information (or “learns structure automatically”).

conceptual · high severity

▼

Why it happens:

Students reason: “Unsupervised methods discover patterns without labels, so they should be stronger than supervised methods that rely on labels.” They implicitly assume that “no labels” means “no constraints,” or that discovering structure automatically yields the target predictive knowledge. This confuses the goal of unsupervised learning (discover structure in unlabeled data) with the evaluation goal of supervised prediction when the target labels are known.

✓ Correct understanding:

Students should reason: “Unsupervised learning optimizes for structure discovery (e.g., clustering or dimensionality reduction), not directly for the supervised prediction objective.” In an evaluation setting where the goal is prediction of known labels, an uninformed unsupervised method can be outperformed by supervised learning because supervised training directly minimizes a loss tied to the labeled task (often framed as ERM). Unsupervised methods can still help as preprocessing, but they are not guaranteed to beat supervised methods on the same predictive metric.

How to avoid:

Match the learning paradigm to the task objective. If the evaluation metric is label prediction, focus on supervised learning and its loss function/ERM viewpoint. If you use unsupervised learning, treat it as representation learning or preprocessing, and verify improvements with the downstream supervised metric.

Confusing machine learning with data mining (KDD) and swapping their goals.

conceptual · medium severity

▼

Why it happens:

Students reason: “Both ML and data mining analyze data, so they must be the same.” They collapse the distinction between discovering unknown properties (data mining/KDD) and predicting known knowledge (ML). This leads to incorrect expectations about evaluation: students may judge ML by whether it “discovers new facts,” rather than by predictive performance on unseen data.

✓ Correct understanding:

Students should reason: “Data mining/KDD emphasizes discovering previously unknown properties in data, while ML emphasizes prediction of known knowledge learned from training data.” The methods can overlap (ML can use unsupervised methods; data mining can use predictive models), but the evaluation goals differ. ML is judged by generalization to unseen examples under a task-specific loss, not primarily by novelty of discovered patterns.

How to avoid:

Before choosing methods or judging results, write down the goal: prediction vs discovery. Then align evaluation: predictive generalization metrics (classification/regression loss, accuracy, etc.) for ML; novelty/interestingness criteria for KDD-style discovery.

Treating compression and prediction as unrelated tasks, ignoring compression–prediction equivalence.

conceptual · high severity

▼

Why it happens:

Students reason: “Compression is about reducing file size, while prediction is about guessing the next label/value. They are different, so there is no direct relationship.” This misconception prevents them from using probabilistic prediction distributions to reason about coding efficiency, and vice versa.

✓ Correct understanding:

Students should reason: “If a system predicts posterior probabilities of a sequence given its history, then it can be used for optimal data compression.” Conversely, “an optimal compressor can be used for prediction.” The mechanism is that arithmetic coding can compress using the predicted output distribution; the best compressor corresponds to the best probability model for the next symbol.

How to avoid:

When you see probabilistic prediction, explicitly connect it to coding: “probability model + arithmetic coding = compression.” When you see compression, ask what probability distribution the compressor implicitly represents, and how that distribution would enable prediction.

Believing ML is purely symbolic/cognitive and not grounded in statistical optimization and loss minimization.

conceptual · medium severity

▼

Why it happens:

Students reason: “Machine learning is about human-like reasoning or symbolic intelligence, so it should be mainly rule-based.” They then interpret ML as if it were primarily about logical knowledge representation, ignoring the operational framing: ML uses statistical algorithms that learn from data and generalize, often by minimizing a loss function (ERM) under learning-theory assumptions.

✓ Correct understanding:

Students should reason: “ML is operationally defined as statistical algorithms that learn from data and generalize to unseen data without explicit programming.” The foundations are statistics and mathematical optimization, commonly expressed as minimizing empirical risk (loss on training data). Neural networks are also statistical models trained via optimization (e.g., backpropagation for gradient-based learning), not primarily symbolic rule execution.

How to avoid:

Use the ERM/loss-function lens as a default mental model: identify the training objective (loss), the optimization process, and the generalization goal. If discussing AI history, treat symbolic AI vs statistical ML as a contrast, not as a replacement of the statistical/optimization core of ML.

Misunderstanding learning-theory guarantees as absolute certainty rather than probabilistic bounds.

conceptual · high severity

▼

Why it happens:

Students reason: “PAC learning or generalization theory gives a guarantee, so the model will definitely achieve high accuracy on future data.” They treat probabilistic statements as deterministic promises, ignoring that finite training data and uncertainty lead to bounds that hold with some probability.

✓ Correct understanding:

Students should reason: “With finite training sets and uncertain future data, learning theory provides probabilistic (not absolute) performance guarantees.” In PAC-style reasoning, the guarantee is of the form: with high probability over the random draw of the training set, the learned hypothesis has low error on the underlying distribution. The correct mental model is uncertainty management via probabilistic bounds.

How to avoid:

When you see “probably” or “with probability,” translate it into the quantifiers: probability is over the randomness in training data sampling (and sometimes algorithm randomness), not over the fixed unknown future labels in a deterministic way. Practice rewriting bounds in plain language with explicit “with high probability” phrasing.

General Tips

When diagnosing a misconception, always ask: “What is the goal being optimized or evaluated?” (training fit vs unseen performance; prediction vs discovery; compression vs probability modeling).
Use a two-layer mental model: (1) objective (loss/ERM or task-specific metric) and (2) generalization mechanism (unknown distribution, finite samples, probabilistic guarantees).
For any claimed relationship (like compression–prediction equivalence), require a mechanism-level explanation (e.g., arithmetic coding uses predicted probabilities).
Prefer counterexamples: test the misconception against a held-out test scenario, a downstream supervised metric, or a probabilistic coding scenario.

Content

AI Audio

Test Yourself

Practice

Games

Summary

Topics Covered

What Machine Learning Is: Definition, Goals, and Historical Roots

Mathematical Foundations: Statistics, Optimization, and Loss

Empirical Risk Minimization (ERM) and Generalization: Why Training Is Not Enough

Learning Paradigms and Task Types: Supervised, Unsupervised, and Beyond

Neural Networks and Deep Learning: Representation Learning and Training Advances

Learning Theory Guarantees: PAC Learning and Probabilistic Reasoning

Relationships and Equivalences: AI, Data Mining, Human-in-the-Loop, and Compression

Key Insights

ERM Is a Generalization Strategy

Unsupervised Can Be Supervised’s Shortcut

Generalization Needs a Probability Model

Compression and Prediction Are the Same

Deep Learning’s Advantage Is Representation

Conclusions

Bringing It All Together

Key Takeaways

Real-World Applications

💻 Code Examples

Supervised learning: classification with empirical risk minimisation (logistic regression)

Code

Explanation

Use Case

Output

💻 Code Practice Problems

Clustering + dimensionality reduction: k-means as unsupervised preprocessing

Code

Explanation

Use Case

Output

💻 Code Practice Problems

Structured prediction: sequence labeling with a simple CRF-like Viterbi decoder

Code

Explanation

Use Case

Output

💻 Code Practice Problems

Anomaly detection + model diagnostics: Isolation Forest with thresholding

Code

Explanation

Use Case

Output

💻 Code Practice Problems

Interactive Lesson

Interactive Lesson: Machine Learning Foundations, Paradigms, and Generalization

Learning Objectives

1. Machine Learning Overview and Definition

2. Mathematical Foundations (Statistics and Optimization)

3. Loss Functions and Empirical Risk Minimization (ERM)

4. Generalization and Learning Theory (PAC)

5. Task Paradigms (Supervised, Unsupervised, Reinforcement)

6. Neural Networks and Deep Learning (Dependency-bridging)

7. Relationships to Other Fields (AI, Data Mining, Data Compression)

Practice Activities

ERM-to-Generalization Causal Chain Builder

Paradigm Feedback Mapping

Generalization vs Training Accuracy Diagnostic

Compression–Prediction Equivalence Reasoning

Next Steps

Cheat Sheet

Cheat Sheet: Machine Learning (ML) Foundations, Paradigms, and Key Relationships

Key Terms

Formulas

Main Concepts

Memory Tricks

Quick Facts

Common Mistakes

Common Mistakes: Machine Learning Foundations, Paradigms, and Relationships

General Tips