Machine learning is the field of study that gives computers the ability to learn from data without being explicitly programmed. It forms the foundation upon which modern AI systems are built.

Learning Paradigms

Supervised Learning

Learn from labelled examples. Given inputs X and outputs Y, learn the mapping f(X) → Y.

Tasks:

  • Classification — Predict discrete categories (spam/not spam)
  • Regression — Predict continuous values (house prices)

Algorithms:

  • Linear/logistic regression
  • Decision trees, random forests
  • Support vector machines (SVM)
  • Neural networks

Unsupervised Learning

Find patterns in unlabelled data.

Tasks:

  • Clustering — Group similar data points (customer segments)
  • Dimensionality reduction — Compress data preserving structure
  • Anomaly detection — Find outliers

Algorithms:

  • K-means clustering
  • Principal Component Analysis (PCA)
  • Autoencoders
  • DBSCAN

Self-supervised Learning

Create labels from the data itself. The dominant paradigm for modern LLMs.

Examples:

  • Next token prediction (GPT)
  • Masked language modelling (BERT)
  • Contrastive learning (CLIP)

Reinforcement Learning (RL)

Learn by interacting with an environment, receiving rewards.

Components:

  • Agent, environment, state, action, reward
  • Policy: state → action mapping
  • Value function: expected future reward

Applications:

  • Game playing (AlphaGo, Atari)
  • Robotics
  • RLHF for LLM alignment

Neural Networks

Fundamentals

Neuron: Weighted sum of inputs + bias, passed through activation function.

output = activation(Σ(weights × inputs) + bias)

Layers: Neurons organised into layers.

  • Input layer — Receives data
  • Hidden layers — Learn representations
  • Output layer — Produces predictions

Activation functions:

  • ReLU: max(0, x) — Most common
  • Sigmoid: 1/(1+e^-x) — Outputs 0-1
  • Tanh: (e^x - e^-x)/(e^x + e^-x) — Outputs -1 to 1
  • Softmax — Probability distribution over classes

Training

Forward pass: Compute predictions from inputs.

Loss function: Measure how wrong predictions are.

  • Cross-entropy (classification)
  • Mean squared error (regression)

Backpropagation: Compute gradients of loss with respect to weights.

Gradient descent: Update weights to reduce loss.

weights = weights - learning_rate × gradient

Optimisers:

  • SGD (Stochastic Gradient Descent)
  • Adam — Adaptive learning rates, momentum
  • AdamW — Adam with weight decay

Regularisation

Prevent overfitting (memorising training data).

  • Dropout — Randomly zero neurons during training
  • Weight decay — Penalise large weights
  • Early stopping — Stop when validation loss increases
  • Data augmentation — Artificially expand training set

Architectures

Feedforward (MLP): Fully connected layers.

Convolutional (CNN): Spatial pattern detection for images.

  • Convolution layers detect local patterns
  • Pooling layers downsample
  • Translation invariance

Recurrent (RNN/LSTM/GRU): Sequential data.

  • Process sequences step by step
  • Hidden state carries information forward
  • Struggle with long-range dependencies

Transformer: Attention-based, parallel processing.

  • Self-attention relates all positions
  • Dominant architecture for NLP
  • See LLMs for details

Key Concepts

Generalisation

The goal: learn patterns that apply to unseen data.

Train/validation/test split:

  • Train: Learn parameters
  • Validation: Tune hyperparameters
  • Test: Final evaluation

Overfitting: Model memorises training data, fails on new data.
Underfitting: Model too simple, fails everywhere.

Bias-Variance Trade-off

  • Bias: Error from overly simple assumptions
  • Variance: Error from sensitivity to training data

High bias = underfitting. High variance = overfitting.

Feature Engineering

Traditional ML relies heavily on crafting input features.

  • Domain knowledge encoded as features
  • Deep learning learns features automatically

Scaling Laws

Modern insight: performance improves predictably with:

  • More data
  • Larger models
  • More compute

This drives the push for ever-larger models.

The Modern AI Stack

Before Deep Learning (Traditional ML)

  1. Collect data
  2. Engineer features
  3. Train simple model (SVM, random forest)
  4. Evaluate, iterate

Deep Learning Era

  1. Collect massive data
  2. Train large neural network
  3. Features learned automatically
  4. Scale up for better results

Foundation Model Era

  1. Pre-train on internet-scale data (self-supervised)
  2. Fine-tune or prompt for specific tasks
  3. LLMs, Multimodal AI as general-purpose tools

Tools & Frameworks

ML Libraries

Deep Learning

  • PyTorch — Research standard, flexible
  • TensorFlow — Production, Keras API
  • JAX — Functional, composable

Experiment Tracking

Deployment

See Model Serving

Learning Resources

Courses

Books

  • Deep Learning (Goodfellow et al.) — Comprehensive textbook
  • Hands-On Machine Learning (Géron) — Practical with scikit-learn
  • The Hundred-Page Machine Learning Book (Burkov) — Concise overview

Interactive

Glossary

TermDefinition
EpochOne pass through entire training data
BatchSubset of data for one update step
Learning rateStep size for weight updates
HyperparameterConfiguration set before training
InferenceUsing trained model on new data
GradientDirection of steepest increase in loss
Latent spaceLearned internal representation