Machine Learning

Reference notes.

Machine learning is the field of study that gives computers the ability to learn from data without being explicitly programmed. It forms the foundation upon which modern AI systems are built.

Learning Paradigms

Supervised Learning

Learn from labelled examples. Given inputs X and outputs Y, learn the mapping f(X) → Y.

Tasks:

Classification — Predict discrete categories (spam/not spam)
Regression — Predict continuous values (house prices)

Algorithms:

Linear/logistic regression
Decision trees, random forests
Support vector machines (SVM)
Neural networks

Unsupervised Learning

Find patterns in unlabelled data.

Tasks:

Clustering — Group similar data points (customer segments)
Dimensionality reduction — Compress data preserving structure
Anomaly detection — Find outliers

Algorithms:

K-means clustering
Principal Component Analysis (PCA)
Autoencoders
DBSCAN

Self-supervised Learning

Create labels from the data itself. The dominant paradigm for modern LLMs.

Examples:

Next token prediction (GPT)
Masked language modelling (BERT)
Contrastive learning (CLIP)

Reinforcement Learning (RL)

Learn by interacting with an environment, receiving rewards.

Components:

Agent, environment, state, action, reward
Policy: state → action mapping
Value function: expected future reward

Applications:

Game playing (AlphaGo, Atari)
Robotics
RLHF for LLM alignment

Neural Networks

Fundamentals

Neuron: Weighted sum of inputs + bias, passed through activation function.

output = activation(Σ(weights × inputs) + bias)

Layers: Neurons organised into layers.

Input layer — Receives data
Hidden layers — Learn representations
Output layer — Produces predictions

Activation functions:

ReLU: max(0, x) — Most common
Sigmoid: 1/(1+e^-x) — Outputs 0-1
Tanh: (e^x - e^-x)/(e^x + e^-x) — Outputs -1 to 1
Softmax — Probability distribution over classes

Training

Forward pass: Compute predictions from inputs.

Loss function: Measure how wrong predictions are.

Cross-entropy (classification)
Mean squared error (regression)

Backpropagation: Compute gradients of loss with respect to weights.

Gradient descent: Update weights to reduce loss.

weights = weights - learning_rate × gradient

Optimisers:

SGD (Stochastic Gradient Descent)
Adam — Adaptive learning rates, momentum
AdamW — Adam with weight decay

Regularisation

Prevent overfitting (memorising training data).

Dropout — Randomly zero neurons during training
Weight decay — Penalise large weights
Early stopping — Stop when validation loss increases
Data augmentation — Artificially expand training set

Architectures

Feedforward (MLP): Fully connected layers.

Convolutional (CNN): Spatial pattern detection for images.

Convolution layers detect local patterns
Pooling layers downsample
Translation invariance

Recurrent (RNN/LSTM/GRU): Sequential data.

Process sequences step by step
Hidden state carries information forward
Struggle with long-range dependencies

Transformer: Attention-based, parallel processing.

Self-attention relates all positions
Dominant architecture for NLP
See LLMs for details

Key Concepts

Generalisation

The goal: learn patterns that apply to unseen data.

Train/validation/test split:

Train: Learn parameters
Validation: Tune hyperparameters
Test: Final evaluation

Overfitting: Model memorises training data, fails on new data.
Underfitting: Model too simple, fails everywhere.

Bias-Variance Trade-off

Bias: Error from overly simple assumptions
Variance: Error from sensitivity to training data

High bias = underfitting. High variance = overfitting.

Feature Engineering

Traditional ML relies heavily on crafting input features.

Domain knowledge encoded as features
Deep learning learns features automatically

Scaling Laws

Modern insight: performance improves predictably with:

More data
Larger models
More compute

This drives the push for ever-larger models.

The Modern AI Stack

Before Deep Learning (Traditional ML)

Collect data
Engineer features
Train simple model (SVM, random forest)
Evaluate, iterate

Deep Learning Era

Collect massive data
Train large neural network
Features learned automatically
Scale up for better results

Foundation Model Era

Pre-train on internet-scale data (self-supervised)
Fine-tune or prompt for specific tasks
LLMs, Multimodal AI as general-purpose tools

Tools & Frameworks

ML Libraries

scikit-learn — Traditional ML algorithms
XGBoost — Gradient boosting

Deep Learning

PyTorch — Research standard, flexible
TensorFlow — Production, Keras API
JAX — Functional, composable

Experiment Tracking

Deployment

See Model Serving

Learning Resources

Courses

fast.ai — Practical deep learning
CS231n — CNNs for vision
CS224n — NLP with deep learning
Andrew Ng’s ML Course

Books

Deep Learning (Goodfellow et al.) — Comprehensive textbook
Hands-On Machine Learning (Géron) — Practical with scikit-learn
The Hundred-Page Machine Learning Book (Burkov) — Concise overview

Interactive

Distill.pub — Visual explanations
Playground.tensorflow.org — Neural network visualisation
3Blue1Brown Neural Networks

Glossary

Term	Definition
Epoch	One pass through entire training data
Batch	Subset of data for one update step
Learning rate	Step size for weight updates
Hyperparameter	Configuration set before training
Inference	Using trained model on new data
Gradient	Direction of steepest increase in loss
Latent space	Learned internal representation

Rai Notes

Explorer