Machine learning is the field of study that gives computers the ability to learn from data without being explicitly programmed. It forms the foundation upon which modern AI systems are built.
Learning Paradigms
Supervised Learning
Learn from labelled examples. Given inputs X and outputs Y, learn the mapping f(X) → Y.
Tasks:
- Classification — Predict discrete categories (spam/not spam)
- Regression — Predict continuous values (house prices)
Algorithms:
- Linear/logistic regression
- Decision trees, random forests
- Support vector machines (SVM)
- Neural networks
Unsupervised Learning
Find patterns in unlabelled data.
Tasks:
- Clustering — Group similar data points (customer segments)
- Dimensionality reduction — Compress data preserving structure
- Anomaly detection — Find outliers
Algorithms:
- K-means clustering
- Principal Component Analysis (PCA)
- Autoencoders
- DBSCAN
Self-supervised Learning
Create labels from the data itself. The dominant paradigm for modern LLMs.
Examples:
- Next token prediction (GPT)
- Masked language modelling (BERT)
- Contrastive learning (CLIP)
Reinforcement Learning (RL)
Learn by interacting with an environment, receiving rewards.
Components:
- Agent, environment, state, action, reward
- Policy: state → action mapping
- Value function: expected future reward
Applications:
- Game playing (AlphaGo, Atari)
- Robotics
- RLHF for LLM alignment
Neural Networks
Fundamentals
Neuron: Weighted sum of inputs + bias, passed through activation function.
output = activation(Σ(weights × inputs) + bias)
Layers: Neurons organised into layers.
- Input layer — Receives data
- Hidden layers — Learn representations
- Output layer — Produces predictions
Activation functions:
- ReLU: max(0, x) — Most common
- Sigmoid: 1/(1+e^-x) — Outputs 0-1
- Tanh: (e^x - e^-x)/(e^x + e^-x) — Outputs -1 to 1
- Softmax — Probability distribution over classes
Training
Forward pass: Compute predictions from inputs.
Loss function: Measure how wrong predictions are.
- Cross-entropy (classification)
- Mean squared error (regression)
Backpropagation: Compute gradients of loss with respect to weights.
Gradient descent: Update weights to reduce loss.
weights = weights - learning_rate × gradient
Optimisers:
- SGD (Stochastic Gradient Descent)
- Adam — Adaptive learning rates, momentum
- AdamW — Adam with weight decay
Regularisation
Prevent overfitting (memorising training data).
- Dropout — Randomly zero neurons during training
- Weight decay — Penalise large weights
- Early stopping — Stop when validation loss increases
- Data augmentation — Artificially expand training set
Architectures
Feedforward (MLP): Fully connected layers.
Convolutional (CNN): Spatial pattern detection for images.
- Convolution layers detect local patterns
- Pooling layers downsample
- Translation invariance
Recurrent (RNN/LSTM/GRU): Sequential data.
- Process sequences step by step
- Hidden state carries information forward
- Struggle with long-range dependencies
Transformer: Attention-based, parallel processing.
- Self-attention relates all positions
- Dominant architecture for NLP
- See LLMs for details
Key Concepts
Generalisation
The goal: learn patterns that apply to unseen data.
Train/validation/test split:
- Train: Learn parameters
- Validation: Tune hyperparameters
- Test: Final evaluation
Overfitting: Model memorises training data, fails on new data.
Underfitting: Model too simple, fails everywhere.
Bias-Variance Trade-off
- Bias: Error from overly simple assumptions
- Variance: Error from sensitivity to training data
High bias = underfitting. High variance = overfitting.
Feature Engineering
Traditional ML relies heavily on crafting input features.
- Domain knowledge encoded as features
- Deep learning learns features automatically
Scaling Laws
Modern insight: performance improves predictably with:
- More data
- Larger models
- More compute
This drives the push for ever-larger models.
The Modern AI Stack
Before Deep Learning (Traditional ML)
- Collect data
- Engineer features
- Train simple model (SVM, random forest)
- Evaluate, iterate
Deep Learning Era
- Collect massive data
- Train large neural network
- Features learned automatically
- Scale up for better results
Foundation Model Era
- Pre-train on internet-scale data (self-supervised)
- Fine-tune or prompt for specific tasks
- LLMs, Multimodal AI as general-purpose tools
Tools & Frameworks
ML Libraries
- scikit-learn — Traditional ML algorithms
- XGBoost — Gradient boosting
Deep Learning
- PyTorch — Research standard, flexible
- TensorFlow — Production, Keras API
- JAX — Functional, composable
Experiment Tracking
Deployment
See Model Serving
Learning Resources
Courses
- fast.ai — Practical deep learning
- CS231n — CNNs for vision
- CS224n — NLP with deep learning
- Andrew Ng’s ML Course
Books
- Deep Learning (Goodfellow et al.) — Comprehensive textbook
- Hands-On Machine Learning (Géron) — Practical with scikit-learn
- The Hundred-Page Machine Learning Book (Burkov) — Concise overview
Interactive
- Distill.pub — Visual explanations
- Playground.tensorflow.org — Neural network visualisation
- 3Blue1Brown Neural Networks
Glossary
| Term | Definition |
|---|---|
| Epoch | One pass through entire training data |
| Batch | Subset of data for one update step |
| Learning rate | Step size for weight updates |
| Hyperparameter | Configuration set before training |
| Inference | Using trained model on new data |
| Gradient | Direction of steepest increase in loss |
| Latent space | Learned internal representation |