Probability theory provides the mathematical foundation for quantifying uncertainty. Statistics uses data to make inferences about the world. Together, they form the backbone of data science, machine learning, and scientific research.
Probability Foundations
Sample Spaces and Events
A sample space Ω is the set of all possible outcomes of an experiment.
- Events are subsets of the sample space
- Complement: Aᶜ = {ω ∈ Ω : ω ∉ A}
- Union: A ∪ B = {ω ∈ Ω : ω ∈ A or ω ∈ B}
- Intersection: A ∩ B = {ω ∈ Ω : ω ∈ A and ω ∈ B}
Probability Axioms (Kolmogorov)
- Non-negativity: P(A) ≥ 0 for any event A
- Normalisation: P(Ω) = 1
- Additivity: P(A ∪ B) = P(A) + P(B) for disjoint events (A ∩ B = ∅)
From these axioms, we derive:
- P(Aᶜ) = 1 - P(A)
- P(∅) = 0
- P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
Conditional Probability
The probability of A given that B has occurred:
Independence: Two events are independent if and only if:
Bayes’ Theorem
This allows us to update our beliefs about A after observing evidence B.
Law of Total Probability
For a partition {B₁, B₂, …, Bₙ} of Ω:
Random Variables
A random variable is a function X: Ω → ℝ that assigns a numerical value to each outcome.
Discrete Random Variables
Takes countably many values.
- Probability mass function (PMF): p(x) = P(X = x)
- Cumulative distribution function (CDF): F(x) = P(X ≤ x) = Σ p(k) for k ≤ x
Properties of PMF:
- p(x) ≥ 0 for all x
- Σ p(x) = 1
Continuous Random Variables
Takes uncountably many values in an interval.
- Probability density function (PDF): f(x) where P(a ≤ X ≤ b) = ∫ₐᵇ f(x)dx
- CDF: F(x) = P(X ≤ x) = ∫₋∞ˣ f(t)dt
Properties of PDF:
- f(x) ≥ 0 for all x
- ∫₋∞^∞ f(x)dx = 1
- P(X = x) = 0 for any specific value
Expected Value
The expected value (mean) is the weighted average of all possible values.
Discrete: E[X] = Σ x · P(X = x)
Continuous: E[X] = ∫₋∞^∞ x · f(x)dx
Properties:
- Linearity: E[aX + bY] = aE[X] + bE[Y]
- E[c] = c for any constant
- E[g(X)] = Σ g(x)P(X = x) or ∫ g(x)f(x)dx
Variance and Standard Deviation
Variance measures the spread of a distribution around its mean:
Standard deviation: σ = √Var(X)
Properties:
- Var(aX + b) = a²Var(X)
- Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y)
- For independent X, Y: Var(X + Y) = Var(X) + Var(Y)
Covariance and Correlation
Covariance measures how two variables vary together:
Correlation is the normalised covariance:
Common Distributions
Discrete Distributions
| Distribution | PMF | E[X] | Var(X) | Use Case |
|---|---|---|---|---|
| Bernoulli(p) | pˣ(1-p)¹⁻ˣ | p | p(1-p) | Single trial |
| Binomial(n,p) | C(n,k)pᵏ(1-p)ⁿ⁻ᵏ | np | np(1-p) | Number of successes in n trials |
| Poisson(λ) | e⁻λλᵏ/k! | λ | λ | Count of rare events |
| Geometric(p) | p(1-p)ᵏ⁻¹ | 1/p | (1-p)/p² | Trials until first success |
| Negative Binomial(r,p) | C(k-1,r-1)pʳ(1-p)ᵏ⁻ʳ | r/p | r(1-p)/p² | Trials until r successes |
Continuous Distributions
| Distribution | E[X] | Var(X) | Use Case | |
|---|---|---|---|---|
| Uniform(a,b) | 1/(b-a) | (a+b)/2 | (b-a)²/12 | Equal likelihood |
| Exponential(λ) | λe⁻λˣ | 1/λ | 1/λ² | Wait times, memoryless |
| Normal(μ,σ²) | (1/σ√2π)e^(-(x-μ)²/2σ²) | μ | σ² | Natural phenomena |
| Gamma(α,β) | βᵅxᵅ⁻¹e⁻βˣ/Γ(α) | α/β | α/β² | Sum of exponentials |
| Beta(α,β) | xᵅ⁻¹(1-x)ᵝ⁻¹/B(α,β) | α/(α+β) | αβ/((α+β)²(α+β+1)) | Probabilities |
The Normal Distribution
The normal (Gaussian) distribution is ubiquitous due to the Central Limit Theorem.
Standard normal: Z ~ N(0,1)
Standardisation: If X ~ N(μ,σ²), then Z = (X - μ)/σ ~ N(0,1)
68-95-99.7 Rule:
- 68% of values fall within 1σ of μ
- 95% of values fall within 2σ of μ
- 99.7% of values fall within 3σ of μ
Limit Theorems
Law of Large Numbers (LLN)
As sample size n → ∞, the sample mean X̄ converges to the population mean μ:
Central Limit Theorem (CLT)
For independent, identically distributed random variables with mean μ and variance σ²:
The sum/average of many independent random variables is approximately normal, regardless of the underlying distribution.
Statistical Inference
Point Estimation
Estimating a single value for a population parameter.
- Sample mean: X̄ = (1/n)Σxᵢ estimates μ
- Sample variance: s² = (1/(n-1))Σ(xᵢ - X̄)² estimates σ²
- Maximum Likelihood Estimation (MLE): Find θ that maximises P(data|θ)
- Method of Moments: Equate sample moments to population moments
Desirable properties: Unbiasedness, consistency, efficiency
Interval Estimation
A confidence interval provides a range of plausible values.
95% CI for μ (known σ): X̄ ± 1.96(σ/√n)
95% CI for μ (unknown σ): X̄ ± t₀.₀₂₅,ₙ₋₁(s/√n)
Interpretation: If we repeated the experiment many times, 95% of the constructed intervals would contain the true parameter.
Hypothesis Testing
- State null hypothesis H₀ and alternative H₁
- Choose significance level α (typically 0.05)
- Calculate test statistic from data
- Find p-value or critical region
- Reject H₀ if p-value < α
Errors:
- Type I error (α): Rejecting H₀ when it’s true (false positive)
- Type II error (β): Failing to reject H₀ when it’s false (false negative)
- Power = 1 - β: Probability of correctly rejecting false H₀
Common Statistical Tests
| Test | Purpose | Assumptions |
|---|---|---|
| One-sample t-test | Compare mean to value | Normality |
| Two-sample t-test | Compare two means | Normality, equal variance |
| Paired t-test | Compare paired observations | Normality of differences |
| χ² test | Test categorical associations | Expected counts ≥ 5 |
| F-test | Compare variances | Normality |
| ANOVA | Compare multiple means | Normality, equal variance |
Bayesian Statistics
The Bayesian Approach
- Prior P(θ): Initial belief about parameter
- Likelihood P(data|θ): Probability of data given parameter
- Posterior P(θ|data): Updated belief after observing data
- Evidence P(data): Normalising constant
Bayesian vs Frequentist
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Probability | Long-run frequency | Degree of belief |
| Parameters | Fixed unknown constants | Random variables |
| Prior information | Not formally used | Explicitly incorporated |
| Inference | Point estimates, CIs | Posterior distribution |
| Interpretation | Objective | Subjective/epistemic |
Regression
Simple Linear Regression
Models the relationship between two variables:
Least squares estimates:
- β̂₁ = Cov(x,y)/Var(x)
- β̂₀ = ȳ - β̂₁x̄
Coefficient of determination R² = 1 - SS_res/SS_tot measures the proportion of variance explained.
Multiple Linear Regression
In matrix form: y = Xβ + ε
Least squares solution: β̂ = (XᵀX)⁻¹Xᵀy
Assumptions: Linearity, independence, homoscedasticity, normality of errors
Multicollinearity: When predictors are highly correlated, estimates become unstable.
Learning Resources
Books
- A First Course in Probability by Sheldon Ross
- Statistical Inference by Casella and Berger
- All of Statistics by Larry Wasserman
- Probability Theory: The Logic of Science by E.T. Jaynes
Online Courses
- MIT 18.05 Introduction to Probability and Statistics
- Khan Academy Statistics and Probability
- StatQuest with Josh Starmer (YouTube)
Related Topics
- Discrete Mathematics — combinatorics, counting principles
- Calculus — integration, differentiation for continuous distributions
- Linear Algebra — matrix operations for regression
- Machine Learning — applications of statistical inference