A database is an organised collection of data stored and accessed electronically. Databases are fundamental to nearly all software applications, from simple websites to complex distributed systems.
Types of Databases
Relational (SQL)
Store data in tables with rows and columns, using structured query language (SQL) for queries. Data is organised with strict schemas and relationships between tables via foreign keys.
Examples: PostgreSQL, MySQL, MariaDB, SQLite, Microsoft SQL Server, Oracle
Document
Store data as semi-structured documents (typically JSON or BSON). Flexible schemas allow different fields per document within the same collection.
Examples: MongoDB, CouchDB, Amazon DocumentDB
Key-Value
Simple storage model mapping keys to values. Extremely fast for lookups but limited query capabilities. Often used for caching and session storage.
Examples: Redis, Amazon DynamoDB, Memcached, etcd
Wide-Column
Store data in tables with rows and dynamic columns. Each row can have different columns, suited for large-scale analytical workloads.
Examples: Apache Cassandra, Apache HBase, Google Bigtable, ScyllaDB
Graph
Optimised for storing and querying relationships between entities. Use nodes, edges, and properties to represent and traverse connected data.
Examples: Neo4j, Amazon Neptune, ArangoDB, JanusGraph
Time-Series
Optimised for time-stamped data with efficient storage and queries over time ranges. Used for metrics, monitoring, and IoT data.
Examples: InfluxDB, TimescaleDB, Prometheus, QuestDB
Vector
Store and search high-dimensional vectors for similarity queries. Essential for AI/ML applications, semantic search, and recommendation systems.
Examples: Pinecone, Weaviate, Milvus, Chroma, pgvector (PostgreSQL extension)
See: Vector Databases
Key Concepts
ACID Properties
Guarantees for reliable database transactions:
- Atomicity - Transactions are all-or-nothing
- Consistency - Data remains valid after transactions
- Isolation - Concurrent transactions don’t interfere
- Durability - Committed data survives failures
CAP Theorem
In distributed systems, you can only guarantee two of three properties:
- Consistency - All nodes see the same data simultaneously
- Availability - Every request receives a response
- Partition Tolerance - System operates despite network failures
Since network partitions are inevitable in distributed systems, the practical choice is between CP (consistency + partition tolerance) or AP (availability + partition tolerance) systems.
| Type | Examples |
|---|---|
| CP | MongoDB, Redis, HBase |
| AP | Cassandra, CouchDB, DynamoDB |
Indexing
Data structures that speed up read queries at the cost of write performance and storage. Common types:
- B-tree - General-purpose, good for range queries
- Hash - Fast exact-match lookups
- GIN/GiST - Full-text and complex data types
- Bitmap - Low-cardinality columns
Transactions
Logical units of work that group multiple operations. Support varies:
- Full ACID transactions (PostgreSQL, MySQL)
- Single-document transactions (MongoDB, DynamoDB)
- Eventual consistency (Cassandra)
Replication
Copying data across multiple nodes for fault tolerance and read scaling:
- Primary-replica - One write node, multiple read nodes
- Multi-primary - Multiple write nodes with conflict resolution
- Synchronous - Wait for replicas before confirming writes
- Asynchronous - Confirm writes immediately, replicate later
Sharding
Horizontal partitioning of data across multiple servers to scale beyond single-node limits. Strategies include:
- Hash-based - Distribute by hash of shard key
- Range-based - Distribute by value ranges
- Geographic - Distribute by location
When to Use What
| Use Case | Recommended Type | Examples |
|---|---|---|
| Complex queries, joins, transactions | Relational | PostgreSQL, MySQL |
| Flexible schema, rapid iteration | Document | MongoDB, CouchDB |
| Caching, sessions, real-time data | Key-Value | Redis, Memcached |
| High write throughput, time-series | Wide-Column / Time-Series | Cassandra, InfluxDB |
| Relationship-heavy data | Graph | Neo4j |
| AI/ML, semantic search | Vector | Pinecone, pgvector |
| Serverless, auto-scaling | Managed NoSQL | DynamoDB |
Performance Considerations
Query Optimisation
- Use
EXPLAIN/EXPLAIN ANALYZEto understand query plans - Index columns used in WHERE, JOIN, and ORDER BY clauses
- Avoid
SELECT *- fetch only needed columns - Use connection pooling (PgBouncer, ProxySQL)
Scaling Strategies
- Vertical - Increase CPU, RAM, storage (simpler but limited)
- Horizontal - Add more nodes via sharding/replication (complex but scalable)
- Read replicas - Offload read traffic from primary
- Caching layer - Redis/Memcached for frequently accessed data
Common Pitfalls
- N+1 query problem - Use eager loading or batch queries
- Missing indexes on foreign keys
- Over-indexing (slows writes)
- Not using prepared statements (SQL injection risk, no query plan caching)
Popular Databases Compared
| Database | Type | Best For | Licence |
|---|---|---|---|
| PostgreSQL | Relational | General purpose, complex queries, extensibility | PostgreSQL (permissive) |
| MySQL | Relational | Web applications, read-heavy workloads | GPL / Commercial |
| MongoDB | Document | Flexible schemas, rapid development | SSPL |
| Redis | Key-Value | Caching, sessions, real-time features | BSD-3 |
| DynamoDB | Key-Value/Document | Serverless, auto-scaling workloads | Proprietary (AWS) |
| Cassandra | Wide-Column | High availability, massive scale | Apache 2.0 |
| Neo4j | Graph | Social networks, recommendations, fraud detection | GPL / Commercial |
Related Pages
External Resources
- DB-Engines Ranking - Database popularity rankings
- Use The Index, Luke - SQL indexing and tuning
- PostgreSQL Documentation
- MongoDB Manual
- Redis Documentation