Cost Optimisation

Cost optimisation (often called FinOps or Cloud Financial Management) is the practice of managing cloud costs while maintaining the performance and reliability your business needs. It’s not about spending less — it’s about spending wisely.

FinOps Principles

FinOps is a cultural practice that brings together technology, finance, and business teams to manage cloud costs collaboratively.

Core Principles

Teams need to collaborate — Engineering, finance, and business must work together
Everyone takes ownership — Decentralise cost decisions to the teams who make architectural choices
A centralised team drives FinOps — A dedicated team provides best practices, tooling, and governance
Reports should be accessible and timely — Real-time visibility enables real-time decisions
Decisions are driven by business value — Optimise for value, not just cost
Take advantage of the variable cost model — Cloud’s pay-as-you-go model is a feature, not a bug

The FinOps Lifecycle

Inform → Optimise → Operate → (repeat)

Inform — Understand where money is going (visibility, allocation, benchmarking)
Optimise — Identify and implement savings opportunities
Operate — Continuously monitor and govern cloud spend

Cost Visibility

You can’t optimise what you can’t see. The first step is understanding your current spend.

Tagging Strategy

Tags are the foundation of cost allocation. Without consistent tagging, you cannot attribute costs to teams, products, or environments.

Essential tags:

Tag	Purpose	Example Values
`environment`	Distinguish prod/non-prod	`production`, `staging`, `development`
`team`	Cost allocation to teams	`platform`, `payments`, `frontend`
`product`	Cost allocation to products	`checkout`, `search`, `analytics`
`cost-centre`	Finance allocation	`CC-1234`, `engineering`
`owner`	Contact for the resource	`[email protected]`
`managed-by`	How it’s provisioned	`terraform`, `manual`, `cdk`

Tagging best practices:

Enforce tagging via IaC and policies
Use consistent naming conventions (kebab-case, lowercase)
Implement tag compliance checks in CI/CD
Create default tags at the account/project level
Review and clean up untagged resources regularly

Cost Allocation

Shared costs — Distribute platform/infrastructure costs fairly (by usage, headcount, or fixed percentage)
Showback — Report costs to teams without formal chargebacks
Chargeback — Formally transfer costs to team budgets

Start with showback to build awareness; move to chargeback when teams have sufficient control over their costs.

Optimisation Strategies

Right-Sizing

Match resource capacity to actual utilisation. Oversized resources are the most common source of waste.

Process:

Collect utilisation metrics (CPU, memory, network, storage)
Identify resources consistently below 40% utilisation
Recommend smaller instance types
Test changes in non-production first
Implement and monitor

Tools: AWS Compute Optimizer, GCP Recommender, Azure Advisor

Reserved Instances / Savings Plans

Commit to usage in exchange for significant discounts (typically 30-70%).

Commitment Type	Discount	Flexibility	Best For
On-demand	0%	Maximum	Unpredictable workloads
Savings Plans (AWS)	30-50%	Good	Stable compute usage
Reserved Instances	40-70%	Limited	Predictable, steady-state
Spot/Preemptible	60-90%	None (can be terminated)	Fault-tolerant, batch

Reservation strategy:

Cover your baseline with reservations (the minimum you always use)
Use savings plans for steady growth
Use on-demand for variable load
Use spot for fault-tolerant workloads

Spot/Preemptible Instances

Spare cloud capacity at massive discounts, but can be reclaimed with little notice.

Good candidates for spot:

Batch processing and data pipelines
CI/CD build agents
Stateless web servers behind load balancers
Development and test environments
Kubernetes node pools (with proper pod disruption budgets)

Not suitable:

Stateful workloads without replication
Long-running jobs that can’t checkpoint
Anything requiring guaranteed availability

Storage Optimisation

Storage costs accumulate silently. Regular review is essential.

Strategies:

Lifecycle policies — Automatically move data to cheaper tiers (e.g., S3 Standard → Glacier)
Delete unused snapshots — Old EBS/disk snapshots add up quickly
Compress and deduplicate — Reduce stored data volume
Right-size volumes — Provisioned storage is often oversized
Review backups — Do you need 90 days of daily backups?

Network Cost Optimisation

Data transfer costs are often overlooked and can be substantial.

Strategies:

Use private endpoints — Avoid NAT gateway and internet egress charges
Keep traffic in-region — Cross-region and cross-AZ traffic costs money
CDN for static content — Serve from edge locations, reduce origin traffic
Compress data — Less data transferred = lower costs
Review NAT gateway usage — These are expensive; consider alternatives

Database Optimisation

Reserved capacity — Commit to database instances like compute
Right-size instances — Databases are often over-provisioned
Storage auto-scaling — Only pay for what you use
Review provisioned IOPS — Often unnecessary
Consider serverless — Aurora Serverless, DynamoDB on-demand for variable workloads

Waste Elimination

Common sources of waste to audit regularly:

Waste Type	Description	Action
Idle resources	Running but unused (VMs, databases, load balancers)	Terminate or schedule
Orphaned resources	Detached volumes, unused IPs, old snapshots	Delete
Over-provisioned	Resources larger than needed	Right-size
Non-production running 24/7	Dev/test environments running overnight/weekends	Schedule shutdown
Unused reservations	Reservations that don’t match current usage	Sell or let expire

Environment Scheduling

Non-production environments often don’t need to run continuously.

Implementation:

Tag resources with schedule (e.g., schedule: office-hours)
Use Lambda/Cloud Functions to start/stop resources on schedule
Provide self-service mechanisms for engineers to extend when needed
Typical savings: 65-75% on non-production compute

Governance

Budgets and Alerts

Set budgets at account, team, and project levels
Alert at 50%, 80%, 100% of budget
Include forecast-based alerts (predicted spend)
Ensure alerts reach people who can act on them

Anomaly Detection

Cloud providers offer anomaly detection to catch unexpected spend spikes:

AWS Cost Anomaly Detection
GCP Recommender anomaly alerts
Azure Cost Management anomaly alerts

Configure alerts to notify relevant teams immediately.

Cost Review Cadence

Frequency	Activity
Daily	Check for anomalies and spikes
Weekly	Review top spending services and trends
Monthly	Team-level cost reviews, reservation coverage
Quarterly	Strategic planning, commitment purchases

Unit Economics

Track cost efficiency, not just total cost.

Example unit metrics:

Cost per transaction
Cost per active user
Cost per GB processed
Cost per API request

As your business scales, total cost should increase but cost per unit should decrease (economies of scale).

Organisational Considerations

FinOps Team

A central FinOps function provides:

Tooling and dashboards
Best practices and training
Reserved instance/savings plan management
Governance and policy enforcement
Executive reporting

Engineering Culture

Include cost as a non-functional requirement
Add cost impact to architecture decision records
Make cost data visible to engineers
Celebrate cost savings alongside feature delivery
Consider cost in code reviews for infrastructure changes

Cloud Provider Tools

AWS

Cost Explorer — Visualise and analyse costs
Cost and Usage Reports (CUR) — Detailed billing data
Compute Optimizer — Right-sizing recommendations
Trusted Advisor — Optimisation recommendations
Savings Plans — Flexible commitment discounts

Google Cloud

Billing Reports — Cost visualisation
Recommender — Right-sizing and idle resource recommendations
Committed Use Discounts — Reservations
Active Assist — Optimisation recommendations

Azure

Cost Management — Visualisation and budgets
Advisor — Optimisation recommendations
Reservations — Commitment discounts
Azure Hybrid Benefit — Use existing licenses

Third-Party Tools

Infracost — Cost estimates for Terraform in CI/CD
OpenCost — Kubernetes cost monitoring (CNCF project)
Kubecost — Kubernetes cost management
CloudHealth — Multi-cloud cost management
Spot.io — Automated spot instance management
CAST AI — Kubernetes cost optimisation

Rai Notes

Explorer