Where Your Cloud Bill Actually Leaks: An Audit Nobody Runs

Thu, 13 Nov 2025 00:00:00 +0000

TL;DR

Cloud bills creep up because nobody owns bringing them down. The largest leaks are S3 versioning without a lifecycle policy, backup retention set when the database was a fraction of its current size, cross-AZ traffic on chatty services, lower environments running 24/7 at production sizes, and old workloads on instance generations the cloud now surcharges. An annual one-day audit by one engineer typically recovers a five-figure monthly sum and the savings stop being mysterious.

The S3 bill on a team’s data-lake bucket went from $1,400 a month to $9,800 over six months without anyone deploying anything new. The bucket had versioning enabled in 2022, no lifecycle policy, and a daily ETL job overwriting the same 40,000 objects every morning. Six months of overwrites left each object with roughly 180 versions in cold storage, and the storage charge was for all of them. Two hours on a one-page lifecycle policy reclaimed about $8,000 a month. The cost had been compounding for three and a half years; nobody had looked at the line item that broke it down.

“Buy a FinOps tool” is the reflexive answer, and it’s half right. Cost tools surface the bill but don’t fix it. They tell you the storage line is up 40%; they don’t tell you which 40,000 objects are versioned 180 times, which dev environment has been running 24/7 since the previous CTO, or which AZ your chatty cache shares with. The savings live in walking the items.

Storage and lifecycle: the largest leaks

Object storage with versioning enabled and no lifecycle policy is the most common large leak in any AWS account. S3 versioning charges for every version of every object indefinitely; a bucket with daily writes to the same keys can carry hundreds of versions per object after a year. The audit takes one query against the s3:ListObjectVersions API or one tab in S3 Storage Lens. For buckets holding derived data (build artifacts, ETL outputs, logs with authoritative copies elsewhere), disable versioning entirely; that’s cheaper than running a lifecycle policy against it. For buckets that genuinely need versioning, a lifecycle rule expiring non-current versions after 30, 60, or 90 days reclaims most of the cost. Incomplete multipart uploads are the related sweep: failed uploads sit on the bucket forever unless a separate lifecycle rule clears them.

Check before you disable versioning

Versioning is sometimes the only mechanism preventing data loss from an application bug, a misconfigured deletion policy, or a compliance retention requirement. A bucket that looks like “derived data” today might be the audit log a regulator asks about next year. Check the application’s recovery model and any compliance scope before turning it off.

Backup retention runs second. Most managed databases ship with a default of 7 days, and most teams later bumped it to 30 or 90 days “for safety” without revisiting whether the database was actually a fraction of its current size at the time. Snapshot storage above the database’s allocated size is billed separately at object-storage rates. A database that grew from 200 GB to 4 TB while retention stayed at 90 days has roughly 360 TB of snapshots on the line item, much of it for backups nobody has restored from. Cross-region snapshot replication, on by default in some compliance configurations, doubles that number. The conversation worth having is which databases need 90 days of point-in-time recovery and which only need 7. The answer is almost never “all of them”.

Temporary and scratch storage is the third item. Buckets named tmp-, scratch-, data-export-*, and migration-2023-* get created for one-off jobs and never deleted. EFS file systems mounted for migration work that finished two years ago. Test datasets uploaded for vendor pitches nobody pursued. Logs shipped to a debugging bucket during last summer’s incident. The discipline is a tag-based lifecycle policy: every temporary resource carries an expires=YYYY-MM-DD tag at creation, and a scheduled job deletes anything past its expiry. Same principle for ephemeral compute and infra: TTLs at creation, not retroactive sweeps.

Database tiering is the fourth storage item, and the cost shows up twice: in steady-state storage charges, and again every time someone touches the cluster. On a 30 TiB RDS cluster left alone as “the archive”, a routine ALTER TABLE to change one column’s datatype kicked off a full table rewrite that ran for a month and cost about $5,000 in IO before completing. The cluster had no active alerts, no recent change requests, and no one watching the bill — the charge accrued the full month before anyone noticed. Hot OLTP storage is the most expensive byte the cloud sells, and tables carrying years of archival rows the application reads less than monthly pay that premium plus a surprise tax on every schema migration. Partitioning by date and moving old partitions to S3, to a slower instance class, or to a column-store warehouse is a one-week project on a known shape. The migration looks unglamorous on a roadmap and gets perpetually deferred until the storage line, or an unexpected five-figure ALTER, crosses a threshold finance flags.

The unifying discipline across all four items is preventive: every storage resource needs an explicit retention policy and ownership tags at creation time, enforced in provisioning code rather than by human attention.

The audit recovers; the wrapped module prevents

A Terraform module that creates an S3 bucket should require lifecycle_rule, owner, and expires (or an explicit retention_class) as inputs and refuse to plan if they’re missing. The same wrapper-module pattern applies to RDS, dev environments, scratch buckets, and one-off compute. Tags applied retroactively only cover what someone remembered to update; tags enforced at the module cover everything provisioned from that point forward, including the infra a future engineer spins up without thinking about cost.

Network and capacity

Cross-AZ traffic is the silent compute leak. AWS charges roughly $0.01 per GB for data crossing AZs in both directions; on a chatty service that fans out to a cache and a database in different AZs, the round-trip charges add up to more than the instance cost itself within a few months. The fix is placement. Pin the chatty pair to the same AZ when the consistency model allows it. Batch the calls when it doesn’t. Move the cache layer to a per-AZ deployment so each application instance hits its local replica. The audit is one query against VPC flow logs or a glance at the Cost Explorer “Data Transfer” breakdown filtered by AZ.

Right-sizing is the next item. Instances provisioned for a load test in 2023 that ran at 12% CPU for two years are still on the bill at the size they were provisioned for. AWS Compute Optimizer and the equivalent recommenders in GCP and Azure are accurate enough to act on for the obvious cases without further investigation. The non-obvious cases (memory-bound workloads, spiky workloads, workloads with seasonal peaks, services with strict latency budgets) need a human pass with a week of metrics in front of them. Either way the data is already in the cloud; nothing has to be instrumented.

HA is the third. Multi-AZ on a Postgres primary roughly doubles the instance cost. On services where a five-minute outage is genuinely tolerable (internal tools, batch jobs, dev databases, services with a clear retry path on the caller) the second instance is paying for an SLA the business doesn’t actually need. The conversation worth having is which services have an RPO and RTO that justifies the standby. Most don’t. The original architecture review made the call on every service the same way (HA on, by default) and never revisited it as the service catalog grew.

Query tuning and application work: the largest single lever

Most items above remove waste in the infra layer. Query tuning and application-side optimization make the existing infra do more work per dollar, and on most systems they’re the largest single cost lever in this article. A single N+1 query in a hot path can put 10x more load on the database than necessary, sized as a more expensive RDS, a higher tier in every downstream cache, and more cross-AZ traffic. The infra audit cuts the bill by trimming what isn’t needed. Query tuning cuts it by reducing what’s actually being done.

Pick any production codebase older than eighteen months. At least one of the patterns covered elsewhere on this blog is in it, and almost always more than one: N+1 ORM iteration on a hot route, non-SARGable predicates that defeat any index, indexes built without understanding selectivity, OFFSET pagination past page 50, retry loops without backoff that triple request rate during the exact conditions that caused the original timeout, aggregations recomputed every request that could be cached for thirty seconds, and long-held row locks blocking unrelated work. Each one shows up at the cost layer as more vCPU, more IO, more cache pressure, and more cross-AZ traffic than the workload actually requires.

Query tuning is more expensive work than the infra audit: reading the slow-query log, profiling hot paths, and refactoring application code that touches the database. The payback shape is better, though. An infra audit recovers a fixed amount once. A query optimization saves on every future request, scaling with traffic growth.

Sprawl and surcharges

Lower environments default to running 24/7 at sizes someone picked when production was a quarter of its current size. The cheapest move is scheduled shutdown nights and weekends, where the workload isn’t worldwide and the engineers using it aren’t online at 3am. A 16-hour weekday shutdown plus full weekends recovers two-thirds of the monthly hours. AWS Instance Scheduler, GCP’s recommender, and a 30-line Lambda all do the job. Lower environments don’t need HA, don’t need the same retention, and don’t need the same instance class.

Per-engineer dev environments and PR-preview deployments are a related leak. Preview environments that spin up on every pull request and don’t tear down on close. Forgotten branches with attached infra. Personal sandboxes from engineers who left the company two years ago. Same TTL-at-creation discipline as the temp storage section above.

The cloud charges a surcharge on deprecated product versions in two places. EC2 instance generations are the visible one. AWS retired a long list of older EC2 generations and quietly raised the per-hour price on the ones still launchable; eventually they refuse to launch at all. Workloads still running on m4, c4, t2, or r3 generations are paying the surcharge today. Migrating to a current generation is usually a stop-start with a different instance type and a brief test window. The audit is aws ec2 describe-instances --query 'Reservations[*].Instances[*].InstanceType' and a join against the published deprecated-generation list. Same pattern in GCP for retired n1 machines, same in Azure for older v2 series.

The same surcharge runs on managed databases and the rest of the managed-storage catalog, less visibly. AWS RDS Extended Support charges per-vCPU-per-hour for Postgres, MySQL, and Aurora major versions past community end-of-life, stepping up each year until the version is forcibly upgraded. Postgres 11 hit that surcharge in early 2024; MySQL 5.7 followed. ElastiCache, OpenSearch managed, and DocumentDB have equivalent timelines. Azure SQL and Cloud SQL apply similar fees on out-of-support versions. EBS gp2 volumes carry a quieter version of the dynamic: gp3 is usually cheaper at the same IOPS budget even though gp2 isn’t formally deprecated. The audit is aws rds describe-db-instances joined against the engine’s published support timeline. The major upgrade was going to happen eventually; the surcharge puts a deadline on it.

Unused infra is the easiest sweep and the smallest line item per resource. EBS volumes left detached after the instance was terminated, billed monthly for storage that nothing reads. Elastic IPs not associated with any instance, billed hourly for the privilege of holding them. NAT gateways carrying near-zero traffic at the same hourly base rate as one carrying terabytes. Load balancers with zero healthy targets. RDS snapshots from databases deleted years ago. CloudWatch log groups with no retention policy that have been growing since 2019. The audit script is twenty lines per resource type. The savings are small per item and large in aggregate, and the cleanup is the safest of any item in this article. Nothing in production depends on a detached volume by definition.

Shared infra is the last item and the hardest call. Centralized logging, metrics, CI runners, internal developer platforms, and shared lower-environment clusters all start as obvious wins because the per-team cost is low and the operational burden is borne by a platform team. Years later the per-team cost has crossed the threshold where running it locally to the team that owns the workload would be cheaper, but the original decision is rarely revisited. The conversation worth having is per-team cost vs. operational complexity, not absolute cost. Centralization wins on operations and loses on per-team economics at scale, and the right answer for a 50-engineer org is rarely the right answer for a 500-engineer one.

When this discipline isn’t worth running

Three conditions make the cost sweep overkill. Very small accounts where the total monthly bill is under a few thousand dollars don’t repay the engineering time it takes to walk the list. Workloads in a hard regulatory regime where retention, HA, and cross-region replication are externally mandated have less room to cut than the article suggests; the audit still surfaces the line items, but the action set is smaller. And teams in a steep growth phase where the cost of the engineer’s time on cost work is more expensive than the savings should defer the sweep until the growth stabilizes. The discipline pays back at sustained scale, in established workloads, with engineering time available to allocate.

The bigger picture

On most companies’ P&L, the cloud bill is one of the few lines that grows by default. Every service deployed adds to it, every backup adds to it, nothing in the standard development loop subtracts from it. Engineering owns deployments and operations and doesn’t own the bill, so the people who can make it go down aren’t the people who feel it go up. The pattern across every item above is the same: someone made a reasonable decision in 2021 with the data they had, the underlying numbers changed by an order of magnitude, and nobody re-ran the decision.

The exercise isn’t about finding mistakes. It’s about re-running old decisions against current numbers, in the order where the gap is biggest. Most teams discover that an annual one-day audit by one engineer recovers a five-figure sum in monthly spend. The audit takes longer the first time because nothing is documented. By year three it’s a quarterly quick-pass, and the line item nobody used to read is the one finance forwards as good news.