Cloud Cost Optimization: 5 Strategies to Stop Overspending on AWS & Azure

Net Onboard

- January 12, 2026

Cloud bills have this special talent. They start small, everyone’s happy, then one day finance asks why the invoice looks like a phone number.

And the annoying part is it’s usually not one big mistake. It’s a hundred tiny ones. Stuff you forgot. Stuff nobody owns. Stuff that “we’ll clean up later”.

This is a how to listicle. No theory. Just things you can actually do this week to bring AWS and Azure costs down.

Why most teams overspend on AWS & Azure (and why it keeps happening)

Cloud is marketed as pay as you go.

In real life, it’s more like pay for what you forget.

A few patterns show up almost everywhere:

No clear ownership. Resources exist, but no human is accountable for the bill.
No tagging. So you can’t tell what belongs to which team, app, environment, or experiment.
Dev and test environments left running. Nights, weekends, holidays. Still billed.
Overprovisioned compute. Instances sized for peak traffic that happens for 30 minutes a day.
Idle storage. Snapshots, logs, backups, old object files, orphaned disks.
Surprise data transfer. NAT gateways, cross AZ chatter, egress, replication.

The fix that actually sticks is FinOps. Not as a tool. As an operating habit.

FinOps is basically three things:

Visibility: see spend clearly, by owner/team/app.
Accountability: someone is responsible for every dollar.
Continuous optimization: not a one time cleanup. A loop.

Below are 5 practical strategies you can start applying immediately in AWS and Azure. Even if your environment is messy. Especially if it’s messy.

Strategy 1: Kill “zombie resources” (the fastest way to reduce cloud bills)

Zombie resources are things that cost money while doing basically nothing.

They happen because someone spun something up for a test, migrated a workload, changed architecture, or deleted an app… and the leftovers stayed behind.

Common zombies:

Unattached disks (EBS volumes, Azure managed disks)
Idle load balancers
Old snapshots
Orphaned public IPs
Stopped but still billed services (depends on the service)
Abandoned dev environments
Logs set to never expire

Action checklist (AWS)

Go hunting for these first:

Unattached EBS volumes
Look for volumes in “available” state (not attached). Delete if truly unused.
Old EBS snapshots
Especially manual snapshots and ones not tied to active AMIs or backup policies.
Idle ALBs/ELBs
Load balancers with no healthy targets or near zero requests.
Unused Elastic IPs
Unassociated Elastic IPs cost money. Release them.
Old AMIs
AMIs themselves aren’t expensive, but the snapshots behind them can add up.
NAT Gateways you don’t need
NAT Gateway charges hourly plus per GB. If it’s attached to dead subnets or legacy stacks, it’s a common leak.
Idle RDS instances
“Dev DB” running 24/7 is a classic. If you can, stop it, schedule it, or move to serverless or smaller tiers.
CloudWatch log retention set to “never expire”
This one hurts slowly, then all at once. Set retention per log group.

Action checklist (Azure)

Same idea, different names:

Unattached managed disks
Disks not attached to any VM. Delete after verification.
Idle public IPs
Especially “reserved” public IPs not attached to anything.
Unused Load Balancers / Application Gateways
App Gateway can be pricey. Make sure it’s actually serving traffic.
Old snapshots
Snapshot sprawl is real. Set a retention policy.
Orphaned NICs
Network interfaces left behind after VM deletion.
Log Analytics retention too long
If you kept everything for 365 days “just in case”, you’re paying for that decision.

Operational tip: do a weekly “reap”, but don’t break prod

Make it routine. Put 30 minutes on the calendar every week.

Also add a lightweight approval flow so you don’t delete something critical:

Post candidates in Slack (or create a Jira ticket)
Give owners 48 hours to object
If nobody claims it, delete it

If you want this to be smooth, require ownership tags (we’ll get there). No owner tag means it’s eligible for reaping.

Quick win: TTL tags for non prod

Add a TTL tag to every non prod resource. Example:

expire_after=2026-02-15

Even better:

enforce it in Terraform modules
have a weekly job that finds expired resources and shuts them down (or deletes after another approval step)

This alone can stop the endless “temporary” environments that live forever.

Strategy 2: Right size compute and databases (stop paying for peak all day)

Right sizing is simple in concept.

You match vCPU and RAM to what the workload actually uses, not what someone guessed during setup.

Most environments have:

instances running at 2 to 10% CPU all day
databases on premium tiers “just in case”
non prod that’s always on
node counts that never got revisited after traffic changed

What to look for

Start with your biggest spend items and check:

Low average CPU
Memory pressure (careful. low CPU with high memory usage is common)
Spiky workloads that don’t need always on sizing
Oversized node counts (Kubernetes clusters, VMSS, ASGs)
Always on dev/test

Azure actions

Do this in a tight loop:

Use Azure Advisor recommendations as your first pass.
Resize VMs down one size, not three sizes. You want safe, repeatable wins.
Review App Service Plans
Many teams overpay here by keeping beefy plans for lightweight apps. Also check auto scale rules.
Review Azure SQL Database / Managed Instance tiers
If you’re on a high tier because performance used to be bad, validate if it’s still needed. Consider serverless where it fits.

Process tip: downsize in small steps

The safest way to right size:

Pick one service (say your top VM group).
Reduce one size down.
Measure for 7 days.
Repeat.

Don’t do a giant “resize everything Friday night” project. That’s how rollbacks happen and everyone gets scared of cost optimization forever.

Guardrail: budgets and alerts for accidental scale ups

Set budgets and alerts so you catch:

sudden scale out events
someone switching a VM from D series to something huge
runaway managed services

In AWS, use AWS Budgets and alerts. In Azure, use Cost Management budgets and alerts. The tool matters less than having the alarm in the first place.

Strategy 3: Use commitment discounts properly (Reserved Instances, Savings Plans, Azure Reservations)

Commitments are the biggest lever for steady workloads.

But only after you do cleanup and right sizing.

If you commit first, you might lock in waste. You’ll get a discount, sure, but you’re still paying for something you didn’t need.

AWS: Reserved Instances vs Savings Plans

Quick way to think about it:

Savings Plans

More flexible. Great default for compute baselines. Compute Savings Plans are usually the safest choice because they cover EC2 across instance families and also often apply to other compute depending on AWS rules and region.

Reserved Instances (RIs)

More specific. Can be great when you know exactly what will run. RDS Reserved Instances can be very effective for stable databases that won’t change often.

If you’re unsure, start with Compute Savings Plans for baseline compute.

Azure: Reservations

Azure Reservations can reduce cost for steady usage on:

VMs
SQL
other eligible services depending on SKU

Same rule applies. Commit to what you know is steady.

How to choose commitment level (don’t go 100%)

Start with 30 to 60% of your steady state usage.

Why not 100%? Because architectures change. Teams migrate. Products get killed. Suddenly your “perfect” reservation coverage becomes unused coverage. Which is just a new kind of waste.

Operational tip: align commitments with app lifecycle

Before buying 1 year or 3 year commitments, ask:

is this workload stable for the next 12 months?
are we planning a migration (Kubernetes, serverless, platform change)?
is this app seasonal?

Commitments should follow reality, not hope.

Tracking: monthly coverage and utilization review

Put a monthly meeting on the calendar (30 minutes is enough) to check:

coverage (how much of your usage is discounted)
utilization (are you actually using what you reserved)

If utilization is low, it’s a signal. Either the environment changed or the reservation strategy is wrong.

Strategy 4: Fix tagging + cost allocation (FinOps basics that unlock every other strategy)

If you can’t attribute spend, you can’t control it.

When nobody knows who owns a resource, it doesn’t get optimized. It just sits there, billing quietly.

Tagging is the boring part that makes everything else possible.

Minimum tag set (AWS + Azure)

Keep it small and enforce it:

owner
team
environment (prod, staging, dev)
application
cost_center
ttl or expire_after

That’s enough to answer, quickly:

who owns this?
why does it exist?
should it still exist?

Azure structure: Management Groups and subscriptions

If you’re on Azure, structure matters a lot:

Use Management Groups for high level org structure.
Use separate subscriptions per environment or team when it makes sense.
Enforce required tags with Azure Policy.
Use Azure Cost Management allocation views to report by subscription, resource group, and tags.

Make it stick: “no tag, no deploy”

This is where most teams fail. They “encourage” tagging.

Don’t encourage it. Enforce it.

In Terraform modules, make tags required variables.
In CI, fail builds when required tags are missing.
For manual console usage, lock down permissions or use policies to require tags on creation.

No tag, no deploy. Sounds harsh. But it’s less harsh than endless cloud waste.

Reporting cadence

Two rhythms that work:

Weekly: team level spend snapshot (what changed, top movers)
Monthly: FinOps review (bigger decisions, commitments, architecture issues)

Keep it lightweight. But make it real.

Strategy 5: Control the silent killers: storage growth + data transfer + logging

Compute gets attention. Storage and transfer sneak up behind you.

The bill grows slowly, then suddenly it’s huge and nobody knows why.

Storage growth (snapshots, logs, object storage)

This is where “set and forget” becomes expensive.

AWS storage actions

Add S3 lifecycle policies
Move older data to IA or Glacier automatically.
Delete old multipart uploads in S3
Incomplete uploads can linger and cost money.
Shorten CloudWatch Logs retention
Don’t keep everything forever by default.
Review EBS gp3 sizing
gp3 decouples IOPS and throughput. Many volumes are overprovisioned because they were set up under gp2 habits.
Clean up old EBS snapshots
Especially ones created by humans. Backup tools usually have retention. Humans usually don’t.

Data transfer: the “wait, why is this so high?” category

Common surprises:

NAT Gateways (hourly + per GB)
Cross AZ traffic (chatty microservices across zones)
Inter region replication
Egress to internet (downloads, APIs, web traffic, updates)

Mitigations:

Keep workloads close. Same region, same VPC/VNet, and reduce cross zone chatter where possible.
Use a CDN for public content.
Review architecture for unnecessary cross AZ calls.
Compress and aggregate logs. Don’t ship raw noisy logs everywhere.

A simple monthly “3 bucket audit”

Once a month, audit:

Storage (object, snapshots, disks)
Transfer (NAT, egress, cross AZ/region)
Observability (logs, metrics, traces retention)

If you do only this, you’ll catch most of the slow leaks before they become a crisis.

AWS pricing in Malaysia: what to know before you estimate savings

If your company is in Malaysia, your AWS pricing is still mainly determined by the AWS region you deploy in.

Most Malaysian companies use Singapore for latency. Some use other regions for compliance, resiliency, or global users.

A few realities to keep in mind:

Base compute prices are regional. Malaysia as a location doesn’t magically change EC2 rates.
Your bill can still shift because of exchange rates, taxes, and billing setup.
Data residency and compliance choices might push you into specific regions, which affects cost.

Practical guidance

Choose region based on latency + compliance + cost, in that order for most production systems.
Don’t assume “Malaysia” changes the underlying compute pricing. Model the actual region you will run in.

Cost estimation workflow

Use the AWS Pricing Calculator for your target region (often Singapore).
Model data transfer to users in Malaysia separately. This is where surprises live.
Include your AWS Support plan if you use one.

Comparing AWS vs Azure

Compare equivalent regions and equivalent services.

Not headline instance prices.

And honestly, most of the time, the biggest savings come from:

deleting zombies
right sizing
commitments done correctly

Not “region shopping”.

Let’s wrap up: a simple 30 day plan to cut AWS & Azure spend

Here’s a plan that won’t overwhelm your team.

Week 1

Zombie resource sweep (AWS and Azure)
Add TTL tags for non prod (expire_after=YYYY-MM-DD)
Start a weekly reap routine with a simple approval flow

Week 2

Right size top 10 spend services (compute + DB)
Resize one step down, measure 7 days, repeat

Week 3

Implement tagging policy (minimum tag set)
Build team cost dashboards
Start weekly spend snapshots

Week 4

Buy commitments for verified baseline usage (start 30 to 60%)
Add lifecycle and retention policies for storage and logs
Set budgets and alerts so you catch regressions

Cost optimization is a habit. Schedule it like security patching.

If you do the basics consistently, the cloud bill stops being a mystery. And it stops growing just because someone forgot to clean up after themselves.