Cloud bills have this special talent. They start small, everyone’s happy, then one day finance asks why the invoice looks like a phone number.
And the annoying part is it’s usually not one big mistake. It’s a hundred tiny ones. Stuff you forgot. Stuff nobody owns. Stuff that “we’ll clean up later”.
This is a how to listicle. No theory. Just things you can actually do this week to bring AWS and Azure costs down.
Why most teams overspend on AWS & Azure (and why it keeps happening)
Cloud is marketed as pay as you go.
In real life, it’s more like pay for what you forget.
A few patterns show up almost everywhere:
- No clear ownership. Resources exist, but no human is accountable for the bill.
- No tagging. So you can’t tell what belongs to which team, app, environment, or experiment.
- Dev and test environments left running. Nights, weekends, holidays. Still billed.
- Overprovisioned compute. Instances sized for peak traffic that happens for 30 minutes a day.
- Idle storage. Snapshots, logs, backups, old object files, orphaned disks.
- Surprise data transfer. NAT gateways, cross AZ chatter, egress, replication.
The fix that actually sticks is FinOps. Not as a tool. As an operating habit.
FinOps is basically three things:
- Visibility: see spend clearly, by owner/team/app.
- Accountability: someone is responsible for every dollar.
- Continuous optimization: not a one time cleanup. A loop.
Below are 5 practical strategies you can start applying immediately in AWS and Azure. Even if your environment is messy. Especially if it’s messy.
Strategy 1: Kill “zombie resources” (the fastest way to reduce cloud bills)
Zombie resources are things that cost money while doing basically nothing.
They happen because someone spun something up for a test, migrated a workload, changed architecture, or deleted an app… and the leftovers stayed behind.
Common zombies:
- Unattached disks (EBS volumes, Azure managed disks)
- Idle load balancers
- Old snapshots
- Orphaned public IPs
- Stopped but still billed services (depends on the service)
- Abandoned dev environments
- Logs set to never expire
Action checklist (AWS)
Go hunting for these first:
- Unattached EBS volumes
- Look for volumes in “available” state (not attached). Delete if truly unused.
- Old EBS snapshots
- Especially manual snapshots and ones not tied to active AMIs or backup policies.
- Idle ALBs/ELBs
- Load balancers with no healthy targets or near zero requests.
- Unused Elastic IPs
- Unassociated Elastic IPs cost money. Release them.
- Old AMIs
- AMIs themselves aren’t expensive, but the snapshots behind them can add up.
- NAT Gateways you don’t need
- NAT Gateway charges hourly plus per GB. If it’s attached to dead subnets or legacy stacks, it’s a common leak.
- Idle RDS instances
- “Dev DB” running 24/7 is a classic. If you can, stop it, schedule it, or move to serverless or smaller tiers.
- CloudWatch log retention set to “never expire”
- This one hurts slowly, then all at once. Set retention per log group.
Action checklist (Azure)
Same idea, different names:
- Unattached managed disks
- Disks not attached to any VM. Delete after verification.
- Idle public IPs
- Especially “reserved” public IPs not attached to anything.
- Unused Load Balancers / Application Gateways
- App Gateway can be pricey. Make sure it’s actually serving traffic.
- Old snapshots
- Snapshot sprawl is real. Set a retention policy.
- Orphaned NICs
- Network interfaces left behind after VM deletion.
- Log Analytics retention too long
- If you kept everything for 365 days “just in case”, you’re paying for that decision.
Operational tip: do a weekly “reap”, but don’t break prod
Make it routine. Put 30 minutes on the calendar every week.
Also add a lightweight approval flow so you don’t delete something critical:
- Post candidates in Slack (or create a Jira ticket)
- Give owners 48 hours to object
- If nobody claims it, delete it
If you want this to be smooth, require ownership tags (we’ll get there). No owner tag means it’s eligible for reaping.
Quick win: TTL tags for non prod
Add a TTL tag to every non prod resource. Example:
expire_after=2026-02-15
Even better:
- enforce it in Terraform modules
- have a weekly job that finds expired resources and shuts them down (or deletes after another approval step)
This alone can stop the endless “temporary” environments that live forever.
Strategy 2: Right size compute and databases (stop paying for peak all day)
Right sizing is simple in concept.
You match vCPU and RAM to what the workload actually uses, not what someone guessed during setup.
Most environments have:
- instances running at 2 to 10% CPU all day
- databases on premium tiers “just in case”
- non prod that’s always on
- node counts that never got revisited after traffic changed
What to look for
Start with your biggest spend items and check:
- Low average CPU
- Memory pressure (careful. low CPU with high memory usage is common)
- Spiky workloads that don’t need always on sizing
- Oversized node counts (Kubernetes clusters, VMSS, ASGs)
- Always on dev/test
Azure actions
Do this in a tight loop:
- Use Azure Advisor recommendations as your first pass.
- Resize VMs down one size, not three sizes. You want safe, repeatable wins.
- Review App Service Plans
- Many teams overpay here by keeping beefy plans for lightweight apps. Also check auto scale rules.
- Review Azure SQL Database / Managed Instance tiers
- If you’re on a high tier because performance used to be bad, validate if it’s still needed. Consider serverless where it fits.
Process tip: downsize in small steps
The safest way to right size:
- Pick one service (say your top VM group).
- Reduce one size down.
- Measure for 7 days.
- Repeat.
Don’t do a giant “resize everything Friday night” project. That’s how rollbacks happen and everyone gets scared of cost optimization forever.
Guardrail: budgets and alerts for accidental scale ups
Set budgets and alerts so you catch:
- sudden scale out events
- someone switching a VM from D series to something huge
- runaway managed services
In AWS, use AWS Budgets and alerts. In Azure, use Cost Management budgets and alerts. The tool matters less than having the alarm in the first place.
Strategy 3: Use commitment discounts properly (Reserved Instances, Savings Plans, Azure Reservations)
Commitments are the biggest lever for steady workloads.
But only after you do cleanup and right sizing.
If you commit first, you might lock in waste. You’ll get a discount, sure, but you’re still paying for something you didn’t need.
AWS: Reserved Instances vs Savings Plans
Quick way to think about it:
Savings Plans
More flexible. Great default for compute baselines. Compute Savings Plans are usually the safest choice because they cover EC2 across instance families and also often apply to other compute depending on AWS rules and region.
Reserved Instances (RIs)
More specific. Can be great when you know exactly what will run. RDS Reserved Instances can be very effective for stable databases that won’t change often.
If you’re unsure, start with Compute Savings Plans for baseline compute.
Azure: Reservations
Azure Reservations can reduce cost for steady usage on:
- VMs
- SQL
- other eligible services depending on SKU
Same rule applies. Commit to what you know is steady.
How to choose commitment level (don’t go 100%)
Start with 30 to 60% of your steady state usage.
Why not 100%? Because architectures change. Teams migrate. Products get killed. Suddenly your “perfect” reservation coverage becomes unused coverage. Which is just a new kind of waste.
Operational tip: align commitments with app lifecycle
Before buying 1 year or 3 year commitments, ask:
- is this workload stable for the next 12 months?
- are we planning a migration (Kubernetes, serverless, platform change)?
- is this app seasonal?
Commitments should follow reality, not hope.
Tracking: monthly coverage and utilization review
Put a monthly meeting on the calendar (30 minutes is enough) to check:
- coverage (how much of your usage is discounted)
- utilization (are you actually using what you reserved)
If utilization is low, it’s a signal. Either the environment changed or the reservation strategy is wrong.
Strategy 4: Fix tagging + cost allocation (FinOps basics that unlock every other strategy)
If you can’t attribute spend, you can’t control it.
When nobody knows who owns a resource, it doesn’t get optimized. It just sits there, billing quietly.
Tagging is the boring part that makes everything else possible.
Minimum tag set (AWS + Azure)
Keep it small and enforce it:
ownerteamenvironment(prod, staging, dev)applicationcost_centerttlorexpire_after
That’s enough to answer, quickly:
- who owns this?
- why does it exist?
- should it still exist?
Azure structure: Management Groups and subscriptions
If you’re on Azure, structure matters a lot:
- Use Management Groups for high level org structure.
- Use separate subscriptions per environment or team when it makes sense.
- Enforce required tags with Azure Policy.
- Use Azure Cost Management allocation views to report by subscription, resource group, and tags.
Make it stick: “no tag, no deploy”
This is where most teams fail. They “encourage” tagging.
Don’t encourage it. Enforce it.
- In Terraform modules, make tags required variables.
- In CI, fail builds when required tags are missing.
- For manual console usage, lock down permissions or use policies to require tags on creation.
No tag, no deploy. Sounds harsh. But it’s less harsh than endless cloud waste.
Reporting cadence
Two rhythms that work:
- Weekly: team level spend snapshot (what changed, top movers)
- Monthly: FinOps review (bigger decisions, commitments, architecture issues)
Keep it lightweight. But make it real.
Strategy 5: Control the silent killers: storage growth + data transfer + logging
Compute gets attention. Storage and transfer sneak up behind you.
The bill grows slowly, then suddenly it’s huge and nobody knows why.
Storage growth (snapshots, logs, object storage)
This is where “set and forget” becomes expensive.
AWS storage actions
- Add S3 lifecycle policies
- Move older data to IA or Glacier automatically.
- Delete old multipart uploads in S3
- Incomplete uploads can linger and cost money.
- Shorten CloudWatch Logs retention
- Don’t keep everything forever by default.
- Review EBS gp3 sizing
- gp3 decouples IOPS and throughput. Many volumes are overprovisioned because they were set up under gp2 habits.
- Clean up old EBS snapshots
- Especially ones created by humans. Backup tools usually have retention. Humans usually don’t.
Data transfer: the “wait, why is this so high?” category
Common surprises:
- NAT Gateways (hourly + per GB)
- Cross AZ traffic (chatty microservices across zones)
- Inter region replication
- Egress to internet (downloads, APIs, web traffic, updates)
Mitigations:
- Keep workloads close. Same region, same VPC/VNet, and reduce cross zone chatter where possible.
- Use a CDN for public content.
- Review architecture for unnecessary cross AZ calls.
- Compress and aggregate logs. Don’t ship raw noisy logs everywhere.
A simple monthly “3 bucket audit”
Once a month, audit:
- Storage (object, snapshots, disks)
- Transfer (NAT, egress, cross AZ/region)
- Observability (logs, metrics, traces retention)
If you do only this, you’ll catch most of the slow leaks before they become a crisis.
AWS pricing in Malaysia: what to know before you estimate savings
If your company is in Malaysia, your AWS pricing is still mainly determined by the AWS region you deploy in.
Most Malaysian companies use Singapore for latency. Some use other regions for compliance, resiliency, or global users.
A few realities to keep in mind:
- Base compute prices are regional. Malaysia as a location doesn’t magically change EC2 rates.
- Your bill can still shift because of exchange rates, taxes, and billing setup.
- Data residency and compliance choices might push you into specific regions, which affects cost.
Practical guidance
- Choose region based on latency + compliance + cost, in that order for most production systems.
- Don’t assume “Malaysia” changes the underlying compute pricing. Model the actual region you will run in.
Cost estimation workflow
- Use the AWS Pricing Calculator for your target region (often Singapore).
- Model data transfer to users in Malaysia separately. This is where surprises live.
- Include your AWS Support plan if you use one.
Comparing AWS vs Azure
Compare equivalent regions and equivalent services.
Not headline instance prices.
And honestly, most of the time, the biggest savings come from:
- deleting zombies
- right sizing
- commitments done correctly
Not “region shopping”.
Let’s wrap up: a simple 30 day plan to cut AWS & Azure spend
Here’s a plan that won’t overwhelm your team.
Week 1
- Zombie resource sweep (AWS and Azure)
- Add TTL tags for non prod (
expire_after=YYYY-MM-DD) - Start a weekly reap routine with a simple approval flow
Week 2
- Right size top 10 spend services (compute + DB)
- Resize one step down, measure 7 days, repeat
Week 3
- Implement tagging policy (minimum tag set)
- Build team cost dashboards
- Start weekly spend snapshots
Week 4
- Buy commitments for verified baseline usage (start 30 to 60%)
- Add lifecycle and retention policies for storage and logs
- Set budgets and alerts so you catch regressions
Cost optimization is a habit. Schedule it like security patching.
If you do the basics consistently, the cloud bill stops being a mystery. And it stops growing just because someone forgot to clean up after themselves.

