The Cloud Cost Playbook: How We Cut AWS Bills by 28%
AWS Monthly Spend: Before vs After
Cost optimization results across a 6-week engagement
Cloud bills for AI workloads are growing faster than any other infrastructure category. GPU instances, LLM API calls, vector database storage, always-on inference endpoints. Most companies we work with are overspending by 30-50% and do not realize it because nobody has tagged their resources properly.
Why AI Infrastructure Costs Surprise Everyone
Traditional web apps have predictable cost profiles. AI workloads do not. A training job spins up 8 GPU instances for 4 hours, then nothing for a week. Inference endpoints sit idle at 3am but need to handle 10x traffic at noon. Vector databases grow every time you index a new document. LLM API costs depend on token volume, which is nearly impossible to predict before launch.
The result: finance teams get surprised every month, and engineering leads have no idea which workloads are driving the bill. Without resource-level cost attribution, optimization is just guessing.
Phase 1: Get Visibility
Before you optimize anything, you need to see where the money goes. Tag every EC2 instance, every S3 bucket, every SageMaker endpoint by team, project, and environment. Build Grafana dashboards showing real-time spend per workload. On most engagements, clients find $10-20K per month in forgotten resources and oversized instances just from this step alone.
We had one client running a p3.8xlarge ($12/hour) as a dev sandbox for a single engineer. It had been running for 7 months. That is roughly $60,000 in compute for what could have been a notebook on a $0.50/hour instance.
Phase 2: Right-Size Everything
Pull utilization data from CloudWatch for every running instance. We typically find that 40-60% of instances are oversized by at least one tier. RDS is the worst offender. Teams pick instance sizes based on "what if we get 10x traffic" scenarios that never materialize, then forget to revisit.
One client was running r6g.4xlarge for a database that peaked at 15% CPU utilization. We moved them to r6g.xlarge and saved $2,400 per month with zero performance impact. Multiply that across 20 databases and you are looking at real money.
Phase 3: Use the Right Pricing
Reserved Instances and Savings Plans are the easiest win in cloud cost optimization. If you have workloads running 24/7, you should not be paying on-demand rates. A 1-year no-upfront reserved instance saves around 30%. For ML training jobs that can tolerate interruption, spot instances save 60-80%. We set up spot-based training pipelines with automatic checkpointing so jobs resume seamlessly after interruption.
Phase 4: Architecture Changes
Tiered inference routes requests to different model sizes based on complexity. Simple classification goes to a fine-tuned small model. Complex reasoning goes to GPT-4o. On one project, this cut LLM API costs by 55%.
Semantic caching stores LLM responses and returns cached results for similar queries. For a customer support bot we built, this eliminated 40% of API calls because users ask the same 50 questions in slightly different ways.
Serverless inference on Lambda or SageMaker Serverless eliminates the cost of idle GPU time. For workloads with variable traffic, this cut compute costs in half compared to always-on endpoints.
Phase 5: Make It Stick
Set budget alerts in AWS Budgets. Run monthly cost reviews with engineering leads. Build Lambda-based policies that auto-stop idle resources and flag spend anomalies. Treat cost as a first-class metric next to latency and uptime.
The Numbers
On the engagement that inspired this post, we took a mid-size SaaS company from $45K per month to $32K per month on AWS. The biggest wins: right-sizing RDS instances ($4,800/year), moving ML training to spot instances ($18,000/year), switching always-on SageMaker endpoints to serverless ($12,000/year), and killing forgotten dev resources ($8,400/year). Total annual savings: about $156,000. The engagement paid for itself in the first month.
Where to Start
Start with visibility. If you cannot tell which team or workload is responsible for every dollar on your cloud bill, that is step one. Tag everything, build a dashboard, and the savings opportunities will jump out at you.
More from QuikSync
What We Learned Shipping GenAI to Production
Most teams can build an AI demo in a week. Getting it into production is a completely different problem. Here is what actually works, based on the projects we have shipped over the past year.
RAG vs Fine-Tuning: When to Use Which (With Real Examples)
We have built six LLM-powered products in the past year. Three used RAG, two used fine-tuning, one used both. Here is how we decide which approach fits which problem.