Cloud Cost Optimization Platform
Unified visibility across AWS, GCP, and Azure with intelligent forecasting
and automated optimization recommendations
The platform connects to AWS, GCP, and Azure, collecting cost and resource data every 15 minutes. Data flows through normalization, analysis, and forecasting engines to power unified dashboards, budget alerts, and savings recommendations.
The Challenge
A fast-growing 200-person B2B SaaS company was managing $2.3M in annual cloud spend across AWS, GCP, and Azure — but had zero unified visibility. Finance manually exported CSV files from three billing consoles, spending 15+ hours per week reconciling different currencies, service names, and region codes. By the time they produced reports, the data was 48 hours old and budget overruns had already happened.
The real cost wasn't wasted time — it was missed opportunities. A forgotten development environment ran for three months unnoticed — $24,600 wasted before detection. Reserved instance coverage sat at 12% because nobody could track utilization across providers. EC2 instances running at 8% CPU. Unattached storage volumes. Cost anomalies went undetected for weeks.
When their CFO demanded better — real-time visibility by customer, feature, and team, with alerts before budget overruns — they evaluated enterprise FinOps platforms. CloudHealth quoted $50K/year. Spot.io required a 6-month implementation. Both were overkill for their 127-account multi-cloud environment.
They needed production-grade multi-cloud visibility without enterprise pricing or timeline.
The Solution
We built a cloud-native, multi-tenant cost optimization platform that unifies AWS, GCP, and Azure into a single pane of glass. Real-time data collection every 15 minutes, intelligent normalization, statistical forecasting, automated anomaly detection, and actionable savings recommendations — all running at $239/month total infrastructure cost.
The architecture follows three guiding principles: separate what changes from what doesn't (provider-specific logic isolated behind clean interfaces), optimize for the common path (pre-aggregated views for fast dashboards, async pipelines for slow provider APIs), and treat cost as a first-class constraint (every service choice evaluated against a $300/month infrastructure budget). The result is a system that delivers enterprise-grade FinOps capabilities with startup-grade economics.
ECS Fargate
Serverless containers + Spot
Aurora Serverless v2
Auto-scaling PostgreSQL
SQS
Reliable async processing
EventBridge
Scheduled triggers
API Gateway
REST API + rate limiting
Cognito
JWT authentication
CloudWatch
Metrics & alerting
X-Ray
Distributed tracing
Secrets Manager
Credential security
Container Architecture
Multi-AZ VPC with ECS Fargate for compute (2-4 API tasks, 2-20 Spot worker tasks at 70% savings), Aurora Serverless v2 for data (scales 0.5-4 ACU automatically), and SQS for reliable async processing. API services respond in <300ms P95 while workers collect and analyze cloud data every 15 minutes.
This architecture powers six core capabilities that work together to give finance and engineering teams complete control over their multi-cloud spend.
Multi-Cloud Normalization
Unified USD, service taxonomy, region standardization
Statistical Forecasting
30/60/90-day predictions with confidence intervals
Anomaly Detection
Z-score, threshold breach, week-over-week comparison
Budget Tracking
Real-time alerts at 50%, 80%, 100% thresholds
Automated Recommendations
Unused resources, rightsizing, storage optimization
Complete Audit Trail
SOC 2 ready compliance logging
No more CSV exports. No more billing console hopping. No more budget surprises.
How It Works
The platform handles two fundamentally different workloads: real-time dashboard queries that must respond in under 500ms, and background data collection from cloud provider APIs that can take up to 30 seconds per request. Rather than forcing both through the same path, we designed separate, optimized pipelines — a fast synchronous path for user-facing requests and an async queue-driven pipeline for data ingestion — connected through a shared data layer with pre-computed views.
The three diagrams below trace a request through each path and show how our multi-cloud abstraction layer normalizes data from providers with fundamentally different APIs, rate limits, and data models.
Dashboard Performance
Request path: CloudFront (CDN + DDoS protection) → API Gateway (JWT validation + rate limiting) → VPC Link → API Service (tenant context + RBAC) → Aurora (materialized views). Total: 285ms P95 — 43% better than 500ms target.
Why pre-aggregated materialized views?
Dashboard queries aggregate millions of cost records across 13 months. Raw queries take 2-3 seconds — violating our <500ms P95 target. By refreshing materialized views every 15 minutes after data collection, we serve complex aggregations in under 100ms. The tradeoff — 15-minute data freshness — is invisible for cost data that's already 12-24 hours delayed from cloud providers.
While dashboards serve pre-computed data in milliseconds, the real complexity lives in how that data gets collected, normalized, and analyzed behind the scenes. The collection pipeline runs every 15 minutes, processing data from 127 cloud accounts across three providers without ever blocking a user request.
Data Collection Pipeline
EventBridge (15-min trigger) → SQS (queued jobs) → Fargate Spot Workers (70% savings) → Secrets Manager (5-layer credential security) → Provider APIs (rate-limited: AWS 4/sec, GCP 1/sec, Azure 0.1/sec) → Normalization (USD, 6 service categories, 6 region zones) → Aurora + S3 archives → Analytics workers (budgets, anomalies, recommendations).
Why async collection with SQS?
Cloud billing APIs are slow (2-30 seconds), rate-limited (AWS 5/sec, GCP 1/sec, Azure 0.1/sec — a 50x difference), and occasionally unreliable. By decoupling collection from user-facing APIs through SQS, dashboards respond in <300ms regardless of external API performance. If AWS Cost Explorer times out, SQS visibility timeout expires and another worker retries automatically. Nothing is ever lost.
The hardest engineering problem wasn't building the pipeline — it was making three fundamentally different cloud APIs look identical to the rest of the system. AWS, GCP, and Azure use different authentication models, return data in different formats, and enforce rate limits that vary by 50x. The abstraction layer below is what makes "multi-cloud" a reality rather than a marketing claim.
Multi-Cloud Abstraction Layer
CloudCollector interface defines the contract: CollectCosts(), CollectResources(), ValidateCredentials(), Provider(). Three adapters implement provider-specific authentication (AWS IAM roles, GCP service accounts, Azure service principals), rate limits, and circuit breakers. Normalization layer converts 120+ services to 6 categories, 50+ regions to 6 zones, all currencies to USD.
Why the adapter pattern?
AWS, GCP, and Azure have fundamentally different authentication methods, rate limits (50x difference), and data models. The adapter pattern isolates complexity: each adapter handles its own rate limiting, retry logic, and authentication quirks while exposing a clean interface. Adding Oracle Cloud or Alibaba later means implementing one new adapter — zero changes to existing code.
Why normalize at ingestion time?
Query-time normalization adds 200-500ms latency per request. Ingestion-time normalization runs once per data point and stores both original and normalized values. Dashboard queries hit pre-normalized data and return in <100ms. Storage overhead is only 20% — storage is cheap; user time is not.
Built for Production
Building a working prototype is straightforward. Building a system that handles sensitive financial data across 127 cloud accounts — where a security breach exposes cost structures, a data loss corrupts compliance records, or a silent failure means missed budget alerts — requires a fundamentally different engineering approach.
We designed three production layers from day one, not as afterthoughts bolted onto an MVP: a data architecture with strict tenant isolation and encryption at every boundary, resilience patterns that gracefully handle the inevitable failures of external cloud APIs, and a high-availability infrastructure with full observability so the team of 3-5 engineers can operate the platform confidently at scale.
Data Architecture & Security
Aurora Serverless v2 with monthly-partitioned tables (13 months hot, then S3 Parquet → Glacier for 7-year compliance). Materialized views refresh every 15 minutes for <100ms queries. Three-layer tenant isolation: API Gateway JWT → Application middleware → Repository WHERE tenant_id=$1. Five-layer credential defense: namespace isolation → ownership validation → IAM restriction → KMS encryption → CloudTrail audit.
Why Aurora Serverless v2?
TimescaleDB requires always-on EC2 ($150+/month) plus operational overhead — overkill for 1M records/month. DynamoDB struggles with complex aggregations our dashboards require. Aurora Serverless v2 gives us PostgreSQL's query flexibility with true serverless economics: scales from 0.5 ACU when idle to 4 ACU during spikes. At MVP load, ACU compute averages ~$50/month; total cost including storage and I/O is ~$73/month.
Why application-level tenant isolation?
Database-per-tenant at 100 tenants = 100 Aurora instances × $65/month = $6,500/month. Application-level filtering costs ~$65/month total — 99% savings. We enforce isolation at compile time: repository methods require tenant context, all queries include WHERE tenant_id = $1, and 100+ automated tests verify no cross-tenant leakage. PostgreSQL Row-Level Security is planned as an additional defense-in-depth layer.
Security protects data at rest. Resilience protects the system in motion — when cloud provider APIs throttle requests, return errors, or go down entirely. The platform interacts with external APIs thousands of times per day, and each call is an opportunity for failure. The patterns below ensure that no single provider outage, API timeout, or rate limit breach can compromise data integrity or degrade the user experience.
Resilience Patterns
Three-phase protection. Entry: Rate limiting (per provider + per tenant), idempotency keys. Processing: Per-provider circuit breakers (CLOSED → OPEN after 5 failures → HALF_OPEN after 30s), async SQS handoff. Completion: Exponential backoff (1s → 2s → 4s, max 30s, ±20% jitter), Dead Letter Queue after 3 attempts.
Why per-provider circuit breakers?
If Azure's API is degraded, a global circuit breaker would halt AWS and GCP collection. Per-provider isolation prevents cascade: Azure goes down → Azure circuit opens → AWS and GCP continue normally. Each provider has independent health.
Resilience patterns handle individual failures. The infrastructure layer ensures the entire platform stays available — even when an AWS Availability Zone goes down — and gives the operations team clear visibility into system health without requiring 24/7 manual monitoring.
High Availability & Observability
Multi-AZ deployment: Aurora writer (AZ-A) + reader (AZ-B) with <60s auto-failover. ECS tasks distributed across both AZs. VPC endpoints eliminate NAT transfer costs (Secrets Manager, SQS, S3, ECR, Logs). X-Ray tracing (5% sampling) + CloudWatch dashboards (latency, errors, queue depth). P1 alerts → PagerDuty within 15 minutes.
Why ECS Fargate Spot for workers?
Lambda has a 15-minute timeout — insufficient for slow cloud APIs (some take 90+ seconds). EKS costs $73/month for control plane alone (31% of our budget). ECS Fargate provides serverless containers with unlimited runtime. Fargate Spot saves 70% on compute. Workers are stateless — if AWS reclaims a Spot instance, SQS redelivers the message automatically. No data loss possible.
Results
After six months in production managing $2.3M in annual cloud spend across 127 accounts, the platform has delivered measurable impact across every dimension the client's CFO originally defined as success criteria. The numbers below are from live production metrics, not projections.
Cloud Cost Optimizer
6-Month Production Metrics
Before this, our finance team was spending two days every month just trying to figure out what we were actually paying across AWS, GCP, and Azure. Now it's all in one dashboard — and we've already saved 18x what the platform costs us.— CFO, 200-Person B2B SaaS Company