Cloud Cost Optimization Platform

The Challenge

A fast-growing 200-person B2B SaaS company was managing $2.3M in annual cloud spend across AWS, GCP, and Azure — but had zero unified visibility. Finance manually exported CSV files from three billing consoles, spending 15+ hours per week reconciling different currencies, service names, and region codes. By the time they produced reports, the data was 48 hours old and budget overruns had already happened.

The real cost wasn't wasted time — it was missed opportunities. A forgotten development environment ran for three months unnoticed — $24,600 wasted before detection. Reserved instance coverage sat at 12% because nobody could track utilization across providers. EC2 instances running at 8% CPU. Unattached storage volumes. Cost anomalies went undetected for weeks.

When their CFO demanded better — real-time visibility by customer, feature, and team, with alerts before budget overruns — they evaluated enterprise FinOps platforms. CloudHealth quoted $50K/year. Spot.io required a 6-month implementation. Both were overkill for their 127-account multi-cloud environment.

They needed production-grade multi-cloud visibility without enterprise pricing or timeline.

The Solution

We built a cloud-native, multi-tenant cost optimization platform that unifies AWS, GCP, and Azure into a single pane of glass. Real-time data collection every 15 minutes, intelligent normalization, statistical forecasting, automated anomaly detection, and actionable savings recommendations — all running at $239/month total infrastructure cost.

The architecture follows three guiding principles: separate what changes from what doesn't (provider-specific logic isolated behind clean interfaces), optimize for the common path (pre-aggregated views for fast dashboards, async pipelines for slow provider APIs), and treat cost as a first-class constraint (every service choice evaluated against a $300/month infrastructure budget). The result is a system that delivers enterprise-grade FinOps capabilities with startup-grade economics.

Built on Amazon Web Services

ECS Fargate

Serverless containers + Spot

Aurora Serverless v2

Auto-scaling PostgreSQL

SQS

Reliable async processing

EventBridge

Scheduled triggers

API Gateway

REST API + rate limiting

Cognito

JWT authentication

CloudWatch

Metrics & alerting

X-Ray

Distributed tracing

Secrets Manager

Credential security

Multi-Cloud Integration

AWS: aws-sdk-go-v2

GCP: cloud.google.com/go

Azure: azure-sdk-for-go

Container Architecture

Architecture Overview

Multi-AZ VPC with ECS Fargate for compute (2-4 API tasks, 2-20 Spot worker tasks at 70% savings), Aurora Serverless v2 for data (scales 0.5-4 ACU automatically), and SQS for reliable async processing. API services respond in <300ms P95 while workers collect and analyze cloud data every 15 minutes.

This architecture powers six core capabilities that work together to give finance and engineering teams complete control over their multi-cloud spend.

Multi-Cloud Normalization

Unified USD, service taxonomy, region standardization

Statistical Forecasting

30/60/90-day predictions with confidence intervals

Anomaly Detection

Z-score, threshold breach, week-over-week comparison

Budget Tracking

Real-time alerts at 50%, 80%, 100% thresholds

Automated Recommendations

Unused resources, rightsizing, storage optimization

Complete Audit Trail

SOC 2 ready compliance logging

What Makes This Approach Distinctive

Enterprise results, startup cost $239/month vs. $50K/year quoted — 95% cost reduction

Production-grade from day one Multi-AZ, 5-layer security, complete observability — not an MVP compromise

Designed for scale Architecture supports 10 to 100+ tenants without redesign

Small team operation 3-5 engineers operate this reliably (vs. 10+ typical)

No more CSV exports. No more billing console hopping. No more budget surprises.

How It Works

The platform handles two fundamentally different workloads: real-time dashboard queries that must respond in under 500ms, and background data collection from cloud provider APIs that can take up to 30 seconds per request. Rather than forcing both through the same path, we designed separate, optimized pipelines — a fast synchronous path for user-facing requests and an async queue-driven pipeline for data ingestion — connected through a shared data layer with pre-computed views.

The three diagrams below trace a request through each path and show how our multi-cloud abstraction layer normalizes data from providers with fundamentally different APIs, rate limits, and data models.

Dashboard Performance

Request Flow

Request path: CloudFront (CDN + DDoS protection) → API Gateway (JWT validation + rate limiting) → VPC Link → API Service (tenant context + RBAC) → Aurora (materialized views). Total: 285ms P95 — 43% better than 500ms target.

Key Design Decision

Why pre-aggregated materialized views?

Dashboard queries aggregate millions of cost records across 13 months. Raw queries take 2-3 seconds — violating our <500ms P95 target. By refreshing materialized views every 15 minutes after data collection, we serve complex aggregations in under 100ms. The tradeoff — 15-minute data freshness — is invisible for cost data that's already 12-24 hours delayed from cloud providers.

While dashboards serve pre-computed data in milliseconds, the real complexity lives in how that data gets collected, normalized, and analyzed behind the scenes. The collection pipeline runs every 15 minutes, processing data from 127 cloud accounts across three providers without ever blocking a user request.

Data Collection Pipeline

Collection Flow

EventBridge (15-min trigger) → SQS (queued jobs) → Fargate Spot Workers (70% savings) → Secrets Manager (5-layer credential security) → Provider APIs (rate-limited: AWS 4/sec, GCP 1/sec, Azure 0.1/sec) → Normalization (USD, 6 service categories, 6 region zones) → Aurora + S3 archives → Analytics workers (budgets, anomalies, recommendations).

Key Design Decision

Why async collection with SQS?

Cloud billing APIs are slow (2-30 seconds), rate-limited (AWS 5/sec, GCP 1/sec, Azure 0.1/sec — a 50x difference), and occasionally unreliable. By decoupling collection from user-facing APIs through SQS, dashboards respond in <300ms regardless of external API performance. If AWS Cost Explorer times out, SQS visibility timeout expires and another worker retries automatically. Nothing is ever lost.

The hardest engineering problem wasn't building the pipeline — it was making three fundamentally different cloud APIs look identical to the rest of the system. AWS, GCP, and Azure use different authentication models, return data in different formats, and enforce rate limits that vary by 50x. The abstraction layer below is what makes "multi-cloud" a reality rather than a marketing claim.

Multi-Cloud Abstraction Layer

Adapter Pattern

CloudCollector interface defines the contract: CollectCosts(), CollectResources(), ValidateCredentials(), Provider(). Three adapters implement provider-specific authentication (AWS IAM roles, GCP service accounts, Azure service principals), rate limits, and circuit breakers. Normalization layer converts 120+ services to 6 categories, 50+ regions to 6 zones, all currencies to USD.

Key Design Decision

Why the adapter pattern?

AWS, GCP, and Azure have fundamentally different authentication methods, rate limits (50x difference), and data models. The adapter pattern isolates complexity: each adapter handles its own rate limiting, retry logic, and authentication quirks while exposing a clean interface. Adding Oracle Cloud or Alibaba later means implementing one new adapter — zero changes to existing code.

Key Design Decision

Why normalize at ingestion time?

Query-time normalization adds 200-500ms latency per request. Ingestion-time normalization runs once per data point and stores both original and normalized values. Dashboard queries hit pre-normalized data and return in <100ms. Storage overhead is only 20% — storage is cheap; user time is not.

Built for Production

Building a working prototype is straightforward. Building a system that handles sensitive financial data across 127 cloud accounts — where a security breach exposes cost structures, a data loss corrupts compliance records, or a silent failure means missed budget alerts — requires a fundamentally different engineering approach.

We designed three production layers from day one, not as afterthoughts bolted onto an MVP: a data architecture with strict tenant isolation and encryption at every boundary, resilience patterns that gracefully handle the inevitable failures of external cloud APIs, and a high-availability infrastructure with full observability so the team of 3-5 engineers can operate the platform confidently at scale.

Data Architecture & Security

Storage Architecture & Security Controls

Security Architecture

Aurora Serverless v2 with monthly-partitioned tables (13 months hot, then S3 Parquet → Glacier for 7-year compliance). Materialized views refresh every 15 minutes for <100ms queries. Three-layer tenant isolation: API Gateway JWT → Application middleware → Repository WHERE tenant_id=$1. Five-layer credential defense: namespace isolation → ownership validation → IAM restriction → KMS encryption → CloudTrail audit.

Key Design Decision

Why Aurora Serverless v2?

TimescaleDB requires always-on EC2 ($150+/month) plus operational overhead — overkill for 1M records/month. DynamoDB struggles with complex aggregations our dashboards require. Aurora Serverless v2 gives us PostgreSQL's query flexibility with true serverless economics: scales from 0.5 ACU when idle to 4 ACU during spikes. At MVP load, ACU compute averages ~$50/month; total cost including storage and I/O is ~$73/month.

Key Design Decision

Why application-level tenant isolation?

Database-per-tenant at 100 tenants = 100 Aurora instances × $65/month = $6,500/month. Application-level filtering costs ~$65/month total — 99% savings. We enforce isolation at compile time: repository methods require tenant context, all queries include WHERE tenant_id = $1, and 100+ automated tests verify no cross-tenant leakage. PostgreSQL Row-Level Security is planned as an additional defense-in-depth layer.

Security protects data at rest. Resilience protects the system in motion — when cloud provider APIs throttle requests, return errors, or go down entirely. The platform interacts with external APIs thousands of times per day, and each call is an opportunity for failure. The patterns below ensure that no single provider outage, API timeout, or rate limit breach can compromise data integrity or degrade the user experience.

Resilience Patterns

Failure Handling

Three-phase protection. Entry: Rate limiting (per provider + per tenant), idempotency keys. Processing: Per-provider circuit breakers (CLOSED → OPEN after 5 failures → HALF_OPEN after 30s), async SQS handoff. Completion: Exponential backoff (1s → 2s → 4s, max 30s, ±20% jitter), Dead Letter Queue after 3 attempts.

Key Design Decision

Why per-provider circuit breakers?

If Azure's API is degraded, a global circuit breaker would halt AWS and GCP collection. Per-provider isolation prevents cascade: Azure goes down → Azure circuit opens → AWS and GCP continue normally. Each provider has independent health.

Resilience patterns handle individual failures. The infrastructure layer ensures the entire platform stays available — even when an AWS Availability Zone goes down — and gives the operations team clear visibility into system health without requiring 24/7 manual monitoring.

High Availability & Observability

Infrastructure

Multi-AZ deployment: Aurora writer (AZ-A) + reader (AZ-B) with <60s auto-failover. ECS tasks distributed across both AZs. VPC endpoints eliminate NAT transfer costs (Secrets Manager, SQS, S3, ECR, Logs). X-Ray tracing (5% sampling) + CloudWatch dashboards (latency, errors, queue depth). P1 alerts → PagerDuty within 15 minutes.

Key Design Decision

Why ECS Fargate Spot for workers?

Lambda has a 15-minute timeout — insufficient for slow cloud APIs (some take 90+ seconds). EKS costs $73/month for control plane alone (31% of our budget). ECS Fargate provides serverless containers with unlimited runtime. Fargate Spot saves 70% on compute. Workers are stateless — if AWS reclaims a Spot instance, SQS redelivers the message automatically. No data loss possible.

Production Features Delivered

285ms P95 API latency (beat 500ms target by 43%)

15-minute data freshness across all providers

Complete multi-tenant isolation with automated testing

5-layer credential security (SOC 2 ready)

Zero data loss (SQS + DLQ + S3 archives)

Multi-AZ deployment with automatic failover

Results

After six months in production managing $2.3M in annual cloud spend across 127 accounts, the platform has delivered measurable impact across every dimension the client's CFO originally defined as success criteria. The numbers below are from live production metrics, not projections.

Cloud Cost Optimizer

6-Month Production Metrics

LIVE

Annual Savings Identified First 30 days

$ 160,000

ROI achieved in 3 weeks · Platform costs just $239/month

Manual Hours / Week -100%

15 0

Status Fully automated

Time to Cost Visibility -99.5%

48h 15m

Refresh rate Every 15 minutes

API Latency (P95) 43% better

285ms

Target <500ms

Platform Uptime Exceeded

99.95%

Target 99.5%

$8,200/mo

Idle dev environment

$2,400/mo

Oversized EC2

$1,600/mo

Unattached EBS

$1,200/mo

Old snapshots

Before this, our finance team was spending two days every month just trying to figure out what we were actually paying across AWS, GCP, and Azure. Now it's all in one dashboard — and we've already saved 18x what the platform costs us.

— CFO, 200-Person B2B SaaS Company