Introduction: The Invisible Budget Leak
The average organization wastes 32% of their cloud spend. Not because they do not care. Because cloud billing is complex, sprawling, and opaque in ways that make it genuinely difficult to understand without dedicated attention.
A survey by Flexera found that cloud waste — resources that are provisioned but not fully utilized — costs organizations $26 billion annually. The same survey found that most organizations underestimate their cloud spend by 35–40% when asked to predict their monthly bill.
For MSPs managing client cloud environments, this is both a problem and an opportunity. Every dollar of cloud waste your clients are paying is either:
- Money you should be saving them (strengthening your value proposition)
- Money that could fund additional managed services (revenue opportunity)
This guide covers the complete FinOps methodology — from waste identification through governance automation — with specific tactics for AWS, Azure, and GCP. Whether you manage cloud directly as a reseller/MSP or advise clients on their own cloud accounts, this is your playbook.
Understanding FinOps: The Framework
FinOps (Financial Operations) is a practice that brings financial accountability to the variable spending model of cloud infrastructure. It is the intersection of finance, technology, and business.
The FinOps Foundation defines three phases of cloud financial management maturity:
Crawl: Initial visibility. Understanding what you are spending and on what. Basic tagging, cost allocation, and budgets. Most organizations start here.
Walk: Optimization. Active right-sizing, Reserved Instance purchasing, automated resource cleanup. Cost efficiency becomes a regular operational activity.
Run: Continuous optimization. Real-time anomaly detection, ML-driven right-sizing recommendations, automated enforcement of cost governance policies. Cloud cost is a first-class engineering consideration.
For MSPs, helping clients move from Crawl to Walk is where the highest-value work exists.
Phase 1: Visibility — Understanding What You Are Spending
You cannot optimize what you cannot see. The foundation of cloud cost optimization is comprehensive tagging and cost allocation.
Resource Tagging Strategy
Tags are metadata applied to cloud resources that enable cost attribution, reporting, and governance. Without consistent tagging, you cannot answer "who is paying for what?"
Minimum required tags for cost governance:
| Tag Key | Description | Example Values |
|---|---|---|
Environment | Deployment environment | production, staging, development |
Project | Project or product | crm-platform, ecommerce-api |
Team | Owning team or department | engineering, marketing, ops |
CostCenter | Finance allocation code | CC-1001, CC-2045 |
Owner | Responsible individual/email | john.smith@company.com(opens in new tab) |
Managed-By | Who provisions/maintains | terraform, manual, msp-name |
Enforcing tags via policy:
AWS: Use AWS Config Rules + Service Control Policies to deny resource creation without required tags.
Azure: Use Azure Policy with Deny effect to block resource creation without required tags:
{
"policyRule": {
"if": {
"field": "tags['Environment']",
"exists": "false"
},
"then": {
"effect": "deny"
}
}
}
GCP: Organization Policy constraints enforce label requirements.
Cost Allocation and Reporting
AWS Cost Explorer: Built-in cost visualization with filtering by service, linked account, tags, region. Essential starting point.
Azure Cost Management + Billing: Native Azure cost reporting with budget alerts and anomaly detection.
GCP Cloud Billing: Cost reports, budgets, and committed use discount analysis.
Third-party tools for multi-cloud MSPs:
- Spot.io (CloudCheckr): Multi-cloud cost management platform with optimization recommendations
- Apptio Cloudability: Enterprise-grade FinOps platform
- Kubecost: Kubernetes-specific cost allocation (critical for container workloads)
- Vantage: Developer-friendly cost management with API integrations
For MSPs managing multiple client accounts, a multi-cloud cost management platform that can aggregate spending across all clients is essential for providing consolidated reporting.
Phase 2: Identifying Waste — The Seven Categories of Cloud Waste
Category 1: Idle and Underutilized Compute
The most common waste: EC2 instances, Azure VMs, or GCP Compute Engine instances running at very low utilization.
Detection: Cloud-native tools report utilization. Target: VMs running at < 10% average CPU and < 20% average memory for more than 7 days.
AWS: AWS Cost Explorer → Rightsizing Recommendations. EC2 instances with < 5% CPU utilization for 14 days are flagged.
Azure: Azure Advisor Recommendations → Cost. Provides specific VM rightsizing recommendations.
GCP: Recommender API provides right-sizing recommendations based on 30-day CPU and memory metrics.
Action options:
- Stop the instance (if it is not needed, stopping eliminates compute cost while preserving the disk)
- Terminate the instance (if the data is backed up or not needed)
- Downsize to a smaller instance type
Automation: Use AWS Instance Scheduler, Azure Automation, or custom Lambda/Azure Functions to automatically stop development/staging instances outside business hours.
# AWS Lambda function to stop non-production instances after hours
import boto3
def lambda_handler(event, context):
ec2 = boto3.client('ec2', region_name='us-east-1')
# Find running non-production instances (by tag)
instances = ec2.describe_instances(
Filters=[
{'Name': 'tag:Environment', 'Values': ['development', 'staging']},
{'Name': 'instance-state-name', 'Values': ['running']}
]
)
instance_ids = [
i['InstanceId']
for r in instances['Reservations']
for i in r['Instances']
]
if instance_ids:
ec2.stop_instances(InstanceIds=instance_ids)
print(f"Stopped {len(instance_ids)} non-production instances")
return {'stopped': len(instance_ids)}
Category 2: Orphaned Resources
Resources that were created as dependencies of other resources but not cleaned up when the parent was deleted:
- Unattached EBS volumes (AWS): Volumes not mounted to any EC2 instance
- Unused Elastic IPs (AWS): Allocated but not associated with running instances ($0.005/hour when unattached)
- Old snapshots: EBS snapshots, Azure Disk Snapshots, GCP Disk Snapshots older than your retention policy
- Unused load balancers: Application or Network Load Balancers with no registered targets
- Old AMIs/Machine Images: AMIs no longer used for provisioning
# Find unattached EBS volumes (AWS CLI)
aws ec2 describe-volumes \
--filters "Name=status,Values=available" \
--query "Volumes[*].{ID:VolumeId,Size:Size,Cost:Size}" \
--output table
# Find unassociated Elastic IPs
aws ec2 describe-addresses \
--query "Addresses[?!InstanceId && !NetworkInterfaceId].AllocationId" \
--output table
Cleanup automation: Many organizations run monthly "cloud garbage collection" scripts that identify and delete orphaned resources based on defined rules (age, status, tag absence).
Category 3: Storage Waste
Storage is often the most overlooked cost category because individual storage costs are low but they accumulate over years:
S3/Azure Blob/GCS lifecycle policies: Move infrequently accessed data to cheaper storage tiers automatically.
{
"Rules": [{
"ID": "archive-old-data",
"Status": "Enabled",
"Filter": {},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER_IR"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": {
"Days": 2555
}
}]
}
Snapshot retention: Define and enforce snapshot retention policies. Most organizations retain snapshots far longer than necessary.
AWS Data Lifecycle Manager and Azure Backup policies can automate snapshot cleanup.
S3 Intelligent-Tiering: For unpredictable access patterns, S3 Intelligent-Tiering automatically moves objects between access tiers, eliminating the need to manually analyze and transition data.
Category 4: Overprovisioned Reserved Instances and Savings Plans
The inverse of idle resources: paying for Reserved Instances or Savings Plans that are not being fully utilized.
Causes:
- Team shrank or workload decreased but reservations were not adjusted
- Application migrated to different instance families
- Workload moved to a different region
Detection: AWS Reservation Utilization report (Cost Explorer → Reservations → Utilization); Azure Reserved VM Instance utilization report.
Remediation: AWS Reserved Instances can be sold on the RI Marketplace. Convertible RIs can be exchanged for different instance types or families. Azure has a similar exchange policy.
Category 5: Data Transfer Costs
Data transfer is one of the most complex and misunderstood cloud cost categories:
- Ingress: Free on all major cloud providers
- Egress (to internet): $0.085–$0.09/GB on AWS; similar on Azure and GCP
- Cross-region transfer: Charged in both source and destination regions
- Cross-AZ transfer: $0.01/GB — often overlooked in architecture design
- NAT Gateway: $0.045/GB for data processed through NAT Gateway
Optimization strategies:
- Use AWS CloudFront/Azure CDN/GCP Cloud CDN to cache content and reduce egress
- Architect applications to minimize cross-AZ and cross-region data movement
- Use VPC Endpoints for S3 and DynamoDB (free) instead of routing through NAT Gateway
- Compress data before transfer where possible
- Consider same-region data gravity — keep compute and data in the same region and AZ
Category 6: Database Overspending
RDS and managed database services are typically expensive and often over-provisioned:
Right-size RDS: Review CPU and memory utilization. Many production RDS instances run at < 30% average utilization. Downsizing from db.r5.2xlarge to db.r5.xlarge saves 50% of instance cost.
RDS Reserved Instances: RDS workloads are typically very stable (always-on, predictable load). 1-year or 3-year RDS Reserved Instances offer 40–70% savings over on-demand.
Aurora Serverless: For variable workloads (development databases, batch processing), Aurora Serverless v2 scales to zero when idle, eliminating cost for periods of inactivity.
Delete development/test databases when not in use: Create a nightly Lambda function that snapshots and terminates development databases, then restores from snapshot on Monday morning. Combined with Amazon Data Lifecycle Manager for automated snapshot management.
Category 7: Misconfigured Auto-Scaling
Auto-scaling is designed to save money by scaling down when load decreases. Misconfigured scaling policies that never scale down — or that scale up to maximum at the first load spike — negate this benefit.
Review Auto Scaling Group policies:
- Scale-down trigger: Is there a scale-down policy? Is it aggressive enough?
- Minimum instance count: Is minimum > 1 justified? Can off-hours minimum be reduced?
- Target tracking: Use target tracking scaling policies rather than step scaling for most workloads — AWS manages the scaling math automatically
Phase 3: Optimization — The Right Tools for Each Workload
Reserved Instances and Savings Plans
For stable, predictable workloads, Reserved Instances (AWS) or Azure Reserved Instances provide the largest savings:
AWS Reserved Instances:
- 1-year standard: 40% savings vs. on-demand
- 3-year standard: 60% savings vs. on-demand
- Convertible RI (allows instance family changes): 45% savings for 3-year
AWS Compute Savings Plans: More flexible than RIs — apply to any EC2 instance regardless of family, size, or region. 1-year Compute Savings Plan: ~38% savings.
AWS EC2 Instance Savings Plans: Apply to a specific instance family in a specific region. Slightly better savings than Compute Savings Plans (~42%).
Decision framework for RI vs. Savings Plan:
- If your workload is stable and you are certain of instance family: EC2 Instance Savings Plan (best savings)
- If your workload may change instance families: Compute Savings Plan (most flexibility)
- For databases: RDS Reserved Instances (the only option — Savings Plans do not apply to RDS)
RI purchasing recommendations: AWS Cost Explorer provides RI purchase recommendations based on your usage patterns. Follow these recommendations for instances running at > 80% utilization for > 720 hours/month.
Spot and Preemptible Instances
AWS Spot Instances, Azure Spot VMs, and GCP Preemptible VMs offer up to 90% savings over on-demand pricing — but can be interrupted with 2 minutes notice.
Suitable workloads for Spot:
- Batch processing and data pipelines
- CI/CD build systems
- Big data analytics
- Stateless web servers (behind a load balancer)
- Machine learning training jobs
Not suitable for Spot:
- Production databases
- Stateful applications without graceful shutdown
- Long-running jobs without checkpointing
MSP opportunity: Help clients identify batch and analytics workloads that can safely use Spot, then implement Spot into the architecture. A 70% cost reduction on a $5,000/month EC2 bill saves $3,500/month — compelling value demonstration.
Container and Kubernetes Cost Optimization
Container workloads have unique optimization patterns:
Right-sizing container resource requests: Kubernetes resource requests determine scheduling — if you request 4 vCPUs per pod and the pod uses 1, you are paying for 4 but getting 1. Use VerticalPodAutoscaler to right-size requests based on actual usage.
Cluster autoscaling: Use the Cluster Autoscaler (or Karpenter for AWS) to add and remove nodes based on pod scheduling needs. Karpenter in particular can intelligently select Spot vs. on-demand and right-size node instance types.
Namespace-level cost allocation: Use Kubecost to allocate cluster costs to namespaces, deployments, and teams. This enables showback/chargeback for multi-team clusters.
Cloud Cost as an MSP Service
For MSPs, cloud cost optimization is a high-value service opportunity.
The Cloud Cost Assessment
A cloud cost assessment is a project-based engagement ($3,000–$15,000) that delivers:
- Cloud cost inventory: Complete breakdown of current cloud spending by service, resource, and team
- Waste identification report: Prioritized list of optimization opportunities with estimated savings
- Architecture recommendations: Structural changes that would reduce cost long-term
- RI/Savings Plan recommendations: Specific purchase recommendations based on usage data
- Governance recommendations: Tagging policies, budget alerts, and guardrails
The ROI of this assessment is typically 5–10× the assessment fee in the first year of savings.
Managed Cloud Cost Governance
Beyond the initial assessment, ongoing cloud cost governance is a natural managed service add-on:
- Monthly cost reports with variance analysis
- Budget alert management
- RI/Savings Plan utilization monitoring and adjustment recommendations
- Automated waste cleanup (orphaned resources, idle instances)
- Quarterly FinOps reviews
Pricing: $500–$2,500/month depending on cloud spend under management. For clients with $50,000+/month cloud spend, this service pays for itself multiple times over.
Cloud Reseller Programs
Major cloud providers have MSP/reseller programs that allow MSPs to bill clients for cloud consumption and earn margin:
AWS Partner Network (APN): AWS Consulting Partners can resell AWS credits and earn referral fees. Advanced and Premier tiers include co-selling support and AWS credits for MSP own use.
Microsoft Partner Network (MPN): Azure CSP (Cloud Solution Provider) program allows MSPs to bill clients for Azure and earn 10–15% margin.
GCP Partner Advantage: Similar program for GCP with reselling rights and partner-exclusive pricing.
Cloud reselling adds a revenue stream while giving MSPs complete control over the client cloud relationship.
FinOps Tool Comparison
| Tool | Best For | Pricing |
|---|---|---|
| AWS Cost Explorer | AWS-only environments | Free (included with AWS) |
| Azure Cost Management | Azure-only environments | Free (included with Azure) |
| GCP Cloud Billing | GCP-only environments | Free (included with GCP) |
| Spot.io | Multi-cloud, spot optimization | % of savings |
| Apptio Cloudability | Enterprise FinOps, showback | Enterprise pricing |
| Vantage | Developer-friendly, multi-cloud | % of managed spend |
| Kubecost | Kubernetes cost allocation | Free (open source), enterprise tier |
| InfraCost | Infrastructure-as-code cost estimation | Free, enterprise tier |
Data Mammoth(opens in new tab) and VPS-Server.host(opens in new tab) provide infrastructure management services that include cloud cost governance as part of their managed hosting offerings — demonstrating that cost optimization is increasingly table stakes for infrastructure service providers.
Frequently Asked Questions
How much of cloud spend can realistically be reduced? Industry data from Flexera and CloudHealth suggests organizations can typically reduce cloud spend by 20–35% through right-sizing, Reserved Instance optimization, and waste elimination — without changing application architecture. More aggressive optimization (Spot instances, architectural changes) can achieve 40–60% reductions.
Should I buy Reserved Instances or Savings Plans? For most AWS workloads, Compute Savings Plans are the recommended default — they provide similar savings to RIs with more flexibility. Use EC2 Instance Savings Plans when you have high confidence in a specific instance family. Reserve RDS instances separately (Savings Plans do not apply to RDS).
How do I handle multi-cloud cost management? Multi-cloud cost management requires a third-party tool (Vantage, Apptio, CloudHealth) since cloud-native tools only cover their own cloud. Standardize on a common tagging schema across all clouds before implementing multi-cloud cost management.
What is showback vs. chargeback? Showback: showing teams or departments how much cloud they are consuming, without actually billing them. Used for cost awareness and behavioral change. Chargeback: actually billing internal teams for their cloud consumption. Used in organizations where business units own their own budgets.
How often should we review Reserved Instance utilization? Monthly. RI utilization should be > 80%. Below 80%, investigate whether the workload has changed and whether exchanges or modifications are warranted.
Conclusion
Cloud cost optimization is not a one-time project — it is an ongoing operational discipline. Organizations that treat cloud cost as a first-class operational metric, review it monthly, and continuously tune their resource configurations will consistently achieve 20–35% lower cloud spend than those that manage cloud reactively.
For MSPs, cloud cost optimization is a service that delivers immediate, quantifiable value. A client spending $30,000/month on cloud infrastructure will see $6,000–$10,000 in monthly savings from a well-executed optimization program — more than paying for comprehensive managed services.
The tools are available, the framework (FinOps) is proven, and the opportunity is significant. Start with visibility — get consistent tagging in place. Then identify waste. Then optimize. Then govern.
For related reading: cloud infrastructure monitoring, WHMCS billing automation for hosting providers, and building a profitable MSP. Start your NinjaIT trial for infrastructure monitoring that supports cloud cost governance.
FinOps Maturity Model: Where Are You?
The FinOps Foundation defines three stages of cloud financial management maturity. Knowing where you are helps prioritize your next steps.
Stage 1: Crawl (Visibility)
Characteristics:
- Cloud spending is tracked at the account level
- No tagging strategy
- No owner assigned to cloud cost management
- Costs are reviewed monthly (or less) when the invoice arrives
- No workload-level visibility
Goals for this stage:
- Implement tagging standards
- Enable Cost Explorer / Cost Management dashboards
- Identify top 10 cost drivers
- Establish a monthly cloud cost review meeting
What you will find: Organizations in the Crawl stage typically have 30–40% waste in their environment because no one has looked at utilization systematically.
Stage 2: Walk (Optimization)
Characteristics:
- Consistent tagging (70–80% coverage)
- Monthly cost reviews with business stakeholders
- Right-sizing done for obvious oversized resources
- Some Reserved Instance coverage for steady-state workloads
- Showback reports sent to engineering teams
Goals for this stage:
- Achieve 90%+ tag coverage
- Right-size all resources with < 20% utilization
- Purchase or convert 70%+ of steady-state compute to Reserved Instances/Savings Plans
- Establish anomaly detection and budget alerts
- Begin chargeback or showback reporting
What you will find: Walk-stage optimization typically yields 20–30% cost reduction from baseline.
Stage 3: Run (Governance)
Characteristics:
- Near-complete tagging
- Automated cost governance (budget alerts automatically block or restrict over-budget workloads)
- Engineering teams own their cloud budgets
- Continuous optimization is embedded in the development process (cloud cost reviewed in sprint reviews)
- FinOps team or dedicated cloud cost manager
- Advanced commitment management (Private Pricing Agreements for AWS, Enterprise Agreements for Azure)
Goals for this stage:
- Optimization is continuous, not periodic
- New workload deployments include cost estimates
- Cloud cost efficiency is a KPI in engineering team metrics
What you will find: Run-stage organizations consistently spend 25–40% less per unit of compute than industry benchmarks.
Cloud Cost Optimization by Platform
Each major cloud platform has its own cost management tools and optimization mechanisms. Here is a practical guide to each.
AWS Cost Optimization
Tools:
- AWS Cost Explorer: Historical spending analysis, service breakdown, reservation recommendations
- AWS Compute Optimizer: Machine learning-based right-sizing recommendations for EC2, Lambda, EBS volumes
- AWS Trusted Advisor: Automated checks including cost optimization recommendations
- AWS Savings Plans Purchase Analyzer: Models the impact of different Savings Plan commitments
Key AWS-specific optimizations:
1. EC2 Instance Right-Sizing:
- Run AWS Compute Optimizer for 14+ days of metrics
- Focus on instances with < 20% CPU utilization at peak
- Common finding: m5.xlarge instances with workloads that run fine on m5.large
- Typical savings: 40% per right-sized instance
2. RDS Instance Optimization:
- Aurora Serverless v2 for variable workloads (pay per ACU-hour, not per instance)
- Read replicas can often be eliminated with caching (ElastiCache)
- Storage type optimization: gp3 is cheaper than gp2 for most workloads
3. S3 Cost Reduction:
- Intelligent Tiering: Automatically moves objects to cheaper storage based on access patterns
- Lifecycle rules: Move to S3-IA (Infrequent Access) after 30 days, Glacier after 90 days
- S3 request optimization: Reduce unnecessary GET/PUT/LIST operations
4. Data Transfer Costs:
- AWS charges for data transfer OUT of AWS (in-region and to internet)
- Using CloudFront as CDN for static content reduces origin data transfer costs
- Keep data processing within the same region as storage to eliminate cross-region transfer fees
5. Lambda and Serverless:
- Power Tuning Tool: Find the optimal memory configuration (higher memory = faster execution = lower cost despite higher per-GB price)
- Dead letter queues: Prevent infinite retry loops that generate unexpected costs
- Provision concurrency: Only for Lambda functions with strict latency requirements
Azure Cost Optimization
Tools:
- Azure Cost Management + Billing: Cost analysis, budgets, alerts, and reservation management
- Azure Advisor: Recommendations including rightsizing, idle resources, and Reserved Instance coverage
- Azure Pricing Calculator: Model architecture costs before deployment
- Azure Monitor Metrics: Utilization data for rightsizing analysis
Key Azure-specific optimizations:
1. Azure Hybrid Benefit:
- Apply existing Windows Server and SQL Server licenses to Azure VMs
- Typical savings: 40% on Windows VMs, 55% on SQL VMs
- Often the single largest Azure cost reduction lever for organizations with existing Microsoft Enterprise Agreements
2. Reserved VM Instances:
- 1-year reserved instances: ~36% savings vs. pay-as-you-go
- 3-year reserved instances: ~52% savings
- Convertible reservations allow exchanging instance families as workloads evolve
3. Azure Spot VMs:
- Up to 90% discount vs. regular pricing
- Can be evicted with 30 seconds notice when Azure needs capacity
- Use for: batch processing, dev/test, fault-tolerant workloads
- Not appropriate for: production databases, critical applications
4. Azure Storage Tiering:
- Hot tier for frequently accessed data
- Cool tier (50% cheaper) for data accessed < monthly
- Archive tier (90% cheaper) for data accessed rarely, with retrieval delay acceptable
5. App Service Plan Optimization:
- Many organizations over-provision App Service Plans
- Consolidate multiple apps onto shared App Service Plans
- Scale down during off-hours using Auto-scale rules
GCP Cost Optimization
Tools:
- Cloud Billing Reports: Detailed cost breakdown by project, service, SKU
- Recommender API: Automated rightsizing, idle resource, and commitment recommendations
- Cloud Monitoring: Utilization metrics for rightsizing analysis
- Billing Export to BigQuery: Enables custom cost analysis at any granularity
Key GCP-specific optimizations:
1. Committed Use Discounts (CUDs):
- 1-year commitment: 37% discount on GCE
- 3-year commitment: 55% discount
- Unlike AWS RIs, GCP CUDs are flexible across machine types in the same region
2. Sustained Use Discounts:
- Automatic discounts (no commitment required) for VMs that run > 25% of the month
- Discounts scale up to 30% for instances running 100% of the month
- This automatic discount means some optimization happens without intervention
3. Preemptible VMs (Spot VMs):
- Up to 80% cheaper than standard instances
- Can be preempted with 30-second notice
- Use for: batch jobs, CI/CD runners, ML training
4. Committed Use Discounts for Cloud SQL and Cloud Spanner:
- Similar to compute CUDs, available for database services
- 3-year commitments provide 50%+ discount
Building Cloud Cost Management as an MSP Service
MSPs that build cloud cost management as a service generate significant recurring revenue while delivering obvious ROI to clients.
The Cloud FinOps Engagement Model
Phase 1: Cloud Cost Assessment ($2,500–$8,000 one-time)
Deliverables:
- Current cloud spend analysis with waste identification
- Resource utilization analysis with right-sizing recommendations
- Tagging gap analysis
- Reservation/commitment coverage analysis
- Estimated savings potential with implementation effort
This assessment pays for itself immediately — clients typically see the potential savings in the report and want to move forward with implementation.
Phase 2: Optimization Implementation ($5,000–$20,000 one-time)
Activities:
- Implement tagging standards
- Right-size identified resources
- Purchase Savings Plans/Reserved Instances
- Configure idle resource shutdowns
- Set up budget alerts and governance policies
Phase 3: Ongoing Cloud Cost Management ($500–$3,000/month)
Monthly managed services including:
- Monthly cost review and reporting
- Ongoing right-sizing as workloads evolve
- Reservation utilization monitoring and optimization
- New workload cost review before deployment
- Quarterly strategic cloud cost review with the client
Pricing Your Cloud FinOps Service
A common pricing model: percentage of cloud savings delivered.
Example: Client spends $30,000/month on AWS. Your assessment identifies $9,000/month in savings (30%). You charge:
- 20% of first-year savings = $21,600 for implementation + first year of managed service
- Or $1,500/month ongoing for continuous optimization
This model aligns your incentives with the client's outcomes — you are motivated to maximize savings, and the client pays from savings, not from a separate budget line.
VPS Server Host(opens in new tab) and other infrastructure providers use FinOps principles to right-size their own hosting infrastructure, passing efficiency gains to customers through competitive pricing. Data Mammoth(opens in new tab) provides data infrastructure optimization services including cloud cost analysis for data-intensive workloads.
FinOps Tools Comparison
Native Cloud Tools (Free, Start Here)
AWS:
- Cost Explorer + Compute Optimizer + Trusted Advisor
- Comprehensive for single-account AWS environments
- Limitation: No multi-cloud, limited alerting sophistication
Azure:
- Cost Management + Advisor + Azure Monitor
- Best in class for Microsoft-centric environments
- Strong Reserved Instance management and anomaly detection
GCP:
- Billing Reports + Recommender API + BigQuery export
- Best customization via BigQuery
- Recommender API provides automated, actionable recommendations
Third-Party FinOps Platforms
Apptio Cloudability (enterprise): Multi-cloud, showback/chargeback, business mapping of cloud costs. Best for large enterprises.
CloudHealth by VMware: Multi-cloud governance and cost management. Strong policy engine for automated governance.
Vantage: Developer-friendly FinOps platform with excellent unit economics tracking. Popular with engineering teams.
Spot.io by NetApp: Focuses on automated infrastructure optimization — Spot instances, Reserved Instance optimization, Kubernetes cost optimization. Particularly strong for container workloads.
CloudZero: Unit cost economics focus. Best for SaaS companies wanting to track cloud cost per customer, per feature, or per team.
For MSPs: Cloudability and CloudHealth are the most common MSP choices because they support multi-tenant management. Vantage is a good single-tenant option.
Kubernetes and Container Cost Optimization
Container workloads introduce unique cost management challenges. A Kubernetes cluster can be running thousands of pods across many nodes, with cost attribution far more complex than VM-based architectures.
Container Cost Attribution
Without instrumentation, a Kubernetes cluster is a black box of cost. Tools for container cost attribution:
Kubecost: Open-source Kubernetes cost monitoring. Allocates cluster costs to namespaces, deployments, labels. Free for single cluster, paid for multi-cluster.
OpenCost: CNCF-standard for Kubernetes cost attribution. Vendor-neutral, open-source.
Cloud-native options: AWS Cost Allocation Tags for EKS, Azure Cost Analysis for AKS, GCP Cloud Billing for GKE — all improving but less granular than dedicated tools.
Right-Sizing Container Workloads
Containers are frequently over-requested (the CPU and memory limits set in pod specs are higher than actual usage):
# Typical over-provisioned pod spec
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
# If actual usage is consistently 128Mi memory and 50m CPU,
# the pod is wasting 75% of its requested resources
# Right-sized:
resources:
requests:
memory: "160Mi" # 25% buffer above average
cpu: "70m" # 25% buffer above average
limits:
memory: "256Mi"
cpu: "200m"
Vertical Pod Autoscaler (VPA): Kubernetes built-in tool that monitors resource usage and recommends or automatically adjusts resource requests. An excellent first step for right-sizing container workloads without manual analysis.
Karpenter (AWS): Node provisioning automation for EKS that selects the most cost-efficient node type for each workload's actual resource requirements, rather than using a fixed node type for the cluster.
Cluster Right-Sizing
Clusters are often oversized at the node level. For each node pool:
- Review average node utilization (CPU and memory) over the past 30 days
- If consistently < 60% utilized, evaluate reducing node count or instance size
- Consider spot/preemptible nodes for fault-tolerant workloads in the cluster
Frequently Asked Questions (Extended)
What is a good cloud cost per unit metric?
Cloud cost per unit (unit economics) measures cloud spend against a business metric: cost per customer, cost per API call, cost per transaction. This requires consistent tagging and either attribution tooling or manual calculation. A "good" number depends entirely on your industry and business model — the important thing is tracking it over time and ensuring it trends down as you grow (reflecting economies of scale in cloud).
How do I handle legacy applications that cannot be resized?
Some legacy applications have licensing restrictions that tie them to specific hardware counts, or configuration constraints that prevent memory reduction. For these: first, confirm the constraint is real (many "cannot be resized" constraints are actually untested assumptions). If confirmed, document the exception with business justification. Consider these workloads as candidates for containerization or refactoring in future roadmap discussions.
Is it safe to turn off development environments overnight?
Generally yes, with appropriate tooling. Key considerations: ensure all code is committed to version control before shutdown (not just saved locally), ensure databases have snapshot backups before shutdown, and test the start-up process — some environments have state that does not survive restart correctly. An automated daily restart (not just shutdown) of dev environments also catches start-up failures early.
What is the cloud cost impact of security controls?
Security controls (GuardDuty, Macie, CloudTrail with S3, WAF, DDoS protection) have cost implications. Budget for these separately from compute and storage. A common mistake: security controls are enabled during an audit or compliance project and then left running without being incorporated into the cloud budget baseline, causing unexpected cost overruns.
How do I prioritize which cost optimizations to implement first?
Prioritize by: (savings potential × implementation ease) / risk. Reserved Instances for predictable workloads: high savings, easy to implement, low risk — first priority. Right-sizing large instances: high savings, moderate effort, low risk — second priority. Idle resource cleanup: moderate savings, easy, no risk — third priority. Spot instance migration: high savings, moderate effort, high risk (potential disruption) — plan carefully and test thoroughly before executing.
Hosting & Cloud Infrastructure Architect
Tom has 12 years of experience in the hosting industry, from shared hosting support to architecting multi-region cloud platforms. He specializes in WHMCS automation, VPS management, cPanel/Plesk administration, and the intersection of hosting and MSP tooling. He has contributed to several open-source hosting automation projects and manages infrastructure spanning 4 data centers.
Tagged:
Ready to put this into practice?
NinjaIT's all-in-one platform handles everything covered in this guide — monitoring, automation, and management at scale.