The Importance Of Cost Scalability And What To Watch Out For
I have spent my entire career on the operational side of IT, whether as an End User Support Engineer, Systems Administrator, IT manager/director, DevOps/Infrastructure, Site Reliability, etc. And during my time in these various roles, I have learned and started to talk about what I believe are the 3 SaaS Operations Pillars for Success - 1) Scalability, 2) Reliability, & 3) Security. While each of these pillars deserves its own due, I would like to focus on Scalability, and specifically, how to become aware and properly manage Cost Scalability in a cloud-based world; what to consider, what to look for, and how best to manage.
Public Cloud - All The Cool Kids Are Doing It!
At the beginning of my work journey, I had the opportunity to build out infrastructure in data centers (or at least in a colocation facility) leveraging capital expenditure budgets to purchase hardware and software as necessary. While this is obvious and many in the industry have discussed and/or written about this topic, it forced technical and business leaders to think about how they were best using the resources they had already purchased while requiring these same teams to plan and communicate the need for more during the annual budget cycles. This process "control valve" had value, but most argued it slowed down innovation.
Then along comes Public Cloud, and while the advent of the public cloud wasn't explicitly meant to solve the capital expense process as I describe above, it did improve the ease of experimentation and innovation by allowing for quick access to the necessary infrastructure. This is great for speed to market and agility (quick iterations), but the "waste" it leaves in its wake is quite scary and I have seen it first hand, even to the point where the CFO will come walking into your office (or walking up to your open cube desk) yelling "What the hell are we doing spending this amount in the cloud!!!!!?!?!!!"
How To Make the CFO Love What You Do (or like you just a bit more)
In the next few sections, I'm going to describe areas in Public Cloud people should consider in order to prevent a cost-tastrophe for their business. It will be focused on AWS, but the same concepts/tooling can be leveraged in Google Compute or Microsoft Azure.
Flip the Compute Cost Narrative
As many of you know, there are 3 models for compute costs; On-Demand, Reserved Instances (or Reserved VMs), and Spot Instances (aka Preemptible in GCP or Low Priority in Azure). Most organizations just start by using On-Demand, work their way to purchasing Reserved Instances/VMs for workloads that are consistent, and then leverage Spot instances as necessary. Instead, I would implore you to work the process in the opposite manner ... Start with Spot Instances where possible (especially if you are already running containers), purchase Reserved Instances / VMs where Spot doesn't work, and then, if necessary, use On- Demand. While leveraging Spot Instances in AWS can be difficult to manage (see my article here that talks about it), this should be the default for organizations. Also, as containers and Kubernetes continues to take over the infrastructure layer, implementing this has never been easier.
"Start with Spot Instances where possible (especially if you are already running containers), purchase Reserved Instances / VMs where Spot doesn't work, and then, if necessary, use OnDemand"
You are probably asking - Why? Well, the simple answer is that Spot pricing is 80-90% of On-Demand, and 50-60% below Reserved Instances (RIs). Who doesn't want to save that kind of money?! Plus, the sooner you do this, the quicker you can show a real cost advantage while pushing to put more workloads into the cloud. Now, if Spot doesn't work for you (for whatever reason), then you need to move to RI's, but this does require a commitment. If possible, I suggest checking out the Reserved Instance marketplace in AWS to get cheap RIs at discounts beyond the norm. (Unfortunately, this can be short-lived, ie., - only a couple of months) Again, you can save earlier than anticipated and put that money back into the business.
Now that we have mastered compute, let's move to the network ...
Oh, for the love of Transfer Costs
Public cloud is great until you need to move data around (which never happens, am I right??! - JK). The simple act of sending data around can cause significant transfer cost problems and ultimately cause one’s bill to spike. Transfer cost areas to keep in mind and stay vigilant:
● EC2 (Inter-AZ)
● S3 ● NAT Gateway
● Load balancers (ALB, ELB, NLB)
I can tell you this can become quite a shock, and even when you think you have solved it, you probably have only moved the costs. Technical teams need to spend time thinking about how they access and move data in the Public Cloud to avoid or minimize these costs. There is no silver bullet here, but you could ask for a private pricing agreement with your provider of choice if you are willing to commit.
Managed Services in the Cloud - Don't let me down now
Managed service solutions in the cloud are fantastic, and I advocate using them where applicable to your business. Solutions like ElasticSearch, ElastiCache, RDS, etc. are great. But there are hidden costs - watch it. In most of these cases, the biggest culprit to drive costs with these services is: inefficient use of resources due to incorrect or poor design of service, ease of deployment ("Oh, I could use this EC cluster / RDS instance to manage the metadata about my service" x 100 services ... Yikes!), or improper data usage and storage ("Let's keep the data for ... 10 years since you never know." .... Whoops!)
So, what to do; think about your use of these services, and if you see them proliferating, stop and ask the question - Do you really need them and how can we do this better??
One last area to focus on is that of Wasted resources or better said, anything that is still not in use, get rid of it. I cannot tell you how many Load Balancers with no hosts attached, or Elastic IPs not attached to hosts, or DNS/Route53 entries pointed to non-existent resources, or CloudFront entries no longer in use, or unattached Storage Volumes or Snapshots that are older than the account has been around (haha, jk ... wanted to see if you were paying attention), etc. While cleaning up is good hygiene, the cloud providers give you Zero (0) tools to help with this ... So, your on your own! And why would they want to give you the tools? [One side note - not only is it important from a cost perspective, in some cases these wasted resources could be security flaws just waiting to be taken advantage of.]
The suggestion to fix - work with your teams to write Lambda jobs that scour for unused resources and report on them and then if no longer needed ... DELETE! Or, you could use tools from companies like CloudHealth (https://www.cloudhealthtech.com/) that allow you to scour your resources, create reports for your users, and then create a governance trail to clean up the waste. We have used the tooling from CloudHealth successfully to save thousands of dollars per month.
Show me the Money
So, to recap, if you want to save money in the cloud and make your CFO your best buddy, here are the things you should do to improve your lot:
1. For Cloud Compute - Leverage Spot/Preemptible/Low Priority VMs first and foremost, followed by Reserved Instances, followed by On-Demand
2. Watch your Transfer Costs closely
3. Managed Services are the bomb - as long as you have good retention policies and architecture reviews to ensure right-sized services
4. Remove the Waste as quickly as you can (or monitor and then remove)
Thanks for letting me share and happy hunting!!