Note: This post is inspired by Julia Evans’ recent post about capacity planning. 😌
First, let’s establish some ground rules. Yes…this post is geared for those of us who use MySQL with a single writer at a time and 2 or more read replicas. A lot of what I will talk about here applies differently, or not at all, to multi-writer clustered datastores, although those also come with their own set of compromises and caveats. So…your mileage will definitely vary.
However, this blog post WILL apply regardless of whether you use self-hosted physical hosts or are in AWS where you can use magic-like reserved instances and near instant provisioning. Being in “the cloud” will not preclude you from knowing your infrastructure’s abilities and limits and it won’t absolve you from that responsibility towards your team and customers. 🙂
I have already covered large strokes of this in one of my earlier posts, I mostly focused there on the benefits of functional or horizontal sharding. Yes, that is an absolute prerequisite, since what you use to access the database layer WILL decide how much flexibility you have to scale.
If you are a brand new company, you may not need functional sharding on day 1 but, if you hope (and I suspect you do) to grow, do not box yourself in with an open source ORM that won’t allow you to split your data on a functional basis down the line. Bonus points if you also don’t start your company with all your relational data in a single schema.
Ability to split reads and writes
This is something you will need to be able to do, but not necessarily enforce as a set in stone rule. There will be use cases where a write needs to be read very soon after and where tolerance for things like lag/eventual consistency is low. Those are ok to have, but in the same applications, you will also have scenarios for reads that can tolerate some longer timespan of eventual consistency. When such reads are in high volume, do you really want that volume going to your single writer if it doesn’t really have to? Do yourself a favor, and make sure soon in your growth days that you can control the use of a read or write IP in your code.
Now onto the thought process of actual capacity planning…A database cluster is not keeping up, what do I do?
Determine the system bottleneck
- Are you bottlenecked on writes or reads?
- Is the issue exhibiting as high CPU?
- Is it exhibiting as IO capacity?
- Is it growing lag on the replicas without a clear read query culprit?
- Is it locks?
- How do I even know which it is?
Each of these can be a post by itself. The point I am trying to make is that you have to be familiar with your system and DB specific metrics to be able to find out what piece is the bottleneck.
You need a baseline
Always make sure you have basic system metrics available to visualize for at least a few weeks back. Many tools provide this (Cacti, Munin, Graphite…etc). Once you know what system metric you are mostly bound to, you need to establish baseline and peak values. Otherwise, determining whether your current issue is a new application sourced bug vs. real growth is going to be a lot more error prone than you’d like.
However, basic server metrics can only go so far–at some point you will find you also need context-based metrics. Query performance and app side perceived performance will tell you what the application sees as a response time to queries. There are many tools to do this context heavy tracking. Some are open source like Anemometer and commercial tools like Vivid Cortex (We use these at SendGrid. See us talk about it here.) Even just tracking these metrics from the app perspective and throwing them as statsd metrics will be a good first step. But, early on you must get used to the fact that what your app perceives is what your customers perceive. And you must find a way to know first.
Learn your business’ traffic patterns
Are you a business that is susceptible to extreme peaks in specific weekdays (e.g. marketing)? Do you have regular launches that triple or quadruple your traffic like gaming? These sorts of questions will drive how much of reserved headroom you should keep or whether you need to invest in elastic growth.
Determine the ratio of raw traffic numbers in relation to capacity in use
This is simply the answer to, “If we made no code optimizations, how many emails/sales/online users/whatever” can we serve with the database instances we have right now?
Ideally, this a specific value that makes the math towards planning a year’s growth a simple math equation. But life is never ideal and this value will vary depending on season or completely external happy factors like signing up a new major customer. In early startups this number is a faster moving target but it should stabilize as the company transitions from early days to more established business with more predictable business growth patterns.
Do I really need to buy more machines?
You need to find a way to determine if this is truly capacity–I need to split the writes to support more concurrent write load or add more read replica–vs. code-based performance bottleneck (this new query from a recent deploy can really have its results cached in something cheaper and not beat the database as much).
How do you do that? You need to get familiar with your queries. The baby step for that is a combination of innotop, slow log, and the Percona Toolkit’s pt-query-digest. You can automate this by shipping the DB logs to a central location and automating the digest portion.
But that is also not the entire picture, slow logs are performance intensive if you lower their threshold too much. If you can live with less selective sampling you will need to detect the entire conversations between the application and the datastore. In open source land you can go as basic as tcpdump or you can use hosted products like Datadog, New Relic or VividCortex.
Make a call
Capacity planning can be 90% science and 10% art, but that 10% shouldn’t mean that we shouldn’t strive for as much of the picture as we can. As engineers we can sometimes fixate on the missing 10% and not realize that if we did the work, that 90% can get us far into a better idea of our stack’s health, a more efficient use of our time optimizing performance, and planning capacity increases carefully which eventually results in much better return on investment for our products.