Note: This is the first of three engineering blog posts from Brad Culberson–one of our highest ranking engineers here at SendGrid. If you’re interested in reading more posts like this, check out our technical blog roll.
Most developers think caching is like bacon. You can add it to anything and it’s better. I run across code in production that is frequently overly complex and actually performing worse due to caching. This article will guide you on what to look for and how you should evaluate whether your cache is detrimental to your application.
The best cases to use cache are where your processing time to complete a task is long and that exact task is common. In this example, I will assume you are getting data from Solr with an average request time of 2 seconds. There are a limited set of queries such that when we add the query result to cache it is available for 80 out of every 100 query requests. In that case, it would have a cache hit ratio of 80% and cache miss ratio of 20%. In this case, every read and write from cache takes 2ms.
The average request time:
0.20 * (2000ms + 2ms) + 0.80 * 2ms = 402ms
The time before cache: 2000ms
There are a lot of bad cases to use cache. One I recently analyzed was caching data that was pulled from Riak. The average read time from Riak in that case was 3ms. The average read and write time to the Redis cache that was in place was 2ms. In this case, the cache hit ratio was 60% which sounds good, right?
The average request time:
0.40 * (3ms + 2ms) + 0.60 * 2ms = 3.2ms
The time before cache: 3ms
In this case, we added 0.2ms of latency by adding the complexity of a Redis cache. These timings also exclude any work needed to invalidate cache for simplicity but that will penalize the cache solution even more. After the analysis, we removed the cache and simplified the code substantially.
Another bad case I run into frequently are caches for relational databases which cache by key. Relational databases do an excellent job keeping hot items in memory and are amazing at querying by key. In my experience, the math rarely works out for caching in another system just because the speed difference from cache and database are negligible. You won’t get a 100% hit rate, and you will also penalize updates and deletes with cache update and eviction time. If you would like to review the minimal savings (microseconds) a cache on MySQL has by key, I published a codebase that benchmarks the “savings.”
Don’t Let Your App Become Addicted to the Bacon
After you’ve done the calculations and decided your cache is efficient, your job isn’t complete. You must be prepared for a worse case situation: your cache may require reloading. This will happen in production for many reasons: eviction policies, server loss, and/or data center scaling and expansion. In this case, your app is likely to have a thundering herd problem reloading the cache. This is a use case you must test for and be prepared for in any cases you are relying on a cache for performance. It’s likely your app may be unhealthy or non-operational for some time while the cache is priming.
Be cautious when you add caches to code and make sure you instrument it heavily before and after. If you don’t know the cache hit ratio and the latency of the hits and misses then you can’t make an informed decision on adding or keeping the cache. You may find that all that work to add the cache was detrimental to latency.
Liberate yourself by instrumenting your code, assuming nothing, and testing everything. Remove the caches that aren’t working. Don’t pre-optimize by caching all-the-things. It’s very possible you are slowing down your application and making it more complicated than necessary. I really love to cache, but you can make your application worse by adding it.