Before SendGrid, I used to deploy all my databases by hand. I’d have a documentation page, (a Google doc, internal wiki page…whatever) and it would be a long bulleted list of “Install this, then install this.” If you have ever maintained “How to” documents like that, this picture to the right will eventually ring true.
This was obviously not a good approach. Especially when small details start changing, but the “documentation” lags behind. Then you have a situation that enables tribal knowledge, which means a 3AM Ops person, who is not the DBA, and has even less of an ability to know what should be running on a database and how it should look like under normal operations.
Multiply by…a lot
Then came my largest deployment to date at SendGrid. We needed a data store for storing the click and open tracking for our short URLs and we decided to use MySQL as the place for this. This was going to be a ton of rows with a high demand on fast writes and supporting a lot of reads. So a single instance was not going to cut it.
Because MySQL 5.5 was the standard GA version at the time, we were still limited to a MySQL that didn’t very efficiently use all the cores that the newer server configurations could offer. So to squeeze out the most performance out of our not so cheap hardware, we also decided to house 5 MySQL instances per box. The way to do that is to add a virtual IP per instance on the box and use distinct data directories and config files per instance, while still making sure that all 5 instances are “equal,” so as not to let one starve the others of system resources.
As you can see, it became very clear to me that I could not successfully deploy this new cluster (the biggest I had done yet) using the same old method. I needed a way to automate the building of these clusters, and I needed it to also be an easy method of maintaining the state of these clusters (configuration or MySQL and the system underneath) in code.
So why Chef? Simply put, it was what SendGrid had already been using for configuration management and what is now often called “Infrastructure as code.” I wanted this datastore to begin the effort of not making what I do seem like black magic…because it really isn’t. I work with a team of great operations engineers and when trying to scale traffic to double or more annually, with a not very big team, consistency of tools is of extreme importance.
What I learned
Learning chef as a DBA was an interesting experience. I will preface this section by saying that 2 years later, I am rewriting not just the cookbook for this data cluster but all of my Chef code at SendGrid. There are many things I learned the hard way in that first major iteration and I can imagine the same pitfalls happening to others in a DBA or similar roles in other companies.
Write your own cookbook
I am not going to go into code samples. There are a few community cookbooks for installing MySQL/ Percona Servers and I consider them all a great place to find examples. Yes, you can absolutely grab them and deploy MySQL with them and I imagine for many budding teams this may be a very fine path to take. But, know the debt you take on when grabbing someone else’s code to deploy your infrastructure. I chose from the very beginning to write my own cookbook because by the time I started, SendGrid was already doing a huge throughput and that comes with a number of tweaks.
When things are similar but not the same
When I started on this cookbook writing adventure, I thought my different database clusters were similar enough to use one cookbook with just some attribute differences. And maybe when I started 2+ years ago, that was true. Very quickly though, as we sharded more tables into their own clusters, plus added a few more brand new projects using MySQL, that stopped being true. I found myself maintaining the monorail of database cookbooks. Making its testing strategy truly comprehensive meant 3 test kitchen suites per database kind. Build times grew exponentially.
This is why, in this rewrite, I am heavily using what is basically a wrapper style. Yes, most of my MySQL deployments use what is more or less the same pattern, but usually in the post server install time, things diverge. And there are few things as frustrating as watching a multi-hour Jenkins build because I changed a config file for a specific database type.
Embrace your organization’s cookbook hierarchy
First and foremost, (besides automation, making my life easier…etc.) I decided to learn Chef and write cookbooks for our databases because database land should not be an island. This is why in my rewrites I made sure the operations engineering team reviewed my code. Because they are immersed in Chef the most, peer review from them is so useful. They also know what parts of system management we decided to turn into internal lightweight resources, which makes my code even simpler, and requires no reinventing my mostly the same, but not quite, wheel. This has made the rewritten cookbooks much easier to follow and maintain.
This rewrite is not done. I only have a few clusters left with cookbooks in progress for them already. I have learned quite a lot about being an operations engineer working on this project.
For more on what it’s like to work as a DBA at a scaling company, read my other post, “Scaling MySQL at SendGrid.”