Note: This engineering post was written by our Database Administrator, Silvia Botros. Check out some of her other DBA posts here.
A year ago, SendGrid was working hard towards SOC2 certification. Everyone was involved. There were stories on nearly every delivery team board with a SOC2 tag as we were all looking to be certified by the end of the third quarter. As you can imagine, being the person in charge of databases, there was definitely some work to do for that part of the stack.
On my task list for this business-wide endeavor was making sure that our backups were encrypted. Since my area of familiarity is DBA tools and knowing that Percona’s xtrabackup already has support for encryption, it was predictable that I would go to that as the first attempt at this task.
A few important things were in my sights in testing this approach:
- Obviously, the backup needed to be encrypted
- The overhead to creating the backup needed to be known and acceptable
- The overhead to decrypting the backup at recovery time needed to be known and acceptable
That meant that first, I needed to be able to track how long my backups take.
Tracking backup time
SendGrid uses Graphite for its infrastructure metrics and while the vast majority are sent via Sensu, Graphite is easy enough to send metrics directly via bash lines–very convenient since the backup scripts are in bash. Note that sending metrics at Graphite directly is not super scalable, but since these backups run at most once an hour, that was fine for my needs.
That part turned out to be relatively easy.
To explain what happened in that last line, I send Graphite the path of the metric I am sending (make sure that is unique), the metric value, then the current time in epoch format. Netcat is what I decided to use for simplicity, and I give it a timeout of 1 second because, otherwise, it will never exit. The `graphite url` is our DNS endpoint for Graphite in the data center.
Now that I had a baseline to compare to, we were ready to start testing encryption methods.
Following the detailed documentation from Percona on how to do this, I started out by making a key. If you read that documentation page carefully, you may realize something.
This key is to be passed to the backup tool directly, and it is the same key that can decrypt the snapshot. That is called symmetric encryption and it is, by nature of that same key in both directions, less secure than asymmetric encryption. I decided to continue testing to see if simplicity still makes this a viable approach.
Tests with very small DBs, a few hundred MBs, were successful. The tool works as expected and documented, but that was more of a functional test and the real question was “what is the size of the penalty of encryption on our larger DBs?” The more legacy instances at SendGrid had grown to sizes from 1-2 TB to a single 18 TB beast. What I was going to use for the small instances had to also be operationally acceptable on the larger ones.
This is where testing and benchmarks got interesting
My first test subject of a considerable size is a database we have that is 1 TB on disk. Very quickly I encountered an unexpected issue. With minimal encryption settings (1 thread, default chunk sizes), I saw the backups fail with this error:
At the time, these databases used 512MB as the transaction log file size, and this is a fairly busy cluster, so those files were rotating almost every minute. Normally, this would be noticeable in the DB performance, but it was mostly masked by the wonder of solid state drives. Seems like not setting any parallel encryption threads (read: use one) means we spend so much time encrypting `.ibd` files that the innodb redo log rolls from under us were making the backup break.
So, let’s try this again with a number of encryption threads. As a first attempt, I tried with 50 threads. The trick here is to find the sweet spot of fast encryption without competing over CPU. I also increased the size of the `ib_logfiles` to 1 GB each.
This was a more successful test that I was happy to let brew overnight. For the first few nights, things seemed good. It was time to make a backup that doesn’t grow too much, but box load average during the backup process was definitely showing the added steps.
However, when I moved onto testing restores, I found that the restore process of the same backups, after adding encryption, had increased from 60 to 280 minutes–meaning a severe penalty to our promised recovery time in case of disaster.
We needed to bring that back to a more reasonable timeframe.
This is where teamwork and simpler solutions to problems shined. One of our InfoSec team members decided to see if this solution can be simplified. So he did some more testing and came back with something simpler and more secure. I had not yet learned about gpg2 and so this became a learning exercise for me as well.
The good thing about gpg2 is that it supports asymmetric encryption. We create a key pair where there are private and public parts. The public part is used to encrypt any stream or file you decide to feed gpg2 and the private secret can be used to decrypt.
The change to our backup scripts to add encryption distilled to this. Some arguments are removed to make this easier to read:
On the other end, when restoring a backup, we simply have to make sure a secret key that is acceptable is in the host’s key ring and then use this command:
Since I was new to gpg2 as well, this became a learning opportunity for me. Colin, our awesome InfoSec team member, continued to test backups and restores using gpg2 until he confirmed that using gpg2 had multiple advantages to using xtrabackup’s built in compression including:
- It was asymmetric
- Its secret management for decryption is relatively easily rotated (more on this below)
- It is tool agnostic, which means any other kind of backup that isn’t using xtrabackup could use the same method
- It was using multiple cores on decryption which gave us better restore times
Always room for improvement
One place where I still see room for improvement is how we handle the secret that can decrypt these files. At the moment, we have them in our enterprise password management solution, but getting it from there, then using it to test backups is a manual process. Next in our plan is to implement Vault by Hashicorp and use that to seamlessly, and on the designated hosts, pull the secret key for decryption, then remove it from the local ring so it is easily available for automated tests and still protected.
Ultimately, we got all of our database backups to comply with our SOC2 needs in time without sacrificing backup performance or disaster recovery SLA. I had lots of fun working on this assignment with our InfoSec team and came out of it learning new tools. Plus, it is always nice when the simplest solution ends up being the most suitable for one’s task.