Note: This is the second of three engineering blog posts from Brad Culberson–one of our highest ranking engineers here at SendGrid. If you’re interested in reading more posts like this, check out his first post here or view our technical blog roll. 

SendGrid is using Riak and Solr for the Marketing Campaigns contacts dataset. We built a disaster recovery plan that protects that dataset from corruption/unintentional deletion/data center loss. This is a challenging task due to the size and growth of that data. We’re currently storing approximately 2TB of key value data and 18TB of Solr index data.

We evaluated several solutions that could have solved the problem:

  1. Riak ring replication
  2. File system backups
  3. Data extraction
  4. Kafka log backups

Riak ring replication

In order to vet this solution, we built a disaster recovery ring and began a full sync. After tuning, we were able to get a full sync to complete within 48-72 hours. This solution is quite simple as we already pay Basho for replication licensing and have the replication tools on the servers. A negative effect of this solution is that any time we may have a data center loss or ring loss we could have lost up to 72 hours of data depending on where in the process the full sync was. Basho also has a real time sync. That does couple the two rings together on every write–something we did not want. The other negative result of this solution is that depending on what causes the “disaster,” many scenarios could also corrupt the replication cluster.

To find out more about Riak replication, read their docs.

File system backups

Backing up the file system is a common approach. This would have worked, but our biggest concern was the disk IO necessary to read and back up 20TB of data and then the compute and resources needed to compress, encrypt, and transfer that data outside the data center. The product goal was to have at least a full weekly and daily incrementals. This solution would have made the fulls possible but we would have needed to do them daily to keep the loss limited. Due to the cost and waste from redundancy included in the backups, we ruled out this option before any implementation.

Data extraction

This solution would build an endpoint in our application server that would allow extraction and restore user data. We could then run a job(s) in Chronos which iterates all users and performs that data extraction, compresses, encrypts, and shuttles the data to another data center.

This was very attractive for a few reasons:

  1. We aren’t backing up 3x the dataset size based on the n-val of the Riak ring, and are not backing up the huge Solr indexes.
  2. We can prioritize restores/backup/recovery per user.
  3. No “unknown” technology.

This ended up being the best choice for us for the full backups we wanted weekly. We added the endpoint and the Mesos job and scheduled user_id % 7 daily so that each day 1/7 of our users are backed up.

After analyzing the total recovery time of our ring, we have started to build and share the data to smaller Riak rings using the same technology such that the total recovery of the dataset meets product requirements and only affects a % of our users.

This did not solve the incremental backup we desired, but we had an ace up our sleeves for that (Kafka).

Kafka log backups

All writes that go through our contacts database go through Kafka. We do that so that we can protect the Riak and Solr cluster from being overloaded and still respond synchronously to requests when we have spikes in updates of the contact data.

Since we were using Kafka for all updates, we had the capability to add a consumer to the Kafka topics to transform and shuttle the insert, updates, and deletes off site creating a copy of our transactions off site. If we need to restore the ring, we can then pull the Kafka data based on the user and dates. We rehydrate the Kafka topics with those messages so that our workers will bring the contact dataset up to date. This is a great solution for us because it gives us something which can be used to restore to any point in time and it puts no load on the production Riak/Solr cluster for backup. It also uses current code paths so that it is extremely unlikely to break under future development.

Look inside the box

In order to back up Riak we were able to rely on our existing development capabilities instead of 3rd party tools which were extremely wasteful. This proved to be very successful as there are no great solutions out of the box to back up Riak/Solr.

Since all our updates go through an async bus, solving for incremental backups was fairly simple and easy to maintain.  If you’re running all your updates through a pub/sub system like Kafka, consider building your own incremental restores. You already have the developers and the capabilities to subscribe to the bus and the code to process the messages, so I’d propose you reuse that knowledge and build your own flexible backup and restore system.

Brad Culberson
Brad is a Principal Engineer II at SendGrid. He's expert in wrangling complicated distributed systems and massive datasets, but has the most fun designing and building simple solutions that are massively scalable. Follow my interests and opinions on twitter @bculberson and my code on github @bculberson.