Note: This is the fifth engineering blog post from Brad Culberson–one of our highest ranking engineers here at SendGrid. If you’re interested in reading more posts like this, check out his other posts or view our technical blog roll.
SendGrid has a huge variety of customers, ranging from very small companies to some of the world’s largest tech companies. From the beginning, we knew it would be critical that our email marketing solution,
Marketing Campaigns, be able to accommodate extremely large volumes of email—and that this would be a big challenge.
One of the earliest problems we tackled is the demand and load generated by huge insertions or updates of recipient lists. A customer asking to insert 20 million recipients could easily overload the production system. To solve this challenge,
we use Riak, a key value store which is based on a hash ring topology. We also index that data with Solr Cloud which allows for sharding and distribution of data.
These technologies allow us to support a large number of updates per second. However, a single operation updating 20 million recipients would still impact reads and reliability during the processing. To solve this, we added another technology, Kafka, as a data pipeline so that we could rate limit writes per second to keep the underlying datastores functional for reads for all customers. By partitioning our users in Kafka, we have minimized the risk of one customer affecting other customers to 1%.
The data pipeline is also very useful when customers send a large campaign. We track successful sends, opens, and clicks for each email sent–which has a big multiplier effect. For example, a campaign of 10 million recipients could easily generate 13 million or more requests into our infrastructure for tracking in a very short period of time during and after a send. This is very similar in scale to an update of 20 million recipients but happens for every single send.
It's impossible to offer both real-time updates as well as high availability. In order to keep the production systems highly available, we rate limit the writes in the data pipeline. What does this mean for our customers? During very heavy load, they will see delays to recipient updates, though all our updates are in resilient queues and we will catch up as quickly as we can.
We also give recipient updates higher priority over send, open, and click data so we make sure that when you make recipient changes, they are made as quickly as possible. Ultimately, we allow these slight delays in order to prioritize the availability of our systems over the immediate availability of recipient changes.
How long could these delays be? This is continuously changing as we add customers and make improvements. We’re always adding clusters of servers and optimizing our current system. Last week, for example, the high priority queue, which is the recipient updates, was able to keep most delays well under 1 minute with the maximum delay within 15 minutes. The second highest queue, which contains updates for engagement, was able to keep most delays well under 1 hour with the maximum delay within 4 hours.
We introduce these delays to provide a reliable system and guarantee we’ll never lose any of you recipient information.
We value the ability to have a fully functional and highly reliable system over the speed of updates. If you are working on an integration and need to know how far behind our queues are, you can use the
status endpoint to check for delays. That endpoint will switch to delayed when there are any substantial delays on the high priority queue. We suggest customers reduce recipient updates when they are experiencing delays to be good netizens.
In the future, we’ll continue to work on increasing processing rates, thus decreasing delays. We may also add additional functionality which allows us to isolate users queues.