Last year I talked about SendGrid’s decision to use Go as our primary development language. For the most part this has affected only new services. Recently though, we have completed a rewrite of one of our highest load components to Go and I thought I would share the story and some of the lessons learned.
As some background, almost a year ago one of our engineers who was helping to make the pitch of why we should use Go rewrote the service that handles final customer IP selection/transmission of data to remote SMTP servers as a proof of concept/example of how awesome Go is. Some time later, at an internal hackathon, he made even more progress. While this was definitely interesting, it was always hard to make the case to replace an existing system that works with something new just because it was cool.
In the late summer of last year that changed though. Due to an architectural change downstream to deal with the performance impact of encrypting over 90% of our traffic with TLS we started putting much more load on those systems, on top of the significant increase in volume that happens that time of year. This started causing delays for our customers’ emails, which was a very bad thing. We did some bandaid work to keep those delays down, but it was quite obvious that the service needed to be replaced with something that could actually handle our load.
To put this service in perspective, it is the last piece to touch traffic as it goes out to the final inbox. Because it is what binds the IP address to use for customers there is only so much you can do by throwing more hardware at it, and in fact the more we split things up, the more delivery latency a customer will experience when they reach ISP rate limits.
We split off a few engineers to work on taking the code that had been previously written and getting it finished off so we could replace the existing service with it. While we had initially hoped this would be an easy task (the code is already written, it should just work, right?) this became a several month effort. Part of what was needed to make this service perform better was a lot of critical sections of code and the locking tripped things up quite a bit.
We also had the competing needs of moving slow so we didn’t break anything for customers, along with the need to get this service deployed because the old one was breaking for customers. We also knew that time was not our friend with the holidays coming (we processed just shy of 700M messages one day in December). Despite all the hurdles our engineers pushed through and got things done and deployed, and we can definitely feel the difference.
As an interesting aside, when we were doing benchmarking for how much traffic the new boxes could take, we had to load test against fewer CPUs than are available because our test machine was having trouble saturating the service and we didn’t want to throw 400 boxes at just testing. We found that just a single CPU could handle 50k concurrent connections with a throughput of over 350Mb/s. That’s about where the old boxes maxed out on bandwidth, and well above the several thousand connections they could handle.
I know most stories of Go are along the lines of “I converted this big program from X to Go and replaced Y servers with 1 box;” we unfortunately can’t say that. While 1 box can push around 10Gb/s of traffic, we routinely hit over 20Gb/s, so 1 box isn’t enough. But we did solve a major bottleneck in our system, reduced send time latency, and have built the foundation for some architectural improvements we’ve been wanting to do for awhile now. I’ll still call that a win 🙂