Note: This is the fourth engineering blog post from Brad Culberson–one of our highest ranking engineers here at SendGrid. If you’re interested in reading more posts like this, check out his other posts or view our technical blog roll. 

At Toyota, the andon cord was created to increase product quality and decrease total costs. The andon cord was placed on the assembly line in order to quickly and directly resolve issues. Any employee on the line that saw a problem could pull the cord. A light would go off after the cord was pulled, the line would stop, and the entire team would rally around the problem to resolve the concern before starting the line.

Although stopping the entire assembly line for one problem might seem counter-intuitive, the cost to fix the problem later would outweigh any lost productivity from the line stoppage.

Pulling the andon cord at SendGrid

This concept has recently been embraced by SendGrid. What does it mean for us? Anytime a teammate discovers a potential issue that may affect a customer or team efficiency, we need to raise that awareness immediately.

And we do it for the same reason as Toyota. The process of software development is quite similar to the vehicle assembly line because as longer the issue goes without being addressed, the more costly the fix will be. That issue may just cost development time, or it may also affect customers depending on when it is caught.

Last year the Marketing Campaigns developers were continuing to see more problems and strain on our systems due to data and customer growth. Developers were continuing to work on fixes as more problems arose, but there was a fundamental change that needed to happen. The developers pulled the andon cord. This manifested as developers talking to management, architecture leadership, and product to say we needed to take action to move in a different direction.

We decided together that the current path of fixing the next bottleneck/problem was endless and put us in a position in which we had no idea when we’d hit the next problem that could affect customers. A solution was proposed to fundamentally shift to a system which scales horizontally into pods.

This would allow us to maintain Marketing Campaigns in a well-understood level of size and performance. No longer would we have to continue adding tickets to fix unanticipated issues caused by data or customer growth.  In the worst case, this scenario would negatively impact customers while we try to fix scaling problems.

What is most exciting to me is that within weeks after pulling the andon cord, the organization at SendGrid fully aligned to solve the problem. Management and product teams were fully supportive of the complete change in direction. And our operations team pulled a miracle by getting us hardware and provisioning without any planning. New pods were provisioned, and new users were being assigned to a new pod.

I think it’s rare that a company the size of SendGrid can still pivot immediately to solve a real problem. It was exciting to be on a team where all members are arm-in-arm and solved the problem long before customers were substantially impacted.

Brad Culberson
Brad is a Principal Engineer II at SendGrid. He's expert in wrangling complicated distributed systems and massive datasets, but has the most fun designing and building simple solutions that are massively scalable. Follow my interests and opinions on twitter @bculberson and my code on github @bculberson.