Learning to Scale an Email-based App: Building for Today with an Eye on Tomorrow Daniel Randa October 24, 2012 Guest Post, Product // SUMMARIES ?> The following post is a guest post by Mike Sun, the Senior Engineer at the team management tool iDoneThis. Learn more about Mike at the bottom of the post. iDoneThis is a simple management application that emails your team at the end of every day to ask, “What’d you get done today?” Just reply with a few lines of what you got done. The following morning everyone on your team gets a digest with what the team accomplished the previous day to keep everyone in the loop and kickstart another awesome day. From our launch in January of 2011, we’ve gone from sending out a few hundred emails per day to delivering 1,000,000+ emails and processing over 200,000 incoming emails per month. Our daily email traffic now breaks down to roughly 10,000 incoming emails and 40,000 outgoing emails, concentrated over the 6pm hour in the US. Make it Work, Now. Before we launched, we built iDoneThis over a weekend in the most rudimentary way possible. I kid you not, we sent the first few batches of daily emails using the BCC field of a Gmail inbox. Primitive, yes, but the upshot is that we’ve had users on the site from the third day of its existence. The evolution of our system to its present email handling capabilities reflects a saga common among startup applications–the constant struggle to balance development for immediate needs versus maintaining agility and scalability for future growth in light of scarce resources. From the beginning, we eschewed running our own mail server due the administration overheads and more importantly, the difficult task of keeping sent emails clear of spam filters. Avoiding blacklists, keeping high reputations, and monitoring delivery rates was an expertise that would call for too much developer time. Using Gmail and the BCC field leveraged the whole of Google’s email expertise and met our immediate needs. But as our traffic grew, piggybacking off Gmail became an nonviable solution: Gmail’s servers throttled the rate of our email delivery and processing. People are Coming, Scale! Our immediate needs demanded greater email processing capacity, but we still we kept to a philosophy of trying to outsource as much of this functionality as we could via proper interfaces and protocols. SMTP and IMAP were well known and well supported protocols for sending and receiving email and so we explored third party high volume email services supporting them. Popular email marketing solutions like Mailchimp didn’t allow for the amount of variation we wanted in our emails. We did a ton of research and we found that SendGrid offered exactly what we needed. Every user in iDoneThis is able to configure the specific hour at which their reminder emails and digest emails are sent to them. To process all of these email deliveries, we wrote a script that we called “sendmail” that ran as a cronjob at hourly intervals to render the user specific emails and send them through SendGrid’s SMTP interface. Sendmail used to crash once a week, but through iterative improvements, now that rarely happens. Sendmail began as a simple for loop but evolved to a database backed finite state machine that carefully tracked and managed the status of each user email. Though we went through a period of growing pains in evolving sendmail from its error-prone first incarnation, understanding all the ways email delivery could fail provided us the knowledge we needed to properly design and build a robust state machine-based sendmail. Iteration was an unavoidable necessity in this situation. To process incoming email, we started with a 200-line script we called “getmail” to retrieve and parse incoming email over IMAP. The script was primitive and unreliable however, leaving the database in a bad state anytime an exception occurred. Getmail had to be be babysat it as it was run by hand. Errors commonly occurred in as getmail attempted to retrieve, process and store email contents into our database. The lesson learned was that getmail was trying to do too much and that email retrieval and processing had to be separated into two different components. This was a lack of foresight on our part. Eventually getmail simply retrieved email and a separate parser module was built to handle the difficult tasks of dealing correctly with encodings, parts, and applying heuristics for extracting relevant content from emails. Good Thing There Was Some Foresight Refactoring our incoming email processing engine required significant developer time, but the investment later proved worthwhile. Unlike hourly email delivery, incoming email from our users occurs at any time. As our inbound email traffic grew, using getmail to retrieve email via IMAP became increasingly more inefficient: it had to constantly poll the email server, was still prone to failures, and lead to slow processing times due to its single-threaded architecture. SendGrid offered an alternative processing model through their incoming parse API–it would handle the reception of emails for us and for each email received, make an HTTP POST request to our web application. Posts to a web API would happen in real-time, concurrently, and with error retransmissions. Switching our inbound email engine to use this SendGrid feature was greatly facilitated by our earlier decision to refactor email transmission and parsing into separate modules. The SendGrid parse API simply replaced our getmail script and web requests posted from SendGrid were sent through the parsing module. Our continued growth has once again pushed our current email delivery system to its limits. A refactor and redesign has become necessary again. But we’ve been heartened to see again that an earlier design decision to incorporate a task queue/middleware engine (Celery+RabbitMQ) during the development of another feature will now greatly benefit the email-delivery refactor. To parallelize delivery, email creation and transmission to SendGrid, functionality will be packaged into batches of atomic Celery tasks that will be managed and executed by our task queue, RabbitMQ. The task queue architecture allows the system to gracefully scale through the addition of concurrent workers, the pool of which can maximize its use of SMTP connections to SendGrid. Although some of the functionality built into sendmail will no longer be needed, it will still track email processing through its state machine model, a testament to the payoff of earlier iterations. The Saga Continues Our experience with email delivery and processing has taught us that it’s okay to focus on building for the present, given that we remain mindful of the future. In particular, we’ve learned to outsource as much as possible; utilize clearly defined abstractions and interfaces; and that for some things, we have to start simplistically, building knowledge and understanding through iterations. There are still so many other architectural and implementation improvements we could make to our system, let alone our email delivery and processing components. But, as a small startup, we hold fast to key principles that provide us some foresight even as we heartily focus on the present. About the Author: Mike Sun is a Senior Engineer at iDoneThis, an easy way to share and celebrate what your team gets done at work, every day, that amazing companies like Zappos and Shopify use. Read more from Mike on the iDoneThis blog and on Twitter at @mikesun.