Note: This is the third of three engineering blog posts from Brad Culberson–one of our highest ranking engineers here at SendGrid. If you’re interested in reading more posts like this, check out his other posts or view our technical blog roll. 

“If we did all the things we are capable of, we would literally astound ourselves.”

-Thomas A. Edison

When you get deep into the details sometimes you miss the easiest solution–one you already know and are capable of because you have already done it once.

Seeing problems from data growth

Last year, we were battling some major scaling issues in Riak search which is based on Solr. In our original design, we partitioned users amongst many Solr cores and the cores grew to be over 400GB. When Solr gets an “update” it will insert that data and mark the previously stored data for cleanup. Depending on the configuration, a merge operation will run which cleans up the index file(s).

Since our dataset is frequently getting modified, we hit a critical point where the merge operations were taking longer and longer. The nodes had started using more ram than we had available, and merges had become less and less effective. Because of this, our indexes had as much deleted data as actual data and the memory needs were frequently crashing the Solr nodes.

Finding the limitation and trying to solve for it

We reached out to Lucidworks because we knew they were masters at configuring Solr. The Lucidworks team worked with us to analyze memory dumps of our Solr nodes just before they ran out of memory. They pinpointed fairly quickly that the merge operations (and our version of Solr) were the root cause as huge parts or all of an index would be loaded into memory to perform a single merge.

The newer versions of Solr had different merge routines but that would require a major version upgrade and we would have to completely decouple from the Riak/Solr integration we were using. We already had a project in progress whose purpose was to get us migrated to SolrCloud, but it was a good deal of work and we needed something to get us stable quickly.

If we had enough ram to load the largest index into we thought we would likely restabilize. We first added swap (128GB) because that was the quickest thing we could do before getting additional ram ordered and installed.  Merges were still not keeping up, but the nodes quit crashing–which was a huge win and gave us some breathing room.

We don’t have to solve for it if we avoid it

Knowing our SolrCloud project still needed a decent amount of work/testing and our dataset was still growing at a fast pace, we needed another solution. We were running a fairly large Riak/Solr cluster mid-year and were having no issues with it.  If only we could get back to a cluster of that size. That’s the solution.

Let’s get back to where we know the dataset is a size that the system works. How is that possible with data growth and additional users? If we ran 2-3 clusters and split the data among them we know the existing tech solution would be sufficient. As data grows, we could add another cluster…and etc. forever. Stick with what we know works, but just do more of it to add additional scale.

Continue to add “pods” which are clusters of a size we know work well and hold a huge set of data. Continually move data and shard users appropriately to balance those clusters and plan out new hardware to prepare for ANY growth we might hit.

Lessons learned

We are still looking forward to SolrCloud as the solution solves many problems we have hit for features, functionality, and stability of our clusters, but we can make sure we take our time to fully vet the change.

We are very competent running a large cluster of Solr/Riak, but we know our capabilities and our limits. We can now size the clusters such that we won’t hit it again. The last few quarters we went from being worried about the next scalability issue to knowing exactly what to expect. If any of your systems are hitting limits, consider splitting them into smaller pieces so you don’t have to solve the problems introduced at that limit.

Brad Culberson
Brad is a Principal Engineer II at SendGrid. He's expert in wrangling complicated distributed systems and massive datasets, but has the most fun designing and building simple solutions that are massively scalable. Follow my interests and opinions on twitter @bculberson and my code on github @bculberson.