As SendGrid recently became a publicly traded company, we needed to have full audit logs of all changes to a subset of our databases. This subset included some small instances that didn’t see a large volume traffic, but also some of our older clusters that are critical to our customer portal uptime and signup flow.

This also impacted some datastore that had high uptime demands and an online rollout (without downtime to writes) was desirable. With an external deadline at hand and knowing the scope of the project, we set out to make a plan.

In the planning phase, we identified a few open questions that we needed answers to:

  1. What options do we have as far as software that can do this?
  2. What performance impact is expected and how much can we test that in advance?
  3. Can we do this without downtime for write availability?

Choices, choices

At SendGrid, we follow a design blueprint process for any large or cross-team project. This project involved a number of teams as stakeholders including:

  • Internal Audit as the team in charge of defining which datastore is in scope for this compliance control and for collecting evidence come audit time
  • InfoSec as the team that was to analyze and handle incident response derived from these database audit logs
  • DB Ops as the team writing the code that deploys, manages, and tunes this new feature to the databases in scope

I set out to write this blueprint and make sure it was reviewed by all stakeholders and approved by our architecture team. Even though I had done some research on the topic of audit logging in MySQL a couple of years back and had some idea what the landscape for options looked like, I did not want to approach the project with prejudices and wanted to still investigate more than one option, and provide everyone with a solid case of why we would choose one solution over another.

This blueprint was unusual in that I was simultaneously writing it while investigating what the solution to the business need was going to be. So I knew that a lot of details were going to be filled out as we did some research.

Our options were:

While these are not the only options on the market, we felt they were the closest to our scale and devops practices to warrant being included in the actual bake-off. Other commercial options on the market did not pass that bar to be included in the benchmark.

The latter was a newer solution only recently opened sourced by Mcafee and was intriguing enough to look into because it claimed to support table level filtering which the Percona plugin did not. Since we knew one of our in scope clusters actually needed to audit about 24 tables of a total of 300, this seemed to be a valuable enough feature to make the Mcafee plugin a contender.

On the other hand, Percona Audit plugin was the most widely used open source solution for this feature. It comes built in but disabled in Percona Server–which we already use. However, it does not provide table level filtering of events which meant we needed to do that outside the database layer.

Bake-off

With the help of our partner team at Pythian, we started the bake-off comparison. At first, we compared how to install and tune each option. The team quickly determined that we had a tradeoff on our hands. While the Mcafee plugin supported table filters natively, it did not support using rsyslog as a method of streaming its events. This meant if we were to use it we would have to write the files locally to disk and have to manage sending them off to the monitoring stack.

This was deemed not desirable as it most likely would increase the performance penalty of using the audit plugin. Writing the audit events locally then reading them again by a separate process takes away IOPS capacity from our actual production traffic and increases the risk to the database instances as it means we have to manage the size of these files on disk lest they bloat and cause database downtime.

On the other hand of the compromise was Percona plugin. It supports sending the events to syslog natively but does not offer any table filters. We knew this meant the busier clusters in scope will send a large number of events, most of which are not actually from tables we want to audit. This presented a risk to the syslog-receive/logstash layer of InfoSec’s monitoring stack. Since DB ops does not have access to that, it meant the success of this project was a shared ownership endeavor.

Ultimately, we decided to use the solution with less moving parts and plan our deployment to know as early as possible if logstash needed to be scaled out to handle filtering thousands of events a second. So the decision was made to move forward with Percona’s audit plugin.

Deployment planning

Installation scope

We decided to keep this deployment simple by installing and enabling the plugin on all nodes in a given cluster which meant that we eliminated the need for orchestrating turning the audit facility on or off when the writer node in a cluster changes. We did not have any concerns about the replication stream causing duplicate events as a change gets applied because the plugin, by design, does not record replication stream changes.

No downtime

We wanted to make this to be a seamless deployment without downtime in the affected clusters. That would greatly reduce the amount of planning and orchestration we would need to do with the delivery teams using these clusters and greatly reduce customer impact. But we also wanted the plugin to send its events to rsyslog, on a specific facility, with a custom config that would send the events to the InfoSec team’s syslog aggregation server. Some of these configurations are documented as not dynamic by Percona and that presented the possibility that every instance in scope for this project would incur some downtime as we restart the mysql instance with the required audit plugin config.

We started testing different orders of operation in deploying the plugin with a dedicated test instance in our staging environment and were able to show that if we had our configuration management lay all the necessary configurations at first then run the load plugin command, that the command would start up with the desired configuration.

This proved instrumental to both simplifying the plan for completing this project and shortening the time to releasing it in production, buying us time to fine tune event filtering and working with the security team on analysis and detection.

Use configuration management

We use chef cookbooks to manage our databases so obviously we planned to use chef to deploy, tune and monitor the audit plugin. But since this needed to be enabled in only a subset of our clusters, this meant we needed a way to control where it was enabled so as not to encumber our log storage with data not pertinent to our business goal here.

For managing MySQL databases, we use a wrapper cookbook model to manage configuration. A core ‘base’ cookbook defines the bulk of how the database instance should look like then a per cluster cookbook wraps that to modify attributes or add configuration where pertinent to the specific cluster. That design made it easy to add the bulk of the code that would create the needed configuration files then load the plugin in one new recipe that we can then toggle on and off based on a chef attribute. We also decided given the scope of the change we were making, that this warranted releasing this code as a new minor version of the cookbook.

We also made sure to have chef remove any related sensu checks and turn off audit streaming when the attribute was set to false. This was to ensure that chef can do the right thing if a cluster is ever deemed no longer in scope or needed to be stopped for any intermittent reason, so we do not need to manually change a node in order to reflect the attribute change.

Monitoring

You can’t declare success of something you are not monitoring. But also monitoring is more than just slapping some sensu checks without putting thought into what failure cases we are monitoring for and what action do we expect in response to these checks failing someday. So I set out to plan the monitoring in this pipeline with 2 things in mind

  • We need to all agree on explicit ownership scope in this system, especially since it straddles the responsibilities of 2 teams with separate on-call rotations
  • Any new checks added for this need to come with a runbook that we link to in the check, explaining what this check is failing on and how to fix it

Given those 2 rules, I proceeded to add very specific checks. At this point, I considered also adding an end to end ‘synthetic’ check but I have since refrained from doing that as an end to end check here would fail at telling us exactly which part of the system failed which means we would have a hard time even paging the right team with it. And I am not a proponent of paging people at night ‘just in case’

We decided to monitor for the following:

  • Check the live mysql configuration of the audit plugin to make sure that
    • The plugin was in active state
    • The audit_log_policy was set to QUERIES

This check confirms that a DB in scope didn’t have its configuration changed on the fly since these settings are dynamic and could change outside the on disk my.cnf file.

  • Check the port on which we send logs to the monitoring stack to make sure data is flowing. Basically, making sure the other end of the syslog stream is working. This check is handled as what sense calls an aggregate check so it does not egregiously page the InfoSec team

Hurdles along the way

Audit plugin configuration

One of the initial iterations of this project intended to leverage the audit_log_exclude_commands feature to limit the events we emit to only data or schema manipulation. We quickly learned that the list this configuration is based on is much longer than one would expect.

rsyslog configuration

Here is something I did not know before this project. Rsyslog configuration is nearly its own language. For example:

  • Use a second @ in front of the remote destination to send logs over TCP instead of UDP. We wanted to utilize this facility to give a little more guarantee that logs are being delivered.
  • rsyslog has a dedicated page on how to configure it for reliable forwarding, which proved really useful for a newbie to the tool such as myself.
  • rsyslog will, by default, also tee its data to /var/log/messages which was undesirable in my case because this is a LOT of events. If you need to make the facility you are using NOT do that, you have to add local5.* ~ to the end of your config

Fire Drill

I will talk about performance implications for the databases in scope later but since rsyslog was being used as a crucial part for this design, we needed to also fire drill how rsyslog would behave when its remote destination is not available or not responding. The best way to do that was to cause disruption to that communication using iptables rules in production on one of the databases in scope that we knew had a high transaction throughput and therefore a large volume of audit events per second. Here is how that fire drill played out.

  • Confirm that audit events are flowing over the designated TCP port
  • Use an iptables rule to drop all traffic on that port /sbin/iptables -A OUTPUT -p tcp –dport {PORT-NUMBER-HERE} -j DROP
  • Watch disk write activity and files in the configured WorkDirectory in rsyslog’s configuration. The file names will be based on the ActionQueueFileName of the facility receiving these events

As expected, files started spooling in this directory. We saw a spike in disk IOPs activity. Once the number of files that defined queue name was totaled in size to the value of ActionQueueMaxDiskSpace, rsyslog stopped creating these files, disk IOPs normalized and it was clear that we were now dropping events on the floor in the rsyslog layer. What was more impressive to watch was that after we removed the potable rule, rsyslog resent all the events it had spooled to disk so we had no loss of events for our analysis store as long as we didn’t exceed spool size. We learned a few things based on that experiment

  • rsyslog is behaving as documented. Always better to prove that with first-hand experiments
  • We will very likely need to define a different queue disk space per cluster depending on the volume of events each generates. Since queuing on disk adds risk to IOPs capacity and disk capacity, it is a thing we will need to revisit periodically and re-examine

In the next post, I will talk about what our observations were after deploying this project to production and what things made this as non-disruptive for production as possible.



Silvia Botros is SendGrid's MySQL DBA. Her mission is to deploy and maintain various MySQL datastores that support the mail pipeline and other products that SendGrid offers and to drive MySQL designs from inception to production. Follow Silvia's opinions on MySQL, Web operations, and various topics on Twitter @dbsmasher.