In the previous post for this series, I talked about how we went about collecting requirements for this project, how we made a decision on what tool to use and how we planned the deployment to avoid service disruptions. As promised, this follow up post focuses on post deploy observations and what factors made this a successful story for us.
Prepared statements and command class filtering
Our initial design for this project intended to leverage a feature in the percona plugin that lets you filter the emitted events to either use a list of included commands or excluded commands as a way to limit the scope to what we actually need to audit.
The feature relies on the plugin marking every detected sql command with a class based on a list inherent to mysql and from there deciding in the output layer if the command class detected should be suppressed or emitted.
So we wrote the initial blueprint to leverage the audit_log_exclude_commands feature to exclude any sql commands that did not cause writes or changes to row data. We went for an exclusion list because we felt that an inclusion list carried the risk of missing any in scope command classes and could become a reason to have gaps in our audit logs.
However, on the first cluster we tested, we saw that all commands being audited were being classified as error even though it was clear that these statements ran successfully and in case of changing data were successfully doing that as well. We also saw that all queries were being emitted to the centralized syslog server regardless of that exclusion list which seems correlated with that error classification.
This meant that we were storing far more events in elasticsearch than we actually needed which will most definitely impact our ability to analyze and detect errant behavior in a timely manner. With collaboration with the Security team, we decided to inform Percona of the issue via their bug tracker but to also move the command based filtering down to the central syslog layer.
Like everything in engineering, the tradeoff here was to send more data through the DB networking layer and make the filtering in the centralized syslog layer more complex and in return that made our elastic search scaling needs more manageable and the configuration on the database side simpler.
One of the most crucial decisions in this project that saved us a lot of heartache is deciding to use the rsyslog facility for trapping and shipping these logs off to a centralized syslog server.
But nothing is without pros and cons.
This decision meant that the uptime of a separate system and the reliability of the network stack in between became crucial to providing the SLA we needed to give our audit team the confidence in this audit solution.
For those not willing to make that compromise and needing to keep the responsibility of the audit logs persistence within the DB stack, I recommend using the FILE handler in the audit plugin then using local logstash to take the logs and forward them to their analysis system of choice.
That latter choice means a lot more disk IOPs consumed away from the database process and taken up by the audit plugin and logstash and it means you have to carefully balance the disk space on your instances between your database table files, operational logs, and the audit plugin logs. Weighing your options is solely reliant on which is the easier to operate/acceptable by the business but you do have choices nonetheless.
In our case, the rsyslog choice proved to be the least impactful to our busier databases, peaking at about 5400 transactions/sec, during normal operations we saw no impact on performance. The audit plugin was chugging along, sending its events without impact to the database performance. All because we had elected an architecture that avoided consuming disk-based operations on the database side.
Past decisions paying off
One of the first concerns we had when we embarked on this project was its impact on performance. One of the database clusters in scope has seen very busy times and its applications are sensitive to lag so we needed to make sure we kept track of performance metrics like transactions per second, Disk and CPU utilization in order to compare numbers after deploying the plugin and see what the penalty for that is.
It was a happy surprise to see that we did not incur much of a penalty in normal operations. As mentioned before, because we decided to use the SYSLOG handler, it meant we normally were not needing to use any local disk capacity for tracking these audit events.
But we also did not want to plan based on ‘normal operation’ times only.
We also needed to know that when rsyslog needs to fallback to local spool files, that these events do not compound into service impacting outages for the more critical DB clusters. By testing rsyslog spooling behavior under close watch in production, we realized that the past few years’ of work to functionally shard our databases has gained us the unplanned benefit of lots of disk IOPs capacity to absorb events like rsyslog spooling to disk then needed to resend all these events.
We definitely noted the increase in disk utilization during these events but it never got the db instances to a critical state or affected customer-facing activity.
Like everything in software engineering, there are always tradeoffs. In this project, we considered a log delivery method that was more reliable and had less potential for lost logs. That would involve writing the logs to disk, then using something like logstash to ingest and send them off to analysis destinations for the security team to use.
But that would have meant significant disk IOPs consumption on the database side which we knew could potentially impact the quality of service of customer-facing calls to these databases.
We decided that our business needs would be sufficiently met by using resilient logging in rsyslog, using a reasonably sized spool, and using TCP to send the events to our log monitoring stack. Like everything in life, nothing is forever. And we are aware that this decision may need to be revisited in the future.
This project, while encompassing half a dozen clusters and a large number of instances, was completed in a single quarter while the DB ops team still kept the lights on in a still fast-growing business. That could not have happened without close collaboration between the DBE and the security team and not without strong product management that kept the scope finite and known to make sure we all kept our eyes on the end goal of making our business more secure.