Here at SendGrid we spent the last couple of years porting a lot of our service and host monitoring to Sensu. Its solid API support meant we could write all sorts of tooling around it. We also liked the idea of standalone, client side checks that push their status to the Sensu alerting queue asynchronously. If you are new to Sensu or haven’t ever read on it, this is a good place to start.
Typical usage example
The typical use I have for such standalone checks is health checks, a simple example looks like this in Sensu’s client config:
But if you look closer, all you really do is tell Sensu to run a command. So this can be…any command. This will be useful in a just a moment.
What I am solving
I have been traditionally using the crond service for running local management tasks on databases like rotating partitions and triggering backups. Along with that, we have a report that would check the logs of these jobs, on each backup replica, and make sure they ended in success messages. This is fragile due to a number of reasons:
- Crond doesn’t have any built in monitoring. It does not have a concept of “stale.”
- Those reports – I will call them “watchers”—are one more moving part adding complexity to the question “when was the last successful backup of my DB?”
- This setup is prone to race conditions. You must time the task and its watcher in cron exactly or else the watcher can preemptively trigger an alert or signal failure when the backup is not done yet. Any drift in duration of the task will eventually make this happen (like a backup taking longer as a database grows).
- What if the watcher script didn’t run? It is also in crond—either on the same host or on another host, right? Either you find yourself in a rabbit hole of who watches the watchers, or a human has to notice that a report didn’t come out…humans aren’t good at remembering things.
- Changing the designation of a server means you must change it in a number of places or the watcher will watch the wrong host.
- We are striving to keep our stack boring. While newer technologies like chronos and rundeck provide more enhanced scheduling, they also need a service discovery layer to do this right. That was a bigger undertaking and too much scope creep for what I was solving.
So I decided to make Sensu work to my advantage, with the help of Chef roles.
Partition rotation in Sensu
I started off by moving any credentials I need for my partition rotation script into Sensu’s redacted configuration. This is good practice for anything you put into sensu that uses secrets. The credentials are added in /etc/sensu/client.json then are just referenced in check configurations using :::secret_thingie::: notation.
*Protip: Sensu doesn’t know what to redact in client.json. You must also define the name of the keys you want redacted. like this..
This is also not a deep merged list. As you can see I had to explicitly include “password” once I needed to add another key in that list.
Then I needed to define the new Sensu check that rotates the partitions. I am using a Chef resource as the code example.
Let’s inspect what just happened here…
command:: pdb-parted is a very useful perl script by the folks from Palominodb (now at Pythian) for rotating partitions in a MySQL DB. This is the same line I used to maintain in a crontab file configuration.
interval: how often this Sensu check runs. In this example, these are daily partitions so running the script daily was sufficient. The important part here is that your script is idempotent. pdb-parted is. If it finds that the needed partitions already exists, it just outputs a message to that effect and exits nicely.
occurrences: This is the number of allowed failures before alerting. It is nice to have a buffer especially when I already configure the command to make partitions days/weeks in advance.
For those who use tools other than Chef and want to see what the final check configuration looks like, here it is, important bits redacted:
Partition rotation now has built in monitoring. If this script exits with a non zero code, the Sensu handlers take care of getting this knowledge to the predefined group of people using the method we want. We already have multiple on-call team rotations set up and escalation paths and all the other stuff that comes along with monitoring at scale. So we got to just leverage all of that rather than reinventing the wheel.
Planned future improvements
No solution is perfect. Neither is this one.
- We still run the risk of more than one machine in a cluster holding the primary role. Until service discovery is in place, this is something we can build a check for leveraging chef search.
- Now that tasks like backups are running in a first class citizen tool in our stack, I can more easily add stats and get graphs for how long backups take, how big backups are to make capacity planning and MTTR tracking easier.