Note: This engineering post was written by our DBA, Silvia Botros and originally appeared on the Sysadvent Blog in December 2016.
Companies have had and needed Database Administrators for years. Data is one of a business’s most important assets. That means many businesses, once they grow to the point where they must be able to rapidly scale, need someone to make sure that asset is well managed, performant for the product needs, and available to restore in case of disasters.
In a traditional sense, the job of the DBA means she is the only person with access to the servers that host the data, the go-to person to create new database cluster for new features, the person to design new schemas, and the only person to contact when anything database related breaks in a production environment.
Because DBAs traditionally have such unique roles their time is at a premium, it becomes harder to think big picture when day-to-day tasks overwhelm. It is typical to resort to brittle tools like bash for all sorts of operational tasks in DBA land. Need a new DB setup from a clean OS install? Take, validate, or restore backups? Rotate partitions or stale data? When your most commonly used tool is bash scripting, everything looks like a nail. I am sure many readers are preparing tweets to tell me how powerful bash is, but please hold your commentary until after evaluating my reasoning.
Does all this sound like your job description as a DBA? Does the job description talk in details about upgrading servers, creating and testing backups, and monitoring? Most typical DBA job postings will make sure to say that you have to configure and setup ‘multiple’ database servers (because the expectation is that DBAs handcraft them), and automate database management tasks with (hand-crafted) scripts.
Is that really a scalable approach for what is often a team of one in a growing, fast-paced organization?
I am here to argue that your job is not to perform and manage backups, create and manage databases, or optimize queries. You will do all these things in the span of your job but the primary goal is to make your business’s data accessible and scalable. This is not just for the business to run the current product but also to build new features and provide value to customers.
You may want to ask, why would I do any of this? There is an argument for continuing the execute the DBA role traditionally: job security, right? Many tech organizations nowadays do one or more of the following:
- They are formed of many smaller teams
- They provide feature by creating many micro-services in place of one or a few larger services
- They adopt agile methodologies to speed the delivery of features
- They combine operations and engineering under one leadership
- They embed operations engineers with developers as early as possible in the design process
- A DBA silo within operations means the operations team is less empowered to help debug production issues in its own stack, is sometimes unable to respond and fix issues without assistance, and frankly less credible at demanding closer and earlier collaborations with the engineering teams if they aren’t practicing what they preach inside Tech Ops.
So what can be done to bust that silo and make it easier for other folks to debug, help scale the database layer, and empower engineers to design services that can scale? Most up-and-coming shops have at most one in-house DBA. Can the one DBA be ‘present’ in all design meetings, approve every schema change, and be on call for a sprawling, ever growing database footprint?
DBAs can no longer be gatekeepers or magicians. A DBA can and should be a source of knowledge and expertise for engineers in an organization. She should help the delivery teams not just deliver features, but to deliver products that scale and empower them to not fear the database. But how can a DBA achieve that while doing the daily job of managing the data layer? There are a number of ways you, the DBA, can set yourself up for excellence.
This is a very important one. DBAs tend to prefer old school tools like bash for database setup. I alluded to this earlier and I have nothing against using bash itself. I use it a lot, actually. But it is not the right tool for cluster setup. Especially if the rest of ops is NOT using Bash to manage the rest of the architecture. It’s true that operations engineers know Bash too, but if they are managing the rest of the infrastructure with a tool like Chef or Puppet and the databases are managed mostly by hand crafted scripts written by the DBA, you are imposing an obstruction for them to provide help when an urgent change is needed.
More so, it becomes harder to help engineering teams to self-serve and own the creation of the new clusters they need for new feature `foo.` You become the ‘blocker’ for completing work. Getting familiar with the configuration management at your company is also a two-way benefit. As you get familiar with how the infrastructure is managed, you get to know the team’s standards, get more familiar with the stack, and are able to collaborate on changes that ultimately affect the product scale.
A DBA who is tuned into the engineering organization’s product and infrastructure as a whole is invaluable.
This is technically a subset of the documentation you have to write, but in my experience has proven far more useful that I feel it has to be pointed out separately. When I say runbooks I am specifically saying a document written for an audience that is NOT a DBA. There are a lot of production DB issues we may encounter as DBAs that are simple for us to debug and resolve. We tend to underestimate that muscle memory and we fall in the pattern of ‘just send me the page’ and we ‘take care of things.’
If your operations team is like mine where you are the only DBA, it probably means someone else on the team is the first line of defense when a DB related event pages. Some simple documentation on how to do initial debugging, data collection, can go a long way in making the rest of the operations team comfortable with the database layer and more familiar with how we monitor it and debug it. Even if that event still results into paging the DBA, slowly but surely, the runbook becomes a place for everyone to add acquired knowledge.
Additionally, I add a link to the related runbook section (use anchors!) to the page descriptions that go to the pager. This is incredibly helpful for someone being paged by a database host at 3 AM to find a place to start. These things may seem small, but in my experience they have gone a long way in breaking mental barriers for my operations team working on database layer when necessary.
As a personal preference, I write these as markdown docs inside my chef cookbook repositories. This falls seamlessly into a pull request, review and merge pattern, and it becomes an integral part of the databases’ cookbooks pattern. As engineering teams start creating their own, the runbooks become a familiar template as new database clusters spring out all over the place.
We like our terminal screens. We love them. The most popular tools in MySQL land are still terminal tools that live directly on the db hosts and that need prior knowledge of them and how to use them. I am talking about things like innotop and the MySQL shell. These are fine and still helpful but they are created for DBAs. If you do not want to be the gatekeeper to questions like “is there replication lag right now?” you need to have better tools to make any cluster health, now and historically, available and easy to digest for all team members. I have a few examples in this arena:
We use read replicas to spread that load away from the primary, which means once lag hits a certain threshold, it becomes a customer support event. It is important to make it easier for anyone in the company to know at any given time whether any cluster is experiencing lag, what servers in that cluster are, and whether any of the hosts has gone down. Orchestrator is a great tool in that regard because it makes visualizing clusters and their health a browser window away.
Metrics for the DB layer need to live in the same place metrics for the rest of the infrastructure are. It is important for the team to be able to juxtapose these metrics side by side. And it is important to have an easy way to see historical metrics for any DB cluster. While you may have a personal preference for cacti or munin, or artisanal templates that you have written over the years, if the metrics you use to investigate issues are not in the same place as the rest of the infrastructure metrics it sets up a barrier for other busy engineers–and they’ll be less inclined to use your tooling over that which is in use elsewhere. Graphite is in wide use for ingesting metrics in modern infrastructure teams, and Grafana is a widely used dashboarding front-end for metrics and analytics.
We use VividCortex to track our queries on critical clusters and while this article isn’t meant to be an advertisement for a paid service, I will say that you need to make the ability to inspect the effect of deploys and code changes on running queries and query performance something that doesn’t need special access to logs and manually crunching them. If VividCortex isn’t a possibility (although, seriously, they are awesome!), there are other products and open source tools that can capture even just the slow log and put it in an easy to read web page for non-DBAs to inspect and see the effect of their code. The important point here is that if you provide the means to see the data, engineers will use that data and do their best to keep things efficient. But it is part of your job to make that access available and not a special DBA trick.
Fight the pager fatigue
A lot of organizations do not incude database layer scaling as a very early imperative in their stack design–and they shouldn’t. In the early days of a company, you shouldn’t worry about how you will throttle API calls if no one is using the API yet. But it’s appropriate to consider a few years later, when the product has gained traction, and that API call that was hitting a table of a few thousand rows by a handful of customers is now a multi-million rows table, and a couple of customers have built cron jobs that flood that API every morning at 6 AM in your timezone.
It takes a lot of work to change the application layer of any product to protect the infrastructure and, in the interim, allowing spurious database activity to cause pager fatigue is a big danger to both you and the rest of the operations organization. Get familiar with tools like pt-kill that can be used in a cinch to keep a database host from having major downtime due to unplanned volume. Make the use of that tool known and communicate the action and its effect to the stake holding engineering team but it is unhealthy to try and absorb the pain from something you directly cannot change and it is ultimately not going to be beneficial to helping the engineering teams’ learn how to deal with growing pains.
There are a lot of ways a DBA’s work is unique to her role in comparison to the rest of the operations team but that doesn’t mean it has to be a magical priesthood no one can approach. These steps go a long way in making your work transparent but most importantly is approaching your work not as a gatekeeper to a golden garden of database host but as a subject matter expert who can provide advice and help grow the engineers you work with and provide more value to the business than backups and query tuning (but those are fun too!).