Note: This post comes from SendGrid’s Engineering Team. For more technical engineering posts like this, check out our technical blogroll.
Database schema management ranges from a wild west anybody can “do it live” in production process to a waterfall-esque multi-step, multi-team review process where only one anointed individual can touch the production database.
As data contained in relational databases becomes more valuable to a company and the availability of the database becomes more important to the business, barriers to potential breaking schema changes pop up.
Early on, Database Administrators (DBAs) became the gatekeepers of the database in order to protect the database from bad things happening. But having an old-school DBA sitting between developers and their application’s database can cause a significant slowdown in an application’s development lifecycle, create silos of development and operations, and generate friction between teams.
In today’s micro service dev-ops oriented world, developers need to be able to manage database schema changes on their own because it is their data and they are ultimately responsible for the application’s performance and uptime. DBAs and Operations teams need to provide appropriate tools and advice to help development teams become the owners of their database.
How we manage schema
Our current schema management process uses a single Git repository to store the initial schema for all of our database clusters and contains all the follow-on changes to that schema as individual table alters/creates and drops are applied:
- A developer makes a schema change locally and generates an alter/create/drop statement and adds it as a pull-request to an integration branch.
- A set of Jira tickets for the Data Operations team are created to review and apply the schema changes to our testing/staging and production environments.
- A member of the Data Operations team reviews the requested change and applies the change to the testing/staging environment and merges the PR to the integration branch.
- The requesting developer tests the change in our testing/staging environments and approves the change to be pushed to production.
- Finally, Data Operations merges the integration branch to master and applies the schema change to the production environment.
Given the value of the data stored in our databases and the desire to have those databases up and running well all of the time, we settled on this byzantine sequence of events to protect ourselves from ourselves.
Protecting the database is one thing, but this process introduces several hurdles to making schema changes in a reliable and efficient manner:
- Reviewing and making schema changes happens on a twice-weekly cadence and can easily get derailed as multiple teams work on different databases all in the same Git repository and everyone is reliant on someone in the Data Operations team to review and make changes to various environments.
- Having one repository for all relational database schemas can lead to challenging release processes. A change to one schema that is ready for production can’t go to production if there are other schema changes that are not ready to get pushed to production but sitting in staging waiting on additional testing.
- The Data Operations team, which is a small team, becomes a bottleneck trying to manage which change can and cannot go to production and when. Scheduling conflicts and personnel availability can really slow down the release of new features or fixes to current applications.
- We are manually applying these changes to production systems using comments in pull requests and Jira tickets; sometimes copy paste can go horribly wrong.
Enter Skeema (and a few helpers)
In order to remove these process hurdles, make schema changes less prone to human error, allow developers to manage their own application’s schema, and potentially increase development velocity, the Data Operations team has put a great deal of effort into automating and streamlining the management of database schema.
We automated the application of schema changes from local development to production using our existing tools, Git, Buildkite CI, and pt-online-schema-change, with the addition of one more, Skeema.
The idea is to break up our monolithic DB-schema repository into individual schema repositories, one per database cluster, and allow developers to make their own schema changes in an environment that is familiar to them. We also want to have sane guardrails in place to help developers seek out additional assistance making large, complex or potentially destructive schema changes.
Skeema is a CLI tool that manages MySQL schema in a declarative fashion using SQL.
It can generate the data definition language (DDL) for each table in a database and export the DDL to a local file system for integration with a tracking repository via Git. Skeema can compare the SQL files in a Git repository to a live MySQL database and output those differences as DDL statements.
It can also be configured to use Percona’s pt-online-schema-change tool and format the necessary pt-online-schema-change command to match the running MySQL database’s schema to the schema defined in the Git repository.
Skeema is also able to manage schema in several environments, like local, testing, and production with different configurations in each. And finally, it can easily be adapted to a pull request based workflow.
Creating individual MySQL database schema repositories will break down our current monolithic db-schema Git repository and allow developers on separate teams to manage their application’s MySQL schema in their own repository instead of a shared repository (db-schema).
Having a separate repository for each database schema will allow greater autonomy to the application development team. This removes the need to coordinate all schema changes to a rigid schedule and allows changes to be moved to production as the application team desires.
A vital component of automating this process is Buildkite’s CI pipeline. We created a pipeline that:
- Checks for SQL syntax errors
- Creates a test MySQL server using the current master branch of the database’s schema and tests the application of the changes in the pull request (PR)
- Check differences and apply the PR changes to our testing MySQL environment
- Check differences and apply the PR changes to our staging environment and output some table statistics from the production environment
The production output statistics are table size on disk and estimated row counts. These stats can assist in determining if the schema change might cause some level of service interruption and might require special handling. Once the PR is merged to master, the buildkite pipeline checks differences between the master branch and what is running in production.
If the differences are the expected changes from the PR, the developer can unblock this final step and Skeema applies changes to the production MySQL database cluster. Each one of these steps is a blocking step that requires approval by the engineering team responsible for the requested change before moving to the next step.
As far as guardrails are concerned, we have configured Skeema to not allow destructive schema changes in production as a default.
Destructive changes are allowed in our testing and staging environments.
We also configured Skeema to use pt-online-schema-change to make schema changes. This is the same schema change tool that the DataOps team is familiar with and has been in use at SendGrid for many years. We have developed a set of reasonable options for pt-online-schema-change to roll back its changes if replication falls behind or active threads in the database become excessive.
Having Skeema configured in this way removes the potential errors of having manual steps for the application and hand coding of pt-online-schema-change commands by DataOps team members.
With the addition of programmatic guardrails, individual teams can be responsible for the management of their MySQL database schemas and the application of those changes to pre-production and production environments with relative safety. If guardrails are hit, the schema change will fail and is rolled back. Reasons for the schema change failure are output to the build logs for additional review.
Allowing developers to shepherd their changes from local test on a laptop to production greatly enhances developer autonomy and ownership of the database that is supporting their application. The automation and integration of Skeema into our MySQL database management process easily covers about ninety percent of our general schema change management tasks.
Most schema changes are for adding columns, changing enum fields, changing defaults, and adding indexes. The remaining ten percent of schema changes deal with special cases of large tables, very active databases, or partitioned tables. As of this post, Skeema does not yet deal with making schema changes to partitioned tables, but I hear it’s an often requested addition and the developer of Skeema is actively asking for help implementing that feature.
Combining Git, pt-online-schema-change, Skeema, and a Buildkite CI pipeline brings a reliable, repeatable, programmatic process to MySQL database schema changes. It empowers developers to safely manage their databases’ schema and control how fast features and fixes are rolled out to production.
Including appropriate guardrails in the configuration files for Skeema and pt-online-schema change, provides a measure of confidence for developers implementing schema changes and gives valuable feedback on possible ways to proceed with the schema change if those guardrails are hit.
The Data Operations team remains available to assist those remaining ten percent of cases that this process can not be applied to and will be working on additional tooling to enhance this process in the future.