Lessons learned from pain, refactors, and more pain

Note: This post, which originally appeared on bogacz.io, comes from SendGrid’s Engineering Team. For more technical engineering posts like this, check out our technical blogroll.

SendGrid has been using Terraform for the better part of a year to provision and manage everything from VPC’s, to Athena, Redshift clusters, EC2 jump boxes, ECS clusters, and much more.

We have used Terraform to deploy completely new services into AWS and we’ve used it to help us migrate on-prem services with live production workloads and data to AWS.

While the learning curve was tough at first, it was a worthwhile investment that has ultimately helped us move faster as an engineering organization. Here is a rough guide to Terraform that we’ve put together as a result of our various experiences.

Index

  1. Brief Overview of Terraform
  2. Writing terraform
  3. Testing terraform
  4. Troubleshooting terraform

Brief Overview of Terraform

What is Terraform

Terraform is a tool to enable Software/DevOps engineers to reconcile infrastructure definition with software best practices. These best practices aim at helping to solve many problems that non-infrastructure code has dealt with for decades: versioning, auditing (this one is important: we need to be able to audit all of our infrastructure programmatically in order to meet our compliance goals), and code reuse. In this sense, Terraform is similar to several other IaC (Infrastructure-as-Code) tools out there, like Puppet, Ansible, Chef, etc.

At its core Terraform is effectively a dependency solver1 tool. It uses its own DSL built on HashiCorp’s HCL to abstract from provider-specific resources (e.g. AWS vs. GCP), and allow for the evaluation of the dependency graph to be decoupled from these specifics.

Building this dependency graph is what happens in the plan stage by reading the current state using a provider’s Read calls and diffing with the desired state. This phase is mostly, but not entirely, decoupled from the provider’s Create/Update/Delete API calls, so as long as a given resource wasn’t egregiously misconfigured (that is, all required input is present, no non-existent outputs are referenced, etc.), the plan will succeed.

It’s only in the apply and destroy steps that Terraform actually runs Create, Update, and Delete commands against the resource’s provider.

The relative platform-agnosticism of the plan phase is a common Terraform stumbling block. Since the plan phase doesn’t run any of the resource creation/update calls against the provider (AWS in our case), there’s no way that it can guarantee that we actually have a valid configuration from the provider’s perspective. It’s important to always make sure that Terraform can be applied error free before merging into master.

When trouble-shooting, a good way to distinguish a Terraform issue from an AWS configuration issue, is to determine whether it occurred at plan or apply time, respectively.

Providers

Terraform providers provide the mapping from Terraform’s DSL to the respective provider’s API calls to perform the necessary CRUD operations. While Terraform supports a large number of providers, here at SendGrid we typically just use the AWS and DNSMadeEasy (to let us create custom subdomains under the sendgrid.net domain managed by it) providers.

Resources

Terraform resources are the smallest building block for all of the Terraform work you’ll have to do. Resources are provider-specific (e.g. GCP has different resources than AWS). Resources are fully-managed by Terraform, from creation to deletion. Depending on the resource type, different arguments (e.g. name, count, etc.) are considered unique, and can cause Terraform to re-create, or update the underlying resource. For example, when creating an RDS cluster, the identifier, if set, forces the creation of a new resource. This occurs any time that the identifier is computed, even if it computes to the same value that the resource previously had (e.g. my-cluster-${var.region}).

The way to get around this set of problems is to leverage the resource lifecycle block. Specifically, we can tell Terraform to ignore changes when determining what actions need to be done. This way we can both specify attributes that we desire programmatically, but avoid the recomputation.

NB: This can be dangerous if you’re ignoring an attribute that you believe you will want to actually change.

From our RDS example, the lifecycle block would look something like this:

Data Sources and Remote State

Data sources are akin to Terraform resources, with the important distinction that they are read-only objects. A common use-case for data sources is when you’d like to get some information at runtime, such as the current region, current account id (aws_caller_identity), or getting the value of some AWS aliased resource, such as the default account KMS key.

Data sources are often also used to fetch information about resources that were created as part of some separate Terraform, either in another project, or as a more persistent type of resource, e.g. VPCs.

While this approach works, it’s preferable to leverage remote state data sources to have the resource import be less brittle. The difference stems from the fact that data sources must be looked up by fields supported by the data source, whereas remote state actually reads a remote state file. So for example, if I have a VPC named application-vpc-staging-us-east-1, I could define the data source either by using that name, or by filtering by some tags (although it would have to be the only resource to match those tags).

If the tagging structure changes, or the name is changed for some reason, then Terraform will fail at plan time as it tries to resolve data source values.

Using remote state allows us to identify resources by object (resource or module), type (e.g. aws_vpc, aws_iam_role, etc.), and object identifier (for more on this see the naming section).

Modules

A Terraform module is any collection of Terraform resources within the same folder. So if I’m creating an IAM user in iam/main.tf, I can leverage that Terraform in other Terraform by calling it as a module. Modules are the building block for reusable Terraform code. They can be sourced either from a relative path (common within some application repo where we have some submodules to help organize out code), from a module repository (e.g. things on the Terraform Registry), or from GitHub.

Below is an example of the folder structure when using relative path imports.

There’s only one way to be able to pass data in and out of modules: via variable and output declarations within the module itself. This is why modules typically have a main.tf file, where the actual resources created in that module live, a variables.tf file for declaring what aspects of the module are configurable, and outputs.tf where the module can pass data back to the caller (e.g. the ARN of a created resource, etc.).

Writing Terraform

There are many aspects of writing Terraform which can be very frustrating. Some of them are related to the current limitations of HCL (2.0 can’t come soon enough), like the lack of lazy evaluation in ternary assignments. Many others, however, stem from the disconnect between an infrastructure DSL and normal engineering best practices. This section is intended to explore several of these, but it shouldn’t be considered exhaustive.

Naming

One of the two hardest problems in Computer Science, naming can easily get out of control when Terraforming infrastructure. There’s two different types of names that occur in Terraform. There are data/resource/module identifiers, e.g.

In the example above, the object names are user1, api, and current respectively.

The other class of names are within those data sources or resources that make that available as an argument, e.g.

While it seems a trivial point, consistent naming can be very helpful when trying to read Terraform files. Identifiers should always be underscored to be consistent with the naming convention of the resource/data type itself (e.g. aws_iam_role, aws_lambda_function, etc.).

Resource/data name arguments ideally should match the convention for the given resource, but it’s often simpler to stick to the underscore convention.

Namespacing

In the topic of naming, we also get into namespacing our resources. There are several reasons for that

  • identifying resource ownership becomes easier (e.g. rds_user isn’t as clear as template_api_rds_user
  • makes multiregionality easier, since some AWS resources are globally namespaced (such as S3), e.g. some-data-bucket-us-east-1 some-data-bucket-us-east-2.
  • allows for running multiple side-by-side environments in the same account e.g. some-data-bucket-us-east-1-dev && some-data-bucket-us-east-1-stage

Secrets

One thing to be aware of is that any values in the Terraform state, whether DB passwords, Lambda environment variables, or ECS task definitions, will be stored in plaintext.

While the remote state Terraform backend (see next section) supports encryption at rest, this would still be considered a violation in some of our compliance environments.

Using AWS Secret Manager, or Parameter Store to create and store the values out of band, and have the environment variables simply refer to the secret name, allows us to store the secrets more securely, and gives us an easier path towards secret rotation.

Using and writing modules

Beyond using some IaC tool in order to satisfy compliance environment constraints, it can also provide a lot of benefits in terms of code reuse when deploying to new environments, designing new systems with similar components, or even jump-starting other teams’ projects. Furthermore, more modular Terraform code, like higher level code, makes for more well-organized and readable code.

That being said, just like normal code, it’s often easier to start off by having all of the resources that we need in one place, making sure things work, and then incrementally refactoring. The same way that we might start with more POC-like code, and slowly evolve our abstractions over time to fit the need and shape of system.

Local development

When first starting to Terraform an architecture that you are less familiar with, it’s often best to start by actually putting resources together in the UI. This helps us validate the architecture, and helps give us a better picture of what we’ll need to Terraform once we start coding. Once we start actually writing Terraform, there are some more considerations for how we can both iterate on our terraform and collaborate.

Many CI/CD pipelines are only set up to deploy Terraform when code is merged to the master branch of the repository. While this is a strong gate for ensuring consistency between the master branch and what’s actively running in a given environment, it also means that the approach to developing Terraform locally can feel mare painful than it is (i.e. feeling like one has to merge to master before validating their Terraform work.

Luckily, not only can we deploy Terraform locally, but can even do so while sharing state with our teams by using a remote state configuration on our backend. This can be set up using the resources created by using this module, which also contains documentation about how to tell Terraform what to use for its state backend.

Multiregion

Many services at SendGrid need to be multi-region in order to meet our availability guarantees. Terraform providers are region specific, which means that any multi-region supporting Terraform code must declare multiple providers. Often times when doing multi-region architectures, it makes sense to modularize the parts that are being repeated in each region. Any time that a module is going to have different providers passed in, it should be done so explicitly.

One such example of this is how a DynamoDB global table module defines aliased providers which are then ultimately configured by the upstream caller.

The module is fairly small, and has one sub-module:

The module’s main.tf defines the aliases to be defined by the caller:

The top-level uses each of the aliased providers as the default provider for the dynamo submodule:

The dynamo submodule has no notion of regionality and is only used to create a single region table, although it does export the region that it’s run against:

Then the module’s top-level can use the region output to tie together each regional table into a global table

Testing Terraform

Given that Terraform is managing not only the code deployment, but the infrastructure of our applications as well, it makes testing it necessarily happen at a higher level. Ultimately, this leaves us with a couple of options for good testing of our Terraform, system integration tests, and terratest (which while it isn’t the most idiomatic Go, at least does provide us with a good amount of helpful functionality).

  1. System integration tests
    • great for validating both our infrastructure and our business logic. They’re a great fit when testing application-specific Terraform, since we also get the system-level test value.
    • can be run as post-apply CI/CD hooks to run automatically whenever we deploy to an environment
  2. Terratest
    • a good fit when we’re writing Terraform modules intended for reuse, i.e. non-application specific.

Here’s an example of one of the tests that we run against our lambda-apigw module:

 

Troubleshooting Terraform

Troubleshooting Terraform is often a frustrating process for a few reasons:

  • The separate Plan/Apply phases having different failures (as outlined above)
  • The long cycle time between making a change, and applying for validation

Some helpful rules of thumb

  • If something fails during plan, it’s almost certainly a Terraform-related change
  • If something fails during apply, it’s more likely an issue with AWS configuration
  • The UI can give you a faster path towards resolution, and it’s often easier to backport fixes found through the UI than to incrementally run Terraform in search of a fix
  • Read the plan closely. Are things being deleted or changed that you don’t expect

Some helpful terraform commands

  • import: terraform import allows you to move preexisting resources into new terraform, such that you don’t have to recreate resources. This can be particularly powerful in order to avoid database migrations.
  • state: the state command has a few subcommands (such as rm) that can allow you to reconcile state discrepancies (e.g. manually removing a resource and removing it from terraform state)
  • state mv: one of the subcommands of state that’s particularly helpful. It allows us to refactor Terraform code with corresponding existing state, and reconciling the new code with the current state, such that Terraform doesn’t think that the old state must be removed and new state created.

The value of count cannot be computed

This is a common issue, and a known deficiency in Terraform. It means that either your Terraform code, or modules that it depends on, is using the result of some computed value as the input to the count metaparameter in a resource definition.

Resource with <name> already exists

This usually happens either

  1. there are multiple environments or regional deployments in the same account, and the resources haven’t been sufficiently namespaced
  2. there were two deployments of the same application using two different backends, e.g. one local deploy with local state and a CI/CD deploy using an S3 backend

Eventual consistency

There are a number of errors that stem from the eventually-consistent nature of many AWS services. One example of these issues that we’ve run into is a module we have called apigw-custom-domain-regional.

This concisely named module, has to create an ACM cert (a service that has no SLA, add a CNAME entry into DNS to validate it (and wait for the DNS to be updated), once validated it creates an APIGW custom domain name using the cert (there can be some lag between when ACM considers it valid, and when other systems can see it as available), then creates a base path mapping between the APIGW stage and the custom domain, which once again, may not be immediately reflected.

In many of these cases, simply re-planning and re-applying Terraform will often resolve the issue. While there’s no fool-proof mechanism by which to verify whether or not the issue is simply down to an issue of eventual consistency, verifying if values have been updated in the UI can often be a good indicator.


1: To be clear, by dependency solver I specifically mean a tool used to resolve directed acyclic graphs (also knowns as DAGs). Although it’s solving a conceptually similar issue, it’s not the same thing as a dependency management tool, which is concerened with resolving library/module dependencies.

In fact, Terraform is terribly inefficient when resolving module dependencies: it greedily fetches the source code for a module every time, irrespective if some other module already imported it. On the other hand, Terraform’s entire raison d’être is to apply graph theory to the resource management problem, and it’s for the most part quite good at that.

 



Steven Bogacz
A Uruguayan who somehow ended up in Colorado climbing, snowboarding, and coding to pass the time.