Deploying a New Region Faster Than You Can Bake a Potato
Authored by Justin Watkinson, Senior Staff DevOps Engineer
One trait I appreciate about the DevOps team at SailPoint is an attitude towards “the other CI” — Continuous Improvement. I believe it’s a habit of highly effective DevOps teams to be able to get things out to production quickly, but also be able to quickly iterate and improve upon that solution, striving for quality improvements upon each revision.
One morning, I was asked to deploy a SailPoint Predictive Identity stack in a new AWS region. We needed everything from the ground up, including a variety of VPCs and the transit gateway which binds them, compute resources from ECS, EKS, and data solutions provided by RDS and Elasticache (to name a few).
- Network — VPC, NAT Gateways, Transit Gateway, Subnets, Route Tables and More
- Compute — EC2, EKS, and ECS clusters, with supporting security groups and IAM
- Data — RDS, Elasticache, and DynamoDB resources
As is often the case, the sooner we can get this into our customer’s hands the better, so we need to go fast.
Terraform and Terragrunt — Infrastructure as Code Companions
At SailPoint, we use the combination of Terraform and Terragrunt to empower our AWS infrastructure automation and adopt infrastructure as code practices.
We use Terraform and the HashiCorp Configuration Language to create modules that codify our infrastructure for consistency in configuration and adherence to best practices in monitoring, alerting, high availability, etc. Terragrunt then acts as the glue to compose those modules together, further codifying the relationships between the infrastructure made up of smaller, repeatable units.
Today’s State — Composable… with Room for Improvement
Our existing Github repository is organized as a monorepo, which contains all of our Terraform modules and Terragrunt deployments together. The Terragrunt configurations include relative references to the modules and handle creating a global map of variables by layering common variables, then regional overrides, down the stack using the extra_arguments feature of Terragrunt.
terraform { extra_arguments "vars" { arguments = [ "-var-file=${get_parent_terragrunt_dir()}/variables/common.tfvars", "-var-file=${get_terragrunt_dir()}/../region.tfvars", "-var-file=${get_terragrunt_dir()}/prod.tfvars", ] commands = get_terraform_commands_that_need_vars() } source = "../../../../modules//vpc"}
In order to compose infrastructure units together, the Terraform modules themselves would make use of data sources and remote state to discover the resources already created by another module.
While both of these options are technically valid solutions, they also come with a few drawbacks. Module re-usability can be impacted by having too many data lookups for external resources that may not yet exist, often creating a situation where a person just has to “know” the correct order to apply things in order to get the next plan to succeed. You may also find yourself creating toggles around these resource lookups by using count or for_each modifiers, which tend to clutter your module and make it less re-usable.
data "terraform_remote_state" "vpc" { count = var.vpc_lookup_enabled ? 1 : 0backend = "s3"config = { bucket = var.remote_state_bucket key = "${var.remote_state_path}/vpc/terraform.tfstate" region = "us-east-1" profile = var.aws_profile }}
Module maintainability is another factor in the decision to use the count or remote_state features in the same modules as resource creation. Resource tags will change over time, including case-sensitivity! Presumed naming conventions will begin to exceed resource limits, forcing you to re-factor, which creates a new problem — how to remain consistent in your module usage between implementations while also keeping your interface, the module input variables, small and well-scoped.
Variable scope was another apparent issue in the current monorepo setup that was affecting how modules could grow and evolve independently from one another. The map of Terraform variables included an ever-growing list of identifiers, often getting unique names as not to interfere with similar modules. As that list grew, it meant that modules had to come up with more unique names to prevent accidental re-use or incorrect configuration throughout the entire repo. This tight-coupling led to some very long Terraform plans, where changing common variables files meant having to re-plan an entire region to ensure configuration drift had not occurred as a result of the change. In DevOps, we believe in smaller, incremental changes, which leads to faster delivery.
Finally, after upgrading to Terraform 0.12, we also became well-acquainted with the “value for undeclared variable” issue. Since our current Terraform repository contained an overlay of 5 or 6 variables files to create a global map of key-value pairs, we had many variables that simply weren’t used in every single Terraform module. This meant seeing potentially hundreds of warnings indicating that the Terraform tool itself was not going to support this, ultimately destined to become an error within the tool.
Building our new Region with Next Generation Terragrunt
The biggest challenge for a new region was getting the dozens of modules to deploy in the correct order. I chose to take advantage of Terragrunt’s dependency injection feature to address this issue. Using dependency blocks, I can explicitly “tell” Terragrunt the correct order in which to apply our infrastructure, while also negating the need for data and remote_state blocks. When you combine dependencies with mock_outputs, you can create a Terraform plan that simulates the correct behavior even if the resource doesn’t yet exist. Use obviously fake mock values to make plan review even easier.
dependency "transit-gateway" { config_path = "../../transit-gateway"mock_outputs = { transit_gateway_id = "tgw-00000000" }mock_outputs_allowed_terraform_commands = ["validate", "plan"]}
Utilizing dependency injection, we could use Terragrunt’s plan-all feature against a much larger segment of our new region with confidence. This helps us get things moving much quicker than before, as we have fewer cycles of plan/apply to complete the environment.
Next, we also wanted to deal with variable scoping. Terragrunt supports an inputs block which allows you to map your dependency outputs, as well as dynamically discover variable files in parent directories. What this really enabled is a way to map Terragrunt variables to Terraform modules in a way that was more loosely coupled than before. We could also have full access to Terraform expressions and functions to do modifications such as placing a string into a list without having to re-define the module variables or duplicate the variable (and keeping it DRY along the way). This helped speed things up because each implementation looked virtually identical now, which means stamping out new regions was a mere copy and paste away.
For anything that was more unique to an environment, such as a unique identifier or a different instance size, we used another file next to the terragrunt.hcl named terraform.tfvars, which gets automatically included into the variable set for the deployment. This way the terragrunt.hcl files are DRY and re-usable in almost 100% of cases (less typing = more speed!).
What we didn’t end up changing much was the modules themselves. In cases where we were using data or remote_state, we would simply use a count to disable it when not needed, instead preferring to pass in these values. Everything was made optional and configurable, allowing for 100% backward compatibility with existing environments. This helped us focus on the task at hand — composition.
The Results
Since adopting this practice, we have built out three new SailPoint Predictive Identity regional environments, each completing faster than the one before it. This activity used to be measured in weeks-to-months, but now it’s a contest to see if we can get it complete in under a day or two (including change management).
For comparison sake, when testing this against our dev account’s disaster recovery environment, I was able to produce all of the base infrastructure (VPC, ECS, EKS, Elasticache, S3, ELBs, you name it), in a total time of 21 minutes in a single terragrunt plan-all that created everything in the correct order. Faster than you can bake a potato, and I have an entire AWS setup — I’ll take it!
What it really comes down to is adopting your tool’s composition opinions and embracing them with open arms. If we really wanted to bring that time down further, we could look at ways to parallelize infrastructure creation that takes several minutes to complete, namely EKS, Elasticsearch, and RDS. Making smaller modules and ensuring the dependency graph can allow those to run in parallel would be a great next step.
We also took a huge step forward in a more loosely coupled module ecosystem, where variable names can evolve independently and promote re-use within the organization. We’re hoping this turns into more module re-use and the ability to more easily automate testing for our modules. As a bonus, we were one step closer to avoiding the undeclared variable issue.
Next Steps
We still have some work ahead of us for migrating our existing regional infrastructure onto the next generation Terragrunt platform. Since the modules are the same in >80% of cases, it’s mainly an exercise in moving Terraform state files to their new home.
We’re also excited to begin the process of adopting Atlantis and the Atlantis config generator for Terragrunt to implement this into a battle-tested CI/CD system.