Migrating Infrastructure Between Terraform Remotes

Rahul Rao
The Startup
Published in
12 min readJun 28, 2020

--

In this post I’ll be going over some personal learnings — the dos and don’ts — about moving new and existing infrastructure into Terraform, and migrating Resources between Terraform Remotes. I’ll be focusing on cloud infrastructure, primarily Amazon Web Services (AWS) as my cloud service provider. The general concepts remain the same regardless of which provider you are working with.

Quick Overview

I recently found myself needing to migrate a mix of a partially managed Terraform stack with a bunch of unmanaged AWS services over to Terraform cloud from S3, ideally without any downtime. One of the challenges I was facing was not having pre-existing documentation. The other challenge was not having the previous owner of this project around.

Before I dive into the approach I took for my use case, however, here’s a brief overview of Infrastructure as Code (IaC). If you already know what IaC is, feel free to skip the next section.

So what’s Infrastructure as Code?

Infrastructure as Code, or IaC (as the cool kids taught me), is the practice of declaring your infrastructure (cloud or otherwise) using code. Basically, instead of using an actual UI and pressing buttons on some console somewhere, you declare your servers, networking components, and everything that goes into your infrastructure via code.

IaC and design patterns related to it could be its own separate book, so in the interest of brevity, here are a few reasons why I personally prefer IaC:

  • Documentation: you can read code, and understand what your “stack” is comprised of
  • Change Management: you can look at changes (e.g. git commits), and understand the changes that someone made to a specific piece of infrastructure, timeline of these changes, as well as the precise changes that it entailed
  • Disaster Recovery: IaC, when set up correctly, could potentially allow you a push button redeploy in case something unexpected happens — or allow a rollback to a steady state in case a breaking change gets accidentally deployed

… and many more.

There are many players in the IaC space, with multiple use cases and areas that they excel in. At the end of the day, it boils down to what your specific use case is, what your team is familiar with, and equally important — consistency! Terraform is one of those players, and I personally love it!

Terraform Terminologies

There are a few concepts I’ll be using during the rest of this article that are Terraform specific:

State: The Terraform State is a snapshot of your cloud infrastructure.

State File: Terraform documents this State in what it calls a State File, which is basically a massive JSON file stored somewhere.

Backend: This is where Terraform saves the State File. Remote (think of it as a git remote) is a location on the interwebs where this State File could be stored. The State File could also be stored on your local machine (a.k.a. Local Backend).

Resource: A piece of cloud infrastructure; could be an EC2 instance, an IAM role, a Security Group, etc. — basically anything that you would “Create” on a cloud console by pressing buttons.

Managed Resource: A Resource that Terraform is aware of. This happens when there is an entry for the Resource in the Terraform State File.

Unmanaged Resource: A Resource that Terraform is not aware of. For example, you could have an EC2 instance that exists on your AWS account, but does not have an entry in Terraform’s State File.

Terraform Plan: The terraform plan command is like a dry-run. It shows you what it plans to do should you actually run the command.

Terraform Apply: The terraform apply command actually applies all of the infrastructure you had declared via code.

Normally the workflow looks as follows:

With those concepts out of the way, let’s dive into this specific problem.

Background Context

Like most stacks on AWS (or any cloud provider), our stack at QuickFrame was initially created via console button presses. After a while, our DevOps Lead started migrating some pieces of this infrastructure over to Terraform, by declaring all of the infrastructure for the different apps in a single git repository.

Terraform lets you use various platforms as a Backend. Since we are an AWS shop, our DevOps lead had decided to, rightfully so, use an S3 Backend.

For most Terraform Backends, including using S3 as a Backend, you run terraform plan and terraform apply statements on your local machine. Terraform knows that you’re using an S3 Backend and owing to that, makes sure to read and write the state file to an S3 bucket.

So if you have more than 1 person working on the same repository, they’ll all be reading and writing to the same State File, but running the Terraform commands on their local machines using their AWS access keys.

Challenges Faced

As soon as the number of resources and number of apps being managed under this single repository started growing, we began facing some issues:

  • There was no “badge” by which we could verify that the code that was in Github was actually applied— the only way to check whether it was run was by logging into AWS and ensuring that each resource existed
  • The above wasn’t a problem when our DevOps lead was the only human working on this repository, but it started to get trickier when multiple developers started working on the same repository for different AWS apps (we want all our engineers to be able to manage the stack they develop)
  • The matter complicated further as we started needing to manage more environments at different stages (qa, staging, prod, etc.) — primarily because we want all environments to be nearly identical with minor differences (e.g. lower compute for lower level environments)
  • It started to get practically impossible to look at the entire stack/infrastructure holistically
  • Continuous Deployment became tricky because we did not want a change for Application A to accidentally deploy an unapplied change for Application B
  • Applies were done “locally”, which use individual access keys — which led to permission granting challenges

… and many more.

Solution

I was a month or so into my current job and had started digging into this issue. While doing some research, I came across Terraform Cloud, which was in an open beta phase. The more I looked into it, the more it made sense for us to switch over to it! At its very core, Terraform cloud provided us with:

  • A Remote Backend for our State File
  • A runtime to centrally run terraform plan and terraform apply (which meant a single user with a single access key)
  • “Badges” for the latest run, and the state of the latest run
  • Workspaces for different projects and environments
  • VCS (Github) integration
  • Teams and team member accounts

WIN!

Migration Overview

As a tl;dr; overall what we ended up doing was:

  1. Finalizing that each project will have a terraform directory that declares the resources for that project (e.g. one of our APIs has a terraform directory in the workspace that declares all the infrastructure Resources needed for that API)
  2. Download the State file locally from S3 Backend
  3. Make adjustments, migrate code and prepare for new project structure
  4. Setup the Terraform Cloud project and workspace
  5. Upload the State File to the TF Cloud Workspace and use that as the new Backend
  6. Setup VCS hooks for that project

Sounds quite easy, doesn’t it…? :)

To be candid, this is the eventual approach I landed at. I won’t bore you with the details of the 1000 missteps that I took before landing there.

The terraform state CLI command was my friend throughout this process. Don’t be afraid to use it once you understand what it does.

I’ll go into how we achieved each of these steps next, as well as the commands we used during these steps.

Step 1: Finalizing New Project Structure

As stated in the previous diagram, we decided to declare each application’s Resources in its own repo that contained the application code itself. We did this primarily because it worked perfectly with our flow, and the “size” of our projects. Sidebar: In the future, should we have larger infrastructure projects, we might revisit this decision.

This meant that I was working with 2 repositories — the old terraform repository, and a new repository containing infrastructure code specific to the new application.

Step 2: Getting The State File Locally

Needless to state, I’m working on a separate branch at this point, and decided to start with the qa infrastructure…

Firstly, I had to grab the state file from the current S3 Backend.

To get the State File , I ran terraform state pull in the old repo that had all the infra. This command outputs the entire state file to your stdout . To save it locally, I ran terraform state pull > terraform.tfstate in the directory (terraform directory) that contained all my .tf files.

FYI: By default, Terraform expects a terraform.tfstate file to exist at the root of your Terraform workspace if you are using a Local Backend.

Now, I had a massive JSON file that represented my entire stack. I simply moved this over to the new repo.

mv ./terraform.tfstate ../newrepo/terraform.tfstate

Step 3: Making Local Adjustments

This is the step where most of my time was spent. From this point onwards, I am working only in the new repository.

This is what the change logically looked like:

  1. Only extract relevant code
  2. Ensure variable names were up to date, fix where applicable
  3. Test that there were no unexpected changes

For reasons that preceded me, each Application Stack was declared as its own Terraform Module. Owing to that, all Application Stacks shared the same State File. This was suboptimal, as we wanted separate State files for separate Application Stacks.

Personally, I feel that having one State File manage all your Application Stacks is like having 1 git repository for all your projects.

To see what all the modules were, I ran terraform state list . As expected, all API 1 related resources were under module.api1 .

As of this writing, there was no command that I found that let me tell terraform; “take everything under module.api1 and put it at the top level”.

Instead, I firstly extracted all the module contents into its own State File:

terraform state mv -state-out=new.tfstate "module.api1" "module.api1"

FYI: By default, terraform uses terraform.tfvars as the default, local State File, when a Remote Backend is not being used.

Next, to ensure everything was extracted correctly and I wasn’t missing anything, I ran:

terraform state list -state=new.tfstate

Now I had to pull all resources under module.api1 up to the top level. As I had mentioned, I had not found a command that let me do that. So I simply edited the State File directly. The Terraform State File has the following structure:

{
"version": x,
"terraform_version": "x.xx.xx",
"serial": xxx,
"lineage": "xxxx-xxx-xxxx-xxxx-xxxx",
"outputs": {// not empty if has outputs},
"resources": [
{
"module": "module.api1",
// .... other data
},
{
"module": "module.api1",
// .... other data
},
]
}

Nice — so the module key/value pair informs Terraform that this specific resource belongs to a module. I simply deleted all the module key/value pairs and ran terraform state list -state=new.tfstate again.

That had the desired effect and Terraform thought that those resource belonged to the top level! That ensured that only relevant resources were extracted. I also copy pasted the TF code from the old repository into the new repository. I backed up the now old terraform.tfstate file as backup.tfstate and renamed new.tfstate to terraform.tfstate .

Next, we were following some updated resource naming conventions, etc. in all future facing Terraform code. So I had to rename a few resources. I changed their names in the .tf files, and then to reflect that in the State File, I ran terraform state mv "old-resource-name" "new_resource_name" , for all the resources that needed to be renamed. This “renaming” was also applicable in places where the project had been partially migrated, i.e. we had moved qa over to TF Cloud but staging was still being managed by the old repo.

Yes, it was a tedious process.

Yes, it took time.

I still needed to change tags, update some server configs, etc. however, I held back so as to not introduce too many changes.

To simply test that “things were just as before”, I went through the process of terraform plan ← → code changes ← → terraform plan a few times. This also helped me catch a few resources that I had accidentally missed. Basically, my goal was for terraform to tell me “No changes. Infrastructure is up-to-date.” which meant that everything was “just as before”.

Sweet — on to setting up the project on TF cloud!

Step 4: Set up Terraform Cloud Project

Terraform has the concept of Workspaces. Terraform has an extensive writeup on the concept of Workspaces. Just as is the case in life, there’s no one-size-fits-all solution. It all boils down to your team’s shared understanding, what the tool allows for, and then being consistent.

We use a Workspace per environment. This is because we want our qa, staging and prod environments to be as similar as possible, be managed by the same TF code (i.e. configuration files), while having the necessary differences. E.g. our qa environment uses smaller EC2 instances than our prod environments. No surprises there.

Each Workspace manages its own separate State File. This allows us to have 1 piece of Terraform code manage 3 separate environments. As you can understand, the State File for qa would be different from the State File for prod by virtue of having different values for instance sizes, vpc ids, security group ids, etc.

Switching workspaces is basically deciding to work on a different environment.

This is what it looks like logically:

I set up these workspaces with the appropriate names, access and secret keys, as well as had them use different .tfvars files. I’ll share how we set our TF Cloud project up in a separate post.

Step 5: Uploading the State File

This is pretty straightforward. I simply add the Backend configuration and populate the appropriate values for the variables, and run terraform init again.

Right away, terraform detects a new Backend and asks whether you want it to upload the State File to the new Remote Backend. I said yes and voila, TF Cloud was showing my new state file!

Step 6: VCS Integration

One of the highlights of TF Cloud is that it provides a terraform runtime (basically a CI server to run your terraform commands).

My upcoming post on our TF Cloud project setup will go into more details around the precise steps in setting this up. However at this point, by hooking up the TF Cloud Workspace with the appropriate branch on Github, I was able to trigger a build on TF cloud via pushes to the appropriate Github branch.

I manually triggered a run from the TF Cloud console, and again, got the words I was looking for: “No changes. Infrastructure is up-to-date.

… and then you repeat that for all the stages… for all the projects… ;)

Wrap Up and Takeaways

This one time lift allowed us to enable our team to manage their stacks themselves. We have a standardized process around how we make updates, and engineers (regardless of their seniority/experience) are now able to affect infrastructure changes in a safe and controlled manner whenever needed.

A mix of Github PRs and TF Cloud Plan reviews make this a collaborative and seamless process, without being bottlenecked by siloed knowledge in one person’s head.

To conclude:

  • Use terraform state pull to get the entire state file
  • If you are unsure, do not hesitate to pull your State File locally and make changes to the State File locally — you can always push it back to the remote via terraform init
  • Use terraform state list to get a list of resources
  • When in doubt, use terraform state show <resource> to get more details about a resource that is being Managed by Terraform
  • Use terraform state mv to extract modules, as well as rename resources — you can use the -state and -state-out params to modify which state file is used as an input and which state file is being used as an output, as long as you are doing all of these changes locally
  • Delete the module: <name> key/value pair to get resources out of a model to the top level — modifying the State File is frowned upon, but some times that’s your only choice
  • Use Workspaces in a consistent manner to switch tracks between the various environments, and have each Workspace manage the State File for one given environment, thereby managing multiple environments using a single Terraform project and some .tfvar files

Don’t be afraid, be curious and make new mistakes! :)

--

--

Rahul Rao
The Startup

Trying to make new mistakes and not repeat old ones.