Implementing a Terraform state backend on Cloudflare Workers

Sep 18, 2022

Preface

If you are just hearing about Terraform for the first time, the introduction to this post attempts to outline all the necessary information to get up to speed. For the already initiated, feel free to skim through the introduction before diving into the details of creating your own HTTP Terraform Backend on Cloudflare Workers.

Introduction to Terraform

Terraform is a popular open-source tool for defining infrastructure-as-code and is primarily developed by HashiCorp. Terraform works by declaring your resources in code written using the HashiCorp Configuration Language (HCL) and then running a couple commands to create, edit, or delete the mentioned resources.

The project has gained lots of popularity, especially in the “Cloud Native” world and many organizations now maintain their own Terraform Providers that can be used to interface with their platform.

Terraform works by using a combination of three things. First, the locally defined .tf files containing all your resource declarations. Second, the terraform.tfstate file (locally by default and the subject of this post) which is used to maintain a record of resources managed by terraform and their last known configured state. Finally, we have the remote (or upstream) definition of the resource. Anytime the terraform plan or terraform apply commands are invoked, all three of these things are checked to determine the course of action terraform will attempt to perform to ensure everything aligns with what is defined in the .tf files.

A “Hello, World” example

Creating a new directory with a file main.tf, which is the entrypoint for terraform, we can declare the existence of a local file named hello.md with the contents of Hello, World!.

# main.tf
terraform {
  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "2.2.3"
    }
  }
}

resource "local_file" "hello_md" {
  content  = "Hello, World!"
  filename = "hello.md"
}

Running the command terraform plan, Terraform will detect that a new file needs to be created and display a plan to create that file.

$ terraform plan

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the
following symbols:
  + create

Terraform will perform the following actions:

  # local_file.hello_md will be created
  + resource "local_file" "hello_md" {
      + content              = "Hello, World!"
      + directory_permission = "0777"
      + file_permission      = "0777"
      + filename             = "hello.md"
      + id                   = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Running terraform apply will then prompt you for confirmation before creating the file.

$ terraform apply
...

local_file.hello_md: Creating...
local_file.hello_md: Creation complete after 0s [id=0a0a9f2a6772942557ab5355d76af442f8f65e01]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

We can then confirm that the new file was created in the current directory, along with another file named terraform.tfstate, which is a JSON file representing the last known state for all the managed resources.

$ cat hello.md
Hello, World!

$ ls -l
total 24
-rwxr-xr-x  1 cole  staff   13 Sep 18 13:20 hello.md
-rw-r--r--  1 cole  staff  216 Sep 18 12:58 main.tf
-rw-r--r--  1 cole  staff  834 Sep 18 13:20 terraform.tfstate

Doing cat terraform.tfstate we can then see the contents of the tfstate file.

{
  "version": 4,
  "terraform_version": "1.2.9",
  "serial": 1,
  "lineage": "7d5540fe-2ad8-3172-ed63-5f6471c34756",
  "outputs": {},
  "resources": [
    {
      "mode": "managed",
      "type": "local_file",
      "name": "hello_md",
      "provider": "provider[\"registry.terraform.io/hashicorp/local\"]",
      "instances": [
        {
          "schema_version": 0,
          "attributes": {
            "content": "Hello, World!",
            "content_base64": null,
            "directory_permission": "0777",
            "file_permission": "0777",
            "filename": "hello.md",
            "id": "0a0a9f2a6772942557ab5355d76af442f8f65e01",
            "sensitive_content": null,
            "source": null
          },
          "sensitive_attributes": [],
          "private": "bnVsbA=="
        }
      ]
    }
  ]
}

Modifying or removing the resource outside of terraform will cause the next terraform <plan|apply> to produce the necessary changes to return to the declared resources state. Using terraform with the state stored locally works well for hobby projects or just trying it out but once multiple people begin editing the files, keeping the state consistent across all their computers becomes a challenge. To get around this, terraform has the concept of a remote state backend which maintains the state elsewhere for it to be shared amongst multiple users.

The remainder of post is about implementing your very own HTTP state provider using Cloudflare Workers , Durable Objects and R2 (Cloudflare’s S3-compatible object store).

Defining the API

Before we can write any code we first need to understand what the interface for the backend looks like and what it expects when issued any HTTP requests. The information page for the http backend has the following description for the API.

State will be fetched via GET, updated via POST, and purged with DELETE. The method used for updating is configurable.

Sounds pretty simple. What if we also want to support state locking?

This backend [http] optionally supports state locking. When locking support is enabled it will use LOCK and UNLOCK requests providing the lock info in the body. The endpoint should return a 423: Locked or 409: Conflict with the holding lock info when it’s already taken, 200: OK for success. Any other status will be considered an error. The ID of the holding lock info will be added as a query parameter to state updates requests.

While it sounds just as simple as the first description, there is actually a fair bit of ambiguity that needs to be figured out. Can one request the current state when it is locked or should a 423: Locked be returned if they are not the lock holder? What about the line “lock info will be added as a query parameter to state updates requests”. Is deleting the state included in the “state update requests”? Luckily the code is open-source, so we can dive in to see the expected behaviors for the client.

Can someone request the current state when it is locked?

Yes.

The .Get() state method does not include the lock info when making a request. The method also does not handle the aforementioned locking codes, 423: Locked and 409: Conflict. Using this information we get to decide the behavior for fetching the state when it is locked. Given this is a read-only operation this will be allowed, even when then state is locked. 1

Are `DELETE` requests considered a state update request?

Yes and no.

The “specification” states that all “lock info will be added as a query parameter to state updates requests” but one can see that the current lock info is not included in the .Delete() request. 2

While it is unclear whether .Delete() should be considered a state update request or not, deleting the state, even when locked, could have consequences when running apply. This is because of the lifecycle terraform follows when making state changes is something akin to:

Lock the state
Fetch the current remote state
Refresh the resource from upstream
Compute the plan
Apply plan
Update the remote state
Unlock the state

If step 1 succeeded but the state was deleted before step 2, the plan would detect all managed resources have been deleted and would propose a plan to create all the managed resources (even though they already exist!). For this reason, we will not support deleting the state while it is locked.

Summary of the API

All routes will follow the HTTP path patterns and use HTTP Basic Auth for authentication.

GET /states/:projectName: return the current state for :projectName
POST /states/:projectName: update the current state for :projectName
LOCK /states/:projectName/lock: lock the state for :projectName
UNLOCK /states/:projectName/lock: unlock the state for :projectName

Any requests that modify the state will add the query parameter ?ID=<UUID>.

The following diagram represents the flow the API will follow. For the sake of brevity, I did not draw LOCK/UNLOCK.

Diagram of API behavior

Implementation details

As I mentioned in the introduction, we will be building out the backend using Cloudflare Workers. The code will be written in TypeScript and the locking implemented using Durable Objects. The actual state file will be stored on Cloudflare R2.

Implementing authentication details is out of scope for this post, however we will assume that the HTTP Basic Authorization header is valid as the username will be used effectively “namespace” the state files to support multiple users along with multiple projects for each user.

All the code is available over on GitHub.

Router

Routing will be done using itty-router.

// src.index.ts
import { Router } from "itty-router";

const router = Router();

export default {
  async fetch(
    request: Request,
    env: Env,
    ctx: ExecutionContext,
  ): Promise<Response> {
    router.get(
      "/states/:projectName",
      withIdentity,
      withParams,
      getStateHandler,
    );
    router.post(
      "/states/:projectName",
      withIdentity,
      withParams,
      putStateHandler,
    );
    router.delete(
      "/states/:projectName",
      withIdentity,
      withParams,
      deleteStateHandler,
    );

    router.all(
      "/states/:projectName/lock",
      withIdentity,
      withParams,
      lockStateHandler,
    );

    router.all("*", () => new Response("Not found.\n", { status: 404 }));
    return router.handle(request, env);
  },
};

Notice that all HTTP verbs are forwarded for the path /states/:projectName/lock. This is because the lockStateHandler forwards the entire request directly to a Durable Object, which contains its own internal itty-router and logic for those routes.

Durable Locking

As part of the requirements to implement locking, we need some mechanism to maintain locks on the states. For this I chose to use Durable Objects (DO) and the nomenclature “Durable Lock” to specify the implementation. DO’s have a special guarantee that only one instance exists across the globe for a given ID. All requests to the DO are routed to the same instance of the object. They also provide access to persistent storage which will be used to store the current lock info. The Durable Locks will use the format :username/:projectName.tfstate. As a convenience the following function will template the string in the desired format.

const getObjectKey = (username: string, projectName: string) =>
  `${username}/${projectName}.tfstate`;

As mentioned above, the lockStateHandler that forwards the requests to the Durable Object will look something like:

export const lockStateHandler = async (
  request: RequestWithIdentity & RouteParams,
  env: Env,
) => {
  const { projectName } = request;
  const username = request.identity?.userInfo?.username || "";
  if (!projectName || projectName === "")
    return new Response("No project name specified.", { status: 400 });
  if (!username || username === "")
    return new Response("Unable to determine username", { status: 500 });
  const id = env.DURABLE_LOCK.idFromName(getObjectKey(username, projectName));
  const lock = env.DURABLE_LOCK.get(id);
  return lock.fetch(request);
};

When issued a valid LOCK request, given the state is not already locked, the Durable Object will store the provided lock info in the key _lock. The presence of any value in the _lock key is used to indicate a lock on the state has been acquired. All attempts to lock the state by another user, or to update the state with the incorrect lockID will be met with a 423: Locked.

// src/durableLock.ts
export class DurableLock {
  // parts omitted for brevity

  // LOCK /states/:projectName/lock
  private async lock(request: Request): Promise<Response> {
    return this.state.blockConcurrencyWhile(async () => {
      if (this.lockInfo) return Response.json(this.lockInfo, { status: 423 });
      const lockInfo = (await request.json()) as LockInfo;
      await this.state.storage.put("_lock", lockInfo);
      this.lockInfo = lockInfo;
      return new Response(); // 200: OK
    });
  }
}

The format of the lock payload is fairly straight forward.

// LockInfo
// https://github.com/hashicorp/terraform/blob/cb340207d8840f3d2bc5dab100a5813d1ea3122b/internal/states/statemgr/locker.go#L115
export interface LockInfo {
  ID: string;
  Operation: string;
  Info: string;
  Who: string;
  Version: string;
  Created: string;
  Path: string;
}

Now you may have been wondering: Why not store the terraform state in durable object storage? The reason for not doing so is that the limit for each value per key stored object is 128 KB. For large projects, 128 KB is simply not enough storage for an entire state file.

Storing state in R2

As with any object store, every blob is stored using a combination of a bucket and a key. The key for this project will use the same format as the Durable Locks, :username/:projectName.tfstate. For example, curl -u terraform:password https://tfstate.example.com/states/hello-world would map to the key terraform/hello-world.tfstate. Using this pattern allows us to remove the need for any additional methods for mapping between users, states and their underlying locks and R2 keys. It also allows us to specify the username as the prefix when listing objects to get a list of states for a given username.

All states are treated as opaque values and simply just passed between as-is to the underlying storage.

export const getStateHandler = async (
  request: RequestWithIdentity & RouteParams,
  env: Env,
) => {
  const { projectName } = request;
  const username = request.identity?.userInfo?.username || "";
  if (!projectName || projectName === "")
    return new Response("No project name specified.", { status: 400 });
  if (!username || username === "")
    return new Response("Unable to determine username", { status: 500 });
  const state: R2ObjectBody = await env.TFSTATE_BUCKET.get(
    getObjectKey(username, projectName),
  );
  if (!state) return new Response(null, { status: 204 });
  return new Response(await state?.arrayBuffer(), {
    headers: { "content-type": "application/json" },
  });
};

Deploying to Cloudflare

Using the Workers CLI, wrangler in addition to the configuration file wranger.toml we can deploy the worker to our Cloudflare Account. Once setup, and after creating the tfstate bucket, simply running wrangler publish src/index.ts will deploy your code across the globe.

Using the new backend

Now that we’ve built and deployed the backend using Cloudflare Workers, it’s time to give it a go. Revisiting the example from the introduction, we can define the backend section in the main.tf file.

terraform {
  backend "http" {
    address        = "https://tfstate.example.com/states/hello-world"
    lock_address   = "https://tfstate.example.com/states/hello-world/lock"
    unlock_address = "https://tfstate.example.com/states/hello-world/lock"
    username       = "terraform"
    password       = "password"
  }

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "2.2.3"
    }
  }
}

resource "local_file" "hello_md" {
  content  = "Hello, World!"
  filename = "hello.md"
}

Re-initialize the project using terraform init -reconfigure and then accept the prompt to push the current local state to the remote backend. The next run of terraform <plan|apply> will now use the remote state stored in R2.

FAQ

Why would I use this over the S3 backend with R2?

If you do not require locking, then the S3 backend may be more appropriate. The S3 backend only supports locking via DynamoDB.

How can I support basic auth with my own usernames and passwords?

There are many ways to implement your own basic authorization scheme. Instead of having to securely manage usernames, passwords and authorization policies I opted for Cloudflare Access and service tokens.

Using my own API Gateway, I shim the values from the authorization header into the necessary Cf-Access-Client-Idand Cf-Access-Client-Secret headers to be allowed through my defined access policies. The username and password are the Client ID and Client Secret, respectively.

Why did I build this?

I needed a remote state provider to manage my terraform state and I figured it would be a fun learning exercise to implement.

How much would this cost me to run?

Approx. $5/month as long as you stay within the very generous free tiers for Cloudflare Workers and R2.