Eleven missing Terraform features

I’ve done a fair amount of work with Terraform since its initial release in 2014. At the time it was revolutionary: a groundbreaking achievement. Since then I feel that it has failed to live up to its promise, and suffers from the inherent weaknesses of being a first-mover in its space.

I think Terraform’s single biggest design flaw is the choice to unify too many concepts under a single roof. Terraform is, at the very least:

a resource abstraction layer,
a set of providers that implement those resource abstractions,
an orchestration engine for applying changes to resources, and
a configuration description language.

I would subtract the configuration description language, outsourcing those responsibilities to a fully-featured programming language. This is the central lesson of Pulumi, and it’s a good one. But that’s a topic for its own post…

Many of Terraform’s greatest misfeatures and missing features are not fundamental issues at all, and are actually things that could be rather straightforwardly added to Terraform’s existing functionality. One area where I’ve seen recent investment is secrets management: people have realized that secrets in Terraform statefiles are no bueno, and have started to look for ways to avoid storing them¹. For the remainder of this post, I’ll describe the features that I think could be added to Terraform in order to make working with it a more joyful experience.

Why even bother?

I imagine some readers, especially ones who have spent time at big companies, are thinking “Why bother making Terraform good, especially given its fundamental flaws? We can build an in-house control plane that is just as good or better.” I have found this to be true for a company’s coremost deployment flows, and untrue elsewhere². If we’re talking about the deployment platform that builds stateless apps, manages their versioning, allocates them resources on a cluster somewhere, monitors their health, et cetera—then yes, Terraform is a very bad fit. Don’t use it for that! Use it for everything else.

The core advantages of Terraform are:

its resource abstraction layer and the well-developed ecosystem of providers that implement that resource abstraction
the fact that it manages infrastructure over its entire lifecycle from birth to death, and can make updates and resolve drift in the meantime.
its status as a lingua franca for integrating with vendor-provided software.
the fact that it deterministically³ produces a desired state, can describe that state in the form of a plan, and converges towards that state

My non-Terraform experience comes mostly from working with platforms like Spinnaker (in its pre-managed delivery era) and Kubernetes operators. From this I have found that working in Terraform—and especially taking advantage of the ecosystem of existing Terraform providers, standing on the shoulders of giants—makes many tasks orders of magnitude cheaper compared to the alternatives. There’s substantial value in that: even at large software organizations, headcount should be considered scarce and maintenance costs should be minimized wherever possible.

But the single most valuable and underappreciated feature of Terraform is the one I have bolded above: it produces an understandable plan that describes the desired infrastructure state in significant detail. As an example, when Terraform produces “a database”, the plan will show what’s actually being created:

an IAM role and policies attached to that role
VPC security groups
an identity definition in the internal CA service
alert definitions in the monitoring service
a capacity reservation with the capacity service
a pipeline definition in the changefeeds service
an actual Kubernetes ReplicaSet in a cluster somewhere⁴.

In contrast, non-infrastructure-as-code control planes (e.g. Kubernetes controllers) lack the ability to statically describe downstream effects⁵. Submitting a resource like DB1 to the k8s apiserver triggers controller logic that is Turing-complete and opaque, since it depends on both the controller’s execution logic and the apiserver’s admission logic. There is often no clear way to know which resources (A1, B2, C3, etc.) will be created, or why they might fail. Even in failure, Terraform’s plan gives operators an immediate, inspectable statement of intent—something declarative systems uniquely provide. Anyone, even a non-expert in these systems, can look at the plan and recognize that the “would create C3” operation is the one blocking the DB1 infrastructure from being fully converged. I find this to be infinitely easier than squinting at the logs of a Kubernetes controller loop, trying to figure out what it’s waiting for or what error it has encountered.

The List

In general this list starts with the most ambitious changes and proceeds towards the least ambitious ones. For each feature, I evaluate whether it can be built as automation to complement Terraform rather than requiring changes to Terraform itself. In many cases the answer is yes.

Unification of data and state

Terraform has a conception of data sources that sits entirely apart from the data-gathering process that populates the statefile. As an example, the code backing a data.aws_s3_bucket data source is similar to but not the same as the terraform refresh code path that reads the state of an aws_s3_bucket resource. What I’d like to see is the ability to snarf up most/all of the data in (for instance) an AWS account and store it in a cache of my cloud state, i.e. a cachefile. So I’d configure it to populate this cachefile with (1) complete information about all S3 buckets, (2) for bucket config-xyz123, include the objects and their contents, and (3) for all other object types, only cache their state if they are present in the Terraform statefile.

The net result is that I would be able to cut the Terraform plan process that runs on pull requests completely off from the Internet. At startup the CI job for Terraform plan would fetch the latest copy of the cachefile, and then the actual planning would be a completely local operation against the cachefile.

The benefits of this approach are:

the pull-request CI job is now completely credentialless, since it makes no calls to the actual cloud that it is managing.
flaky and/or ratelimited cloud APIs can no longer prevent Terraform from producing a plan. At worst, you have a cachefile with certain entries that are out of date.
a file full of information about cloud state is valuable for other purposes, including all kinds of monitoring that presently are served by entirely separate tools.
we can build Terraform test environments by synthesizing a cachefile that mocks a cloud environment.
Mocking no longer is a separate consideration.

Is it DIYable?

Partially. You can create a Terraform runner that downloads data from a cloud provider (or other data source) and presents it to Terraform as files-on-disk. This would be a reasonably good replacement for most data resources.

But doing the same thing to terraform refresh would involve forking providers and doing invasive surgery on them. That’s probably not within the reach of a DIYer.

Desired state and incremental compilation

I don’t know about everyone else, but my biggest pet peeve with infrastructure-as-code (IaC) tools like Terraform is that they convert a 60 second task of clicking in a cloud UI into a 60 minute task of waiting for CI to complete, running tests, etc. This may be acceptable if I’m trying to automate something that will be done 1000s of times, but if I just have a one-off, why bother? It doesn’t help that Terraform itself can be quite slow: I have seen terraform plan -refresh=false take 10 CPU-minutes on statefiles that contain only single-digit thousands of resources. To get to the kind of productivity loop I want, things need to be at least two orders of magnitude faster. Once a cachefile comes into the picture, we can start to see a path towards enabling this kind of speed. A Terraform plan is a nearly-deterministic output of a few inputs:

Terraform code (.tf and .tf.json)
providers
the cachefile
any supplementary data files

In most cases the author of a change can do their local work against an empty cachefile. Terraform could incrementally compile at the module level, and the result would be a compiled desired state, which lives independent of the plan. Puppet (a tool from which Terraform inherits numerous other ideas) has a completely separate step of catalog compilation that it does in advance of diffing against the current infrastructure state. Terraform should do the same.

Is it DIYable?

For incremental compilation: no.

For an independent tool: yes, but the value is minimal with available tools. When Terraform produces a plan, it will provide a change-representation with the known values for planned objects. If you run a plan for a complex Terraform workspace against an empty statefile, most of the planned values will be unknown, and therefore missing from the plan json. To be truly useful, we’d need Terraform to also provide the HCL that will be used to compute attributes during the apply process.

Easier import

Terraform has the concept of a statefile, which I believe was innovative among infrastructure provisioning tools at the time (2014). The statefile allows Terraform to coexist peacefully in the same cloud environment with other control planes, because Terraform is only ever allowed to modify or delete the resources that it remembers creating (i.e. are present in its statefile). This is a good default!⁶

Terraform has import (since v0.7, released August 2016), import blocks (since v1.5, released June 2023), and HCL generation from imported resources (also since v1.5). These tools allow a person to point Terraform at a specific resource and say “manage that thing there”. But the author must still figure out what resource IDs to import. Far more frequently I want the reverse: I want to write the HCL that generates my resources, and then I want Terraform to offer “oh it looks like an S3 bucket with that name already exists. Here is its resource ID. Would you like to import it?”.

To achieve this, the Terraform provider interface should be extended to support a new callback: check_exists. This would be fed all of the information from a planned resource creation, and the provider would answer with either “that resource doesn’t exist, go ahead and create it” or “the cloud provider isn’t going to let you create that resource because the same resource already exists. Here’s the ID of the existing resource.” Terraform would then prompt the user to convert a plan’s create operations into import operations.

Is it DIYable?

Yes. Write code that digests the plan, extracts all resource creations, and checks with the cloud provider whether the resource already exists. Each resource type will need its own logic, although there may be enough similarities within a given cloud provider that the overall LoC count to support numerous resource types is still reasonably low.

Universal management

Terraform—when combined with SCM tools like git—is a powerful tool for infrastructure change management, and it can be further combined with policy-as-code tools (e.g. OPA) to enforce invariants on infrastructure. But Terraform doesn’t have a way to guarantee that it is the exclusive means to manage a given type of resource. In practice this means that infrastructure and security teams must build completely separate tools (called CSPMs) to audit resource configuration and identify clickopsed or otherwise non-compliant resources.

It would be significantly better if there was a generic way to inform Terraform that it is the sole owner of a given domain. This would be similar to Puppet’s resources {...purge => true } invocation. For Terraform, the invocation would say things like:

exclusively manage all S3 objects with prefix terraform/stack-1234/
exclusively manage all IAM roles tagged with owner: terraform stack 1234
exclusively manage all subscriptions associated with SNS topics in this statefile
exclusively manage all policies attached to IAM roles in this statefile

At present, Terraform provides no means of achieving examples 1-3. Example 4 has some interesting history. The AWS provider has two different ways to specify exclusive management of attached IAM policies: managed_policy_arns (deprecated) is given inline within the aws_iam_role resource, whereas aws_iam_role_policies_exclusive (the new hotness) is a completely separate resource that pairs with aws_iam_role_policy. But both of them are too clunky and require too much from the user. Ideally the configuration would be just:

resource "aws_iam_role" "this" {
  ...
  policy_attachments_exclusive = true
}

resource "aws_iam_role_policy_attachment" "this" {
  for_each   = var.managed_arns
  role_name  = aws_iam_role.this.name
  policy_arn = each.key
}

In this example, var.managed_arns describes the policy attachments that we want to keep, and Terraform would generate a plan that contains destroy operations on any other aws_iam_role_policy_attachment resources that it sees in the wild. This would function akin to an “import-for-delete” operation, where the unmanaged resource gets momentarily added to the statefile and then Terraform immediately destroys it.

Is it DIYable?

Yes. Build Terraform post-apply automation that searches for resources that exist in the wild and subtracts out resources seen in the statefile. Any resources that remain are unmanaged, and the automation can decide how and when to simply delete them, or throw up a warning for human review.

Refactoring within statefiles

Terraform’s moved blocks are a low-level primitive, but Terraform should be providing a significantly higher level primitive that automates the production of moved blocks. Specifically, the Terraform provider API should add an “are these two resources actually the same?” hook. If a Terraform plan causes a resource to be deleted at location A and created at location B, then the provider should say “yep, I’m reasonably confident they’re the same”, and Terraform should produce a moved block automatically. So if I delete an s3 bucket config-xyz123 in one module and create the same-named bucket in another, my Terraform plan should suggest a moved block for me to accept or reject.

Is it DIYable?

Yes. It’s a big list of attributes that, when equal, cause two resources to be equal. For instance with S3 buckets, it would be name, plus account_id (which is implicit in the provider). Create two hashtables: one stores deletes and another stores creates. Any resource with the same key in both hashtables is a candidate move.

Semantic `lifecycle` blocks

Terraform’s lifecycle blocks are syntactic features, rather than semantic ones. As an example, this is impossible in Terraform:

resource "time_sleep" "test" {
  for_each = toset(["a", "b"])

  lifecycle {
    prevent_destroy = each.key == "a"
  }
}

The example tries to use each.key to prevent destruction of only a single instance, but that’s disallowed in Terraform. Sadly, the only information you’re allowed to use in these places is syntactic: you cannot refer to a specific instance of a parametrized (for_each/count) resource, and you certainly cannot refer to the instance of the module it is sitting in. This is ridiculous. No feature in Terraform should be syntactic-only.

In addition to prevent_destroy, you cannot use semantic information in removed blocks or ignore_changes.

Is it DIYable?

Yes for removed blocks; no for ignore_changes. Build automation which runs before generating a Terraform plan: it reads a file and invokes terraform state rm against the addresses listed in that file.

Invert the default for `prevent_destroy`

I have seen organizations where Terraform wasn’t allowed to destroy anything: the admins had actually revoked the cloud permissions that would allow the Terraform service account to perform a delete operation. As a result their control plane became a weird mishmash of Terraform and human clickops, and its performance was highly asymmetrical: it was good at creating infrastructure quickly, but abysmal at deletions and replacements. The oncall rotation that was responsible for terraform apply became button-pushers-as-a-service, which is not a good place for any engineering team to be.

It would be much better to start Terraform with a default of “no resource may be destroyed”, so that there’s a meaningful difference in the impedance between “here’s a plan to add some new resources” vs “this plan destroys everything.” But the impedance should be just enough to prevent accidental destruction of resources, not so great that it actually prevents people from getting their jobs done.

Is it DIYable?

Yes. Build automation which analyzes a plan and validates all planned deletes are listed in a file which the PR author has committed to the source control repository. At runtime, if the Terraform plan process contains a resource destruction that wasn’t explicitly allowlisted, it stops and waits for human involvement⁷.

Flexibility in moving resources between statefiles

As mentioned above, I think Terraform’s benchmark for “time to deployment” of small changes needs to be comparable to “just clickops it”. In other words, it needs to be fast enough that people don’t feel the urge to use non-IaC methods to manage their infrastructure. This is a high bar.

One obvious issue is that Terraform gets slow with big statefiles, and splitting big statefiles into smaller ones is not a natively-supported operation. Pulumi released pulumi state move in July of 2024. IMO Terraform should have had that feature by its fifth birthday, in 2019. The Github issue for it was filed in 2023.

I think the even more ambitious among us would say “why can’t I operate on subgraphs of a Terraform statefile in parallel”. The OpenTofu folks are talking about that; I think it’s the right direction, since it’s what any control plane that wants scalability would do.

Is it DIYable?

Yes. In practice people have been splicing resources between statefiles for ages, which perhaps explains why the issue for it was only filed in 2023. Doing remove+import operations works well, and is not too hard with proper tools. But in some cases it’s necessary to actually splice the actual statefile data into a new file, and Terraform should provide native support for this.

Provider customization

To the average Terraform author there is a stark gap between “the things that I can touch” (Terraform code) vs “the things that are immutable facts about the world” (Terraform providers). If something can be done purely in Terraform code, then it can probably ship today. If something needs a provider patch, that’s a matter of “I don’t want to fork the provider, so let’s open a Github issue and see whether the maintainer would be willing to take my patch. If things go well it’ll make it into a release a few months from now.”

Puppet and Pulumi both subvert this problem by having providers that ship with the underlying code (Puppet, Pulumi). This makes it a natural and easy operation to write new provider logic, and you can ship a provider change in the same commit as the functionality that uses that change.

Is it DIYable?

Sorta-kinda, but ultimately not really. You can use terraform_data/null_resource with the local-exec provisioner to build bootleg providers: I have found this to be error prone. You can fork a provider and compile it in-tree (e.g. in a git submodule) with the rest of your Terraform code. I haven’t tried this: I imagine it would work (so long as you cached the compiled binary between runs), but it would subject you to the pain of maintaining an entire forked provider. It would probably be better to be able to split off just the one or two resource types you want to modify into a separate provider, but that’ll involve its own miseries.

Lying about resource state

In some cases (especially incidents) we want to be able to make changes to Terraform-owned resources without worrying that Terraform is going to come along and immediately revert the changes we’ve made. So far as I can tell, the best solution is to make Terraform (falsely) believe that the resource hasn’t changed. We would “pin” selected resources by inhibiting their terraform refresh. The net effect is that Terraform is still functional for the rest of the resources in the statefile, even dependencies of the pinned resources.

Is it DIYable?

Yes I believe so: you can simply doctor the Terraform statefile after a terraform refresh, and then do terraform plan -refresh=false. But the user-facing ergonomics would be better within Terraform itself, since Terraform could clearly print “resource state pinned” right in the middle of the plan output.

Native separation of plan and apply

Terraform and the Terraform ecosystem should assume that the identity performing a plan operation is strictly separate from an apply operation. Like all other IaC tools, Terraform is a powerful enabler of “privilege escalation through code review”: the Terraform-based control plane enables someone to achieve something by merging a PR that they would not and should not be able to achieve by “directly” interacting with the infrastructure.

But this is subverted pretty much entirely if the terraform plan process that runs in a CI environment has access to the same high-privilege credential used for terraform apply. The Terraform ecosystem should expect that the typical deployment mode uses one identity for applies and a completely separate identity to generate plans (both for PRs and for plans that will eventually be applied).

Is it DIYable?

Yep! There are a few aspects of the provider ecosystem that assume the plan identity can mutate the cloud environment, but these can be worked around without too much difficulty.

In Conclusion

It’s a crying shame that that our industry is stuck with Terraform: it seems like it’s in the local minima of being just good enough for many use-cases, but not actually a joy to work with. We certainly deserve a better control plane, one that is quicker, easier to understand, cheaper to maintain, and with better reliability features. I’d like to say “throw it all away and start over”, but that would be forsaking its many advantages, especially its deep bench of providers. Maybe Pulumi will save us, but like I said, that’s a topic for another post.

Specifically, they’ve added ephemeral resources (v1.10, November 2024) and write-only attributes (v1.11, February 2025) ↩
Yes, this is true even at big companies, although it might cease to be true at the very largest engineering organizations (e.g. Alphabet, Meta). Even at medium-large companies, devtools teams often can’t keep up with the needs of the coremost deployment flows. ↩
There are some exceptions to determinism in the core language, like timestamp(), which is clearly non-deterministic. And of course a Terraform provider can inject whatever kinds of non-determinism it wants. Despite all of this, the Terraform environment strongly encourages determinism and makes non-determinism something that creeps in quite rarely. ↩
In this example, the task of autoscaling the database’s compute layer is still handled by a tight control loop within the Kubernetes cluster. But all of the more static infrastructure definitions that make up this database abstraction are managed by—and inspectable in—Terraform. ↩
I say “most” because I think there are certain paradigms available within Kubernetes that achieve this ideal, or come very close to it. In researching this post, cdk8s and pulumi-kubernetes-operator both looked very interesting. ↩
The statefile has another crucial function: they make garbage collection mandatory. With non-statefile tools (e.g. Puppet) it is possible to create resources in commit 1 and then stop managing them in commit 2: the result is that the resources from commit 1 remain deployed, and represent a form of silent drift between the presently-deployed fleet and the hosts that you’d get if you ran Puppet on hosts launched from a blank slate. ↩
Some people might be thinking “shouldn’t all Terraform applies stop and wait for human involvement?”. My answer is an emphatic “no”. Every time your control plane stops and prompts a human for intervention, that is an alert, and in 99% of cases that alert will only convey the message of “your Terraform has planned exactly what you expected it to plan.” All alerts should be actionable, both to safeguard human productivity and to prevent alert fatigue. The right move is to automatically identify the 1% of exceptional cases, and alert on those. ↩