Eleven missing Terraform features
I’ve done a fair amount of work with Terraform since its initial release in 2014. At the time it was revolutionary: a groundbreaking achievement. Since then I feel that it has failed to live up to its promise, and suffers from the inherent weaknesses of being a first-mover in its space.
I think Terraform’s single biggest design flaw is the choice to unify too many concepts under a single roof. Terraform is, at the very least:
- a resource abstraction layer,
- a set of providers that implement those resource abstractions,
- an orchestration engine for applying changes to resources, and
- a configuration description language.
I would subtract the configuration description language, outsourcing those responsibilities to a fully-featured programming language. This is the central lesson of Pulumi, and it’s a good one. But that’s a topic for its own post…
Many of Terraform’s greatest misfeatures and missing features are not fundamental issues at all, and are actually things that could be rather straightforwardly added to Terraform’s existing functionality. One area where I’ve seen recent investment is secrets management: people have realized that secrets in Terraform statefiles are no bueno, and have started to look for ways to avoid storing them1. For the remainder of this post, I’ll describe the features that I think could be added to Terraform in order to make working with it a more joyful experience.
Why even bother?
I imagine some readers, especially ones who have spent time at big companies, are thinking “Why bother making Terraform good, especially given its fundamental flaws? We can build an in-house control plane that is just as good or better.” I have found this to be true for a company’s coremost deployment flows, and untrue elsewhere2. If we’re talking about the deployment platform that builds stateless apps, manages their versioning, allocates them resources on a cluster somewhere, monitors their health, et cetera—then yes, Terraform is a very bad fit. Don’t use it for that! Use it for everything else.
The core advantages of Terraform are:
- its resource abstraction layer and the well-developed ecosystem of providers that implement that resource abstraction
- the fact that it manages infrastructure over its entire lifecycle from birth to death, and can make updates and resolve drift in the meantime.
- its status as a lingua franca for integrating with vendor-provided software.
- the fact that it deterministically3 produces a desired state, can describe that state in the form of a plan, and converges towards that state
My non-Terraform experience comes mostly from working with platforms like Spinnaker (in its pre-managed delivery era) and Kubernetes operators. From this I have found that working in Terraform—and especially taking advantage of the ecosystem of existing Terraform providers, standing on the shoulders of giants—makes many tasks orders of magnitude cheaper compared to the alternatives. There’s substantial value in that: even at large software organizations, headcount should be considered scarce and maintenance costs should be minimized wherever possible.
But the single most valuable and underappreciated feature of Terraform is the one I have bolded above: it produces an understandable plan that describes the desired infrastructure state in significant detail. As an example, when Terraform produces “a database”, the plan will show what’s actually being created:
- an IAM role and policies attached to that role
- VPC security groups
- an identity definition in the internal CA service
- alert definitions in the monitoring service
- a capacity reservation with the capacity service
- a pipeline definition in the changefeeds service
- an actual Kubernetes ReplicaSet in a cluster somewhere4.
In contrast, non-infrastructure-as-code control planes (e.g. Kubernetes controllers) lack the ability to statically
describe downstream effects5. Submitting a resource like DB1
to the k8s apiserver triggers controller
logic that is Turing-complete and opaque, since it depends on both the controller’s execution logic and the apiserver’s
admission logic. There is often no clear way to know which resources (A1
, B2
, C3
, etc.) will be created, or why
they might fail. Even in failure, Terraform’s plan gives operators an immediate, inspectable statement of
intent—something declarative systems uniquely provide. Anyone, even a non-expert in these systems, can look at the plan
and recognize that the “would create C3
” operation is the one blocking the DB1
infrastructure from being fully
converged. I find this to be infinitely easier than squinting at the logs of a Kubernetes controller loop, trying to
figure out what it’s waiting for or what error it has encountered.
The List
In general this list starts with the most ambitious changes and proceeds towards the least ambitious ones. For each feature, I evaluate whether it can be built as automation to complement Terraform rather than requiring changes to Terraform itself. In many cases the answer is yes.
Unification of data and state
Terraform has a conception of data sources that sits entirely apart from the data-gathering
process that populates the statefile. As an example, the code backing a data.aws_s3_bucket
data source is similar to but not the same as the terraform refresh
code path that reads the state of an
aws_s3_bucket
resource. What I’d like to see is the ability to snarf up most/all of the data
in (for instance) an AWS account and store it in a cache of my cloud state, i.e. a cachefile. So I’d configure it to
populate this cachefile with (1) complete information about all S3 buckets, (2) for bucket config-xyz123
, include the
objects and their contents, and (3) for all other object types, only cache their state if they are present in the
Terraform statefile.
The net result is that I would be able to cut the Terraform plan
process that runs on pull requests completely off
from the Internet. At startup the CI job for Terraform plan would fetch the latest copy of the cachefile, and then the
actual planning would be a completely local operation against the cachefile.
The benefits of this approach are:
- the pull-request CI job is now completely credentialless, since it makes no calls to the actual cloud that it is managing.
- flaky and/or ratelimited cloud APIs can no longer prevent Terraform from producing a plan. At worst, you have a cachefile with certain entries that are out of date.
- a file full of information about cloud state is valuable for other purposes, including all kinds of monitoring that presently are served by entirely separate tools.
- we can build Terraform test environments by synthesizing a cachefile that mocks a cloud environment.
- Mocking no longer is a separate consideration.
Is it DIYable?
Partially. You can create a Terraform runner that downloads data from a cloud provider (or other data source) and
presents it to Terraform as files-on-disk. This would be a reasonably good replacement for most data
resources.
But doing the same thing to terraform refresh
would involve forking providers and doing invasive surgery on them.
That’s probably not within the reach of a DIYer.
Desired state and incremental compilation
I don’t know about everyone else, but my biggest pet peeve with infrastructure-as-code (IaC) tools like Terraform is
that they convert a 60 second task of clicking in a cloud UI into a 60 minute task of waiting for CI to complete,
running tests, etc. This may be acceptable if I’m trying to automate something that will be done 1000s of times, but if
I just have a one-off, why bother? It doesn’t help that Terraform itself can be quite slow: I have seen terraform plan
-refresh=false
take 10 CPU-minutes on statefiles that contain only single-digit thousands of resources. To get to the
kind of productivity loop I want, things need to be at least two orders of magnitude faster. Once a cachefile comes into
the picture, we can start to see a path towards enabling this kind of speed. A Terraform plan is a nearly-deterministic
output of a few inputs:
- Terraform code (
.tf
and.tf.json
) - providers
- the cachefile
- any supplementary data files
In most cases the author of a change can do their local work against an empty cachefile. Terraform could incrementally compile at the module level, and the result would be a compiled desired state, which lives independent of the plan. Puppet (a tool from which Terraform inherits numerous other ideas) has a completely separate step of catalog compilation that it does in advance of diffing against the current infrastructure state. Terraform should do the same.
Is it DIYable?
For incremental compilation: no.
For an independent tool: yes, but the value is minimal with available tools. When Terraform produces a plan, it will provide a
change-representation
with the known values for planned objects. If you run a
plan for a complex Terraform workspace against an empty statefile, most of the planned values will be unknown, and
therefore missing from the plan json. To be truly useful, we’d need Terraform to also provide the HCL that will be used
to compute attributes during the apply process.
Easier import
Terraform has the concept of a statefile, which I believe was innovative among infrastructure provisioning tools at the time (2014). The statefile allows Terraform to coexist peacefully in the same cloud environment with other control planes, because Terraform is only ever allowed to modify or delete the resources that it remembers creating (i.e. are present in its statefile). This is a good default!6
Terraform has import
(since v0.7, released August 2016), import blocks
(since v1.5, released June 2023), and HCL generation from imported resources (also since
v1.5). These tools allow a person to point Terraform at a specific resource and say “manage that thing there”. But the
author must still figure out what resource IDs to import. Far more frequently I want the reverse: I want to write the
HCL that generates my resources, and then I want Terraform to offer “oh it looks like an S3 bucket with that name
already exists. Here is its resource ID. Would you like to import it?”.
To achieve this, the Terraform provider interface should be extended to support a new callback: check_exists
. This
would be fed all of the information from a planned resource creation, and the provider would answer with either “that
resource doesn’t exist, go ahead and create it” or “the cloud provider isn’t going to let you create that resource
because the same resource already exists. Here’s the ID of the existing resource.” Terraform would then prompt the user
to convert a plan’s create
operations into import
operations.
Is it DIYable?
Yes. Write code that digests the plan, extracts all resource creations, and checks with the cloud provider whether the resource already exists. Each resource type will need its own logic, although there may be enough similarities within a given cloud provider that the overall LoC count to support numerous resource types is still reasonably low.
Universal management
Terraform—when combined with SCM tools like git—is a powerful tool for infrastructure change management, and it can be further combined with policy-as-code tools (e.g. OPA) to enforce invariants on infrastructure. But Terraform doesn’t have a way to guarantee that it is the exclusive means to manage a given type of resource. In practice this means that infrastructure and security teams must build completely separate tools (called CSPMs) to audit resource configuration and identify clickopsed or otherwise non-compliant resources.
It would be significantly better if there was a generic way to inform Terraform that it is the sole owner of a given
domain. This would be similar to Puppet’s resources {...purge => true }
invocation. For Terraform, the invocation
would say things like:
- exclusively manage all S3 objects with prefix
terraform/stack-1234/
- exclusively manage all IAM roles tagged with
owner: terraform stack 1234
- exclusively manage all subscriptions associated with SNS topics in this statefile
- exclusively manage all policies attached to IAM roles in this statefile
At present, Terraform provides no means of achieving examples 1-3. Example 4 has some interesting history. The AWS
provider has two different ways to specify exclusive management of attached IAM policies:
managed_policy_arns
(deprecated) is given inline within the aws_iam_role
resource,
whereas aws_iam_role_policies_exclusive
(the new hotness) is a completely separate
resource that pairs with aws_iam_role_policy
. But both of them are too clunky and require too much from the user.
Ideally the configuration would be just:
resource "aws_iam_role" "this" {
...
policy_attachments_exclusive = true
}
resource "aws_iam_role_policy_attachment" "this" {
for_each = var.managed_arns
role_name = aws_iam_role.this.name
policy_arn = each.key
}
In this example, var.managed_arns
describes the policy attachments that we want to keep, and Terraform would generate
a plan that contains destroy operations on any other aws_iam_role_policy_attachment
resources that it sees in the
wild. This would function akin to an “import-for-delete” operation, where the unmanaged resource gets momentarily added
to the statefile and then Terraform immediately destroys it.
Is it DIYable?
Yes. Build Terraform post-apply automation that searches for resources that exist in the wild and subtracts out resources seen in the statefile. Any resources that remain are unmanaged, and the automation can decide how and when to simply delete them, or throw up a warning for human review.
Refactoring within statefiles
Terraform’s moved
blocks are a low-level primitive, but Terraform should be providing a
significantly higher level primitive that automates the production of moved
blocks. Specifically, the Terraform
provider API should add an “are these two resources actually the same?” hook. If a Terraform plan causes a resource to
be deleted at location A and created at location B, then the provider should say “yep, I’m reasonably confident they’re
the same”, and Terraform should produce a moved
block automatically. So if I delete an s3 bucket config-xyz123
in
one module and create the same-named bucket in another, my Terraform plan should suggest a moved block for me to accept
or reject.
Is it DIYable?
Yes. It’s a big list of attributes that, when equal, cause two resources to be equal. For instance with S3 buckets, it
would be name
, plus account_id
(which is implicit in the provider). Create two hashtables: one stores deletes
and
another stores creates
. Any resource with the same key in both hashtables is a candidate move.
Semantic lifecycle
blocks
Terraform’s lifecycle blocks are syntactic features, rather than semantic ones. As an example, this is impossible in Terraform:
resource "time_sleep" "test" {
for_each = toset(["a", "b"])
lifecycle {
prevent_destroy = each.key == "a"
}
}
The example tries to use each.key
to prevent destruction of only a single instance, but that’s disallowed in
Terraform. Sadly, the only information you’re allowed to use in these places is syntactic: you cannot refer to a
specific instance of a parametrized (for_each
/count
) resource, and you certainly cannot refer to the instance of the
module it is sitting in. This is ridiculous. No feature in Terraform should be syntactic-only.
In addition to prevent_destroy
, you cannot use semantic information in removed
blocks or
ignore_changes
.
Is it DIYable?
Yes for removed
blocks; no for ignore_changes
. Build automation which runs before generating a Terraform plan: it
reads a file and invokes terraform state rm
against the addresses listed in that file.
Invert the default for prevent_destroy
I have seen organizations where Terraform wasn’t allowed to destroy anything: the admins had actually revoked the cloud
permissions that would allow the Terraform service account to perform a delete operation. As a result their control
plane became a weird mishmash of Terraform and human clickops, and its performance was highly asymmetrical: it was good
at creating infrastructure quickly, but abysmal at deletions and replacements. The oncall rotation that was
responsible for terraform apply
became button-pushers-as-a-service, which is not a good place for any engineering team
to be.
It would be much better to start Terraform with a default of “no resource may be destroyed”, so that there’s a meaningful difference in the impedance between “here’s a plan to add some new resources” vs “this plan destroys everything.” But the impedance should be just enough to prevent accidental destruction of resources, not so great that it actually prevents people from getting their jobs done.
Is it DIYable?
Yes. Build automation which analyzes a plan and validates all planned deletes
are listed in a file which the PR author
has committed to the source control repository. At runtime, if the Terraform plan process contains a resource
destruction that wasn’t explicitly allowlisted, it stops and waits for human involvement7.
Flexibility in moving resources between statefiles
As mentioned above, I think Terraform’s benchmark for “time to deployment” of small changes needs to be comparable to “just clickops it”. In other words, it needs to be fast enough that people don’t feel the urge to use non-IaC methods to manage their infrastructure. This is a high bar.
One obvious issue is that Terraform gets slow with big statefiles, and splitting big statefiles into smaller ones is not
a natively-supported operation. Pulumi released pulumi state move
in July of 2024. IMO Terraform
should have had that feature by its fifth birthday, in 2019. The Github issue for it was
filed in 2023.
I think the even more ambitious among us would say “why can’t I operate on subgraphs of a Terraform statefile in parallel”. The OpenTofu folks are talking about that; I think it’s the right direction, since it’s what any control plane that wants scalability would do.
Is it DIYable?
Yes. In practice people have been splicing resources between statefiles for ages, which perhaps explains why the issue
for it was only filed in 2023. Doing remove
+import
operations works well, and is not too hard with proper tools.
But in some cases it’s necessary to actually splice the actual statefile data into a new file, and Terraform should
provide native support for this.
Provider customization
To the average Terraform author there is a stark gap between “the things that I can touch” (Terraform code) vs “the things that are immutable facts about the world” (Terraform providers). If something can be done purely in Terraform code, then it can probably ship today. If something needs a provider patch, that’s a matter of “I don’t want to fork the provider, so let’s open a Github issue and see whether the maintainer would be willing to take my patch. If things go well it’ll make it into a release a few months from now.”
Puppet and Pulumi both subvert this problem by having providers that ship with the underlying code (Puppet, Pulumi). This makes it a natural and easy operation to write new provider logic, and you can ship a provider change in the same commit as the functionality that uses that change.
Is it DIYable?
Sorta-kinda, but ultimately not really. You can use terraform_data
/null_resource
with the local-exec
provisioner
to build bootleg providers: I have found this to be error prone. You can fork a provider and compile it in-tree (e.g. in
a git submodule) with the rest of your Terraform code. I haven’t tried this: I imagine it would work (so long as you
cached the compiled binary between runs), but it would subject you to the pain of maintaining an entire forked provider.
It would probably be better to be able to split off just the one or two resource types you want to modify into a
separate provider, but that’ll involve its own miseries.
Lying about resource state
In some cases (especially incidents) we want to be able to make changes to Terraform-owned resources without worrying
that Terraform is going to come along and immediately revert the changes we’ve made. So far as I can tell, the best
solution is to make Terraform (falsely) believe that the resource hasn’t changed. We would “pin” selected resources by
inhibiting their terraform refresh
. The net effect is that Terraform is still functional for the rest of the resources
in the statefile, even dependencies of the pinned resources.
Is it DIYable?
Yes I believe so: you can simply doctor the Terraform statefile after a terraform refresh
, and then do terraform plan
-refresh=false
. But the user-facing ergonomics would be better within Terraform itself, since Terraform could clearly
print “resource state pinned” right in the middle of the plan output.
Native separation of plan and apply
Terraform and the Terraform ecosystem should assume that the identity performing a plan
operation is strictly separate
from an apply
operation. Like all other IaC tools, Terraform is a powerful enabler of “privilege escalation through
code review”: the Terraform-based control plane enables someone to achieve something by merging a PR that they would not
and should not be able to achieve by “directly” interacting with the infrastructure.
But this is subverted pretty much entirely if the terraform plan
process that runs in a CI environment has access to
the same high-privilege credential used for terraform apply
. The Terraform ecosystem should expect that the typical
deployment mode uses one identity for applies and a completely separate identity to generate plans (both for PRs and for
plans that will eventually be applied).
Is it DIYable?
Yep! There are a few aspects of the provider ecosystem that assume the plan
identity can mutate the cloud environment,
but these can be worked around without too much difficulty.
In Conclusion
It’s a crying shame that that our industry is stuck with Terraform: it seems like it’s in the local minima of being just good enough for many use-cases, but not actually a joy to work with. We certainly deserve a better control plane, one that is quicker, easier to understand, cheaper to maintain, and with better reliability features. I’d like to say “throw it all away and start over”, but that would be forsaking its many advantages, especially its deep bench of providers. Maybe Pulumi will save us, but like I said, that’s a topic for another post.
-
Specifically, they’ve added ephemeral resources (v1.10, November 2024) and write-only attributes (v1.11, February 2025) ↩
-
Yes, this is true even at big companies, although it might cease to be true at the very largest engineering organizations (e.g. Alphabet, Meta). Even at medium-large companies, devtools teams often can’t keep up with the needs of the coremost deployment flows. ↩
-
There are some exceptions to determinism in the core language, like
timestamp()
, which is clearly non-deterministic. And of course a Terraform provider can inject whatever kinds of non-determinism it wants. Despite all of this, the Terraform environment strongly encourages determinism and makes non-determinism something that creeps in quite rarely. ↩ -
In this example, the task of autoscaling the database’s compute layer is still handled by a tight control loop within the Kubernetes cluster. But all of the more static infrastructure definitions that make up this database abstraction are managed by—and inspectable in—Terraform. ↩
-
I say “most” because I think there are certain paradigms available within Kubernetes that achieve this ideal, or come very close to it. In researching this post, cdk8s and pulumi-kubernetes-operator both looked very interesting. ↩
-
The statefile has another crucial function: they make garbage collection mandatory. With non-statefile tools (e.g. Puppet) it is possible to create resources in commit 1 and then stop managing them in commit 2: the result is that the resources from commit 1 remain deployed, and represent a form of silent drift between the presently-deployed fleet and the hosts that you’d get if you ran Puppet on hosts launched from a blank slate. ↩
-
Some people might be thinking “shouldn’t all Terraform applies stop and wait for human involvement?”. My answer is an emphatic “no”. Every time your control plane stops and prompts a human for intervention, that is an alert, and in 99% of cases that alert will only convey the message of “your Terraform has planned exactly what you expected it to plan.” All alerts should be actionable, both to safeguard human productivity and to prevent alert fatigue. The right move is to automatically identify the 1% of exceptional cases, and alert on those. ↩