Why bother with immutable infrastructure?

all of these arguments fall under the heading of “the immutable server principle”. My experience with this issue come from my bad-old days with Puppet at Yelp, which I liked a lot at the time and continue to like (in a rosy-eyed way) today. I was also the person at Netflix who was assigned to fix our databases with regards to software upgrades and config changes.

security

it’s better to have a single unit (an immutable base image) that can be identified by a SHA as 100% identical to all of its siblings, rather than numerous servers that all potentially have separate configurations. There’s no chance of a single instance being different because it missed a Puppet/Salt/Ansible/SSM apply, or because the Puppet apply did something different due to the instance’s age, etc. In other words, you get to avoid building many forms of monitoring which would otherwise be needed. Immutability also means one fewer C&C control plane that you need to reason about, and if you pair immutability with reproducible builds you have a powerful tool for ensuring your servers are behaving in exactly the way your code intends for them to behave.

complexity

with immutable servers, you manage exactly one transition (OutOfService->InService). With mutable servers, you manage numerous transitions, and there is no guarantee that (in fact it is very rare for) your servers all stay identical. By way of illustration, some upgrades involve apt, others pip, others npm, and your configuration tool needs to understand how to interact with all of the above. To give an example, one time my team tried to perform an upgrade of libc which led to version skew: some processes had loaded the old version, and then later attempted to load a file from the old version, only to find that it was gone (because libc had been upgraded). Other times you’ll upgrade something (like MySQL) which you most certainly cannot restart all-at-once, and so you have to build a pipeline to perform (and monitor) an orchestrated rollout of all your servers. A lot of my work at Yelp focused on ensuring that old servers and newly created servers were actually running the same configuration, and especially that no old server had “zombie” config.

reliability

when you configure your servers before they go into service, you get the opportunity to test their fully-assembled functionality when they’re still out of service. Because it’s just a single SHA that changes, your monitoring can cleanly associate any transitions in semantics or performance with the rollout of a given SHA. Your servers never change, so you can sleep (somewhat) soundly knowing that there are no circumstances where a server suddenly stops functioning the way it was originally configured and tested. Of course this doesn’t solve all problems (e.g. time bombs, resource exhaustion), but in general you get far more guarantees that you see all of the effects of your change at a single moment (the deploy), rather than days or weeks into the future. Deploys of immutable servers also don’t (in the general case) remove capacity from your clusters, so there’s no period of reduced performance.

completeness/labor

some upgrades (e.g. kernel, many kernel modules) can’t be performed at all without a reboot (no, ksplice and kpatch don’t solve this problem). If your upgrade strategy revolves around in-place patching, your skills for dealing with the harder upgrades will atrophy. With mutable servers, you have to maintain two separate systems:

manage (assemble, version, monitor, upgrade) the server that gets deployed originally
manage the server once it’s running

It’s better to maintain one system well with (less than) half the resources, rather than two systems poorly.