A good database is a stateless database

For a long time while working with databases, I conceived of them as fundamentally different from other services because they were stateful. This is not true.

A typical database is a proxy that transforms inbound API calls into commands on some underlying storage engine. A Postgres DB, for instance, is a proxy that converts Postgres wire protocol on the frontend side into filesystem accesses (syscalls and page faults) on the backend side. The Linux VFS layer then relays these calls to filesystems like ext4 and xfs, which convert file I/Os on the frontend into 4 kiB block I/Os on the backend. Next, an SSD’s flash translation layer (FTL) translates 4 kiB block writes into log-structured 256 kiB erase block writes, providing the illusion of a random-access read-writable disk.

Going up one level further, the people who run a “user profile” service might think of their service as stateless, since it is backed by Postgres. But the people who call this user profile service think of the user profile service as stateful, since it serves as a durable repository for user profiles. OLTP DBAs should adopt the same mode of thinking: my database is stateless, and the layer beneath me stores my actual state. We can summarize the path of proxies as something like: user profile service → Postgres → Linux VFS layer → ext4 filesystem → SSD’s FTL → NAND flash chips.

But it turns out that this picture is under-inclusive. Locally attached block devices aren’t durable, and individual servers aren’t reliable. Databases need to be both reliable and durable. Blobstores like S3 bridge this gap. In any data environment that aims to achieve high-availability, the source of truth is the blobstore, not the disks attached to any one machine. This means that the modern cloud¹ database has two jobs:

write buffer: to funnel data into the blobstore as efficiently and expeditiously as possible.
read cache: accelerate reads that could otherwise be performed in the blobstore

Writing

On the write-side, each node’s primary purpose is to perform local processing of writes: buffering, sorting and coalescing. Buffering is the most basic operation: absorb writes over a period of time, so that you can make one large sequential write to the blobstore rather than a bunch of tiny ones. While you’re buffering, you might as well sort it by key: this ensures that future reads against the data can proceed in O(log n) time rather than O(n) time. Finally, some workloads that overwrite the same item over-and-over can coalesce all of those individual writes into a single write, reducing the total volume written. These functions are present both in databases that use memtables (e.g. Cassandra, RocksDB) and in databases that perform writes onto mutable b-trees (e.g. Postgres, InnoDB).

Another purpose of a database is to guarantee durability of every acknowledged write. Single-node databases do this spectacularly poorly, since their only available tool is a write-ahead log flushed to local disk. Any distributed database worth its salt will replicate its log to multiple nodes, rather than relying on local disk for durability.

Finally, a good OLTP database facilitates users who want to read its data directly from a blobstore². In the simplest instantiation, it might just mean that it is possible for someone with direct access to the blobstore to spin up a local copy of the DB and query it for their exclusive purposes. More advanced methods allow OLAP frameworks (e.g. Spark) to directly read row-oriented data from the blobstore, facilitating direct queries or conversion into a columnar format.

Reading

On the read-side, an OLTP database serves as a cache of the blobstore, accelerating reads to meet throughput and latency constraints that the blobstore has no hope of achieving.

Typical blobstores have a time-to-first-byte of 15 milliseconds³, which is an eternity in an online query processing context that measures latency in microseconds. If you’re fetching items that are on the order of 15 MB or larger, then blocking on an (unloaded) blobstore might make sense. But if you’re fetching 256 byte user records, you probably want an OLTP database to accelerate that access.

Most databases also provide some degree of adaptive caching. Adaptive caching is when the database monitors the temperature of individual data items, and replicates hot items to storage with lower latency. You’ll often hear discussion of cache eviction (LRU, multilevel LRU, CLOCK-Pro) or admission (TinyLFU) in this space. As one example, it’s pretty common for databases to implement a three-layer cache: local disk for the entire dataset, the OS page cache for warm data pages in compressed form, and process memory for hot data in uncompressed form.

What about everything else?

Databases do plenty of things beyond what I have mentioned so far. They act as lock services to coordinate who gets to write at any given moment. They do compare-and-swap, and sometimes they even group writes into all-or-nothing transactions. They offer query planners, which abstract details of the underlying storage layout from the callers. They provide indexing of all kinds (e.g. b-tree, hash, bloom filter). But it turns out that all of those features are merely different ways of deciding: (1) which queries succeed in writing to the database, (2) who goes first, and (3) the most efficient means to read it back.

All of these tasks are useful, but they have no fundamental relationship with the data stored in the database. A write performed using compare-and-swap produces the same output (when successful) as any other write. Transactions that execute under concurrency produce the same data as ones executed serially (we hope). If someone wants to perform full scans instead of using the query planner’s services, they’ll get the same results (slowly).

Quantifying statefulness

I have watched JVM-based Cassandra nodes bootstrap themselves, heat their caches, and be ready to serve queries faster than their corresponding “stateless” JVM-based client applications. To me, this implies that the state—as measured in bytes—isn’t really the problem.

Instead, any service should be able to quantify its “time to first response” (TTFR), defined as: after receiving a new raw⁴ computing resource, how long will it take to put that resource into useful service⁵? When viewed through this lens, every computing service in existence is at least somewhat stateful, and we just need to measure where it sits on the continuum. Even “completely stateless” application services need to download their code and configuration.

A service’s TTFR will be a major contributor to how much cushion is required when planning its resource needs. If a service can reliably scale in 5 minutes, then we can give it enough resources to ride out the next 6 minutes of modelable exigencies, and autoscale to meet demand using that cushion as a…cushion. If a service needs 3 months to scale, we need to model a much larger set of exigencies. This usually means some form of human-in-the-loop “autoscaling”, with quarterly planning meetings, graphs to review, purchase contracts, etc.

For me as a DBA, I interpret this to mean that I should be tracking and minimizing the TTFR for any given database under my care. The most basic lesson is: every database should be able to restore from blobstore at line rate⁶. But it also means making suitable choices when selecting hardware and node topology, like ensuring that hardware components (mostly NICs) are sized to meet a desired TTFR.

IMO all databases, cloud or not, should be aiming to emulate the cloud paradigm, but that’ll have to be a topic for another post. ↩
It’s still reasonable to grumble about this, since this means that the OLTP database needs to provide a stable data format for others to consume: a new public interface that isn’t the database’s native API. ↩
I haven’t benchmarked S3 Express One Zone, so it certainly could change the math. ↩
“Raw” is going to have many definitions. Certainly many PaaSes will have a warm pool of VMs, ready to be assigned work. Should we be measuring from the moment a Kubernetes pod gets assigned to a pre-warmed VM? Or should we assume that the warm pool is depleted? This question will have different answers based on the exercise. If we’re planning for normal diurnal variation, then we’ll include the warm pool. If we’re talking about responding to an outage of a whole failure domain, then probably not. ↩
Some people might say that this is the same as the standard “recovery time objective” (RTO) metric. They’re similar, but different enough. RTO is the time to recover from an internal fault. TTFR is the time to adapt to a new external need. ↩
Grumbling LOUDLY at AWS RDS here, where restoring from backup takes an indeterminate quantity of time. ↩