Every database needs a transaction killer

The previous post discussed ways that clients could optimize their interactions with RDBMSes to guarantee correctness and performance of their transactional logic. This post will take the opposite perspective: how can database servers defend themselves against the negative effects of transactions? While the technology from that post might get deployed once or twice if we’re lucky, the technology from this one is more “everyone should deploy it everywhere, if they haven’t already”.

Story time

Within a week of starting my internship at an online review aggregator company, one of my tickets was to make a map-reduce based batch job whose results would get loaded into an OLTP MySQL database. I went and wrote the functionality, added some tests, got it code reviewed, and added it to the job scheduler. All of this was very exciting for a wee intern. Then it completed for the first time and brought down prod. And that’s how I met the DBA team…

Long-story short: my batch job had written all of its data to the database in a single massive transaction, comprising 2GB or so. This was perfectly fine for the primary MySQL node, which happily accepted the data from my batch job over the course of an hour or so. It was not fine for the replicas. The role of a primary node is to produce a list of transactions in serial order (called a binlog); the role of a replica is to ingest those transactions…in serial order. The presence of a 2GB transaction in the binlog implied that each of the replicas must chew through that 2GB transaction before applying anything else. The primary didn’t have this problem because it had the opportunity to apply each INSERT as it arrived, and was taking other read/write traffic concurrently.

Did I mention that this MySQL database cluster was also handling a significant amount of interactive customer traffic? Yeah… Anyways, we had a strict SLA on how much a replica could lag, which required us to start disabling site features when the lag exceeded that SLA. If my memory serves, that outage lasted about 30 minutes, but I can’t be certain because I was just a baby intern who wasn’t familiar with database replication, let alone bringing down prod.

Fast forward 18 months or so, and I’m now oncall for the same company’s DBA team (I had been forgiven, thank goodness). The Python monolith had deployed at around 9:30am that morning, and ever since we had been seeing some intermittent database errors, like lock wait timeouts on our most popular tables. Multiple deploys had gone out since, but the errors kept coming. We couldn’t figure out why: none of the logic in that deploy looked suspicious. By about 4pm we were looking at rolling back the entire contents of the morning deploy, when a colleague asked me to check open database transactions. Sure enough, there was an open transaction that had been sitting around since that morning, holding locks and doing nothing of value. We killed it and the errors subsided. Thank goodness: I had zero interest in being the DBA who forced a bunch of devs to roll back their code over a red herring.

Let us count the ways

As far as I can tell, long transactions are bad for three reasons.

Old data

Databases need to keep copies of old data, matching the age of the oldest transaction currently running. When that transaction finishes, the storage engine can purge the leftover copies up to the next oldest transaction. Rinse and repeat, and you’ll have the daily life of an InnoDB purge thread or the Postgres autovacuum daemon. This is a read-write coupling: holding open a read-only transaction can inhibit purging/vacuuming, and ending a read-only transaction can create a sudden flurry of writes.

In InnoDB, this is the undo log: a linked list of row versions starting from the current version’s ROLL_PTR and moving backwards in time. When a transaction wants to read from an older version, it starts at current and moves backward in time in an O(n) scan. The only good news is that the InnoDB undo tablespace—since it is stored in separate files—is likely to have different cache characteristics from the rest of the dataset: it’s more likely to be kept hot in cache, especially if it is small relative to the rest of your data (which it should be).

In Postgres, old copies of data (“dead tuples”) live in the same heap as the live tuples. The VACUUM process avoids freeing any tuple that might still be needed by an open transaction. Since Postgres deployments often use physical replication, the autovacuum daemon needs to accommodate not only the transactions running on the primary server, but also the replica servers. This means that replicas have to give feedback to the primary about what transactions they have open. This is configurable through hot_standby_feedback. On means: “I want the vacuum behavior of my primary to be determined by its replicas”. Off means: “I want my replicas to suddenly stop replicating (and start lagging) if they’re still serving tuples that my primary has purged”.

Obviously all of these approaches require additional storage space that could be used for other purposes. Depending on how quickly your DB accepts writes, this might matter a lot, or just a little.

Failover

RDBMSes treat transaction state as transient and volatile. Neither of the two I’m familiar with have a facility to resume an uncommitted transaction against a different server (short of just checkpointing the entire VM). As such, the latency with which a database can fail over is intimately linked to the amount of time it takes for the slowest clients to finish up their open transactions.

Fast failover is good! Many web services can sustain a 1 second outage just by buffering inbound requests, thereby converting an outage into latency¹. By the time we’re talking about a 5 second outage, things are pretty iffy. Beyond that, definitely not.

But even if we aren’t in the “just buffer requests until the outage resolves” territory, shorter more-retryable transactions mean that there is less sunk cost accumulated in your client-server system when an outage/failover does occur.

Locks

At one point I thought I understood how databases use locks for MVCC. I was wrong, obviously. For reference, just yesterday I learned something entirely new and mind-blowing about Postgres locks.

The RDBMS manages locks to ensure that transactions see a serial view of causality², even though they are executing in parallel. Locks are a massive blessing, because they take the burden of reasoning about the concurrency semantics of our apps (mostly) out of the hands of the typical app developer or DBA, and put it in the hands of a much smaller cadre of database developers who presumably understand these concepts much better than we do. But they’re also a curse. Even when properly used, locks either: a) cause queries to block on one-another, or b) cause operations to fail and need to be retried. Put another way: locks are micro-outages of portions of your dataset. We should aim to minimize these outages, which means minimizing the number of locks held at any one time.

Little’s law informs us that there are two ways to do this: 1) decrease the number of locks taken by queries or 2) decrease the time they are held. We should mostly focus on approach #2, because decreasing the number of locks taken directly interferes with the usefulness of the locks.

How do I…

…load a lot of data

Bit by bit, in numerous small transactions. And then all at once: reveal the data to readers in a single atomic operation.

There are three approaches here, and all are fine:

have a version column, with values identifying which batch iteration originally produced the data. So 2024-12-28 might be a good value, assuming a daily batch job. The app consults a current_versions table to determine which version it should query at a given moment.
same as 1, but do this in a view within the DB, so that the readers don’t even need to think about it.
load the data into a separate table/partition, and use a DDL operation to atomically move it into the actual location read by the app. This assumes the writer can be safely given DDL permissions.

…get a read-only snapshot for reporting

This is equivalent to—and slightly easier than—the question of: how can I get a full SQL dump of my database at a consistent point in time? So let’s solve that problem, which implies solving the original problem as well.

My favored approach here is to separate transactional processing from analytic processing. “I need a snapshot for an indefinite period” is an analytic processing concern. In order to move data from the transactional domain to the analytic domain, the first step is to take a fast physical backup of the transactional DB. Copy that data to somewhere else, start up the DBMS there, and then start hammering it with reads. Congratulations, you now have a single-tenant DB for your use.

Alternatively, it’s perfectly fine to designate a replica for reporting purposes. Just don’t let it affect the primary, so in Postgres you should disable hot_standby_feedback. This is basically a static non-automated multi-tenant version of the above method.

What’s the solution?

Have the database self-defend against long-running transactions. My favored approach has always been a cronjob (or a little daemon, take your pick) that scans running transactions and auto-terminates any that have passed their prime. In the bad old MySQL days, this meant writing a parser for SHOW ENGINE INNODB STATUS, but thankfully now that data is available in structured format. I haven’t found a way to do this for MyRocks, which is a bummer.

If you don’t want to manage your own cronjob, Postgres 17 has now shipped transaction_timeout, which—contrary to the docs—you should absolutely enable server-wide where appropriate. Alternatively, put it in a proxy layer like pgbouncer, so that client connections come pre-initialized with a timeout.

Included below is a sample transaction killer. It takes the “full FoundationDB” approach of killing transactions after 5 seconds. I’d totally do that if I was building an app from scratch, but legacy apps probably will need different settings.

#!/usr/bin/env python3.12
# Copyright 2024 Josh Snyder
# Licensed under https://www.apache.org/licenses/LICENSE-2.0.txt
import time
import argparse
import signal
from textwrap import dedent as dd


def kill_mysql_transactions(threshold):
    from pymysql import connect
    from pymysql.cursors import DictCursor

    conn = connect()
    with conn:
        with conn.cursor(DictCursor) as cursor:
            cursor.execute(
                dd(
                    """\
                SELECT
                    p.ID AS thread_id,
                    p.USER AS user_name,
                    p.HOST AS host_name,
                    p.DB AS db_name,
                    t.trx_id AS trx_id,
                    t.trx_started AS trx_started,
                    TIMESTAMPDIFF(SECOND, t.trx_started, NOW()) AS elapsed_sec
                FROM information_schema.PROCESSLIST p
                JOIN information_schema.INNODB_TRX t
                  ON p.ID = t.trx_mysql_thread_id
                WHERE TIMESTAMPDIFF(SECOND, t.trx_started, NOW()) > %s
                """
                ),
                (threshold,),
            )
            rows = cursor.fetchall()

            for row in rows:
                thread_id = row["thread_id"]
                cursor.execute(f"KILL {thread_id}")

                print(
                    f"[MySQL] Killed thread={thread_id} "
                    f"(user={row['user_name']}, host={row['host_name']}, "
                    f"db={row['db_name']}, trx_id={row['trx_id']}, "
                    f"started={row['trx_started']}, elapsed={row['elapsed_sec']}s)"
                )


def kill_postgres_transactions(threshold):
    from psycopg2 import connect
    from psycopg2.extras import DictCursor

    conn = connect()
    with conn:
        with conn.cursor(cursor_factory=DictCursor) as cursor:
            cursor.execute(
                dd(
                    f"""\
                SELECT
                    pid,
                    usename,
                    datname,
                    xact_start,
                    EXTRACT(EPOCH FROM (NOW() - xact_start)) AS elapsed_sec
                FROM pg_stat_activity
                WHERE NOW() - xact_start > INTERVAL '{threshold} seconds'
                """
                )
            )

            rows = cursor.fetchall()
            for row in rows:
                pid = row["pid"]
                cursor.execute("SELECT pg_terminate_backend(%s)", (pid,))

                print(
                    f"[Postgres] Terminated pid={pid} "
                    f"(user={row['usename']}, db={row['datname']}, "
                    f"started={row['xact_start']}, elapsed={int(row['elapsed_sec'])}s"
                )


# Dictionary for kill functions based on db_type
KILL_FUNCTIONS = {
    "mysql": kill_mysql_transactions,
    "postgres": kill_postgres_transactions,
}


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--db-type",
        choices=["mysql", "postgres"],
        required=True,
        help="Choose the database backend.",
    )
    parser.add_argument(
        "--threshold",
        type=float,
        default=5.0,
        help="Transaction runtime threshold in seconds.",
    )
    parser.add_argument(
        "--interval",
        type=float,
        default=None,
        help="If provided, repeat checking at this interval (seconds).",
    )

    args = parser.parse_args()

    run_once = KILL_FUNCTIONS[args.db_type]
    while True:
        run_once(args.threshold)
        if args.interval is None:
            break

        # If we get stuck, die
        signal.alarm(math.ceil(args.interval * 5))
        time.sleep(args.interval)


if __name__ == "__main__":
    main()

This is in fact the FoundationDB upgrade strategy. Just blip all of the servers in the cluster simultaneously, and hope nobody notices. It works for them (I have been told) due to their ridiculously powerful testing regimen. ↩
Which would be SERIALIZABLE isolation. DBs can also provide less restrictive isolation levels. ↩