Every database needs a transaction killer
The previous post discussed ways that clients could optimize their interactions with RDBMSes to guarantee correctness and performance of their transactional logic. This post will take the opposite perspective: how can database servers defend themselves against the negative effects of transactions? While the technology from that post might get deployed once or twice if we’re lucky, the technology from this one is more “everyone should deploy it everywhere, if they haven’t already”.
Story time
Within a week of starting my internship at an online review aggregator company, one of my tickets was to make a map-reduce based batch job whose results would get loaded into an OLTP MySQL database. I went and wrote the functionality, added some tests, got it code reviewed, and added it to the job scheduler. All of this was very exciting for a wee intern. Then it completed for the first time and brought down prod. And that’s how I met the DBA team…
Long-story short: my batch job had written all of its data to the database in a single massive transaction, comprising 2GB or so. This was perfectly fine for the primary MySQL node, which happily accepted the data from my batch job over the course of an hour or so. It was not fine for the replicas. The role of a primary node is to produce a list of transactions in serial order (called a binlog); the role of a replica is to ingest those transactions…in serial order. The presence of a 2GB transaction in the binlog implied that each of the replicas must chew through that 2GB transaction before applying anything else. The primary didn’t have this problem because it had the opportunity to apply each INSERT as it arrived, and was taking other read/write traffic concurrently.
Did I mention that this MySQL database cluster was also handling a significant amount of interactive customer traffic? Yeah… Anyways, we had a strict SLA on how much a replica could lag, which required us to start disabling site features when the lag exceeded that SLA. If my memory serves, that outage lasted about 30 minutes, but I can’t be certain because I was just a baby intern who wasn’t familiar with database replication, let alone bringing down prod.
Fast forward 18 months or so, and I’m now oncall for the same company’s DBA team (I had been forgiven, thank goodness). The Python monolith had deployed at around 9:30am that morning, and ever since we had been seeing some intermittent database errors, like lock wait timeouts on our most popular tables. Multiple deploys had gone out since, but the errors kept coming. We couldn’t figure out why: none of the logic in that deploy looked suspicious. By about 4pm we were looking at rolling back the entire contents of the morning deploy, when a colleague asked me to check open database transactions. Sure enough, there was an open transaction that had been sitting around since that morning, holding locks and doing nothing of value. We killed it and the errors subsided. Thank goodness: I had zero interest in being the DBA who forced a bunch of devs to roll back their code over a red herring.
Let us count the ways
As far as I can tell, long transactions are bad for three reasons.
Old data
Databases need to keep copies of old data, matching the age of the oldest transaction currently running. When that transaction finishes, the storage engine can purge the leftover copies up to the next oldest transaction. Rinse and repeat, and you’ll have the daily life of an InnoDB purge thread or the Postgres autovacuum daemon. This is a read-write coupling: holding open a read-only transaction can inhibit purging/vacuuming, and ending a read-only transaction can create a sudden flurry of writes.
In InnoDB, this is the undo log: a linked list of row versions
starting from the current version’s ROLL_PTR
and moving backwards in time.
When a transaction wants to read from an older version, it starts at current
and moves backward in time in an O(n) scan. The only good news is that the
InnoDB undo tablespace—since it is stored in separate files—is likely to have
different cache characteristics from the rest of the dataset: it’s more likely
to be kept hot in cache, especially if it is small relative to the rest of your
data (which it should be).
In Postgres, old copies of data (“dead tuples”) live in the same heap as the
live tuples. The VACUUM process avoids freeing any tuple that might still be
needed by an open transaction. Since Postgres deployments often use physical
replication, the autovacuum daemon needs to accommodate not only the
transactions running on the primary server, but also the replica servers. This
means that replicas have to give feedback to the primary about what
transactions they have open. This is configurable through
hot_standby_feedback
. On means: “I want the vacuum behavior of my primary to
be determined by its replicas”. Off means: “I want my replicas to suddenly stop
replicating (and start lagging) if they’re still serving tuples that my primary
has purged”.
Obviously all of these approaches require additional storage space that could be used for other purposes. Depending on how quickly your DB accepts writes, this might matter a lot, or just a little.
Failover
RDBMSes treat transaction state as transient and volatile. Neither of the two I’m familiar with have a facility to resume an uncommitted transaction against a different server (short of just checkpointing the entire VM). As such, the latency with which a database can fail over is intimately linked to the amount of time it takes for the slowest clients to finish up their open transactions.
Fast failover is good! Many web services can sustain a 1 second outage just by buffering inbound requests, thereby converting an outage into latency1. By the time we’re talking about a 5 second outage, things are pretty iffy. Beyond that, definitely not.
But even if we aren’t in the “just buffer requests until the outage resolves” territory, shorter more-retryable transactions mean that there is less sunk cost accumulated in your client-server system when an outage/failover does occur.
Locks
At one point I thought I understood how databases use locks for MVCC. I was wrong, obviously. For reference, just yesterday I learned something entirely new and mind-blowing about Postgres locks.
The RDBMS manages locks to ensure that transactions see a serial view of causality2, even though they are executing in parallel. Locks are a massive blessing, because they take the burden of reasoning about the concurrency semantics of our apps (mostly) out of the hands of the typical app developer or DBA, and put it in the hands of a much smaller cadre of database developers who presumably understand these concepts much better than we do. But they’re also a curse. Even when properly used, locks either: a) cause queries to block on one-another, or b) cause operations to fail and need to be retried. Put another way: locks are micro-outages of portions of your dataset. We should aim to minimize these outages, which means minimizing the number of locks held at any one time.
Little’s law informs us that there are two ways to do this: 1) decrease the number of locks taken by queries or 2) decrease the time they are held. We should mostly focus on approach #2, because decreasing the number of locks taken directly interferes with the usefulness of the locks.
How do I…
…load a lot of data
Bit by bit, in numerous small transactions. And then all at once: reveal the data to readers in a single atomic operation.
There are three approaches here, and all are fine:
- have a
version
column, with values identifying which batch iteration originally produced the data. So2024-12-28
might be a good value, assuming a daily batch job. The app consults acurrent_versions
table to determine which version it should query at a given moment. - same as 1, but do this in a view within the DB, so that the readers don’t even need to think about it.
- load the data into a separate table/partition, and use a DDL operation to atomically move it into the actual location read by the app. This assumes the writer can be safely given DDL permissions.
…get a read-only snapshot for reporting
This is equivalent to—and slightly easier than—the question of: how can I get a full SQL dump of my database at a consistent point in time? So let’s solve that problem, which implies solving the original problem as well.
My favored approach here is to separate transactional processing from analytic processing. “I need a snapshot for an indefinite period” is an analytic processing concern. In order to move data from the transactional domain to the analytic domain, the first step is to take a fast physical backup of the transactional DB. Copy that data to somewhere else, start up the DBMS there, and then start hammering it with reads. Congratulations, you now have a single-tenant DB for your use.
Alternatively, it’s perfectly fine to designate a replica for reporting
purposes. Just don’t let it affect the primary, so in Postgres you should
disable hot_standby_feedback
. This is basically a static non-automated
multi-tenant version of the above method.
What’s the solution?
Have the database self-defend against long-running transactions. My favored
approach has always been a cronjob (or a little daemon, take your pick) that
scans running transactions and auto-terminates any that have passed their
prime. In the bad old MySQL days, this meant writing a parser for SHOW ENGINE
INNODB STATUS
, but thankfully now that data is available in structured format.
I haven’t found a way to do this for MyRocks, which is a bummer.
If you don’t want to manage your own cronjob, Postgres 17 has now shipped
transaction_timeout
, which—contrary to the docs—you
should absolutely enable server-wide where appropriate. Alternatively, put it
in a proxy layer like pgbouncer, so that client connections come
pre-initialized with a timeout.
Included below is a sample transaction killer. It takes the “full FoundationDB” approach of killing transactions after 5 seconds. I’d totally do that if I was building an app from scratch, but legacy apps probably will need different settings.
#!/usr/bin/env python3.12
# Copyright 2024 Josh Snyder
# Licensed under https://www.apache.org/licenses/LICENSE-2.0.txt
import time
import argparse
import signal
from textwrap import dedent as dd
def kill_mysql_transactions(threshold):
from pymysql import connect
from pymysql.cursors import DictCursor
conn = connect()
with conn:
with conn.cursor(DictCursor) as cursor:
cursor.execute(
dd(
"""\
SELECT
p.ID AS thread_id,
p.USER AS user_name,
p.HOST AS host_name,
p.DB AS db_name,
t.trx_id AS trx_id,
t.trx_started AS trx_started,
TIMESTAMPDIFF(SECOND, t.trx_started, NOW()) AS elapsed_sec
FROM information_schema.PROCESSLIST p
JOIN information_schema.INNODB_TRX t
ON p.ID = t.trx_mysql_thread_id
WHERE TIMESTAMPDIFF(SECOND, t.trx_started, NOW()) > %s
"""
),
(threshold,),
)
rows = cursor.fetchall()
for row in rows:
thread_id = row["thread_id"]
cursor.execute(f"KILL {thread_id}")
print(
f"[MySQL] Killed thread={thread_id} "
f"(user={row['user_name']}, host={row['host_name']}, "
f"db={row['db_name']}, trx_id={row['trx_id']}, "
f"started={row['trx_started']}, elapsed={row['elapsed_sec']}s)"
)
def kill_postgres_transactions(threshold):
from psycopg2 import connect
from psycopg2.extras import DictCursor
conn = connect()
with conn:
with conn.cursor(cursor_factory=DictCursor) as cursor:
cursor.execute(
dd(
f"""\
SELECT
pid,
usename,
datname,
xact_start,
EXTRACT(EPOCH FROM (NOW() - xact_start)) AS elapsed_sec
FROM pg_stat_activity
WHERE NOW() - xact_start > INTERVAL '{threshold} seconds'
"""
)
)
rows = cursor.fetchall()
for row in rows:
pid = row["pid"]
cursor.execute("SELECT pg_terminate_backend(%s)", (pid,))
print(
f"[Postgres] Terminated pid={pid} "
f"(user={row['usename']}, db={row['datname']}, "
f"started={row['xact_start']}, elapsed={int(row['elapsed_sec'])}s"
)
# Dictionary for kill functions based on db_type
KILL_FUNCTIONS = {
"mysql": kill_mysql_transactions,
"postgres": kill_postgres_transactions,
}
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"--db-type",
choices=["mysql", "postgres"],
required=True,
help="Choose the database backend.",
)
parser.add_argument(
"--threshold",
type=float,
default=5.0,
help="Transaction runtime threshold in seconds.",
)
parser.add_argument(
"--interval",
type=float,
default=None,
help="If provided, repeat checking at this interval (seconds).",
)
args = parser.parse_args()
run_once = KILL_FUNCTIONS[args.db_type]
while True:
run_once(args.threshold)
if args.interval is None:
break
# If we get stuck, die
signal.alarm(math.ceil(args.interval * 5))
time.sleep(args.interval)
if __name__ == "__main__":
main()
-
This is in fact the FoundationDB upgrade strategy. Just blip all of the servers in the cluster simultaneously, and hope nobody notices. It works for them (I have been told) due to their ridiculously powerful testing regimen. ↩
-
Which would be SERIALIZABLE isolation. DBs can also provide less restrictive isolation levels. ↩