Observations about S3

How to use S3

This guide is meant to provide a quick introduction to S3. Some of the lessons likely apply to similar S3-alike datastores, like Google Cloud Storage, Backblaze B2, Cloudflare R2, etc. I can’t tell you which ones apply, because I haven’t used the others.

Let’s start with some rule-of-thumb performance numbers. I last checked these a few years ago, but they’re probably still accurate:

100MB/s (800Mbps) of throughput
15ms time to first byte (ttfb)

Observation 1: the universal speed limit

The 100MB/s quoted above is an absolute speed limit. No single operation in S3 will ever proceed faster than 100MB/s. But we can skirt this limit with concurrent operations.

Observation 2: latency dictates object size

Off the bat latency-wise, we can see that the 15ms of ttfb latency S3 costs about 1.5MB of throughput. This gives us a design lesson: our objects should typically be bigger than a few megabytes. Downloading a 128MB object will “waste” 1% in the form of first-byte latency; by contrast a 1MB object will spend 60% of its time waiting for bytes.

Observation 2.5: connection keep-alive is your friend

The latency numbers mentioned above apply only when the TCP connection is open and the TLS handshake has already occurred. This implies that the best approach is to have an open connection to S3 ready to go whenever you might want to make an API call.

Unfortunately it does not appear that you can keep connections open to S3 indefinitely: in my tests, the server hung up on me after 20 seconds of silence. For requests on a connection after the first, I got only six seconds to send the request before S3 hung up.

Observation 3: an object is the sum of its parts

Since 2010, S3 has supported multipart upload. The multipart upload API looks like this:

POST /<key>?uploads -> (uploadId)
PUT /<key>?uploadId=<uploadId>&partNumber=1
PUT /<key>?uploadId=<uploadId>&partNumber=2
PUT /<key>?uploadId=<uploadId>&partNumber=3
...
PUT /<key>?uploadId=<uploadId>&partNumber=n
POST /<key>?uploadId=<uploadId>

The initial POST is called an InitiateMultipartUpload and the subsequent PUTs are UploadPart calls. The final POST is called CompleteMultipartUpload and contains a manifest describing each of the uploaded parts. The entire object appears atomically when S3 processes the CompleteMultipartUpload API call.

I had initially thought that S3 would concatenate uploaded object parts, either during each UploadPart call or maybe at the final CompleteMultipartUpload call. So far as I can tell, this assumption is wrong: the S3 storage layer never concatenates the parts that you upload. Instead, the S3 read layer provides the illusion of a single object out of many object parts. When serving a GET, the S3 server will sequentially request object parts from the storage layer, starting at Part 1 and ending at Part N.

S3’s GET API also accepts a partNumber parameter, which accesses object parts as originally uploaded. I once did a test of whether it was faster to access objects by partNumber rather than by byte range. I found no difference, even when the byte range sizes didn’t match the chunk size of the original upload.

Observation 3.5: part limit and size limit.

You can upload at most 10,000 parts in a single object. Each object can be at most 5 TB. If you want to upload a 5TB object, this implies that your average part size must be at least 500MB (5TB / 10000).

Observation 4: big operations are good, except when they aren’t

In general you want to perform the largest feasible operation against S3 at all times. Each API call costs time and money, and so it is best to move as much data per call as you possibly can.

At least one library (s3transfer, included in boto3) has a ridiculously small default chunk size of 8MB. Since s3transfer has a default concurrency of 10 threads, this is achieves optimal concurrency for uploads in the neighborhood of 80MB (10 threads * 8MB). Even this is optimistic, because in most cases each request thread will need to perform a TLS handshake before sending data. For larger transfers, such small chunk sizes are wasteful. With a chunk size of 8MB, you waste 19% of your throughput (1.5/8) to ttfb latency.

An exception applies when you have a tenuous network connection: if the connection is likely to cut out during your operation, then it would be better to break the upload into small units. Each unit becomes durable the moment it is acknowledged by S3, so losing your connection might mean re-uploading only 5MB, rather than 500MB. In the context of multipart uploads, the minimum upload size is 5MB, but for single-part uploads (and the last part of a multipart upload), there is no minimum.

Another exception applies to streaming uploads in memory-constrained situations. When data is arriving via a stream (i.e. not originally on a disk), it is necessary to buffer a full PUT API call’s worth of data in RAM while waiting for S3 to accept the write. If I’m performing 12 concurrent 256MB uploads to S3 as part of a database backup, then I should set aside as much as 3072MB of RAM that I could otherwise use as a cache for my database. I can cut the RAM usage by uploading fewer parts.

Observation 4.5: concatenation

Let’s imagine I want to upload a logfile to S3 every minute, so that no more than 1 minute of log data is lost in the event of a crash. At the end of one day, I’ll have 1440 objects in S3, and downloading them will require 1440 GET API calls (about 22 seconds of ttfb latency). Alternatively, I can use the multipart upload API to upload 1440 individual parts, and after the upload is completed S3 will allow me to download them in a single GET (15 ms of TTFB latency).

One fun trick is that file formats like tar, zstd, lz4, gzip, and squashfs are all “concatenatable”. This means that you can concatenate two valid files together into a single S3 object, and the resulting value will decode as-if it had been uploaded as a single file.

$ echo -n "hello " | zstd > helloworld.zst $ echo "world" | zstd >> helloworld.zst
$ zstdcat helloworld.zst
hello world
$

Observation 5: It’s best to stay busy!

The trick to getting good performance with S3 is to pick your desired concurrency, and then…don’t be idle. The shell snippet below illustrates how NOT to do it:

for file in *; do aws s3 cp $file s3://bucket/$file; done

The problem here is that there is a substantial warm-up and cool-down period at the beginning and end of each invocation of awscli. Let’s say that each file is 10GB, and we’re uploading eight concurrent parts of 250MB each, for a total of 40 parts. In an ideal scenario, we would dispatch five waves of eight parts each, in perfect synchrony, with no wasted time. In reality there is a natural dispersion in the performance of the uploads. As each file upload wraps up, it is likely that we will stuck waiting for just one or two parts to finish, while the other six or seven threads sit idle. Our performance would be better if they started work on the next file, but they can’t because total work has been divided on a file-by-file basis rather than a chunk-by-chunk basis.

The other problem with this approach is the the resources spent on each successive invocation of awscli. By the time the first upload completes, we have spun-up a Python interpreter to run awscli and performed 8 TLS handshakes. It is much better to perform that work once, rather than once-per-file.

Observation 6: S3 is not a CDN

(This section is based on my interaction with S3’s “Standard” storage class, and may not apply to the other storage classes. But it probably does.)

Most operations against S3 parallelize quite cleanly. One exception is that attempts to download the same object at the same time will reach a limit somewhere around 500MB/s meaning that performance will start to drop off a cliff if you have more than ~5 clients simultaneously downloading the same object.

So far as I can tell, the S3 backend retrieves all objects equally fast, with no variation based how an object has been accessed in the past. Speculating a bit, it seems that S3 picks a holds a fixed number of available copies, with any variation being due to maintenance, not due to user behavior. An S3 bucket does not have an associated “cache tier” for recently accessed data.

If you want to store a piece of popular data, the “solution” to this problem is that you should upload the same object many times. If you want to support 100 concurrent clients, then you should upload each piece of data at least 20 times (100 / 5), and assign each client to download one of the 20 identical objects. (If the assignment process is random, you should leave extra headroom for the binomial distribution associating clients to objects).

Of course, this solution is quite dissatisfying, because it blends the problems of storing the data, where all of the decisions can be made up-front, with the problem of distributing the data, where the decisions vary with read traffic. It would be quite cool if there were some kind of CDN with an S3-compatible API and IAM for access control, but I’m not aware of any such offering.

Observation 7: metadata isn’t S3’s strong suit

Conceptually, S3 can be divided into two separate services. The first one (call it “Data”) stores objects that are nameless and immutable, and has functionally infinite scalability ¹. The second service (call it “Metadata”) associates names and version numbers with those immutable objects, and allows for objects to change by substitution.

Metadata a fine service, and works well for most purposes, but compared to Data, it is a weakling. (Maybe a fairer way to say it is that Metadata is an inherently tougher problem to solve in a scalable way). Most of the workloads I have for S3 do not place much stress on Metadata, and I prefer to keep it that way. For that reason, I avoid:

intensive use of the ListBucket API call
overwriting the same key multiple times per second

Observation 8: S3 has “copy”, but not “rename” or “move”

I know that the S3 Web Console claims to provide all of these things, but they are all essentially lies. The S3 copy operation works just fine, but it might be better called an “UploadFrom” operation. It’s a way of telling S3 to create a new object, but instead of providing its contents, you provide a link to another location in S3 where those contents can be found. Some worker within S3 retrieves that object and then stores it back in a different location, creating a brand-new object with no connection to the original, except that they both contain the same data.

The S3 copy API requires O(n) API calls, where n is the size of the object to be copied. It also doesn’t seem to be any faster than just uploading the data again, assuming your data is already in the local EC2 region. However, using the copy API still provides an advantage by saving resources (e.g. CPU, NIC bandwidth) that would otherwise be used by your instance in transferring the data.

A “rename” or “move” operation in S3 is implemented by copying the object and then issuing an API call to delete the original object. This is vastly different from S3’s closest analogue (filesystems), where the rename operation is O(1), completes atomically, and preserves the other metadata on a file.

Observation 8.5: indirection does not exist

One rather useful feature provided by filesystems is the ability to make data show up in two (or three, or four…) places at once. Modern filesystems provide three methods for achieving this. You can give a file two names (a hard link). You can create a new file that tells your operating system “hey, look over here instead” (a symbolic link). More recently, some filesystems gained the ability to have two separate files, but to share the cost of storing identical sections (a reflink).

S3 provides none of these things. Which is a shame, because features of this kind can be quite useful. Let’s say I want to store the results of a CI job that produces many artifacts, and only a portion of them change with each commit. It would be quite nice to be able to cheaply copy all of the unchanged objects from the Version1 prefix of my bucket to the Version2 prefix of the same bucket. And naturally when an object is deleted, the underlying storage would be reference-counted: this way the storage is freed whenever when there are no remaining objects holding that data.

Two other use-cases I have encountered: container image registries and LSM database (e.g. Cassandra) backups. Both of these involve a set of immutable objects (container layers and SSTables, respectively) that need to be stored, and a set of GC roots (containers and database snapshots, respectively) that reference them.

Lacking this feature within S3, I usually see it implemented haphazardly on a use-case by use-case basis. The implementation of atomic reference counting atop a non-transactional datastore like S3 is somewhat fiddly. I find that high-quality implementations incorporate a transactional database like DynamoDB. Adding DynamoDB brings a “you had one problem, and now you have two problems” vibe, but I still strongly believe that it’s the right approach. Based on the implementations that I’ve been involved with, I have found that the deletion logic is terrifying; triply so in the production context where the consequences are “we lost all of our backups”. In summary, this is exactly the sort of functionality that is best addressed by a well-resourced team that has experience deploying datastores and verifying storage invariants, not random developers who were assigned to manage their company’s database backups.

Observation 9: Folders are an illusion

Another lie that the S3 Web Console likes to tell is “folders”. S3 doesn’t have a concept of “folders” or “directories”. To S3, everything is a named blob. Directories are a convenient way of displaying a hierarchy of named blobs, but they’re a concept invented by S3 clients, not S3 itself.

In a filesystem, you must create a directory before adding any files within it. In S3, by contrast, you can create an object with any name at any time, without first creating a directory to contain it. If I upload an object called “a/b/c/d/e” to S3, the Web Console (and many other S3-compatible tools) will tell me that I have created a “folder” called “a/”. If I then delete the object, all of those tools will forget that folder “a/” and its children ever existed (unless there’s some other object that begins with “a/”).

The client tools behave this way because of how the ListObjects (and ListObjectsV2) APIs are structured. When you call ListObjects, you provide a “prefix” that all returned keys must match. This condition is just a raw comparison over the UTF-8 encoded bytes in an object’s key. You can also optionally provide a “delimiter”, which is used to group together related entries. Conventionally this is the string “/”, but it can be any string. When it is “/”, the resulting semantics look somewhat like the way files and directories are presented in a unix filesystem.

The ListObject API calls the groups of objects that it returns “Common Prefixes”, but this name is less catchy and accessible than “folder”. If I press “Create folder” in the Web Console, the UI informs me: “When you create a folder, S3 creates an object using the name that you specify followed by a slash (/). This object then appears as folder on the console.” In other words, if I create a folder called “foo”, it uploads a zero-byte blob called “foo/”. In a unix filesystem, it would be disallowed to have file path that ends in a slash, but in S3 it is perfectly acceptable!

In a filesystem, you can rename a directory atomically in O(1) time. In S3 the operation is non-atomic and takes O(n) time proportional to the data stored “in” the folder. If you find yourself renaming entire prefixes in S3, then you have done something quite wrong!

Observation 10: S3 operations can stress your local system resources

The 100MB/s afforded by a single stream to S3 is a rather high transfer rate, all things considered. It is 80% of a gigabit NIC. It is two thirds of what of what can be expected of a NAS-grade HDD (assuming 150 MB/s). In some cases (especially when working with consumer-grade equipment and Internet connections), one stream to S3 is all one needs.

In the remaining cases, though, CPU can become the primary bottleneck. When I worked with database backups, the goal was to be able to run incremental backups on the same host as a latency-critical database workload, but impose no additional latency on queries. To do this, we needed to be as efficient with CPU cycles as possible.

Most S3 uploads will hash the data twice, and encrypt it once. The first hash is a data-at-rest checksum that gets passed to S3 and checked by S3 before storing the data. Traditionally this was the Content-MD5 header (and the MD5 algorithm), but recently S3 shipped other checksum algorithms. The encryption and the second hash is data-in-transport encapsulation performed by the TLS protocol, usually in the form AES-GCM, which combines the AES encryption algorithm and the GMAC hash. This is not used anywhere within S3; it gets stripped off in the early stages of request processing and not seen again.

At 100MB/s, the best case scenario is to spend about 2% of a CPU on AES-GCM for TLS (assuming AES-GCM proceeds at 4GB/s), and 14% on MD5 (assuming MD5 proceeds at 700 MB/s), plus whatever overhead is imposed by your network stack. Unfortunately using a higher quality hashing algorithm like SHA-256 would cost about 25% of a CPU core (assuming SHA-256 proceeds at 400MB/s). If your data is already encrypted and is being hashed with SHA-256, disabling TLS on the connection to S3 can theoretically be rendered secure. But as seen above, going without TLS would probably net only a ~2% reduction in CPU usage. It would be quite nice if S3 would support object checksumming with faster+better algorithms like Blake3 and xxHash, but unfortunately they do not.

Oftentimes we’ll want to perform other CPU-hungry work while uploading to S3. We can easily devote an entire core (i.e. 100% CPU) to zstd compression. This CPU cost will often pay off, especially if we plan to store this object in S3 for a long period of time. In other cases, a faster compression algorithm like lz4 is sufficient, which will cost about 14% CPU (assuming lz4 proceeds at 700MB/s).

Observation 10.5: Rate limiters are suboptimal

Sometimes rate-limiters are floated as a solution to this kind of resource contention. I’ve never been very impressed by rate-limiters, mostly because they have too many parameters that need to be well-tuned, and meanwhile the consequence of mis-tuning a ratelimiter is that some latency critical workload suffers a performance degradation. The three parameters I have in mind are permit size (how much data does each permit allow for to be sent), permit rate (how fast are permits added to the token bucket), and bucket size.

I think the right approach to this problem is to be parameterless wherever possible. For this reason I prefer to affect how the system handles queued work. For instance, to prevent an S3 upload from causing CPU contention, its threads can be assigned to the SCHED_IDLE CPU scheduling class, thereby ensuring that any latency-critical tasks pre-empt the backup workload.

For network, we can give hints to the networking stack that our traffic should be sent as second-priority traffic, with all other traffic taking precedence. For many Linux machines, this can be achived simply with setsockopt(fd, SOL_SOCKET, SO_PRIORITY, 1) on the outbound network socket. This allows S3 packets to flow unrestricted when the network is uncontended, but halts sending when the network is contended. So a DB query that wants to download a 1 MB resultset can utilize all of a 10Gbit NIC for 800 microseconds, and once it is finished the S3 upload can resume sending.

Unfortunately this does leave the disk as a contentious resource. In some cases, it is practical to avoid the disk entirely: with LSM database backups, we can back up data as it is written to the filesystem or immediately afterward, while it is likely stored in page cache. But not all workloads will be conducive to this technique.

If it is ultimately necessary to reach the disk, the sad news is that Linux doesn’t provide an idle scheduling class for disk requests that actually functions correctly, in most cases. Luckily disks today are concurrent, with most SSDs having the ability to efficiently service 16 I/Os simultaneously. Some workloads may not notice another reader, especially if we funnel all disk reads to a single thread.

And if all else fails, we can keep track of time spent reading each file, and back off if we’ve spent more than 200μs in the past 1ms reading data. Which is…sigh…a rate limiter.

(Non-)observation 11: caching IP addresses

In some places, I have read that if you want seriously high throughput from S3, you should perform multiple DNS resolutions of the S3 endpoints over time, and thereby produce a larger pool of resource that can answer requests. I haven’t ever found this to work.

I once heard that S3 is as scalable as your budget. This is wrong: S3 will scale far beyond your budget. ↩