How we reduced error rates, and dropped latencies across merchants’ flows
Shopify merchants trust that when they build their stores on our platform, we’ve got their back. They can focus on their business, while we handle everything else. Any failures or degradations that happen put our promise of a sturdy, battle-tested platform at risk.
To do so, we need to ensure that the platform stays up and stays reliable. Shopify since 2016 has grown from 375,000 merchants to over 600,000. As of today, an average of 450,000 S3 operations per second are being made through our platform. However, that rapid growth also came with an increased S3 error rate, and increased read and write latencies.
While we use S3 at Shopify, if your application uses any flavor of cloud storage, and its use of cloud storage strongly correlates with the growth of your user base—whether it’s storing user or event data—I’m hoping this post provides some insight into how to optimize your cloud storage!
At Shopify, we use cloud storage for storing merchant uploaded data. That data includes pictures of merchants’ products, theme assets etc. Two of the first things a new merchant might do are:
- Upload elaborately photographed pictures of their bath bombs to their store.
- Install a paid theme off our theme store to add some flavour to the bath bomb game.
Errors along any of those flows damage the merchant’s trust in our platform, and add friction to the merchant’s entrepreneurial path. Let’s look at how we’d make their start as smooth as possible by digging a little deeper into what happens along those flows.
Uploading product images
The merchant uploads assets that are written to our S3 bucket under the namespace /s/2410/1317/products/ - where the merchant’s Shop ID is composite in that namespace (24101317).
Installing a new theme
The theme source files are copied from a shared location to the user’s S3 namespace: /s/2410/1317/files/. In the background, S3 stores data in partitions based on shortest common asset prefixes. For example, in our base case, in the above example, we can imagine starting out with all our assets under a single partition.
An important note here is that each partition has its own rate-limits as well as a limit on the number of assets allowed in a single partition.
As the throughput on a single partition increases gradually over time, the partition “splits” gracefully into multiple partitions. For example, the above partition may split into two partitions, pictured below.
When a partition splits gracefully, everything continues to work as expected. However, when partitions hit their throughput limits abruptly, all operations on that partition fail!
We learned this when we encountered clusters of a mysterious S3 exception: AWS::S3::Errors::SlowDown: Please reduce your request rate. That exception raised by S3 is triggered when resource over-consumption is detected—such as by high request rates.
As the platform grows rapidly, so does the volume of S3 operations it carries out. With the growth, we would hit what we began to call SlowDown events more, and more frequently.
So to put all that in context: spikes in S3 request rates resulted in a blocking repartitioning, wherein all operations to that partition would raise the SlowDown exception. What this meant was a flurry of action in one shop could result in temporarily failing S3 operations for a whole range of shops!
Once triggered, the failing writes would occur for anywhere between minutes to a few hours.
While S3 doesn’t provide metrics to things like the number of partitions within a bucket, or a list of partition keyspaces within a bucket, we did some more black-box investigating on the SlowDown exceptions and came up with two key observations that would guide our solution.
- Once a range of shops encountered SlowDown exceptions, those shops were unlikely to see it again.
- The shops that seemed to see SlowDown exceptions had been created almost always 1-3 months from the SlowDown events.
(1) Likely happened because that keyspace was subpartitioned granularly enough after sharply hitting the partition rate-limit once - but then again, this was a shaky inference from our exercise in pitch-black-box investigating.
One solution suggested by AmazonS3 was to add randomness to the key name. For example:
examplebucket/photos/232a-2013-26-05-15-00-00/cust1234234/photo1.jpg examplebucket/photos/7b54-2013-26-05-15-00-00/cust3857422/photo2.jpg examplebucket/photos/921c-2013-26-05-15-00-00/cust1248473/photo2.jpg examplebucket/photos/ba65-2013-26-05-15-00-00/cust8474937/photo2.jpg examplebucket/photos/8761-2013-26-05-15-00-00/cust1248473/photo3.jpg examplebucket/photos/2e4f-2013-26-05-15-00-01/cust1248473/photo4.jpg examplebucket/photos/9810-2013-26-05-15-00-01/cust1248473/photo5.jpg examplebucket/photos/7e34-2013-26-05-15-00-01/cust1248473/photo6.jpg examplebucket/photos/c34a-2013-26-05-15-00-01/cust1248473/photo7.jpg
This injection of entropy meant that we could dramatically drop the odds of a single shop’s writes hitting a partition’s rate limit—if all its writes were distributed over many partitions! The corollary here is that triggering a SlowDown event on any one partition would not be enough to significantly degrade a shop’s abilities to perform writes to the S3 bucket.
Armed with our learnings from our investigation, we hit a switch and decided to precede hash generated digests to new shops’ asset operations. For example, a write to /s/2410/1317/files/theme.scss would become /s/3582/2410/1317/files/theme.scss (since
CityHash.hash(‘2410/1317/files/theme.scss’) = ‘3582...’)
Once we saw the first SlowDown event after we enabled that change, the number of SlowDown exceptions then almost dropped away completely! After just having enabled the digests on new shops.
Not only did we stop seeing SlowDown events, but we also saw a pretty significant drop in latencies for the digested assets.
The median dropped by about 60%.
And the 95th percentile by about 25%.
With a relatively simple change we had made S3 operations across the platform faster, more reliable, and future-proofed. While we use S3 as our datastore, the lessons here are likely to cross over in some fashion with alternative cloud storage providers such as Google Cloud Storage as well, given that they suggest using similar asset naming schemes in their best practices document.