Infrastructure
Performance Testing At Scale—for BFCM and Beyond
Safely Adding NOT NULL Columns to Your Database Tables
Leveraging Go Worker Pools to Scale Server-side Data Sharing
Maintaining a service that processes over a billion events per day poses many learning opportunities. In an ecommerce landscape, our service needs to be ready to handle global traffic, flash sales, holiday seasons, failovers and the like. Within this blog, I'll detail how I scaled our Server Pixels service to increase its event processing performance by 170%, from 7.75 thousand events per second per pod to 21 thousand events per second per pod.
But First, What is a Server Pixel?
Capturing a Shopify customer’s journey is critical to contributing insights into marketing efforts of Shopify merchants. Similar to Web Pixels, Server Pixels grants Shopify merchants the freedom to activate and understand their customers’ behavioral data by sending this structured data from storefronts to marketing partners. However, unlike the Web Pixel service, Server Pixels sends these events through the server rather than client-side. This server-side data sharing is proven to be more reliable allowing for better control and observability of outgoing data to our partners’ servers. The merchant benefits from this as they are able to drive more sales at a lower cost of acquisition (CAC). With regional opt-out privacy regulations built into our service, only customers who have allowed tracking will have their events processed and sent to partners. Key events in a customer’s journey on a storefront are captured such as checkout completion, search submissions and product views. Server Pixels is a service written in Golang which validates, processes, augments, and consequently, produces more than one billion customer events per day. However, with the management of such a large number of events, problems of scale start to emerge.
The Problem
Server Pixels leverages Kafka infrastructure to manage the consumption and production of events. We began to have a problem with our scale when an increase in customer events triggered an increase in consumption lag for our Kafka’s input topic. Our service was susceptible to falling behind events if any downstream components slowed down. Shown in the diagram below, our downstream components process (parse, validate, and augment) and produce events in batches:
The problem with our original design was that an unlimited number of threads would get spawned when batch events needed to be processed or produced. So when our service received an increase in events, an unsustainable number of goroutines were generated and ran concurrently.
Goroutines can be thought of as lightweight threads that are functions or methods that run concurrently with other functions and threads. In a service, spawning an unlimited number of goroutines to execute increasingly growing tasks on a queue is never ideal. The machine executing these tasks will continue to expend its resources, like CPU and memory, until it reaches its limit. Furthermore, our team has a service level objective (SLO) of five minutes for event processing, so any delays in processing our data would exceed our processing beyond its timed deadline. In anticipation of three times the usual load for BFCM, we needed a way for our service to work smarter, not harder.
Our solution? Go worker pools.
The Solution
The worker pool pattern is a design in which a fixed number of workers are given a stream of tasks to process in a queue. The tasks stay in the queue until a worker is free to pick up the task and execute it. Worker pools are great for controlling the concurrent execution for a set of defined jobs. As a result of these workers controlling the amount of concurrent goroutines in action, less stress is put on our system’s resources. This design also worked perfectly for scaling up in anticipation of BFCM without relying entirely on vertical or horizontal scaling.
When tasked with this new design, I was surprised at the intuitive setup for worker pools. The premise was creating a Go channel that receives a stream of jobs. You can think of Go channels as pipes that connect concurrent goroutines together, allowing them to communicate with each other. You send values into channels from one goroutine and receive those values into another goroutine. The Go workers retrieve their jobs from a channel as they become available, given the worker isn’t busy processing another job. Concurrently, the results of these jobs are sent to another Go channel or to another part of the pipeline.
So let me take you through the logistics of the code!
The Code
I defined a worker interface that requires a CompleteJobs
function that requires a go channel of type Job
.
The Job
type takes the event batch, that’s integral to completing the task, as a parameter. Other types, like NewProcessorJob
, can inherit from this struct to fit different use cases of the specific task.
New workers are created using the function NewWorker
. It takes workFunc
as a parameter which processes the jobs. This workFunc
can be tailored to any use case, so we can use the same Worker
interface for different components to do different types of work. The core of what makes this design powerful is that the Worker
interface is used amongst different components to do varying different types of tasks based on the Job
spec.
CompleteJobs
will call workFunc
on each Job as it receives it from the jobs
channel.
Now let’s tie it all together.
Above is an example of how I used workers to process our events in our pipeline. A job channel and a set number of numWorkers
workers are initialized. The workers are then posed to receive from the jobs
channel in the CompleteJobs
function in a goroutine. Putting go
before the CompleteJobs
function allows the function to run in a goroutine!
As event batches get consumed in the for loop above, the batch is converted into a Job that’s emitted to the jobs channel with the go worker.CompleteJobs(jobs, &producerWg)
runs concurrently and receives these jobs.
But wait, how do the workers know when to stop processing events?
When the system is ready to be scaled down, wait groups are used to ensure that any existing tasks in flight are completed before the system shuts down. A waitGroup is a type of counter in Go that blocks the execution of a function until its internal counter becomes zero. As the workers were created above, the waitGroup counter was incremented for every worker that was created with the function producerWg.Add(1)
. In the CompleteJobs
function wg.Done()
is executed when the jobs channel is closed and jobs stop being received. wg.Done
decrements the waitGroup counter for every worker.
When a context cancel signal is received (signified by <- ctx.Done()
above ), the remaining batches are sent to the Job
channel so the workers can finish their execution. The Job
channel is closed safely enabling the workers to break out of the loop in CompleteJobs
and stop processing jobs. At this point, the WaitGroups’ counters are zero and the outputBatches
channel,where the results of the jobs get sent to, can be closed safely.
The Improvements
Once deployed, the time improvement using the new worker pool design was promising. I conducted load testing that showed as more workers were added, more events could be processed on one pod. As mentioned before, in our previous implementation our service could only handle around 7.75 thousand events per second per pod in production without adding to our consumption lag.
My team initially set the number of workers to 15 each in the processor and producer. This introduced a processing lift of 66% (12.9 thousand events per second per pod). By upping the workers to 50, we increased our event load by 149% from the old design resulting in 19.3 thousand events per second per pod. Currently, with performance improvements we can do 21 thousand events per second per pod. A 170% increase! This was a great win for the team and gave us the foundation to be adequately prepared for BFCM 2021, where we experienced a max of 46 thousand events per second!
Go worker pools are a lightweight solution to speed up computation and allow concurrent tasks to be more performant. This go worker pool design has been reused to work with other components of our service such as validation, parsing, and augmentation.
By using the same Worker
interface for different components, we can scale out each part of our pipeline differently to meet its use case and expected load.
Cheers to more quiet BFCMs!
Kyra Stephen is a backend software developer on the Event Streaming - Customer Behavior team at Shopify. She is passionate about working on technology that impacts users and makes their lives easier. She also loves playing tennis, obsessing about music and being in the sun as much as possible.
Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.
Spin Infrastructure Adventures: Containers, Systemd, and CGroups
The Spin infrastructure team works hard at improving the stability of the system. In February 2022 we moved to Container Optimized OS (COS), the Google maintained operating system for their Kubernetes Engine SaaS offering. A month later we turned on multi-cluster to allow for increased scalability as more users came on board. Recently, we’ve increased default resources allotted to instances dramatically. However, with all these changes we’re still experiencing some issues, and for one of those, I wanted to dive a bit deeper in a post to share with you.
Spin’s Basic Building Blocks
First it's important to know the basic building blocks of Spin and how these systems interact. The Spin infrastructure is built on top of Kubernetes, using many of the same components that Shopify’s production applications use. Spin instances themselves are implemented via a custom resource controller that we install on the system during creation. Among other things, the controller transforms the Instance
custom resource into a pod that’s booted from a special Isospin
container image along with the configuration supplied during instance creation. Inside the container we utilize systemd as a process manager and workflow engine to initialize the environment, including installing dotfiles, pulling application source code, and running through bootstrap scripts. Systemd is vital because it enables a structured way to manage system initialization and this is used heavily by Spin.
There’s definitely more to what makes Spin then what I’ve described, but from a high level and for the purposes of understanding the technical challenges ahead it's important to remember that:
- Spin is built on Kubernetes
- Instances are run in a container
- systemd is run INSIDE the container to manage the environment.
First Encounter
In February 2022, we had a serious problem with Pod relocations that we eventually tracked to node instability. We had several nodes in our Kubernetes clusters that would randomly fail and require either a reboot or to be replaced entirely. Google had decent automation for this that would catch nodes in a bad state and replace them automatically, but it was occurring often enough (five nodes per day or about one percent of all nodes) that users began to notice. Through various discussions with Shopify’s engineering infrastructure support team and Google Cloud support we eventually honed in on memory consumption being the primary issue. Specifically, nodes were running out of memory and pods were being out of memory (OOM) killed as a result. At first, this didn’t seem so suspicious, we gave users the ability to do whatever they want inside their containers and didn’t provide much resources to them (8 to 12 GB of RAM each), so it was a natural assumption that containers were, rightfully, just using too many resources. However, we found some extra information that made us think otherwise.
First, the containers being OOM killed would occasionally be the only Spin instance on the node and when we looked at their memory usage, often it would be below the memory limit allotted to them.
Second, in parallel to this another engineer was investigating an issue with respect to Kafka performance where he identified a healthy running instance using far more resources than should have been possible.
The first issue would eventually be connected to a memory leak that the host node was experiencing, and through some trial and error we found that switching the host OS from Ubuntu to Container Optimized OS from Google solved it. The second issue remained a mystery. With the rollout of COS though, we saw 100 times reduction in OOM kills, which was sufficient for our goals and we began to direct our attention to other priorities.
Second Encounter
Fast forward a few months to May 2022. We were experiencing better stability which was a source of relief for the Spin team. Our ATC rotations weren't significantly less frantic, the infrastructure team had the chance to roll out important improvements including multi-cluster support and a whole new snapshotting process. Overall things felt much better.
Slowly but surely over the course of a few weeks, we started to see increased reports of instance instability. We verified that the nodes weren’t leaking memory as before, so it wasn’t a regression. This is when several team members re-discovered the excess memory usage issue we’d seen before, but this time we decided to dive a little further.
We needed a clean environment to do the analysis, so we set up a new spin instance on its own node. During our test, we monitored the Pod resource usage and the resource usage of the node it was running on. We used kubectl top pod
and kubectl top node
to do this. Before we performed any tests we saw
Next, we needed to simulate memory load inside of the container. We opted to use a tool called stress, allowing us to start a process that consumes a specified amount of memory that we could use to exercise the system.
We ran kubectl exec -it spin-muhc – bash
to land inside of a shell in the container and then stress -m 1 --vm-bytes 10G --vm-hang 0
to start the test.
Checking the resource usage again we saw
This was great, exactly what we expected. The 10GB used by our stress test showed up in our metrics. Also, when we checked the cgroup assigned to the process we saw it was correctly assigned to the Kubernetes Pod:
Where 24899 was the PID of the process started by stress.This looked great as well. Next, we performed the same test, but in the instance environment accessed via spin shell
. Checking the resource usage we saw
Now this was odd. Here we saw that the memory created by stress wasn’t showing up under the Pod stats (still only 14Mi), but it was showing up for the node (33504Mi). Checking the usage from in the container we saw that it was indeed holding onto memory as expected
However, when we checked the cgroup this time, we saw something new:
What the heck!? Why was the cgroup different? We double checked that this was the correct hierarchy by using the systemd cgroup list tool from within the spin instance:
So to summarize what we had seen:
- When we run processes inside the container via
kubectl exec
, they’re correctly placed within the kubepods cgroup hierarchy. This is the hierarchy that contains the pods memory limits. - When we run the same processes inside the container via
spin shell
, they’re placed within a cgroup hierarchy that doesn’t contain the limits. We verify this by checking the cgroup file directly:
The value above is close to the maximum value of a 64 bit integer (about 8.5 Billion Gigabytes of memory). Needless to say, our system has less than that, so this is effectively unlimited.
For practical purposes, this means any resource limitation we put on the Pod that runs Spin instances isn’t being honored. So Spin instances can use more memory than they’re allotted which is concerning for a few reasons, but probably most importantly, we depend on this to avoid instances from interfering with one another.
Isolating It
In a complex environment like Spin it’s hard to account for everything that might be affecting the system. Sometimes it’s best to distill problems down to the essential details to properly isolate the issue. We were able to reproduce the cgroup leak in a few different ways. First on the Spin instances directly using crictl
or ctr
and custom arguments with real Spin instances. Second, running on a local Docker environment . Setting up an experiment like this also allowed for much quicker iteration time when testing potential fixes.
From the experiments we discovered differences between the different runtimes (containerd, Docker, and Podman) execution of systemd containers. Podman for instance has a --systemd
flag that enables and disables an integration with the host systemd. containerd has a similar flag –runc-systemd-cgroup
that starts runc with the systemd cgroup manager. For Docker, however, no such integration exists (you can modify the cgroup manager via daemon.json, but not via the CLI like Podman and Containerd) and we saw the same cgroup leakage. When comparing the cgroups assigned to the container processes between Docker and Podman, we saw the following
Docker
Podman
Podman placed the systemd and stress processes in a cgroup unique to the container. This allowed Podman to properly delegate the resource limitations to both systemd and any process that systemd spawns. This was the behavior we were looking for!
The Fix
We now had an example of a systemd container properly being isolated from the host with Podman. The trouble was that in our Spin production environments we use Kubernetes that uses Containerd, not Podman, for the container runtime. So how could we leverage what we learned from Podman toward a solution?
While investigating differences between Podman and Docker with respect to Systemd we came across the crux of the fix. By default Docker and containerd use a cgroup driver called cgroupfs to manage the allocation of resources while Podman uses the systemd driver (this is specific to our host operating system COS from Google). The systemd driver delegates responsibility of cgroup management to the host systemd which then properly manages the delegate systemd that’s running in the container.
It’s recommended for nodes running systemd on the host to use the systemd cgroup driver by default, however, COS from Google is still set to use cgroupfs. Checking the developer release notes, we see that in version 101 of COS there is a mention of switching the default cgroup driver to systemd, so the fix is coming!
What’s Next
Debugging this issue was an enlightening experience. If you had asked us before, Is it possible for a container to use more resources than its assigned?, we would have said no. But now that we understand more about how containers deliver the sandbox they provide, it’s become clear the answer should have been, It depends.
Ultimately the escaped permissions were from us bind-mounting /sys/fs/cgroup read-only into the container. A subtle side effect of this, while this directory isn’t writable, all sub directories are. But since this is required of systemd to even boot up, we don’t have the option to remove it. There’s a lot of ongoing work by the container community to get systemd to exist peacefully within containers, but for now we’ll have to make do.
Acknowledgements
Special thanks to Daniel Walsh from RedHat for writing so much on the topic. And Josh Heinrichs from the Spin team for investigating the issue and discovering the fix.
Additional Information
- Control Group APIs and Delegation
- How to run systemd in a container
- Container-Optimized OS Release Notes: DEV
Chris is an infrastructure engineer with a focus on developer platforms. He’s also a member of the ServiceMeshCon program committee and a @Linkerd ambassador.
Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.
Möbius: Shopify’s Unified Edge
While working on improvements to the Shopify platform in terms of how we handle traffic, we identified that through time each change became more and more challenging. When deploying a feature, if we wanted the whole platform to benefit from it, we couldn’t simply build it “one way fits all”—we already had more than six different ways for the traffic to reach us. To list some, we had a traffic path for:
- “general population” (GenPop) Shopify Core: the monolith serving shops that’s already using an edge provider
- GenPop Shopify applications and services
- GenPop Shopify applications and services that’s using an edge provider
- Personally identifiable information (PII) restricted Shopify Core: where we need to make sure that the traffic and related data are kept in a specific area of the world
- PII restricted Shopify applications and services
- publicly-accessible APIs that are required for Mutual Transport Layer Security (mTLS) authentication.
We had to choose which traffic path could use those new features or build the same feature in more than six different ways. Moreover, many different traffic paths means that observability gets blurred. When we receive requests and something doesn’t work properly, figuring out why and how to fix it requires more time. It also takes longer to onboard new team members to all of those possibilities and how to distinguish between them.
This isn’t the way to build a highway to new features and improvements. I’d like to tell you why and how we built the one edge to front our services and systems.
One Does Not Simply Know What “The Edge” Stands For
The most straightforward definition of the edge, or network edge, is the point at which an enterprise-owned network connects to a third-party network. With cloud computing, lines are slightly blurred as we use third parties to provide us with servers and networks (even more when using a provider to front your network, like we do at Shopify). But in both those cases, as long as they’re used and controlled by Shopify, they’re considered part of our network.
The edge of Shopify is where requests from outside our network are made to reach our network.
The Fellowship of the Edge
Unifying our edge became our next objective and two projects were born to make this possible: Möbius, which as the name taken from the “Möbius strip” suggests, was to be the one edge of Shopify and Shopify Front End (SFE), the routing layer that receives traffic from Möbius and dispatches it to where it needs to go.
About a year before starting Möbius, we already had a small number of applications handled through our edge, but we saw limitations in terms of how to properly automate such an approach at scale, while the gains to the platform justified the monetary costs to reach those gains. We designed SFE and Möbius together, leading to a better separation of concerns between the edge and the routing layers.
The Shopify Front End
SFE is designed to provide a unified routing layer behind Möbius. Deployed in many different regions, routing clusters can receive any kind of web traffic from Möbius, whether for Shopify Core or Applications. Those clusters are mainly nginx deployments with custom Lua code to handle the routing according to a number of criteria, including but not limited to the IP address a client connected to and the domain that was used to reach Shopify. For the PII restricted requirements, parallel deployments of the same routing clusters code are deployed in the relevant regions.
To handle traffic for applications and services, SFE works by using a centralized API receiving requests from Kubernetes controllers deployed in every cluster using such applications and services. This allows linking the domain names declared by an application to the clusters where the application is deployed. We also use this to provide active/active (when two instances of a given service can receive requests at the same time) or active/passive (when only a single instance of a given service can receive requests) load balancing.
Providing load balancing at the routing layer instead of DNS allows for near instantaneous traffic changes instead of depending on the Time to Live as described in my previous post. It avoids those decisions being made on the client side and thus provides us with better command and control over the traffic.
Möbius
Möbius’s core concerns are simple: we grab the traffic from outside of Shopify and make sure it makes its way inside of Shopify in a stable, secure, and performant manner. Outside of Shopify is any client connecting to Shopify from outside a Shopify cluster. Inside of Shopify is, as far as Möbius is concerned, the routing cluster with the lowest latency to the receiving edge’s point-of-presence (PoP).
Möbius is responsible for TLS and TCP termination with the clients, and doing that termination as close as possible to the client. It brings faster requests and better DDoS protection, plus it allows us to filter malicious requests before the traffic even reaches our clusters. This is something that was already done for our GenPop Shopify Core traffic, but Möbius now standardizes. On top of handling the certificates for the shops, we added an automated path to handle certificates for applications domains.
SFE already needs to be aware of the domains that the applications respond to, so instead of building the same logic a second time to configure the edge, we piggybacked on the work the SFE controller was already doing. We added handlers in the centralized API to configure those domains at the edge, through API requests to our vendor, and indicate we’re expecting to receive traffic on those, and to forward requests to SFE. Our API handler takes care of each and any DNS challenge to validate that we own the domain in order for the traffic to start flowing, but also obtains a valid certificate.
Prior to Möbius, if an application owner wanted to take advantage of the edge, they had to configure their domain manually at the edge (validating ownership, obtaining a certificate, setting up the routing), but Möbius provides full automation of that setup, allowing application owners to simply configure their ingress and DNS and ripe the benefits of the edge right away.
Finally, it’s never easy to have many systems migrate to use a new one. We aimed to make that change as easy as possible for application owners. With automation deploying all that was required, the last required step was a simple DNS change for applications domains, from targeting a direct-to-cluster record to targeting Möbius. We wanted to keep that change manual to make sure that application owners own the process and make sure that nothing gets broken.
To make sure all is fine for an application before (and after!) migration, we also added observability in the form of easy:
- access to the logs for a given application at the edge
- identification of which domains an application will have configured at the edge,
- understanding of what is the status of those domains.
This allows owners of applications and services to immediately identify if one of their domains isn’t configured or behaving as expected.
Our Precious Edge
On top of all the direct benefits that Möbius provides right away, it allows us to build the future of Shopify’s edge. Different teams are already working on improvements to the way we do caching at the edge, for instance, or on ways to use other edge features that we’re not already taking advantage of. We also have ongoing projects to handle cluster-to-cluster communications by avoiding the traffic from going through the edge and coming back to our clusters by taking advantage of SFE.
Using new edge features and standardizing internal communications is possible because we unified the edge. There are exceptions where we need to avoid cross-dependency for applications and services on which either Möbius or SFE depend to function. If we were to onboard them to use Möbius and SFE, whenever an issue would happen, we would be in a crash-loop situation: Möbius/SFE requires that application to work, but that application requires Möbius/SFE to work.
It’s now way easier to explain to new Shopifolk how traffic reaches us and what happens between a client and Shopify. There’s no need for as many conditionals in those explanations, nor as many whiteboards… but we might need more of those to explain all that we do as we grow the capabilities on our now-unified edge!
Raphaël Beamonte holds a Ph.D. in Computer Engineering in systems performance analysis and tracing, and sometimes gives lectures to future engineers, at Polytechnique Montréal, about Distributed Systems and Cloud Computing.
If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.
Using Terraform to Manage Infrastructure
Large applications are often a mix of code your team has written and third-party applications your team needs to manage. These third-party applications could be things like AWS or Docker. In my team’s case, it’s Twilio TaskRouter.
The configuration of these services may not change as often as your app code does, but when it does, the process is fraught with the potential for errors. This is because there is no way to write tests for the changes or easily roll them back–things we depend on as developers when shipping our application code.
Using Terraform improves your infrastructure management by allowing users to implement engineering best practices in what would otherwise be a GUI with no accountability, tests, or revision history.
On the Conversations team, we recently implemented Terraform to manage a piece of our infrastructure to great success. Let’s take a deeper look at why we did it, and how.
My team builds Shopify’s contact center. When a merchant or partner interacts with an agent, they are likely going through a tool we’ve built. Our app suite contains applications we’ve built in-house and third-party tools. One of these tools is Twilio TaskRouter.
TaskRouter is a multi-channel skill-based task routing API. It handles creating tasks (voice, chat, etc.) and routing them to the most appropriate agent, based on a set of routing rules and agent skills that we configure.
As our business grows and becomes more complex, we often need to make changes to how merchants are routed to the appropriate agent.
Someone needs to go into our Twilio console and use the graphic user interface (GUI) to update the configuration. This process is fairly straightforward and works well for getting off the ground quickly. However, the complexity quickly becomes too high for one person to understand it in its entirety.
In addition, the GUI doesn’t provide a clear history of changes or a way to roll them back.
As developers, we are used to viewing a commit history, reading PR descriptions and tests to understand why changes happened, and rolling back changes that are not working as expected. When working with Twilio TaskRouter, we had none of these.
Using Terraform to Configure Infrastructure
Terraform is an open source tool for configuring infrastructure as code.
It is a state machine for infrastructure that gives teams all the benefits of engineering best practices listed above to infrastructure that was previously only manageable via a GUI.
Terraform requires three things to work:
- A reliable API. We need a reliable API for Terraform to work. When using Terraform, we will stop using the GUI and rely on Terraform to make our changes for us via the API. Anything you can’t change with the API, you won’t be able to manage with Terraform.
- A Go client library. Terraform is written in Go and requires a client library for the API you’re targeting written in Go. The client library makes HTTP(S) calls to your target app.
- A Terraform provider. The core Terraform software uses a provider to interact with the target API. Providers are written in Go using the Terraform Plugin SDK.
With these three pieces, you can manage just about any application with Terraform!
A Terraform provider adds a set of resources Terraform can manage. Providers are not part of Terraform’s code. They are created separately to manage a specific application. Twilio did not have a provider when we started this project, so we made our own.
Since launching this project, Twilio has developed its own Terraform provider, which can be found here.
At its core, a provider enables Terraform to perform CRUD operations on a set of resources. Armed with a provider, Terraform can manage the state of the application.
Creating a Provider
Note: If you are interested in setting up Terraform for a service that already has a provider, you can skip to the next section.
Here is the basic structure of a Terraform provider:
This folder structure contains your Go dependencies, a Makefile for running commands, an example file for local development, and a directory called twilio. This is where our provider lives.
A provider must contain a resource file for every type of resource you want to manage. Each resource file contains a set of CRUD instructions for Terraform to follow–you’re basically telling Terraform how to manage this resource.
Here is the function defining what an activity
resource is in our provider:
Note: Go is a strongly typed language, so the syntax might look unusual if you’re not familiar with it. Luckily you do not need to be a Go expert to write your own provider!
This file defines what Terraform needs to do to create, read, update and destroy activities in Task Router. Each of these operations is defined by a function in the same file.
The file also defines an Importer function
, a special type of function that allows Terraform to import existing infrastructure. This is very handy if you already have infrastructure running and want to start using Terraform to manage it.
Finally, the function defines a schema–these are the parameters provided by the API for performing CRUD operations. In the case of Task Router activities, the parameters are friendly_name, available,
and workspace_sid
.
To round out the example, let’s look at the create function we wrote:
Note: Most of this code is boilerplate Terraform provider code which you can find in their docs.
The function is passed context, a schema resource, and an empty interface.
We instantiate the Twilio API client and find our workspace (Task Router activities all exist under a single workspace).
Then we format our parameters (defined in our Schema in the resourceTwilioActivity
function) and pass them into the create method provided to us by our API client library.
Because this function creates a new resource, we set the id (setID) to the sid
of the result of our API call. In Twilio, a sid is a unique identifier for a resource. Now Terraform is aware of the newly created resource and it’s unique identifier, which means it can make changes to the resource.
Using Terraform
Once you have created your provider or are managing an app that already has a provider, you’re ready to start using Terraform.
Terraform uses a DSL for managing resources. The good news is that this DSL is more straightforward than the Go code that powers the provider.
The DSL is simple enough that with some instruction, non-developers should be able to make changes to your infrastructure safely–but more on that later.
Here is the code for defining a new Task Router activity:
Yup, that’s it!
We create a block declaring the resource type and what we want to call it. In that block, we pass the variables defined in the Schema
block of our resourceTwilioActivity
, and any resources that it depends on. In this case, activities need to exist within a workspace. So we pass in the workspace resource in the depends_on
array. Terraform knows it needs this resource to exist or to create it before attempting to create the activity.
Now that you have defined your resource, you’re ready to start seeing the benefits of Terraform.
Terraform has a few commands, but plan and apply are most common. Plan will print out a text-based representation of the changes you’re about to make:
Terraform makes visualizing the changes to your infrastructure very easy. At this planning step you may uncover unintended changes - if there was already an offline activity the plan step would show you an update instead of a create. At this step, all you need to do is change your resource block’s name,and run terraform plan
again.
When you are satisfied with your changes, run terraform apply
to make the changes to your infrastructure. Now Terraform will know about the newly created resource, and its generated id, allowing you to manage it exclusively through Terraform moving forward.
To get the full benefit of Terraform (PRs, reviews, etc.), we use an additional tool called Atlantis to manage our GitHub integration.
This allows people to make pull requests with changes to resource files, and have Atlantis add a comment to the PR with the output of terraform plan
. Once the review process is done, we comment atlantis apply -p terraform
to make the change. Then the PR is merged.
We have come a long way from managing our infrastructure with a GUI in a web app! We have a Terraform provider communicating via a Go API client to manage our infrastructure as code. With Atlantis plugged into our team’s GitHub, we now have many of the best practices we rely on when writing software–reviewable PRs that are easy to understand and roll back if necessary, with a clear history that can be scanned with a git blame.
How was Terraform Received by Other Teams?
The most rewarding part of this project was how it was received by other teams. Instead of business and support teams making requests and waiting for developers to change Twilio workflows, Terraform empowered them to do it themselves. In fact, some people’s first PRs were changes to our Terraform infrastructure!
Along with freeing up developer time and making the business teams more independent, Terraform provides visibility to infrastructure changes over time. Terraform shows the impact of changes, and the ease of searching GitHub for previous changes makes it easy to understand the history of changes our teams have made.
Building great tools will often require maintaining third-party infrastructure. In my team’s case, this means managing Twilio TaskRouter to route tasks to support agents properly.
As the needs of your team grow, the way you configure your infrastructure will likely change as well. Tracking these changes and being confident in making them is very important but can be difficult.
Terraform makes these changes more predictable and empowers developers and non-developers alike to use software engineering best practices when making these changes.
Jeremy Cobb is a developer at Shopify. He is passionate about solving problems with code and improving his serve on the tennis court.
Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.
How We Fixed the Dependency Confusion Vulnerability in Over 600 Ruby Applications
Shopify has grown significantly over the years, and our success makes us an attractive target for malicious actors. We take the safety of our merchants seriously, so we have a good reason to continuously improve the security at Shopify.
I’ll share how the Ruby Conventions team, which focuses on creating conventions to make Ruby services sustainable, used an iterative approach to solve complex problems at scale while responding to shifting circumstances. In particular, how we solved the dependency confusion vulnerability in over 600 Ruby applications, developed tooling that allows us to do large-scale migration with ease, and made the Ruby community a bit safer.
Understanding the Dependency Confusion Problem
Shopify runs a bug bounty program where we pay people to find vulnerabilities on our platform and learn what we have to improve on. One such report showed that we were vulnerable to a dependency confusion vulnerability that could give an attacker access to our local, continuous integration/continuous deployment (CI/CD), and production environments.
The vulnerability leverages the ambiguity of a package source to install malicious dependencies. If an external package is created with a higher version number under the same name as an internal Shopify package, the external dependency is resolved instead of the internal dependency.
In Ruby, developers use Bundler to manage their dependencies and make their environments reproducible. Bundler resolves dependencies so that you use the correct versions and sources for each gem. The Bundler team fixed the issue by introducing a new Gemfile.lock
file format that’s created by a fresh install or an update. The new format assigns each gem to an explicit source:
However, at that time, the new format required you to upgrade. That meant Bundler updated all dependencies in the lockfile that would require vetting each update and testing the application for regressions in behavior.
Identifying the Impact
We didn’t know how many applications were susceptible to the dependency confusion vulnerability that made it hard to assess the impact of the problem. Our first step was to disambiguate the situation, so we could understand the problem better.
Disambiguating unknowns doesn’t need to be fancy, and it’s better to have some insight than none. In our case, we defined a cron job in our CI system to get the Bundler version information from all repositories into our data lake. It turned out that around 600 Ruby applications were susceptible to the dependency confusion vulnerability.
Having that data also allowed us to create a metric of outstanding migrations and measure progress towards solving our problem. It’s also a great way of detaching the solution from the goal, which is less constraining.
Changing Assumptions Through Experimentation
As developers, our solution has to take quite a few constraints into account. When developing software iteratively, we try to change some of those constraints and reevaluate our solution quickly. Making those changes as soon as possible surfaces unknowns increasing the likelihood of a successful project.
In our case having over 600 repositories to migrate meant that manually migrating every application would be too time-consuming. Requiring teams to do it themselves would be tedious and error-prone because the Gemfile.lock
file couldn’t be automatically updated while keeping the current gem versions. In that case, developers would need to modify the lockfile to revert the versions updates back to prevent regressions from being introduced.
If we were able to update a Gemfile.lock
to the new format without updating dependencies, it would enable us to automate rolling this upgrade out to all Ruby applications in Shopify. We would only rely on the application owners to deploy the changes.
We experimented with building a Bundler plugin (a gem that extends Bundler’s functionality) to automate the upgrade. It updated the Gemfile.lock
file to the new format without updating dependencies. The plugin boiled down to:
- Initializing the specification for a given
Gemfile.lock
file that contains information about the gems such as the name, the version, and remote. - Updating the
Gemfile.lock
file to the new lockfile format that updates all gems in the process. We minimize updates by only permitting patch version updates. - Replacing the versions in the updated
Gemfile.lock
file with the gem versions from the oldGemfile.lock
file.
This approach wasn’t a perfect solution, but it worked well enough to run Bundler migrations. It allowed us to proceed to the next problem area of migrating large numbers of applications.
Running Migrations at Scale
One of the biggest challenges in running large-scale migrations is handling edge cases. Rather than exploring how migrations can go wrong beforehand, it’s more effective to migrate a handful of applications and discover the actual problems. The other benefit is that we can identify and migrate the subset of applications with issues that have known solutions while resolving the edge cases at the same time. This approach allows us to constantly deliver on our goals and put ourselves in a better spot each day.
Our Bundler plugin migrated the lockfile without dependency updates, and then we could start migrating applications. We started out running the plugin on a handful of applications that weren’t merchant-facing. This went smoothly, and we decided to run it on a larger batch for non-critical repositories. However, we noticed issues arising from inconsistent build setups, Ruby versions, and other configurations in the larger batches of migrations.
Some of our tooling didn’t support the latest Bundler version, and we had to work with our deployment, CI, and local environments teams to update them. Our collaborations were particularly fruitful when we:
- investigated the issue first
- tried to solve it
- shared the context with the team.
Most people want to help and making it easy for them benefits everyone.
Some of our Docker images are built with Heroku’s Ruby buildpack that didn’t support the required Bundler version. This situation rendered a percentage of applications unable to migrate. To solve this issue, we worked with the Heroku Buildpack team to adopt the latest Bundler version. They released a new version with the bundler update, making it broadly available in the Ruby community.
Another critical element was raising awareness with project owners and setting a deadline to deprecate the old Bundler version. Being upfront with owners and communicating the impact of the change allowed teams to prioritize and work with us to update their projects.
The Bundler migration plugin was run locally, but scalability issues arose. It became too complicated to manage different Ruby versions, parallelize them, and address failures. Instead of wasting time on building a solution that would have solved all eventualities at the start, we used the migration plugin to its breaking point, investigated the problem areas, and implemented improvements.
As a response to our scaling issues, we built a command-line interface (CLI) tool on top of our CI system to set up the right environment for a repository, run commands on it, and open a pull request (PR) based on the changes made. Having an environment per repository worked great because we didn’t run into misconfiguration problems anymore. Using our CI system also allowed us to parallelize the execution, which in turn, sped the process up. Furthermore, migration failures were easier to recover and track.
Preventing Future Problems
Part of iteratively solving a problem means focusing on current problems rather than future concerns. However, it doesn’t mean ignoring future concerns altogether. It’s important to distinguish between critical concerns and ones that can be figured out later on.
One example was preventing a Gemfile.lock
file from regressing to its previous format that would make us vulnerable. We were aware of the possibility of regressions, but we also knew that we could build tooling to solve this issue. Instead of investing time in tackling the problem upfront, we decided to wait and start working on it once we migrated most applications. This approach also allowed us to gauge the magnitude of the problem rather than wasting resources working out hypotheticals.
We encountered a handful of regressions during our migration and were a bit concerned. We investigated each manually to see if there were bigger problems present. Since we didn't find anything suggesting deeper problems, we carried on and continued monitoring knowing that if we ran into more regressions, we had more information to change course and face the new reality.
We investigated the lockfile regression problem and shared what we learned with the Bundler team. They enhanced the tool to prevent these cases from occuring in the future. We didn’t need to implement special tooling to prevent regressions (it saved us a lot of work and time). We only had to make sure that all applications were using the correct Bundler version.
Most of our applications were migrated to the Bundler version that didn’t prevent regressions because we staggered the migration to make continuous progress. Since we battle-tested our migration tooling and resolved most configuration issues, it allowed us to migrate all of our applications to the latest Bundler version in less than a day.
Rather than waiting for the perfect solution, making iterative changes improved our tooling to the point where we made changes that used to be hard, easy. This de-risked the deployment.
To prevent the installation of malicious gems, we made changes to our local environment tooling to ensure it always defaults to the recommended Bundler version. This ensures that an individual developer machine isn’t susceptible to running malicious code from the dependency confusion vulnerability. We also started failing CI whenever it encountered an out-of-date Bundler version, ensuring that any code change that could introduce the dependency confusion vulnerability wouldn’t be merged. Since most of our other automated processes require CI to execute, we rely on CI to catch vulnerable Bundler versions.
Sharing What We've Done with the Community
We love open source at Shopify, and we like giving back to the community. When contributing, it is quite valuable to share the purpose as well as the solution. It leads to insightful conversations that result in a better solution. Often, contributions aren’t solely PRs. Providing context on investigative work, bringing problems to someone's attention, or testing another contributor’s prototypes are just as valuable.
Our plugin worked pretty well for us, so we created a proposal in Bundler to fix the issue for the Ruby community. These changes would allow Bundler to update the Gemfile.lock
file without upgrading gems in the process. Our proposal didn’t make it in, but led to a conversation resulting in an alternative approach that was shipped in Bundler 2.2.21. We helped test their approach on our applications to ensure that we caught as many edge cases as possible to help minimize the potential burden on the community.
We also ran into issues where developers using an insecure version of Bundler could accidentally revert to the old lockfile format. The problem was that the latest Bundler version (at the time) still resolved the old Gemfile.lock
file on `bundle install`, which made it very simple to regress to the old format. We created a prototype to prevent that from happening that sparked another conversation with the maintainers of Bundler and brought the issue to their attention. They released version 2.2.22 of Bundler that prevents regressions and makes everybody in the community more secure.
We set out to fix the dependency confusion vulnerability in every Ruby project at Shopify and succeeded. This wouldn’t have been possible if we hadn’t followed an iterative approach that allowed us to make steady progress while taking shifting circumstances into account. We developed tooling that allows us to do large-scale migration, which has come in handy for other uses. We also aggregated Bundler version data on our Ruby projects to track adoption and make future decision-making easier. Lastly, we have worked closely with the Bundler team to improve the base functionality while leveraging Shopify’s scale to find edge cases, fix bugs, improve Bundler, and make it better for everyone in the Ruby community.
Frederik is a production engineer at Shopify and part of the Ruby & Rails infrastructure team. He contributed to massively scaling Shopify’s CI/CD system and making Ruby services more secure across Shopify and the Ruby community.
Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.
That Old Certificate Expired and Started an Outage. This is What Happened Next
In distributed systems, there’s plenty of occasions for things to go wrong. This is why resiliency and redundancy are important. But no matter the systems you put in place, no matter whether you did or didn’t touch your deployments, issues might arise. It makes it critical to acknowledge the near misses: the situations where something could have gone wrong and the situations where something did, but it could have been worse. When was the last time it happened to you? For us, at Shopify, it was on September 30th, 2021, when the expiration of Let’s Encrypt’s (old) root certificate almost led to a global outage of our platform.
In April 2021, Let’s Encrypt announced that the former root certificate was expiring. As we use Let’s Encrypt as our public certificate provider since we became a sponsor in 2016, we made sure that Shopify’s edge infrastructure was up to date with the different requirements, so we wouldn’t stop serving traffic to all of (y)our beloved shops. As always, Let’s Encrypt did their due diligence with communications and by providing a cross-signing of their new root certificate by the old one. What this means is that while clients didn’t trust the new root certificate yet, because that new root certificate was signed by the old one, they trusted the old one and would transfer their trust to the new one. Also, the period of time between the announcement and the expiration was sufficient for any Let’s Encrypt-emitted certificates, which expire after three months, to be signed by the new cross-signed root certificate and considered valid using any of the old or new root certificates. We didn’t expect anything bad to happen on September 30th, 2021, when the root certificate was set to expire at 10:00 a.m. Eastern Standard Time.
At 10:15 am that same day, our monitors started complaining about certificate errors, but not at Shopify’s edge—that is, between the public and Shopify—but between our services. As a member of the traffic team of Shopify that handles a number of matters related to bringing traffic safely and reliably into Shopify’s platform (including the provisioning and handling of certificates), I joined the incident response to try and help figure out what was happening. Our first response was to lock the deployments of the Shopify monolith (using spy, our chatops) while some of us connected to containers to try and figure out what was happening in there. In the meantime, we started looking at the deployments that were happening when this issue started manifesting. It didn’t make any sense as those changes didn’t have anything to do with the way services interconnected, nor with certificates. This is when the Let’s Encrypt root certificate expiry started clicking in our minds. An incident showing certificate validity errors happening right after the expiry date couldn’t be a coincidence, but we couldn’t reproduce the error in our browsers or even using curl
. Using openssl
, we could, however, observe the certificate expiry for the old root certificate:
The error was related to the client being used for those connections. And we saw those errors appearing in multiple services across Shopify using different configurations and libraries. For a number of those services, the errors were bubbling up from the internally-built library allowing services to check people’s authentication to Shopify. While Faraday
is the library we generally use for HTTP connections, our internal library has dependencies on rack-oauth2
and openid_connect
. Looking at the dependency chains for both applications, we saw the following:
Both rack-oauth2
(directy) and openid_connect
(indirectly) depend on httpclient
, which, according to the GitHub repository of the library, “gives something like the functionality of libwww-perl (LWP) in Ruby.”
From other service errors, we identified that the google-api-client
also was in error. Using the same process, we pinpointed the same library as a dependency:
And so we took a closer look at httpclient
and...
Code snippet from httpclient/nahi
Uh-oh, that doesn’t look good. httpclient
is heavily used, whether it’s directly or through indirect exposures of the dependency chain. Like web browsers, httpclient
embeds a version of the root certificates. The main difference is that in this case, the version of the root certificate store in the library is six years old (!!) while reference root certificate stores are generally updated every few months. So even with Let’s Encrypt due diligence, a stale client store that doesn’t trust the new root certificate directly or the old one, as it expired, was sufficient to cause internal issues.
Our emergency fix was simple. We forked the Git repository, created a branch that overrode cacert.pem
with the most recent root certificate bundle and started using that branch in our deployments to make things work. After confirming the fix was working as expected and deploying it in our canaries, the problem was solved for the monolith. Then automation helped create pull requests for all our affected repositories.
The choice of overriding cacert.pem
with a more recent one is a temporary fix. However, it was the one, following a solve-fast approach, we knew would work automatically for all our deployments without making any other changes. To support this fix and make sure a similar issue does not happen soon, we put in place systems to keep track of changes in the root certificates, and automatically update them in our fork when needed. A better long-term approach could be to use the system root certificates store for instance, which can commence after a review of our system root certificate stores, across all of our runtime environments.
We wondered why it took about 15 minutes for us to start seeing the effects of that certificate expiry. The answer is actually in the trigger as we identified that we started seeing the issue on the Shopify monolith when a deployment happened. HTTP has a process of permanent connections, also called HTTP keep-alive, that keeps a connection alive as long as it’s being used, and only closes it when it hasn’t been used after a short period of time. Also, TLS validation, the check of the validity of certificates, is only performed while initializing the connection, but the trust is maintained for the duration of that connection. Given the traffic on Shopify, our deployments kept alive the connections to other systems, and the only reason those connections were broken was because of Kubernetes pods being recreated to deploy the new version, leading to a new HTTP connection and the failure of TLS validation—hence the 15 minutes discrepancy.
Beside our Ruby applications having (indirect!) dependencies on httpclient
, a few other of our systems were affected by the same problem. Particularly, services powered by data were left hanging as the application providing them with data was affected by the disruption. For instance, product recommendations weren’t shown during that time, marketing campaigns ended up being throttled temporarily, and, more visible to our merchants’ customers, order confirmations were delayed for a short period because the risk analysis couldn’t be performed.
Of the Shopify monolith, however, only the canaries—that is, the server to which we roll changes first to test their effect in production before rolling them to the rest of the fleet—were affected by the issue. Our incident response initial action of locking deployments also stops any deployment process in its current state. This simple action allowed us to avoid cycling Kubernetes pods for the monolith and keep the current version running, protecting us from a global outage of Shopify, leading to September 30th, 2021 being that one time an outage could have been way worse.
Raphaël is a Staff Production Engineer and the tech lead of the Traffic team at Shopify, taking care of the interfaces between Shopify and the outside world and providing reliable and scalable systems for configuring the edge of the neverending growing applications. He holds a Ph.D. in Computer Engineering in systems performance analysis and tracing, and sometimes gives lectures to future engineers, at Polytechnique Montréal, about Distributed Systems and Cloud Computing.
Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.
Upgrading MySQL at Shopify
In early September 2021, we retired our last Shopify database virtual machine (VM) that was running Percona Server 5.7.21, marking the complete cutover to 5.7.32. In this post, I’ll share how the Database Platform team performed the most recent MySQL upgrade at Shopify. I’ll talk about some of the roadblocks we encountered during rollback testing, the internal tooling that we built out to aid upgrading and scaling our fleet in general, and our guidelines for approaching upgrades going forward, which we hope will be useful for the rest of the community.
Why Upgrade and Why Now?
We were particularly interested in upgrading due to the replication improvements that would preserve replication parallelism in a multi-tier replication hierarchy via transaction writesets. However, in a general sense, upgrading our version of MySQL was on our minds for a while and the reasons have become more important over time as we’ve grown:
- We’ve transferred more load to our replicas over time, and without replication improvements, high load could cause replication lag and a poor merchant and buyer experience.
- Due to our increasing global footprint, to maintain efficiency, our replication topology can be up to four “hops” deep, which increases the importance of our replication performance.
- Without replication improvements, in times of high load such as Black Friday/Cyber Monday (BFCM) and flash sales, there’s a greater likelihood of replication lag that in turn heightens the risk to merchants’ data availability in the event of a writer failure.
- It’s industry best practice to stay current with all software dependencies to receive security and stability patches.
- We expect to eventually upgrade to MySQL 8.0. Building the upgrade tooling required for this minor upgrade helps us prepare for that.
To the last point, one thing we definitely wanted to achieve as a part of this upgrade was—to put it in the words of my colleague Akshay—“Make MySQL upgrades at Shopify a checklist of tasks going forward, as opposed to a full-fledged project.” Ideally, by the end of the project, we have documentation with steps for how to perform an upgrade that can be followed by anyone on the Database Platform team that takes on the order of weeks, rather than months, to complete.
Database Infrastructure at Shopify
Core
Shopify's Core database infrastructure is horizontally sharded by shop, spread across hundreds of shards, each consisting of a writer and five or more replicas. These shards are run on Google Compute Engine Virtual Machines (VM) and run the Percona Server fork of MySQL. Our backup system makes use of Google Cloud’s persistent disk snapshots. While we’re running the upstream versions of Percona Server, we maintain an internal fork and build pipeline that allows us to patch it as necessary.
Mason
Without automation, there’s a non-trivial amount of toil involved in just the day-to-day operation of our VM fleet due to its sheer size. VMs can go down for many reasons, including failed GCP live migrations, zone outages, or just run-of-the-mill VM failures. Mason was developed to respond to VMs going down by spinning up a VM to replace it—a task far more suited to a robot rather than a human, especially in the middle of the night.
Mason was developed as a self-healing service for our VM-based databases that was borne out of a Shopify Hack Days project in late 2019.
Healing Isn’t All That’s Needed
Shopify’s query workload can differ vastly from shard to shard, which necessitates maintenance of vastly different configurations. Our minimal configuration is six instances: three instances in Google Cloud’s us-east1 region and three instances in us-central1. However, each shard’s configuration can differ in other ways:
- There may be additional replicas to accommodate higher read workloads or to provide replicas in other locations globally.
- The VMs for the replicas may have a different number of cores or memory to accommodate differing workloads.
With all of this in mind, you can probably imagine how it would be desirable to have automation built around maintaining these differences—without it, a good chunk of the manual toil involved in on-call tasks would be simply provisioning VMs, which isn’t an enviable set of responsibilities.
Using Mason to Upgrade MySQL
Upgrades at our scale are extremely high effort as the current count of our VM fleet numbers in the thousands. We decided that building additional functionality onto Mason would be the way forward to automate our MySQL upgrade, and called it the Declarative Database Topologies project. Where Mason was previously used as a solely reactive tool that only maintained a hardcoded default configuration, we envisioned its next iteration as a proactive tool—one that allows us to define a per-shard topology and do the provisioning work that reconciles its current state to a desired state. Doing this would allow us to automate provisioning of upgraded VMs, thus removing much of the toil involved in upgrading a large fleet, and automate scale-up provisioning for events such as BFCM or other high-traffic occurrences.
The Project Plan
We had approximately eight months before BFCM preparations would begin to achieve the following:
- pick a new version of MySQL.
- benchmark and test the new version for any regressions or bugs
- perform rollback testing and create a rollback plan to so we can safely downgrade if necessary
- finally, perform the actual upgrade.
At the same time, we also needed to evolve Mason to:
- increase its stability
- move from a global hardcoded configuration to a dynamic per-shard configuration
- have it respond to scale-ups when the configuration changed
- have it care about Chef configuration, too
- … do all of that safely.
One of the first things we had to do was pick a version of Percona Server. We wanted to maximize the gains that we would get from an upgrade while minimizing our risk. This led us to choose the highest minor version of Percona Server 5.7, which was 5.7.32 at the start of the project. By doing so, we benefited from the bug and security fixes made since we last upgraded; in the words of one of our directors, “incidents that never happened” because we upgraded. At the same time, we avoided some of the larger risks associated with major version upgrades.
Once we had settled on a version, we made changes in Chef to have it handle an in-place upgrade. Essentially, we created a new Chef role with the existing provisioning code but with the new version specified for the MySQL server version variable and modified the code so that the following happens:
- Restore a backup taken from a 5.7.21 VM on an VM with 5.7.32 installed.
- Allow the VM and MySQL server process to start up normally.
- Check the contents of the
mysql_upgrade_info
file in the data directory. If the version differs from that of the MySQL server version installed, runmysql_upgrade
(via a wrapper script that’s necessary to account for unexpected behaviour of themysql_upgrade
script that exits with the return code 2, instead of the typical return code of 0, when an upgrade wasn’t required). - Perform the necessary replication configuration and proceed with the rest of the MySQL server startup.
After this work was completed, all we had to do to provision an upgraded version was to specify that the new VM be built with the new Chef role.
Preparing for the Upgrade
Performing the upgrade is the easy part, operationally. You can spin up an instance with a backup from the old version, let mysql_upgrade
do its thing, have it join the existing replication topology, optionally take backups from this instance with the newer version, populate the rest of the topology, and then perform a takeover. Making sure the newer version performs the way we expect and can be safely rolled back to the old version, however, is the tricky part.
During our benchmarking tests, we didn’t find anything anomalous, performance-wise. However, when testing the downgrade from 5.7.32 back to 5.7.21, we found that the MySQL server wouldn’t properly start up. This is what we saw when tailing the error logs:
When we allowed the calculation of transient stats at startup to run to completion, it took over a day due to a lengthy table analyze process on some of our shards—not great if we needed to roll back more urgently than that.
A cursory look at the Percona Server source code revealed that the table_name
column in the innodb_index_stats
and innodb_table_stats
changed from VARCHAR(64)
in 5.7.21 to VARCHAR(199)
in 5.7.32. We patched mysql_system_tables_fix.sql
in our internal Percona Server fork so that the column lengths were set back to a value that 5.7.21 expected, and re-tested the rollback. This time, we didn’t see the errors about the column lengths, however we still saw the analyze table process causing full table rebuilds, again leading to an unacceptable startup time, and it became clear to us that we had merely addressed a symptom of the problem by fixing these column lengths.
At this point, while investigating our options, it occurred to us that one of the reasons why this analyze table process might be happening is because we run ALTER TABLE
commands as a part of the MySQL server start: we run a startup script that sets the AUTO_INCREMENT
value on tables to set a minimum value (this is due to the auto_increment
counter not being persisted across restarts, a long-standing bug which is addressed in MySQL 8.0).
Investigating the Bug
Once we had our hypothesis, we started to test it. This culminated in a group debugging session where a few members of our team found that the following steps reproduced the bug that resulted in the full table rebuild:
- On 5.7.32: A backup previously taken from 5.7.21 is restored.
- On 5.7.32: An
ALTER TABLE
is run on a table that should just be an instantaneous metadata change, for example,ALTER TABLE t AUTO_INCREMENT=n
. The table is changed instantaneously, as expected. - On 5.7.32: A backup is taken.
- On 5.7.21: The backup taken from 5.7.32 in the previous step is restored.
- On 5.7.21: The MySQL server is started up, and
mysql_upgrade
performs the in-place downgrade. - On 5.7.21: A similar
ALTER TABLE
statement to step 1 is performed. A full rebuild of the table is performed, unexpectedly and unnecessarily.
Stepping through the above steps with the GNU Debugger (GDB), we found the place in the MySQL server source code where it’s incorrectly concluded that indexes have changed in a way that required a table rebuild (from Percona Server 5.7.21 in the has_index_def_changed function in sql/sql_table.cc):
We saw, while inspecting in GDB, that the flags
for the old version of the table (table_key->flags
above) don’t match that of the new version of the table (new_key->flags
above), despite the fact that only a metadata change was applied:
Digging deeper, we found past attempts to fix this bug. In the 5.7.23 release notes, there’s the following:
“For attempts to increase the length of a VARCHAR
column of an InnoDB table using ALTER TABLE
with the INPLACE
algorithm, the attempt failed if the column was indexed.If an index size exceeded the InnoDB limit of 767 bytes for COMPACT
or REDUNDANT
row format, CREATE TABLE
and ALTER TABLE
did not report an error (in strict SQL mode) or a warning (in nonstrict mode). (Bug #26848813)”
A fix was merged for the bug, however we saw that there was a second attempt to fix this behaviour. In the 5.7.27 release notes, we see:
“For InnoDB tables that contained an index on a VARCHA
R column and were created prior to MySQL 5.7.23, some simple ALTER TABLE
statements that should have been done in place were performed with a table rebuild after an upgrade to MySQL 5.7.23 or higher. (Bug #29375764, Bug #94383)”
A fix was merged for this bug as well, but it didn’t fully address the issue of some ALTER TABLE
statements that should be simple metadata changes instead leading to a full table rebuild.
My colleague Akshay filed a bug against this, however the included patch wasn’t ultimately accepted by the MySQL team. In order to safely upgrade past this bug, we still needed MySQL to behave in a reasonable way on downgrade, and we ended up patching Percona Server in our internal fork. We tested our patched version successfully in our final rollback tests, unblocking our upgrade.
What are “Packed Keys” Anyway?
The PACK_KEYS
feature of the MyISAM storage engine allows keys to be compressed, thereby making indexes much smaller and improving performance. This feature isn’t supported by the InnoDB storage engine as its index layout and expectations are completely different. In MyISAM, when indexed VARCHAR
columns are expanded past eight bytes, thus converting from unpacked keys to packed keys, it (rightfully) triggers an index rebuild.
However, we can see that in the first attempt to fix the bug in 5.7.23, that the same type of change triggers the same behaviour in InnoDB, even though packed keys aren’t supported. To remedy this, from 5.7.23 onwards, the HA_PACK_KEY
and HA_BINARY_PACK_KEY
flags weren’t set if the storage engine didn’t support it.
That, however, meant that if a table was created prior to 5.7.23, these flags are unexpectedly set even on storage engines that didn’t support it. So upon upgrade to 5.7.23 or higher, any metadata-only ALTER TABLE
commands executed on an InnoDB table incorrectly conclude that a full index rebuild is necessary. This brings us to the second attempt to fix the issue in which the flags were removed entirely if the storage engine didn’t support it. Unfortunately that second bug fix didn’t account for the case where the flags might have changed, but the difference should be ignored when evaluating whether the indexes need to be rebuilt in earlier versions, and that’s what we addressed in our proposed patch. In our patch, during downgrade, if the old version of the table (from 5.7.32) didn’t specify the flag, but the new version of the table (in 5.7.21) does, then we bypass the index rebuild.
Meanwhile, in the Mason Project…
While all of this rollback testing work was in progress, another part of the team was hard at work shipping new features in Mason to let it handle the upgrades. These were some of the requirements we had that guided the project work:
- The creation of a “priority” lane—self-healing should always take precedence over a scale-up related provisioning request.
- We needed to throttle the scale-up provisioning queue to limit how much work was done simultaneously.
- Feature flags were required to limit the number of shards to release the scale-up feature to, so that we could control which shards were provisioned and release the new features carefully.
- A dry-run mode for scale-up provisioning was necessary to allow us to test these features without making changes to the production systems immediately.
Underlying all of this was an abundance of caution in shipping the new features. Because of our large fleet size, we didn’t want to risk provisioning a lot of VMs we didn’t need or VMs in the incorrect configuration that would cost us either way in terms of GCP resource usage or engineering time spent in decommissioning resources.
In the initial stages of the project, stabilizing the service was important since it played a critical role in maintaining our MySQL topology. Over time, it had turned into a critical component of our infrastructure that significantly improved our on-call quality of life. Some of the early tasks that needed to be done were simply making it a first-class citizen among the services that we owned. We stabilized the staging environment it was deployed into, created and improved existing monitoring, and started using it to emit metrics to Datadog indicating when the topology was underprovisioned (in cases where Mason failed to do its job).
Another challenge was that Mason itself talks to many disparate components in our infrastructure: the GCP API, Chef, the Kubernetes API, ZooKeeper, Orchestrator, as well as the database VMs themselves. It was often a challenge to anticipate failure scenarios—often, the failure experienced was completely new and wouldn’t have been caught in existing tests. This is still an ongoing challenge, and one that we hope to address through improved integration testing.
Later on, as we onboarded new people to the project and started introducing more features, it also became obvious that the application was quite brittle in its current state; adding new features became more and more difficult due to the existing complexity, especially when they were being worked on concurrently. It brought to the forefront the importance of breaking down streams of work that have the potential to become hard blockers, and highlighted how much a well-designed codebase can decrease the chances of this happening.
We faced many challenges, but ultimately shipped the project on time. Now that the project is complete, we’re dedicating time to improving the codebase so it’s more maintainable and developer-friendly.
The Upgrade Itself
Throughout the process of rollback testing, we had already been running 5.7.32 for a few months on several shards reserved for canary testing. A few of those shards are load tested on a regular basis, so we were reasonably confident that this, along with our own benchmarking tests, made it ready for our production workload.
Next, we created a rollback plan in case the new version was unstable in production for unforeseen reasons. One of the early suggestions for risk mitigation was to maintain a 5.7.21 VM per-shard and continue to take backups from them. However, that would have been operationally complex and also would have necessitated the creation of more tooling and monitoring to make sure that we always have 5.7.21 VMs running for each shard (rather toilsome when the number of shards reaches the hundreds in a fleet). Ultimately, we decided against this plan, especially considering the fact that we were confident that we could roll back to our patched build of Percona Server, if we had to.
Our intention was to do everything we could to de-risk the upgrade by performing extensive rollback testing, but ultimately we preferred to fix forward whenever possible. That is, the option to rollback was expected to be taken only as a last resort.
We started provisioning new VMs with 5.7.32 in earnest on August 25th using Mason, after our tooling and rollback plan were in place. We decided to stagger the upgrades by creating several batches of shards. This allowed the upgraded shards to “bake” and not endanger the entire fleet in the event of an unforeseen circumstance. We also didn’t want to provision all the new VMs at once due to the amount of resource churn (at the petabyte-scale) and pressure it would put on Google Cloud.
On September 7th, the final shards were completed, marking the end of the upgrade project.
What Did We Take Away from This Upgrade?
This upgrade project highlighted the importance of rollback testing. Without the extensive testing that we performed, we would have never known that there was a critical bug blocking a potential rollback. Even though needing to rebuild the fleet with the old version to downgrade would have been toilsome and undesirable, patching 5.7.21 gave us the confidence to proceed with the upgrade, knowing that we had the option to safely downgrade if it became necessary.
Also Mason, the tooling that we relied on, became more important over time. In the past, Mason was considered a lower-tier application, and simply turning it off was a band-aid solution to when it was behaving in unexpected ways. Fixing it wasn’t often a priority when bugs were encountered. However, as time has gone by, we’ve recognized how large of a role it plays in toil-mitigation and maintaining healthy on-call expectations, especially as the size of our fleet has grown. We have invested more time and resources into it by improving test coverage and refactoring key parts of the codebase to reduce complexity and improve readability. We also have future plans to improve the local development environments and streamline its deployment pipeline.
Finally, investing in the documentation and easy repeatability of upgrades has been a big win for Shopify and for our team. When we first started planning for this upgrade, finding out how upgrades were done in the past was a bit of a scavenger hunt and required a lot of institutional knowledge. By developing guidelines and documentation, we paved the way for future upgrades to be done faster, more safely, and more efficiently. Rather than an intense and manual context-gathering process every time that pays no future dividends, we can now treat a MySQL upgrade as simply a series of guidelines to follow using our existing tooling.
Next up: MySQL 8!
Yi Qing Sim is a Senior Production Engineer and brings nearly a decade of software development and site reliability engineering experience to the Database Backend team, where she primarily works on Shopify’s core database infrastructure.
Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.
Debugging Systems in the Cloud: MySQL, Kubernetes, and Cgroups
By Rodrigo Saito, Akshay Suryawanshi, and Jeremy Cole
KateSQL is Shopify’s custom-built Database-as-a-Service platform, running on top of Google Cloud’s Kubernetes Engine (GKE), currently manages several hundred production MySQL instances across different Google Cloud regions and many GKE Clusters.
Earlier this year, we found a performance related issue with KateSQL: some Kubernetes Pods running MySQL would start up and shut down slower than other similar Pods with the same data set. This partially impaired our ability to replace MySQL instances quickly when executing maintenance tasks like config changes or upgrades. While investigating, we found several factors that could be contributing to this slowness.
The root cause was a bug in the Linux kernel memory cgroup
controller. This post provides an overview of how we investigated the root cause and leveraged Shopify’s partnership with Google Cloud Platform to help us mitigate it.
The Problem
KateSQL has an operational procedure called instance replacement. It involves creating a new MySQL replica (a Kubernetes Pod running MySQL) and then stopping the old one, repeating this process until all the running MySQL instances in the KateSQL Platform are replaced. KateSQL’s instance replacement operations revealed inconsistent MySQL Pod creation times, ranging from 10 to 30 minutes. The Pod creation time includes the time needed to:
- spin up a new GKE node (if needed)
- create a new Persistent Disk with data from the most recent snapshot
- create the MySQL Kubernetes Pod
- initialize the
mysql-init
container (which completes InnoDB crash recovery) - start the
mysqld
in themysql
container.
We started by measuring the time of the mysql-init
container and the MySQL startup time, and then compared the times between multiple MySQL Pods. We noticed a huge difference between these two MySQL Pods that had the exact same resources (CPU, memory, and storage) and dataset:
KateSQL instance |
Initialization |
Startup |
katesql-n4sx0 |
2120 seconds |
1104 seconds |
katesql-jxijq |
74 seconds |
17 seconds |
Later, we discovered that the MySQL Pods with slow creation time also showed gradual decrease in performance. Evidence of that was an increased number of slow queries for queries that utilized temporary memory tables:
Immediate Mitigation
A quick spot-check analysis revealed that newly provisioned Kubernetes cluster nodes performed better than those that were up and running for a few months. Having this information in hand, we started our first mitigation strategy for this problem that was to replace the older Kubernetes cluster nodes with new ones using the following steps:
- Cordon (disallow any new Pods) the older Kubernetes cluster nodes.
- Replace instances using KateSQL to move MySQL Pods to new Kubernetes nodes, allowing GKE to autoscale the cluster by adding new cluster nodes as necessary.
- Once instances are moved to new Kubernetes nodes, drain the cordoned cluster nodes to scale down the cluster (automatically, through GKE autoscaler).
This strategy was applied to production KateSQL instances, and we observed performance improvements on the new MySQL Pods.
Further Investigation
We began a deeper analysis to understand why newer cluster nodes performed better than older cluster nodes. We ruled out differences in software versions like kubelet, the Linux kernel, Google’s Container-optimized OS (COS), etc. Everything was the same, except their uptimes.
Next, we started a resource analysis of each resource subsystem to narrow down the problem. We ruled out the storage subsystem, as the MySQL error log provided a vital clue as to where the slowness was So we examined timestamps from InnoDB’s Buffer Pool initialization:
We analyzed the MySQL InnoDB source code to understand the operations involved during InnoDB’s Buffer Pool initialization. Most importantly, memory allocation during its initialization is single-threaded, as confirmed using top
to show it consuming approximately 100% CPU usage. We subsequently captured strace
output of the mysqld
process while it was starting up
We see that each mmap()
system call took around 100 ms to allocate approximately 128MB sized chunks, which in our opinion is terribly slow for the memory allocation process.
We also did an On-CPU perf
capture during this initialization process, below is the snapshot of the flamegraph:
Quick analysis of the flamegraph shows how MySQL (InnoDB) buffer pool initialization task is delegated to the memory allocator (jemalloc
in our case) that then spends most of its time in a kernel function called mem_cgroup_commit_charge
.
We further analyzed what the mem_cgroup_commit_charge
function does: it seems to be part of memcg
(memory control group) and is responsible for charging (claiming ownership of) pages from one cgroup
(unused/dead or root cgroup
) to the cgroup
of the allocating process. Unfortunately, memcg isn’t well documented, so it’s hard to understand what’s causing the slow down.
Another unusual thing we spotted (using the slabtop
command) was the abnormally high dentry
cache, sometimes around 20GB for about 64 pods running on a given Kubernetes Cluster node:
While investigating if a large dentry
cache could be slowing the entire system down, we found this post by sysdig that provided useful information. After further analysis following the steps from the post, we confirmed that it wasn’t the same case as we were experiencing. However, we noticed immediate performance improvements (similar to a restarted Kubernetes cluster node) after dropping the dentry
cache using the following command:
echo 2 > /proc/sys/vm/drop_caches
Continuing the unusual slab allocation investigation, we ruled out any of its side effects, like memory defragmentation, since enough higher-order free pages were available (which was verified using the output of /proc/buddyinfo
). Also, this memory is reclaimable during memory pressure events.
A Breakthrough
Going through various bug reports related to cgroups, we found a command to list the number of cgroups in a system:
We compared the memory cgroup’s value of a good node and an affected node. We concluded that approximately 50K memory cgroups is more than expected even with some of our short-lived tasks! This indicated to us that there could be a cgroup
leak. It also helped make sense of the perf
output that we had flamegraphed previously. There could be a performance impact if the cgroup
commit charge has to traverse through many cgroup
s for each page charge. We also saw that it locks page cache least recently used (LRU) list from source code analysis.
We evaluated a few more bug reports and articles, especially the following:
- A bug report unrelated to Kubernetes but pointing to the underlying cause related to cgroups. This bug report also helped point to the fix that was available for such an issue.
- An article on lwn.net related to almost our exact issue. A must read!
- A related workaround to the problem in Kubernetes.
- A thread in the Linux kernel mailing list that helped a lot.
These posts were a great signal that we were on the right track to understanding the root cause of the problem. To confirm our findings and understand if a symptom of this cgroup
leak, that wasn’t yet observed in the Linux community, we met with Linux kernel engineers at Google.
We evaluated an affected node, and the nature of the problem. The Google engineers were able to confirm that we were in fact hitting another side-effect of reparent slab memory on cgroup removal.
To Prove the Hypothesis
After evaluating the problem, we tested a possible solution to this problem. We invoked a switch file for the kubepods cgroup
(the parent cgroup
for Kubernetes pods) to force it to empty zombie/dead cgroup
:
$ echo 1 | sudo tee /sys/fs/cgroup/memory/kubepods/memory.force_empty
This caused the number of memory cgroups to decrease rapidly to only approximately 1800 memory cgroups that is in line with a good node as previously compared:
We quickly tested a MySQL Pod restart to see if there were any improvements in startup time performance. An 80G InnoDB buffer pool was initialized in five seconds:
A Possible Workaround and Fixes
There were multiple workarounds and fixes for this problem that we evaluated with engineers from Google:
- Rebooting or cordoning the cluster node VM, identifying them by monitoring
/proc/cgroups output
. - Set up a cronjob to drop SLAB and page caches. It’s an old school DBA/sysadmin technique that might work but could have a performance penalty on read IO.
- Short-lived Pods moved to a dedicated nodepool to isolate them from the more critical parts like MySQL Pods.
-
echo 1 > /sys/fs/cgroup/memory/memory.force_empty
in apreStop
hook of a short-lived Pod. - Upgrade to COS 85 that has upstream fixes to cgroup SLAB re-parenting bugs. Upgrading from GKE 1.16 to 1.18 should get us the Linux kernel 5.4 with the relevant bug fixes.
Since we were due a GKE version upgrade, we created new GKE clusters with GKE 1.18 and started creating new MySQL Pods on those new clusters. After several weeks of running on GKE 1.18, we saw consistent MySQL InnoDB Buffer Pool initialization time and query performance:
This was one of the lengthiest investigations that the Database Platform group has carried out at Shopify. The time taken to identify the root cause was due to the nature of the problem, difficult reproducibility, and absolutely no out-of-the-book methodology to follow. However, there are multiple ways to solve the problem and that’s a very positive outcome.
Special thanks to Roman Gushchin from Facebook’s Kernel Engineering team, whom we connected with via LinkedIn to discuss this problem, and Google’s Kernel Engineering team who helped us confirm and solve the root cause of the problem.
Rodrigo Saito is a Senior Production Engineer in the Database Backend team, where he primarily works on building and improving KateSQL, Shopify's Database-as-a-Service, using his more than a decade of Software Engineering experience. Akshay Suryawanshi is a Staff Production Engineer in the Database Backend team, who helped build KateSQL at Shopify along-with Jeremy Cole, who is a Senior Staff Production Engineer in the larger Database Platform group. Both of them brings decades of Database Administration and Engineering experience to manage Petabyte scale infrastructure.
Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.
Shard Balancing: Moving Shops Confidently with Zero-Downtime at Terabyte-scale
High Availability by Offloading Work Into the Background
Unpredictable traffic spikes, slow requests to a third-party payment gateway, or time-consuming image processing can easily overwhelm an application, making it respond slowly or not at all. Over Black Friday Cyber Monday (BFCM) 2020, Shopify merchants made sales of over 5 Billion USD, with peak sales of over 100 Million USD per hour. On such a massive scale, high availability and short response times are crucial. But even for smaller applications, availability and response times are important for a great user experience.
High Availability
High availability is often conflated with a high server uptime. But it’s not sufficient that the server hasn’t crashed or shut down. In the case of Shopify, our merchants need to be able to make sales. So a buyer needs to be able to interact with the application. A banner saying “come back later” isn’t sufficient, and serving only one buyer at a time isn’t good enough either. To consider an application available, the community of users needs to have meaningful interactions with the application. Availability can be considered high if these interactions are possible whenever the users need them to be.
Offloading Work
In order to be available, the application needs to be able to accept incoming requests. If the external-facing part of the application (the application server) is also doing the heavy lifting required to process the requests, it can quickly become overwhelmed and unavailable for new incoming requests. To avoid this, we can offload some of the heavy lifting into a different part of the system, moving it outside of the main request response cycle to not impact the application server’s availability to accept and serve incoming requests. This also shortens response times, providing a better user experience.
Commonly offloaded tasks include:- sending emails
- processing images and videos
- firing webhooks or making third party requests
- rebuilding search indexes
- importing large data sets
- cleaning up stale data
The benefits of offloading a task are particularly large if the task is slow, consumes a lot of resources, or is error-prone.
For example, when a new user signs up for a web application, the application usually creates a new account and sends them a welcome email. Sending the email is not required for the account to be usable, and the user doesn’t receive the email immediately anyways. So there’s no point in sending the email from within the request response cycle. The user shouldn’t have to wait for the email to be sent, they should be able to start using the application right away, and the application server shouldn’t be burdened with the task of sending the email.
Any task not required to be completed before serving a response to the caller is a candidate for offloading. When uploading an image to a web application, the image file needs to be processed and the application might want to create thumbnails in different sizes. Successful image processing is often not required for the user to proceed, so it’s generally possible to offload this task. However, the server can no longer respond, saying “the image has been successfully processed” or “an image processing error has occurred.” Now, all it can respond with is “the image has been uploaded successfully, it will appear on the website later if things go as planned.” Given the very time-consuming nature of image processing, this trade-off is often well worth it, given the massive improvement of response time and the benefits of availability it provides.
Background Jobs
Background jobs are an approach to offloading work. A background job is a task to be done later, outside of the request response cycle. The application server delegates the task, for example, the image processing, to a worker process, which might even be running on an entirely different machine. The application server needs to communicate the task to the worker. The worker might be busy and unable to take on the task right away, but the application server shouldn’t have to wait for a response from the worker. Placing a message queue between the application server and worker solves this dilemma, making their communication asynchronous. Sender and receiver of messages can interact with the queue independently at different points in time. The application server places a message into the queue and moves on, immediately becoming available to accept more incoming requests. The message is the task to be done by the worker, which is why such a message queue is often called a task queue. The worker can process messages from the queue at its own speed. A background job backend is essentially some task queues along with some broker code for managing the workers.
Features
Shopify queues tens of thousands of jobs per second in order to leverage a variety of features.
Response times
Using background jobs allows us to decouple the external-facing request (served by the application server) from any time-consuming backend tasks (executed by the worker). thus improving response times. What improves response times for individual requests also improves the overall availability of the system.
Spikeability
A sudden spike in, say, image uploads, doesn’t hurt if the time consuming image processing is done by a background job. The availability and response time of the application server is constrained by the speed with which it can queue image processing jobs. But the speed of queueing more jobs is not constrained by the speed of processing them. If the worker can’t keep up with the increased amount of image processing tasks, the queue grows. But the queue serves as a buffer between the worker and the application server so that users can continue uploading images as usual. With Shopify facing traffic spikes of up to 170k requests per second, background jobs are essential for maintaining high availability despite unpredictable traffic.
Retries and Redundancy
When a worker encounters an error while running the job, the job is requeued and retried later. Since all of that is happening in the back, it's not affecting the availability or response times of the external facing application server. It makes background jobs a great fit for error-prone tasks like requests to an unreliable third party.
Parallelization
Several workers might process messages from the same queue allowing more than one task to be worked on simultaneously. This is distributing the workload. We can also split a large task into several smaller tasks and queue them as individual background jobs so that several of these subtasks are worked on simultaneously.
Prioritization
Most background job backends allow for prioritizing jobs. They might use priority queues that don’t follow the first in - first out approach so that high-priority jobs end up cutting the line. Or they set up separate queues for jobs of different priorities and configure workers to prioritize jobs from the higher priority queues. No worker needs to be fully dedicated to high-priority jobs, so whenever there’s no high-priority job in the queue, the worker processes lower-priority jobs. This is resourceful, reducing the idle time of workers significantly.
Event-based and Time-based Scheduling
Background jobs aren’t always queued by an application server. A worker processing a job can also queue another job. While they queue jobs based on events like user interaction, or some mutated data, a scheduler might queue jobs based on time (for example, for a daily backup).
Simplicity of Code
The background job backend encapsulates the asynchronous communication between the client requesting the job and the worker executing the job. Placing this complexity into a separate abstraction layer keeps the concrete job classes simple. A concrete job class only implements the task to be done (for example, sending a welcome email or processing an image). It’s not aware of being run in the future, being run on one of several workers, or being retried after an error.
Challenges
Asynchronous communication poses some challenges that don’t disappear by encapsulating some of its complexity. Background jobs aren’t any different.
Breaking Changes to Job Parameters
The client queuing the job and the worker processing it doesn’t always run the same software version. One of them might already have been deployed with a newer version. This situation can last for a significant amount of time, especially if practicing canary deployments. Changes to the job parameters can break the application if a job has been queued with a certain set of parameters, but the worker processing the job expects a different set. Breaking changes to the job parameters need to roll out through a sequence of changes that preserve backward compatibility until all legacy jobs from the queue have been processed.
No Exactly-once Delivery
When a worker completes a job, it reports back that it’s now safe to remove the job from the queue. But what if the worker processing the job remains silent? We can allow other workers to pick up such a job and run it. This ensures that the job runs even if the first worker has crashed. But if the first worker is simply a little slower than expected, our job runs twice. On the other hand, if we don’t allow other workers to pick up the job, the job will not run at all if the first worker did crash. So we have to decide what’s worse: not running the job at all, or running it twice. In other words, we have to choose between at least and at most-once delivery.
For example, not charging a customer is not ideal, but charging them twice might be worse for some businesses. At most-once delivery sounds right in this scenario. However, if every charge is carefully tracked and the job checks those states before attempting a charge, running the job a second time doesn’t result in a second charge. The job is idempotent, allowing us to safely choose at-least once delivery.
Non-Transactional Queuing
The job queue is often in a separate datastore. Redis is a common choice for the queue, while many web applications store their operational data in MySQL or PostgreSQL. When a transaction for writing operational data is open, queuing the job will not be part of this enclosing transaction - writing the job into Redis isn’t part of a MySQL or PostgreSQL transaction. The job is queued immediately and might end up being processed before the enclosing transaction commits (or even if it rolls back).
When accepting external input from user interaction, it’s common to write some operational data with very minimal processing, and queue a job performing additional steps processing that data. This job may not find the data it needs unless we queue it after committing the transaction writing the operational data. However, the system might crash after committing the transaction and before queuing the job. The job will never run, the additional steps for processing the data won’t be performed, leaving the system in an inconsistent state.
The outbox pattern can be used to create transactionally staged job queues. Instead of queuing the job right away, the job parameters are written into a staging table in the operational data store. This can be part of a database transaction writing operational data. A scheduler can periodically check the staging table, queue the jobs, and update the staging table when the job is queued successfully. Since this update to the staging table might fail even though the job was queued, the job is queued at least once and should be idempotent.
Depending on the volume of jobs, transactionally staged job queues can result in quite some load on the database. And while this approach guarantees the queuing of jobs, it can’t guarantee that they will run successfully.
Local Transactions
A business process might involve database writes from the application server serving a request and workers running several jobs. This creates the problem of local database transactions. Eventual consistency is reached when the last local transaction commits. But if one of the jobs fails to commit its data, the system is again in an inconsistent state. The SAGA pattern can be used to guarantee eventual consistency. In addition to queuing jobs transactionally, the jobs also report back to the staging table when they succeed. A scheduler can check this table and spot inconsistencies. This results in an even higher load on the database than a transactionally staged job queue alone.
Out of Order Delivery
The jobs leave the queue in a predefined order, but they can end up on different workers and it’s unpredictable which one completes faster. And if a job fails and is requeued, it’s processed even later. So if we’re queueing several jobs right away, they might run out of order. The SAGA pattern can ensure jobs are run in the correct order if the staging table is also used to maintain the job order.
A more lightweight alternative can be used if consistency guarantees are not of concern. Once a job has completed its tasks, it can queue another job as a follow-up. This ensures the jobs run in the predefined order. The approach is quick and easy to implement since it doesn’t require a staging table or a scheduler, and it doesn’t generate any extra load on the database. But the resulting system can become hard to debug and maintain since it’s pushing all its complexity down a potentially long chain of jobs queueing other jobs, and little observability into where exactly things might have gone wrong.
Long Running Jobs
A job doesn’t have to be as fast as a user-facing request, but long-running jobs can cause problems. For example, the popular ruby background job backend Resque prevents workers from being shut down while a job is running. This worker cannot be deployed. It is also not very cloud-friendly, since resources are required to be available for a significant amount of time in a row. Another popular ruby background job backend, Sidekiq, aborts and requeues the job when a shutdown of the worker is initiated. However, the next time the job is running, it starts over from the beginning, so it might be aborted again before completion. If deployments happen faster than the job can finish, the job has no chance to succeed. With the core of Shopify deploying about 40 times a day, this is not an academic discussion but an actual problem we needed to address.
Luckily, many long-running jobs are similar in nature: they iterate over a huge dataset. Shopify has developed and open sourced an extension to Ruby on Rails’s Active Job framework, making this kind of job interruptible and resumable. It sets a checkpoint after each iteration and requeues the job. Next time the job is processed, work continues at the checkpoint, allowing for safe and easy interruption of the job. With interruptible and resumable jobs, workers can be shut down any time, which makes them more cloud-friendly and allows for frequent deployments. Jobs can be throttled or halted for disaster prevention, for example, if there’s a lot of load on the database. Interrupting jobs also allows for safely moving data between database shards.
Distributed Background Jobs
Background job backends like Resque and Sidekiq in Ruby usually queue a job by placing a serialized object into the queue, an instance of the concrete job class. This implies that both the client queuing the job and the worker processing it needs to be able to work with this object and have an implementation of this class. This works great in a monolithic architecture where clients and workers are running the same codebase. But if we would like to, say, extract the image processing into a dedicated image processing microservice, maybe even written in a different language, we need a different approach to communicate.
It is possible to use Sidekiq with separate services, but the workers still need to be written in Ruby and the client has to choose the right redis queue for a job. So this approach is not easily applied to a large-scale microservices architecture, but avoids the overhead of adding a message broker like RabbitMQ.
A message-oriented middleware like RabbitMQ places a purely data-based interface between the producer and consumer of messages, such as a JSON payload. The message broker can serve as a distributed background job backend where a client can offload work onto workers running an entirely different codebase.
Instead of simple point-to-point queues, topics that leverage task queues add powerful routing. In contrast to HTTP, this routing is not limited to 1:1. Beyond delegating specific tasks, messaging can be used for different event messages whenever communication between microservices is needed. With messages removed after processing, there’s no way to replay the stream of messages and no source of truth for a system-wide state.
Event streaming like Kafka has an entirely different approach: events are written into an append-only event log. All consumers share the same log and can read it at any time. The broker itself is stateless; it doesn’t track event consumption. Events are grouped into topics, which provides some publish subscribe capabilities that can be used for offloading work to different services. These topics aren’t based on queues, and events are not removed. Since the log of events can be replayed, it can serve, for example, as a source of truth for event sourcing. With a stateless broker and append-only writing, throughput is incredibly high—a great fit for real-time applications and data streaming.
Background jobs allow the user-facing application server to delegate tasks to workers. With less work on its plate, the application server can serve user-facing requests faster and maintain a higher availability, even when facing unpredictable traffic spikes or dealing with time-consuming and error-prone backend tasks. The background job backend encapsulates the complexity of asynchronous communication between client and worker into a separate abstraction layer, so that the concrete code remains simple and maintainable.
High availability and short response times are necessary for providing a great user experience, making background jobs an indispensable tool regardless of the application’s scale.
Kerstin is a Staff Developer transforming Shopify’s massive Rails code base into a more modular monolith, building on her prior experience with distributed microservices architectures .
Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.
Dynamic ProxySQL Query Rules
How Shopify Dynamically Routes Storefront Traffic
In 2019 we set out to rewrite the Shopify storefront implementation. Our goal was to make it faster. We talked about the strategies we used to achieve that goal in a previous post about optimizing server-side rendering and implementing efficient caching. To build on that post, I’ll go into detail on how the Storefront Renderer team tweaked our load balancers to shift traffic between the legacy storefront and the new storefront.
First, let's take a look at the technologies we used. For our load balancer, we’re running nginx with OpenResty. We previously discussed how scriptable load balancers are our secret weapon for surviving spikes of high traffic. We built our storefront verification system with Lua modules and used that system to ensure our new storefront achieved parity with our legacy storefront. The system to permanently shift traffic to the new storefront, once parity was achieved, was also built with Lua. Our chatbot, spy, is our front end for interacting with the load balancers via our control plane.
At the beginning of the project, we predicted the need to constantly update which requests were supported by the new storefront as we continued to migrate features. We decided to build a rule system that allows us to add new routing rules easily.
Starting out, we kept the rules in a Lua file in our nginx repository, and kept the enabled/disabled status in our control plane. This allowed us to quickly disable a rule without having to wait for a deploy if something went wrong. It proved successful, and at this point in the project, enabling and disabling rules was a breeze. However, our workflow for changing the rules was cumbersome, and we wanted this process to be even faster. We decided to store the whole rule as a JSON payload in our control plane. We used spy to create, update, and delete rules in addition to the previous functionality of enabling and disabling the rules. We only needed to deploy nginx to add new functionality.
The Power of Dynamic Rules
Fast continuous integration (CI) time and deployments are great ways to increase the velocity of getting changes into production. However, for time-critical use cases like ours removing the CI time and deployment altogether is even better. Moving the rule system into the control plane and using spy to manipulate the rules removed the entire CI and deployment process. We still require a “code review” on enabled spy commands or before enabling a new command, but that’s a trivial amount of time compared to the full deploy process used prior.
Before diving into the different options available for configuration, let’s look at what it’s like to create a rule with spy. Below are three images showing creating a rule, inspecting it, and then deleting it. The rule was never enabled, as it was an example, but that process requires getting approval from another member of the team. We are affecting a large share of real traffic on the Shopify platform when running the command spy storefront_renderer
enable example-rule
, so the rules to good code reviews still apply.
Configuring New Rules
Now let’s review the different options available when creating new rules.
Option Name | Description | Default | Example |
rule_name | The identifier for the rule. | products-json | |
filters | A comma-separated list of filters. | is_product_list_json_read | |
shop_ids | A comma-separated list of shop ids to which the rule applies. | all |
The rule_name
is the identifier we use. It can be any string, but it’s usually descriptive of the type of request it matches.
The shop_ids
option lets us choose to have a rule target all shops or target a specific shop for testing. For example, test shops allow us to test changes without affecting real production data. This is useful to test rendering live requests with the new storefront because verification requests happen in the background and don’t directly affect client requests.
Next, the filters
option determines which requests would match that rule. This allows us to partition the traffic into smaller subsets and target individual controller actions from our legacy Ruby on Rails implementation. A change to the filters list does require us to go through the full deployment process. They are kept in a Lua file, and the filters option is a comma-separated list of function names to apply to the request in a functional style. If all filters return true, then the rule will match that request.
Above is an example of a filter, is_product_list_path
, that lets us target HTTP GET requests to the storefront products JSON API implemented in Lua.
Option Name
|
Description
|
Default
|
Example
|
render_rate
|
The rate at which we render allowed requests.
|
0
|
1
|
verify_rate
|
The rate at which we verify requests.
|
0
|
0
|
reverse_verify_rate
|
The rate at which requests are reverse-verified when rendering from the new storefront.
|
0
|
0.001
|
Both render_rate
and verify_rate
allow us to target a percentage of traffic that matches a rule. This is useful for doing gradual rollouts of rendering a new endpoint or verifying a small sample of production traffic.
The reverse_verify_rate
is the rate used when a request is already being rendered with the new storefront. It lets us first render the request with the new storefront and then sends the request to the legacy implementation asynchronously for verification. We call this scenario a reverse-verification, as it’s the opposite or reverse of the original flow where the request was rendered by the legacy storefront then sent to the new storefront for verification. We call the opposite flow forward-verification. We use forward-verification to find issues as we implement new endpoints and reverse-verifications to help detect and track down regressions.
Option Name |
Description |
Default |
Example |
self_verify_rate |
The rate at which we verify requests in the nearest region. |
0 |
0.001 |
Now is a good time to introduce self-verification and the associated self_verify_rate
. One limitation of the legacy storefront implementation was due to how our architecture for a Shopify pod meant that only one region had access to the MySQL writer at any given time. This meant that all requests had to go to the active region of a pod. With the new storefront, we decoupled the storefront rendering service from the database writer and now serve storefront requests from any region where a MySQL replica is present.
However, as we started decoupling dependencies on the active region, we found ourselves wanting to verify requests not only against the legacy storefront but also against the active and passive regions with the new storefront. This led us to add the self_verify_rate
that allows us to sample requests bound for the active region to be verified against the storefront deployment in the local region.
We have found the routing rules flexible, and it made it easy to add new features or prototypes that are usually quite difficult to roll out. You might be familiar with how we generate load for testing the system's limits. However, these load tests will often fall victim to our load shedder if the system gets overwhelmed. In this case, we drop any request coming from our load generator to avoid negatively affecting a real client experience. Before BFCM 2020 we wondered how the application behaved if the connections to our dependencies, primarily Redis, went down. We wanted to be as resilient as possible to those types of failures. This isn’t quite the same as testing with a load generation tool because these tests could affect real traffic. The team decided to stand up a whole new storefront deployment, and instead of routing any traffic to it, we used the verifier mechanism to send duplicate requests to it. We then disconnected the new deployment from Redis and turned our load generator on max. Now we had data about how the application performed under partial outages and were able to dig into and improve resiliency of the system before BFCM. These are just some of the ways we leveraged our flexible routing system to quickly and transparently change the underlying storefront implementation.
Implementation
I’d like to walk you through the main entry point for the storefront Lua module to show more of the technical implementation. First, here is a diagram showing where each nginx directive is executed during the request processing.
During the rewrite phase, before the request is proxied upstream to the rendering service, we check the routing rules to determine which storefront implementation the request should be routed to. After the check during the header filter phase, we check if the request should be forward-verified (if the request went to the legacy storefront) or reverse-verified (if it went to the new storefront). Finally, if we’re verifying the request (regardless of forward or reverse) in the log phase, we queue a copy of the original request to be made to the opposite upstream after the request cycle has completed.
In the above code snippet, the renderer module referenced in the rewrite phase and the header filter phase and the verifier module reference in the header filter phase and log phase, use the same function find_matching_rule
from the storefront rules module below to get the matching rule from the control plane. The routing_method
parameter is passed in to determine whether we’re looking for a rule to match for rendering or for verifying the current request.
Lastly, the verifier module uses nginx timers to send the verification request out of band of the original request so we don’t leave the client waiting for both upstreams. The send_verification_request_in_background
function shown below is responsible for queuing the verification request to be sent. To duplicate the request and verify it, we need to keep track of the original request arguments and the response state from either the legacy storefront (in the case of a forward verification request) or the new storefront (in the case of a reverse verification request). This will then pass them as arguments to the timer since we won’t have access to this information in the context of the timer.
The Future of Routing Rules
At this point, we're starting to simplify this system because the new storefront implementation is serving almost all storefront traffic. We’ll no longer need to verify or render traffic with the legacy storefront implementation once the migration is complete, so we'll be undoing the work we’ve done and going back to the hardcoded rules approach of the early days of the project. Although that doesn’t mean the routing rules are going away completely, the flexibility provided by the routing rules allowed us to build the verification system and stand up a separate deployment for load testing. These features weren’t possible before with the legacy storefront implementation. While we won’t be changing the routing between storefront implementations, the rule system will evolve to support new features.
We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.
Derek Stride is a Senior Developer on the Storefront Renderer team. He joined Shopify in 2016 as an intern and transitioned to a full-time role in 2018. He has worked on multiple projects and has been involved with the Storefront Renderer rewrite since the project began in 2019.
Read Consistency with Database Replicas
At Shopify, we’ve long used database replication for redundancy and failure recovery, but only fairly recently started to explore the potential of replicas as an alternative read-only data source for applications. Using read replicas holds the promise of enhanced performance for read-heavy applications, while alleviating pressure on primary servers that are needed for time-sensitive read/write operations.
There's one unavoidable factor that can cause problems with this approach: replication lag. In other words, applications reading from replicas might be reading stale data—maybe seconds or even minutes old. Depending on the specific needs of the application, this isn’t necessarily a problem. A query may return data from a lagging replica, but any application using replicated data has to accept that it will be somewhat out of date; it’s all a matter of degree. However, this assumes that the reads in question are atomic operations.
In contrast, consider a case where related pieces of data are assembled from the results of multiple queries. If these queries are routed to various read replicas with differing amounts of replication lag and the data changes in midstream, the results could be unpredictable.
For example, a row returned by an initial query could be absent from a related table if the second query hits a more lagging replica. Obviously, this kind of situation could negatively impact the user experience and, if these mangled datasets are used to inform future writes, then we’re really in trouble. In this post, I’ll show you the solution the Database Connection Management team at Shopify chose to solve variable lag and how we solved the issues we ran into.
Tight Consistency
One way out of variable lag is by enforcing tight consistency, meaning that all replicas are guaranteed to be up to date with the primary server before any other operations are allowed. This solution is expensive and negates the performance benefits of using replicas. Although we can still lighten the load on the primary server, it’s at the cost of delayed reads from replicas.
Causal Consistency
Another approach we considered (and even began to implement) is causal consistency based on global transaction identifier (GTID). This means that each transaction in the primary server has a GTID associated with it, and this GTID is preserved as data is replicated. This allows requests to be made conditional upon the presence of a certain GTID in the replica, so we can ensure replicated data is at least up to date with a certain known state in the primary server (or a replica), based on a previous write (or read) that the application has performed. This isn’t the absolute consistency provided by tight consistency, but for practical purposes it can be equivalent.
The main disadvantage to this approach is the need to implement software running on each replica which would report its current GTID back to the proxy so that it can make the appropriate server selection based on the desired minimum GTID. Ultimately, we decided that our use cases didn’t require this level of guarantee, and that the added level of complexity was unnecessary.
Our Solution to Read Consistency
Other models of consistency in replicated data necessarily involve some kind of compromise. In our case, we settled on a form of monotonic read consistency: successive reads will follow a consistent timeline, though not necessarily reading the latest data in real time. The most direct way to ensure this is for any series of related reads to be routed to the same server, so successive reads will always represent the state of the primary server at the same time or later, compared to previous reads in that series.
In order to simplify implementation and avoid unnecessary overhead, we wanted to offer this functionality on an opt-in basis, while at the same time avoiding any need for applications to be aware of database topology and manage their own use of read replicas. To see how we did this, let’s first take a step back.
Application access to our MySQL database servers is through a proxy layer provided by ProxySQL using the concept of hostgroups: essentially pools of interchangeable servers which look like a single data source from the application’s point of view.
When a client application connects with a user identity assigned to a given hostgroup, the proxy routes its individual requests to any randomly selected server within that hostgroup. (This is somewhat oversimplified in that ProxySQL incorporates considerations of latency, load balancing, and such into its selection algorithm, but for purposes of this discussion we can consider the selection process to be random). In order to provide read consistency, we modified this server selection algorithm in our fork of ProxySQL.
Any application which requires read consistency within a series of requests can supply a unique identifier common to those requests. This identifier is passed within query comments as a key-value pair:
/* consistent_read_id:<some unique ID> */ SELECT <fields> FROM <table>
The ID we use to identify requests is always a universally unique identifier (UUID) representing a job or any other series of related requests. This consistent_read_id
forms the basis for a pseudorandom but repeatable index into the list of servers that replaces the default random indexing taking place in the absence of this identifier. Let’s see how.
First, a hashing algorithm is applied to the consistent_read_id
to yield an integer value. We calculate the modulo of this number and the number of servers that becomes our pseudorandom index into the list of available servers. Repeated application of this algorithm yields the same pseudorandom result, thus maintaining read consistency over a series of requests specifying the same consistent_read_id
. This explanation is simplified in that it ignores the server weighting which is configurable in ProxySQL. Here’s what an example looks like, including the server weighting:
A Couple of Bumps in the Road
I’ve covered the basics of our consistent-read algorithm, but there were a couple of issues to be addressed before the team got it working perfectly.
The first one surfaced during code review and relates to situations where a server becomes unavailable between successive consistent read requests. If the unavailable server is the one that was previously selected (and therefore would’ve been selected again), a data inconsistency is possible—this is a built-in limitation of our approach. However, even if the unavailable server isn’t the one that would’ve been selected, applying the selection algorithm directly to the list of available servers (as ProxySQL does with random server selection) could also lead to inconsistency, but in this case unnecessarily. To address this issue, we index into the entire list of configured servers in the host group first, then disqualify the selected server and reselect if necessary. This way, the outcome is affected only if the selected server is down, rather than having the indexing potentially affected for others in the list. Discussion of this issue in a different context can be found on ebrary.net.
The second issue was discovered as an intermittent bug that led to inconsistent reads in a small percentage of cases. It turned out that ProxySQL was doing an additional round of load balancing after initial server selection. For example, in a case where the target server weighting was 1:1 and the actual distribution of server connections drifted to 3:1, any request would be forcibly rerouted to the underweighted server, overriding our hash-based server selection. By disabling the additional rebalancing in cases of consistent-read requests, these sneaky inconsistencies were eliminated.
Currently, we're evaluating strategies for incorporating flexible use of replication lag measurements as a tuneable factor that we can use to modify our approach to read consistency. Hopefully, this feature will continue to appeal to our application developers and improve database performance for everyone.
Our approach to consistent reads has the advantage of relative simplicity and low overhead. Its main drawback is that server outages (especially intermittent ones) will tend to introduce read inconsistencies that may be difficult to detect. If your application is tolerant of occasional failures in read consistency, this hash-based approach to implementing monotonic read consistency may be right for you. On the other hand, if your consistency requirements are more strict, GTID-based causal consistency may be worth exploring. For more information, see this blog post on the ProxySQL website.
Thomas has been a software developer, a professional actor, a personal trainer, and a software developer again. Currently, his work with the Database Connection Management team at Shopify keeps him learning and growing every day.
We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.
Managing Google Cloud Platform Project-Wide SSH Keys
Resiliency Planning for High-Traffic Events
On January 27, 2021 Shipit!, our monthly event series, presented Building a Culture of Resiliency at Shopify. Learn about creating and maintaining resiliency plans for large development teams, testing and tooling, developing incident strategies, and incorporating and improving feedback loops. The video is now available.
Each year, Black Friday Cyber Monday weekend represents the peak of activity for Shopify. Not only is this the most traffic we see all year, but it’s also the time our merchants put the most trust in our team. Winning this weekend each year requires preparation, and it starts as soon as the weekend ends.
Load Testing & Stress Testing: How Does the System React?
When preparing for a high traffic event, load testing regularly is key. We have discussed some of the tools we use already, but I want to explain how we use these exercises to build towards a more resilient system.
While we use these tests to confirm that we can sustain required loads or probe for new system limits, we can also use regular testing to find potential regressions. By executing the same experiments on a regular basis, we can spot any trends at easily handled traffic levels that might spiral into an outage at higher peaks.
This same tool allows us to run similar loads against differently configured shops and look for differences caused by the theme, configuration, and any other dimensions we might want to use for comparison.
Resiliency Matrix: What are Our Failure Modes?
If you've read How Complex Systems Fail, you know that "Complex systems are heavily and successfully defended against failure" and "Catastrophe requires multiple failures - single point failures are not enough.” For that to be true, we need to understand our dependencies, their failure modes, and how those impact the end-user experience.
We ask teams to construct a user-centric resiliency matrix, documenting the expected user experience under various scenarios. For example:
The act of writing this matrix serves as a very basic tabletop chaos exercise. It forces teams to consider how well they understand their dependencies and what the expected behaviors are.
This exercise also provides a visual representation of the interactions between dependencies and their failure modes. Looking across rows and columns reveals areas where the system is most fragile. This provides the starting point for planning work to be done. In the above example, this matrix should start to trigger discussion around the ‘User can check out’ experience and what can be done to make this more resilient to a single dependency going ‘down’.
Game Days: Do Our Models Match?
So, we’ve written our resilience matrix. This is a representation of our mental model of the system, and when written, it's probably a pretty accurate representation. However, systems change and adapt over time, and this model can begin to diverge from reality.
This divergence is often unnoticed until something goes wrong, and you’re stuck in the middle of a production incident asking “Why?”. Running a game day exercise allows us to test the documented model against reality and adjust in a controlled setting.
The plan for the game day will derive from the resilience matrix. For the matrix above, we might formulate a plan like:
Here, we are laying out what scenarios are to be tested, how those will be accomplished, and what we expect to happen.
We’re not only concerned with external effects (what works, what doesn’t), but internally do any expected alerts fire, are the appropriate on-call teams paged, and do those folks have the information available to understand what is happening?
If we refer back to How Complex Systems Fail, the defences against failure are technical, human, and organizational. On a good game day, we’re attempting to exercise all of these.
- Do any automated systems engage?
- Do the human operators have the knowledge, information and tools necessary to intervene?
- Do the processes and procedures developed help or hinder responding to the outage scenario?
By tracking the actual observed behavior, we can then update the matrix as needed or make changes to the system in order to bring our mental model and reality back into alignment.
Incident Analysis: How Do We Get Better?
During the course of the year, incidents happen which disrupt service in some capacity. While the primary focus is always in restoring service as fast as possible, each incident also serves as a learning opportunity.
This article is not about why or how to run a post-incident review; there are more than enough well-written pieces by folks who are experts on the subject. But to refer back to How Complex Systems Fail, one of the core tenets in how we learn from incidents is “Post-accident attribution to a ‘root cause’ is fundamentally wrong.”
When focusing on a single root cause, we stop at easy, shallow actions to resolve the ‘obvious’ problem. However, this ignores deeper technical, organizational, and cultural issues that contributed to the issue and will again if uncorrected.
What’s Special About BFCM?
We’ve talked about the things we’re constantly doing, year-round to ensure we’re building for reliability and resiliency and creating an anti-fragile system that gets better after every disruption. So what do we do that’s special for the big weekend?
We’ve already mentioned How Complex Systems Fail several times, but to go back to that well once more, “Change introduces new forms of failure.” As we get closer to Black Friday, we slow down the rate of change.
This doesn’t mean we’re sitting on our hands and hoping for the best, but rather we start to shift where we’re investing our time. Fewer new services and features as we get closer, and more time spent dealing with issues of performance, reliability, and scale.
We review defined resilience matrices carefully, start running more frequent game days and load tests and working on any issues or bottlenecks those reveal. This means updating runbooks, refining internal tools, and shipping fixes for issues that this activity brings to light.
All of this comes together to provide a robust, reliable platform to power over $5.1 billion in sales.
Shipit! Presents: Building a Culture of Resiliency at Shopify
Watch Ryan talk about how we build a culture of resiliency at Shopify to ensure a robust, reliable platform powering over $5.1 billion in sales.
We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.
Capacity Planning at Scale
By Kathryn Tang and Kir Shatrov
The fourth Thursday in November is Thanksgiving in the United States. The day after, Black Friday (coined in 1961), is the first day of the Christmas shopping season and since 2005 it’s the busiest shopping day of the year in North America. Cyber Monday is a more recent development. Getting its name in 2005, it refers to the Monday after the Thanksgiving weekend where retailers focus on sales offered online. At Shopify, we call the weekend including Black Friday and Cyber Monday BFCM.
From the engineering team’s point of view, every BFCM challenges the platform and all the things we’ve shipped throughout the year:
- Would our clusters handle two times the number of virtual machines?
- Would we hit some sort of limitation on the new network design?
- Would the new logging pipeline handle such an increase in traffic?
- What’s going to be the next scalability bottleneck that we hit?
The other challenge is planning the capacity. We need to understand the magnitude of traffic ahead of us, and how many resources like CPUs and storage we’ll need to handle BFCM sales. On top of that, we need to have enough room in case of something unexpected, and we need to perform a regional failover.
Since 2017, we’ve partnered with Google Cloud Platform (GCP) as our main vendor for the cloud. Over these years, we’ve worked closely with their team on our capacity models, and prior to every BFCM that collaboration gets even closer.
In this post, we’ll cover our approaches to capacity planning, and how we rolled it out across the org and to dozens of teams. We’ll also share how we validated our capacity plans with scalability tests to make sure they work.
Capacity Planning
Our Google Cloud resource needs depend on how much traffic our merchants see during BFCM. We worked with our data scientists to forecast traffic levels and set those levels as a bar for our platform to scale to. Additionally, we looked into historical numbers, applied a safety margin, and projected how many buyers would check out or view online stores.
We created a master resourcing plan for our Google Cloud implementation and estimated how things like CPUs and storage would scale to BFCM traffic levels. Owners for our top 10 or so resource areas were tasked to estimate what they needed for BFCM. These estimates were detailed breakdowns of the machine types, geographic locations, and quantities of resources like CPUs. We also added buffers to our overall estimates to allow flexibility to change our resourcing needs, move machines across projects, or failover traffic to different regions if we needed to. What also helps is that we partition each component into a separate GCP project, which makes it a lot easier to think of quotas per every project.
2020 is an exceptionally difficult year to plan for. Normally, we’d look at BFCM trends from years prior and predict BFCM traffic with a fairly high level of confidence. This year, COVID-19 lockdowns drove a rapid shift to selling online this spring, and we didn't know what to expect. Would we see a massive increase in online traffic this BFCM, or a global economic depression where consumers stopped buying much at all? To manage heightened uncertainty, we forecasted multiple scenarios and their respective needs for our cloud deployment.
From an investment perspective, planning for the largest scale scenario means spending a lot of money very quickly to handle sales that might not happen. Alternatively, not deploying enough machines means having too little computing power and putting our merchant storefronts at risk of outages. It was absolutely vital to avoid anything that would put our merchants at risk of downtime. We decided to scale to our more aggressive growth scenarios to ensure our platform is stable regardless of what happens. We’re transparent with our partners, finance teams, and internal teams about how we thought through these scenarios which helps them make their own operating decisions.
Scalability Testing
A sheet with a capacity plan is just a starting point. Once we start scaling to projected numbers, there’s a high chance that we’ll hit limits throughout our tech stack that need resolving. In a complex system, there’s always a limit like:
- the number of VMs in a network
- the number of packets that a busy Memcached server can accept
- the number of MB/s your logging pipeline can handle.
Historically, every BFCM brought us some scalability surprises, and what’s worse, we’d only notice them when fully scaled prior to BFCM. That left too little time to come up with mitigation plans.
Back in 2018, we decided that a “faux” BFCM in the middle of the year would increase our resilience as an organization and push us to find unknowns that we’d otherwise only discover during the real thing. As we started doing that, it allowed us to find problems at scale more often and created that mental muscle of preparing for critical events and finding unknowns. If you’re exercising and something feels hard, you train more and eventually your muscles get better. Shopify treats BFCM the same way.
We’ve started the practice of regular scale-up testing at Shopify, and of course we made sure to come up with fun names for each. We’ve had Mayday (2019), Spooky scale-up (2019), and Oktoberfest scale-up (2020). Another fun fact is that our Waterloo teams play a large part in running this testing, and the dates of our Oktoberfest matched the city of Kitchener-Waterloo’s Oktoberfest festivities (It’s the second-largest Oktoberfest in the world).
Oktoberfest scale-up’s goal was to simulate this year’s expected BFCM load based on the traffic forecasts from the data science team. And the fact that we run Shopify in cloud on Google Kubernetes Engine allowed us to grab extra compute capacity just for the window of the exercise, and only pay for those hours when we needed it.
Investment in our internal load testing tooling over the years is fundamental to our ability to run such large scale, platform-wide load tests. We’ve talked about go-lua, an open source project that powers our load testing tool. Thanks to embedded Lua, we feed it with a high-level set of steps for what we want to test: actions like browsing the storefront, adding a product to a card, proceeding to check out, and processing the transaction through a mock payment gateway.
Thanks to Oktoberfest scale-up, we identified and then fixed some bottlenecks that could have become an issue for the real BFCM. Doing the test in early October gave us time to address issues.
After addressing all the issues, we repeated the scale-up test to see how our mitigations helped. Seeing that going smoothly increased our confidence levels about the upcoming Black Friday and reduced stress levels for all teams.
We strive for a smooth BFCM and spend a lot of time preparing for it, from capacity planning, to setting the expectations for our vendors, to load testing, and failover simulations. Beyond delivering a smooth holiday season for our merchants, BFCM is time to reflect on the future. As Shopify continues to grow, BFCM traffic levels can become the normal everyday loads we see in the next year. Our job is to bring lessons from events like BFCM to make our systems even more automated, more dynamic, and more resilient. We relish this opportunity to think about where Shopify is going and to architect our platform to scale with it.
Kir Shatrov is an Engineering Lead who’s been with Production Engineering at Shopify for the past five years, working on areas from CI/CD infrastructure to sharding and capacity planning.
Kathryn Tang is an Engineering Program Manager who manages our Google Cloud relationship. She has been at Shopify for 4 years, working with a multitude of R&D and commercial teams to derive business insights and guide operating decisions to help us scale.
We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.
Pummelling the Platform–Performance Testing Shopify
Developing a product or service at Shopify requires care and consideration. When we deploy new code at Shopify, it’s immediately available for merchants and their customers. When over 1 million merchants rely on Shopify for a successful Black Friday Cyber Monday (BFCM), it’s extremely important that all merchants—big and small—can run sales events without any surprises.
We typically classify a “sales event” as any single event that attracts a high amount of buyers to Shopify merchants. This could be a product launch, a promotion, or an event like BFCM. In fact, BFCM 2020 was the largest single sales event that Shopify has ever seen, and many of the largest merchants on the planet also saw some of the biggest flash sales ever observed before on Earth.
In order to ensure that all sales are successful, we regularly and repeatedly simulate large sales internally before they happen for real. We proactively identify and eliminate issues and bottlenecks in our systems using simulated customer traffic on representative test shops. We do this dozens of times per day using a few tools and internal processes.
I’ll give you some insight into the tools we use to raise confidence in our ability to serve large sales events. I’ll also cover our experimentation and regression framework we built to ensure that we’re getting better, week-over-week, at handling load.
We use “performance testing” as an umbrella term that covers different types of high-traffic testing including (but not limited to) two types of testing that happen regularly at Shopify: “load testing” and “stress testing”.
Load testing verifies that a service under load can withstand a known level of traffic or specific number of requests. An example load test is when a team wants to confirm that their service can handle 1 million requests per minute for a sustained duration of 15 minutes. The load test will confirm (or disconfirm) this hypothesis.
Stress testing, on the other hand, is when we want to understand the upper limit of a particular service. We do this by increasing the amount of load—sometimes very quickly—to the service being tested until it crumbles under pressure. This gives us a good indication of how far individual services at Shopify can be pushed, in general.
We condition our platform through performance testing on a massive scale to ensure that all components of Shopify’s platform can withstand the rush of customers trying to purchase during sales events like BFCM. Through proactive load tests and stress tests, we have a really good picture of what to expect even before a flash sale kicks off.
Enabling Performance Testing at Scale
A platform as big and complex as Shopify has many moving parts, and each component needs to be finely tuned and prepared for large sales events. Not unlike a sports car, each individual part needs to be tested under load repeatedly to understand performance and throughput capabilities before assembling all the parts together and taking the entire system for a test drive.
Individual teams creating services at Shopify are responsible for their own performance testing on the services they build. These teams are best positioned to understand the inner workings of the services they own and potential bottlenecks or situations that may be overwhelmed under extreme load, like during a flash sale. To enable these teams, performance testing needs to be approachable, easy to use and understand, and well-supported across Shopify. The team I lead is called Platform Conditioning, and our mission is to constantly improve the tooling, process, and culture around performance testing at Shopify. Our team makes all aspects of the Shopify platform stronger by simulating large sales events and making high-load events a common and regular occurrence for all developers. Think of Platform Conditioning as the personal trainers of Shopify. It’s Platform Conditioning that can help teams develop individualized workout programs and set goals. We also provide teams with the tools they need in order to become stronger.
Generating Realistic Load
At the heart of all our performance testing, we create “load”. A service at Shopify will add load to to cause stresses that—in the end—make it stronger, by using requests that hit specific endpoints of the app or service.
Not all requests are equal though, and so stress testing and load testing are never as easy as tracking the sheer volume of requests received by a service. It’s wise to hit a variety of realistic endpoints when testing. Some requests may hit CDNs or layers of caching that make the response very lightweight to generate. Other requests, however, can be extremely costly and include multiple database writes, N+1 queries, or other buried treasures. It’s these goodies that we want to find and mitigate up front, before a sales event like BFCM 2020.
For example, a request to a static CSS file is served from a CDN node in 40ms without creating any load to our internal network. Comparatively, making a search query on a shop hits three different layers of caching and queries Redis, MySQL, and Elasticsearch with total round-trip time taking 1.5 seconds or longer.
Another important factor to generating load is considering the shape of the traffic as it appears at our load balancers. A typical flash sale is extremely spiky and can begin with a rush of customers all trying to purchase a limited product simultaneously. It’s very important to simulate this same traffic shape when generating load and to run our tests for the same duration that we would see in the wild.
A systems diagram showing how we generate load with go-lua
When generating load we use a homegrown, internal tool that generates raw requests to other services and receives responses from them. There are two main pieces to this tool: the first is the coordinator, and the second is the group of workers that generate the load. Our load generator is written in Go and executes small scripts written in Lua called “flows”. Each worker is running a Go binary and uses a very fast and lightweight Lua VM for executing the flows. (The Go-Lua VM that we use is open source and can be found on Github) Through this, the steps of a flow can scale to issue tens of millions of requests per minute or more. This technique stresses (or overwhelms) specific endpoints of Shopify and allows us to conduct formal tests using the generated load.
We use our internal ChatOps tool, ‘Spy’, to enqueue tests directly from Slack, so everyone can quickly see when a load test has kicked off and is running. Spy will take care of issuing a request to the load generator and starting a new test. When a test is complete, some handy links to dashboards, logs, and overall results of the test are posted back in Slack.
Here’s a snippet of a flow, written in Lua, that browses a Shopify storefront and logs into a customer account—simulating a real buyer visiting a Shopify store:
Just like a web browser, when a flow is executing it sends and receives headers, makes requests, receives responses and simulates many browser actions like storing cookies for later requests. However, an executing flow won’t automatically make subsequent requests for assets on a page and can’t execute Javascript returned by the server. So our load tests don’t make any XMLHttpRequest (XHR) requests, Javascript redirects or reloads that can happen in a full web browser.
So our basic load generator is extremely powerful for generating a great deal of load, but in its purest form it only can hit very specific endpoints as defined by the author of a flow. What we create as “browsing sessions” are only a streamlined series of instructions and only include a few specific requests for each page. We want all our performance testing as realistic as possible, simulating real user behaviour in simulated sales events and generating all the same requests that actual browsers make. To accomplish this, we needed to bridge the gap between scripted load generation and realistic functionality provided by real web browsers.
Simulating Reality with HAR-based Load Testing
Our first attempt at simulating real customers and adding realism to our load tests was an excellent idea, but fairly naive when it came to how much computing power it would require. We spent a few weeks exploring browser-based load testing. We researched tools that were already available and created our own using headless browsers and tools like Puppeteer. My team succeeded in making realistic browsing sessions, but unfortunately the overhead of using real browsers dramatically increased both computing costs and real money costs. With browser-based solutions, we could only drive thousands of browsing sessions at a time, and Shopify needs something that can scale to tens of millions of sessions. Browsers provide a lot of functionality, but they come with a lot of overhead.
After realizing that browser-based load generation didn’t suit our needs, my team pivoted. We were still driving to add more realism to our load tests, and we wanted to make all the same requests that a browser would. If you open up your browser’s Developer Tools, and look at the Network tab while you browse, you see hundreds of requests made on nearly every page you visit. This was the inspiration for how we came up with a way to test using HTTP Archive (HAR) files as a solution to our problems.
A small sample of requests made by a single product page
HAR files are detailed JSON representations of all of the network requests and responses made by most popular browsers. You can export HAR files easily from your browser, or web proxy tools like Charles Proxy. A single HAR file includes all of the requests made during a browsing session and are easy to save, examine, and share. We leveraged this concept and created a HAR-based load testing solution. We even gave it a tongue-and-cheek name: Hardy Har Har.
Hardy Har Har (or simply HHH for those who enjoy brevity) bridges the gap between simple, lightweight scripted load tests and full-fledged, browser-based load testing. HHH will take a HAR file as input and extract all of the requests independently, giving the test author the ability to pick and choose which hostnames can be targeted by their load test. For example, we nearly always remove requests to external hostnames like Google Analytics and requests to static assets on CDN endpoints (They only add complexity to our flows and don’t require load testing). The resulting output of HHH is a load testing flow, written in Lua and committed into our load testing repository in Git. Now—literally at the click of a button—we can replay any browsing session in its full completeness. We can watch all the same requests made by our browser, scaled up to millions of sessions.
Of course, there are some aspects of a browsing session that can’t be simply replayed as-is. Things like logging into customer accounts and creating unique checkouts on a Shopify store need dynamic handling that HHH recognizes and intelligently swaps out the static requests and inserts dynamic logic to perform the equivalent functionality. Everything else lives in Lua and can be ripped apart or edited manually giving the author complete control of the behaviour of their load test.
Taking a Scientific Approach to Performance Testing
The final step to having great performance testing leading up to a sales event is clarity in observations and repeatability of experiments. At Shopify, we ship code frequently, and anyone can deploy changes to production at any point in time. Similarly, anyone can kick off a massive load test from Slack whenever they please. Given the tools we’ve created and the simplicity in using them, it’s in our best interest to ensure that performance testing follows the scientific method for experimentation.
Applying the scientific method of experimentation to performance testing
Developers are encouraged to develop a clear hypothesis relating to their product or service, perform a variety of experiments, observe the results of various experiment runs, and formulate a conclusion that relates back to their hypothesis.
All this formality in process can be a bit of a drag when you’re developing, so the Platform Conditioning team created a framework and tool for load test experimentation called Cronograma. Cronograma is an internal Rails app making it easy for anyone to set up an experiment and track repeated runs of a performance testing experiment.
Cronograma enforces the formal use of experiments to track both stress tests and load tests. The Experiment model has several attributes, including a hypothesis and one or more orchestrations that are coordinated load tests executed simultaneously in different magnitudes and durations. Also, each experiment has references to the Shopify stores targeted during a test and links to relevant dashboards, tracing, and logs used to make observations.
Once an experiment is defined, it can be run repeatedly. The person running an experiment (the runner) starts the experiment from Slack with a command that creates a new experiment run. Cronograma kicks off the experiment and assigns a dedicated Slack channel for the tests allowing multiple people to participate. During the running of an experiment any number of things could happen including exceptions, elevated traffic levels, and in some cases, actual people may be paged. We want to record all of these things. It’s nice to have all of the details visible in Slack, especially when working with teams that are Digital by Default. Observations can be made by anyone and comments are captured from Slack and added to a timeline for the run. Once the experiment completes, the experiment runner terminates the run and logs a conclusion based on their observations that relates back to the original hypothesis.
We also included additional fanciness in Cronograma. The tool automatically detects whether any important monitors or alerts were triggered during the experiment from internal or third-party data monitoring applications. Whenever an alert is triggered, it is logged in the timeline for the experiment. We also retrieve metrics from our data warehouse automatically and consume these data in Cronograma allowing developers to track observed metrics between runs of the same experiment. For example:
- the response times of the requests made
- how many 5xx errors were observed
- how many requests per minute (RPM) were generated
All of this information is automatically captured so that running an experiment is useful and it can be compared to any other run of the experiment. It’s imperative to understand whether a service is getting better or worse over time.
Cronograma is the home of formal performance testing experiments at Shopify. This application provides a place for all developers to conduct experiments and repeat past experiments. Hypotheses, observations, and conclusions are available for everyone to browse and compare to. All of the tools mentioned here have led to numerous performance improvements and optimizations across the platform, and they give us confidence that we can handle the next major sales event that comes our way.
The Best Things Go Unnoticed
Our merchants count on Shopify being fast and stable for all their traffic—whether they’re making their first sale, or they’re processing tens of thousands of orders per hour. We prepare for the worst case scenarios by proactively testing the performance of our core product, services, and apps. We expose problems and fix them before they become a reality for our merchants using simulations. By building a culture of load testing across all teams at Shopify, we’re prepared to handle sales events like BFCM and flash sales. My team’s tools make performance testing approachable for every developer at Shopify, and by doing so, we create a stronger platform for all our merchants. It’s easy to go unnoticed when large sales events go smoothly. We quietly rejoice in our efforts and the realization that it’s through strength and conditioning that we make these things possible.
Performance Scale at Shopify
Join Chris as he chats with Anita about how Shopify conditions our platform through performance testing on a massive scale to ensure that all components of Shopify’s platform can withstand the rush of customers trying to purchase during sales events like flash sales and Black Friday Cyber Monday.
Chris Inch is a technology leader and development manager living in Kitchener, Ontario, Canada. By day, he manages Engineering teams at Shopify, and by night, he can be found diving head first into a variety of hobbies, ranging from beekeeping to music to amateur mycology.
We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.
Using DNS Traffic Management to Add Resiliency to Shopify’s Services
If you are lacking understanding of what is DNS, traffic management, or why we would even use it, read Part 1: Introduction to DNS traffic management.
Distributed systems are only as resilient as we build them to be. Domain Name System (DNS) traffic management (TM) is a well-used approach to do so. In this second part of the two-part series, we’re sharing Shopify’s DNS traffic management journey from the numerous manually set-up, maintained, and updated traffic management approaches to the fully automated self-served system used by 40+ domains owned amongst 12+ different teams, handling 100M+ requests per day.
Shopify’s Previous Approaches to DNS Traffic Management
DNS traffic management isn’t entirely new at Shopify. A number of different teams had their own way of doing traffic management through DNS changes before we automated in 2019, which brought different sets of features and techniques to update records.
Streaming Platform Team
The team handling our Kafka pipelines used Kubernetes ConfigMaps to define target clusters. So, making changes required
- creating a pull request (PR) to the git repository holding the deployment configuration
- getting the PR approved
- merging the PR
- waiting for the tests to pass
- waiting for the shipping pipeline to run and deploy the change, which can take a few minutes depending on the circumstances
On top of the process duration which isn't ideal for failover time, using this manual approach doesn't open the door to any active/active configuration (where we share the traffic between two active clusters), since it would require to use two target clusters at once, while this only allows using the one defined in the ConfigMaps. At the time, being able to share traffic wasn't considered necessary for Kafka.
Search Platform Team
The team handling Elasticsearch set up the beginnings of DNS updates automation. They used our chatops bot, spy, to run commands requesting failovers. The bot creates a PR in our record-store repository (which uses the record_store open source project) with the requested DNS change, which then needs to be approved, merged, and deployed after passing the tests. Except for the automation, it’s a similar approach to the manual one, hence it has the same limitations with failover delays and active/active capabilities.
Edgescale Team
Part of the Edgescale team's responsibilities is to handle our assets (images and static files) and Content Delivery Networks (CDN), which bring the assets as close to the client as possible thanks to a network of distributed servers that store those files. To succeed in this mission, the team wanted more control and active/active capabilities. They used DNS providers allowing for weighted traffic management. They set weights to their records and define which share of traffic goes to which endpoint. It allowed them to share the traffic between their two CDN providers. To set this up, they used the DNS provider’s APIs with weights that could span from 0 (disabled) to 15, using DNS A records (hostname to IP address). To make their lives easier, they wrote a spy cdn command, responsible for making the API calls to the two DNS providers. It reduced the limitations of failover time, as well as provided the active/active capabilities. However, we couldn’t produce a stable, easy to reproduce, and version-controlled configuration of the providers. Adding endpoints to the traffic management was to be done manually, and thus prone to errors. The providers we use for our CDN needs don’t perform the same way in every region of the world and incur different costs in those regions. Using different traffic shares depending on the geographical location of the requesters is one of the traffic-management use-cases we presented in Part 1: Introduction to DNS traffic management(URL), but that wasn’t available here.
Other Teams
Finally, a few other teams decided to manually create records in one of the DNS providers used by the team handling our assets. They had fast failover and active/active capabilities but didn’t have a stable configuration, nor redundancy from using a single DNS provider.
We had too many setups going all over the place, corresponding to maintainability problems. Also, any impactful changes needed a lot of coordination and moving pieces. All of these setups had similar use cases and needs. We started working on how to improve and consolidate our approach to DNS traffic management to
- reduce the limitations encountered by teams
- connect other teams with similar needs
- improve maintainability
- create ownership
Consolidation and Ownership of DNS Traffic Management
The first step of creating a better service used by many teams at Shopify was to define the requirements and goals. We wanted a reliable and redundant system that would provide
- regionalized traffic
- fine-grained traffic sharing
- failover capabilities
- easy setup and updating by the teams owning traffic-managed domains
The final state of our setup provides a unified way of setting up and handling our services. We use a git
repository to store the domain's configuration that’s then deployed to two DNS providers. The configuration can be tweaked in a fast and easy manner for both providers through a set of spy commands, allowing for efficient failovers. Let's talk about those choices to build our system, and how we built it.
Establish Reliability and Redundancy
Each domain name has a set of nameservers, and when using a DNS client, one of those nameservers is selected and queried first, another one is used when a timeout occurs. Shopify used a single DNS provider until 2016, where a large DNS outage happened while our DNS provider was under a distributed denial of service (DDoS) attack, effectively dropping a large number of legitimate requests. We learned from our mistakes and increased our reliability and redundancy by using more than one DNS provider.
When CDN traffic management was set up, it used two different DNS providers to follow in the steps of our static DNS records in the record_store
. The decision for our new system was easy to make since it prevented being dependent on a single provider, we wanted to follow the same approach and adopt two providers to build our new standard.
Define The Traffic Management Layers
Two of our DNS providers allowed for regionalized and weighted traffic management, as well as multiple failover layers. It was just a matter of defining how we wanted things to work and build the equivalent approach for both providers.
We defined our approach in layers of traffic management and considered that each layer had a decision to make that would reduce the set of options that the next layers can choose from.
Layer 1 Geographical Fencing
The layer supports globally matching endpoints, which are mandatory. We always have an answer to a DNS request for an existing traffic-managed domain, even if there is no region specifically matching the requester. We defined a global region that is selected when nothing more specific matches. The geographical fencing layer selects the endpoints that fit the region where the request originates from. This layer selects the best geographical match with the client’s request. For instance, we set a rule to have endpoint A answered for Canada and endpoint B for Quebec. When we get a request originating from Montreal, we return B. If the request originates from Ottawa, we return A.
Layer 2 Endpoint Status
We provided a way to enable and disable the use of endpoints depending on their status, which is manually or automatically set. Automatically setting endpoints status depends on a process called health checking or monitoring, where we try to reach the endpoint regularly in order to verify if it does (healthy) or does not (unhealthy) answer. We added a layer of traffic management based on the endpoint status aimed at selecting only the endpoints currently considered as healthy. However, we don’t want the requester to receive an empty answer for a domain that does exist, as it would trigger the negative TTL, most of the time higher than our traffic-managed domain TTL. If any of the endpoints is healthy, then only the healthy endpoints are returned by Layer 2. If none of the endpoints are healthy, all the endpoints are returned. The logic behind this is simple: returning something that doesn’t work is better than not returning anything, as it allows the client to start back using the service as soon as endpoints are back online.
Layer 3 Endpoint Priority
Another aspect we want control over is the failover approach for our endpoints. We allow users to define levels of priorities for the traffic shares of their domains. For instance, they could define, as the highest priority, that three endpoints A, B, and C would receive 100%, 0%, and 0% of the traffic respectively. However, when A is unhealthy, instead of using B and C, we define, as a second priority, that B would receive 100% of the traffic. This can’t be done without a layer selecting endpoints based on their priority, as we’d be sending a share of the normal traffic to B, or we don’t have automated control over how B and C share the traffic in the case where A is failing. Endpoints of higher priority layers with a weight set to 0 (not receiving traffic) are also considered down for those layers. This means when the endpoints receiving traffic are unhealthy, through health checking, any higher-priority endpoints get discarded at Layer 2, allowing Layer 3 to keep only the highest priority endpoints left in the returned list.
Layer 4 Weighted Selection
This final layer deals with the weights defined for the endpoints. Each endpoint E
reaching this layer has a probability PE
of being selected as the answer. PE
is obtained through the formula <weight of E>/<sum of the weights of all endpoints reaching Layer 4>. Any 0-weighted endpoints will be automatically discarded unless there are only 0-weighted endpoints where all endpoints will have an equal chance of being selected and returned to the requester.
Deploying and Maintaining The Traffic-Managed Domains
We try to build tooling in a self-service way. It creates a new standard requiring us to make tooling easily accessible for other teams to deploy their traffic-managed domains. Since we use Terraform with Atlantis for a number of our deployments, we built a Terraform module that receives only the required parameters for an application owner and hides most of the work happening behind the scenes to configure our providers.
The above code represents the gist of what an application owner needs to provide to deploy their own traffic-managed domain.
We work to keep our deployments organized, so we derive the zone
and subdomain
parameters from the path of the domain being terraformed. For example, the path to this file is terraform/tm.shopifysvc.com/test/domain.tf
allows deriving that the zone is tm.shopifysvc.com
and the subdomain is test
.
When an application owner wants to make changes to their traffic-managed domain, they just need to update the domain.tf
file and apply the terraform change. There are a number of extra features that control
- automated monitoring and failover for their domains
- monitoring configuration for domains
- paging when a failover is automatically triggered or not.
When we make changes to how traffic-managed domains are deployed, or add new features, we update the module and move domains to the new module version one by one. Everything stays transparent for the application owners and easy to maintain for us, the Edgescale team.
Everyday Traffic Steering Operations
We allowed the users of our new standard to make changes fast and easily applied to their traffic-managed domain. We built a new command in our chatops bot, spy endpoints
, to perform operations on the traffic-managed domains.
Those commands will operate on relative and absolute domains. Relative domains will automatically receive our default traffic-managed zone as a suffix. It’s also possible to specify in which region the change should apply by using square brackets; for example, cdn[us-*,na]
would concern the cdn
traffic-managed domain but only in region na and only ones starting with us-
.
The spy endpoints get command gets the current traffic shares between endpoints for a given domain. If all providers are holding the same data (which should be the case most of the time), then the command results without any mention of the DNS providers. When results are different (the providers went out of sync), the data will be shown with mentions of the providers to make sure that the current traffic shares are known.
The spy endpoints set
command changes the traffic shares using specific weight values we provide. It updates every provider and runs the spy endpoints set
command to show the new traffic shares. Instead of specifying the weights for each endpoint, it’s possible to use a number of defined profiles that set predefined traffic shares. For example, our test domain with its two endpoints mostly[central-4]
will define weights of 95 for central-4
and 5 for central-5
.
Our Success Stories
Moving to ElasticSearch 7.0.0 - We talked about the process that was used by the ElasticSearch team and the fact that it was limited to failovers and didn’t allow traffic shares. When we moved our internal ElasticSearch clusters to ElasticSearch 7.0.0, the team was able to use the weighted load balancing provided by our tools to move the traffic chunk by chunk and ensure everything was working properly. It allowed them to keep the regular traffic going and mitigate any issue they might have encountered along the way, making the transition to ElasticSearch 7.0.0 seamless to the different systems using it.
Recovering from Kafka overload during a flash sale - During a large flash sale, the Kafka brokers in one of our clusters started overloading from the traffic they had to handle. Once the problem was identified, it took a few minutes for the Kafka team to realize that they now had traffic share capabilities from our DNS traffic manager. They used it to divert half the traffic from the overloaded region and send it to another available region. Less than five minutes after making that change, the Kafka queues started recovering.
Relieving on-call stress - Being on-call is stressful, especially when running errands and we don’t want to be stuck at home waiting for our phone to ring. Even with the great on-call culture at Shopify, and people always happy to override parts of your shift, being able to use the DNS traffic manager to steer traffic of an application to another cluster when something happens helps in so many cases. One aspect the different teams appreciated is that work can be done from the phone easily (thanks spy
!). Another one is allowing Shopifolk to stay serene while solving the incidents which are mitigated thanks to traffic management and don’t impact our merchants and their clients. In summary, easy to use tooling and practical features together improve the experience of both our merchants and coworkers.
Since the creation of our new DNS traffic management standard in the middle of 2019, we’ve onboarded more than 40 different domains across more than 12 different teams.
Why Ownership Is Important. Demonstrated By Example
A few months ago, while we moved teams to use the initial version of our DNS traffic manager, we got an email from one of our DNS providers letting us know that they would discontinue their service because it would be merged with the services of the company that bought them a few years prior. Of course, we weren’t so lucky, having their systems merged together would require action on our part. We needed to manually migrate our zones to the new provider.
We launched a project to find our next DNS provider as a result. Since we needed to manually migrate our zones and consequently all of our tooling, we might as well evaluate our options. We looked at more than 40 providers, keeping in mind our needs for our static zones and traffic management requirements. We selected a few providers that fit our needs and decided on which one to sign a contract with.
Once we chose the provider, the big migration happened. First, we updated our terraform module to support the new provider and deployed the traffic-managed domains in the three providers. Then we updated our spy endpoints tooling to update all providers when making changes so everything was ready and in sync. Next, we moved the nameservers of our traffic-managed zone one by one from the DNS provider we were leaving to the new DNS provider, making sure that in case of a problem only a controlled share of the traffic would be affected. We explained our migration plans to the different teams owning domains in the traffic manager, letting them know when it would happen, and that it should be transparent to them, but if anything unexpected seems to be happening, they should contact us. We also told the incident manager on-call of the changes happening and the timelines.
Everything was in the plan for the change to happen. However, 30 minutes before change time, the provider we were leaving had an incident, preventing us from moving traffic and happening at the same time as one of the application owners having an incident that they wanted to mitigate with the traffic manager. It pushed our timeline forward, but we continued with the change without any issue and it was fully transparent to all the application owners.
Looking back to how things were before we rolled out our new standard for DNS traffic management, we easily can say that moving to a new DNS provider wouldn’t have been that smooth. We would have had to
- contact every team using their own approach to gather their needs and usage so we could find a good alternative to our current DNS provider (luckily this was done while preparing and building this project)
- coordinate between those teams for the change to happen, and then chase after them to make sure they updated any tooling used
The change couldn’t be handled for all of them as a whole as there wasn’t one product that one team handles, but many products that many teams handle.
With our DNS traffic management system, we brought ownership to this aspect of our infrastructure because we understand the capabilities and requirements of teams, and how we can maintain and evolve as our teams’ needs evolve, improving the experience of our merchants and their customers.
Our DNS traffic management journey took us from many manually setup, maintained, and updated traffic management approaches to a fully automated self-served system used by more than 40 domains owned by more than 12 different teams, and handling more than 100M requests per 24h. If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.
An Introduction to DNS Traffic Management
Distributed systems are only as resilient as we build them to be. Domain Name System (DNS) traffic management is a well-used approach to do so. In this first part of a two-part series, we aim to give a broad overview of DNS and how it’s used for traffic management, as well as the different reasons why we want to use DNS traffic management.
If you already have context on what is DNS, what is traffic management, and the reasons why you would need to use DNS traffic management, you can skip directly to where we share our journey and improvements made regarding DNS traffic management at Shopify in Part 2: Shopify’s DNS Traffic Management.
A Summarized History of DNS
Everything started with humans trying to communicate and a plain text file, even before the advent of the modern internet.
The Advanced Research Projects Agency Network (ARPANET) was thought, in 1966, to enable access to remote computers. In 1969, the first computers were connected to ARPANET, followed by the implementation of the Network Control Program (NCP) in 1970. Guided by the need to connect more and more computers together, and as the work on the Transmission Control Protocol (TCP), started in 1974, evolved, TCP/IP was created in the late 1970s to provide the ability to join separate networks in a network of networks and replaced NCP in ARPANET on January 1st, 1983.
At the beginning of ARPANET there were just a handful of computers from four different universities connected together, which was easier for people to remember the addresses. This became challenging with new computers joining the network. The Stanford Research Institute provided, through file sharing, a manually maintained file containing the hostnames and related addresses of hosts as provided by member organizations of ARPANET. This file, originally named HOSTS.TXT
, is now also broadly known as the /etc/hosts
file on Unix and Unix-like systems.
A growing network with an increasingly large number of computers meant an increasingly large file to download and maintain. By the early 1980s, this process became slow and an automated naming system was required to address the technical and personnel issues of the current approach. The Domain Name System (DNS) was born, a protocol converting human-readable (and rememberable) domain names into Internet Protocol (IP) addresses.
What is DNS?
Let’s consider that DNS is a very large library where domains are organized from the least to the most meaningful parts of their names. For instance, if you (the client) want to resolve shops.myshopify.com
, you would consider that .com
is the least meaningful part as it’s shared with many domain names, and shops
the most meaningful part as it’s a specification on the subdomain you’re requesting. Finding shops.myshopify.com
in this DNS Library would thus mean going to the .com
shelves and finding the myshopify book. Once the book in hand, we would then open it to the shops
page, and see something that looks like the following:
DNS Library Book
The image is telling us that shops.myshopify.com corresponds to the IP address 23.227.38.64. Also, our DNS Library provides us with the equivalent of a Due Date, which is called Time To Live (TTL). It corresponds to the amount of time the association of hostname to IP address is valid. We remember or cache that information for the given amount of time. If we need this information after expiration, we have to “find that book” again to verify if the association is still valid.
The opposite concept already exists: if you’re trying to find a page in the book and can’t find it, chances are that you won’t wait there until someone writes it down for you. In DNS, this concept is driven by the Negative TTL, which represents the duration we consider a NXDOMAIN (non-existing domain) answer can be cached. This means that the author of a new page in this book cannot consider their update is known by everyone until that period of time has elapsed.
Another relevant element is that the DNS Library doesn't necessarily hold only one myshopify.com book but multiple identical ones, from different editors, enabling others to be consulted if one copy is unavailable.
In DNS terms, the editors are DNS providers. The shelves contain multiple sections for each domain nameserver, the servers that provide DNS resolution as a source of truth for a domain. The books are zones in the domain nameservers, and the book pages are DNS records, the direct relation between the queried record and the value it should resolve to. The DNS Library is what we call root servers, a set of 13 nameservers (named from a to m) that hold the keys to the root of the hostnames. The root servers
are responsible for helping to locate the shelves
, the nameservers of the Top Level Domains (TLD), the domains at the highest level in the hierarchy for DNS.
What is Traffic Management?
Traffic management is a key branch of logistics that aims to plan and control everything required to provide for the safe, orderly, and efficient movement of persons and goods. Traffic management helps to manage situations such as congestion or roadblocks, by redirecting traffic or sharing traffic between multiple routes. For instance, some navigation applications use data they get from their users (current location, current speed, etc.) to know where congestion is happening and improve the situation by suggesting alternative routes instead of sending them to the already overloaded roads.
A more generic description is that traffic management uses data to decide where to direct the traffic. We could have different paths depending on the country of origin (think country border waiting lines for the booths, where the checks are different depending on the passport you hold), different paths depending on vehicle size (bike lanes, directions for trucks vs. cars, etc.) or any other information we find relevant.
DNS + Traffic Management = DNS Traffic Management
Bringing the concept of traffic management to DNS means serving data-driven answers to DNS queries resulting in different answers depending on the location of the requester or for each request. For instance, we could have two clusters of servers and want to split the traffic between the two: we can decide to answer 50% of the requests with the first cluster and the other 50% with the second. The clients obtaining the answers would connect to the cluster they got directed to, without any other action on their part.
However, from the previous section, DNS queries are cached to avoid overloading servers with queries. Each time a query is cached by a resolver, it won’t be repeated by that resolver for the duration of its TTL. Using a low TTL will make sure that the information is kept around but not for too long. For example, returning a TTL of 15 seconds means that after 15 seconds the client needs to resolve the record again, and can get a different answer than before.
A low TTL needs careful consideration, as the time it takes to obtain the DNS record’s content from the DNS servers, called DNS resolution time, sometimes can dominate the time it takes to retrieve a resource like a webpage. The connection performance and accuracy of the result are thus often at odds. For instance, if I want my changes to appear to users in at most 15 seconds (hence setting a 15 seconds TTL), but the DNS resolution time takes 1 second means that every 15 seconds the users will take 1 more second to reach the service they are connecting to. Over a day, this added resolution time adds up to 5760 seconds, or 1 hour and 36 minutes. If we slightly sacrifice the accuracy by moving the TTL to 60 seconds, the resolution time becomes 1440 seconds over a day, or only 24 minutes, improving the overall performance.
The use of caching and TTL implies that doing DNS traffic management isn’t instant. There's a short delay in refreshing the record that should be at most the TTL that we configured. In practice, it can be slightly more as some DNS resolvers, unbeknownst to the client, might cache the results for a longer time than they see fit. The override of TTL shouldn’t happen often, however, but it’s something to be aware of when choosing DNS to do traffic management.
Examining Four DNS Traffic Management Use Cases
DNS traffic management is interesting when handling systems that don’t necessarily hold load balancer capabilities at the network level, either through an IP-level load balancer or any front-facing proxy, i.e. once already connected to the service we are trying to reach. There are many reasons to use DNS traffic management in front of services, and multiple reasons why we use it at Shopify.
Easy Failover
One of Shopify’s use cases is easily failing over a service when the live instance crashes or is rendered unavailable for any reason. Using DNS management and having it ready to target two clusters, but using one by default, simply redirects the traffic to the second cluster whenever the first one crashes, it then redirects back the traffic when it recovers. This is commonly called active-passive. If you’re able to identify the unavailability of your main cluster in a timely fashion, this approach makes it almost seamless (considering the TTL) to the clients using the service, as they’d use the still-working cluster while the issue is solved, either automatically or through the intervention of the responsible on-call team (as a last resort). The pressure is relieved on those on-call teams, as they know that clients can still use the service while they solve the issues, sometimes even pushing the work to be done to the next working day.
Share traffic between endpoints
Services inevitably grow and end up receiving requests from many clients. Now those requests need to be shared between available endpoints offering the exact same service. This is called active-active. Another motivation behind this approach is money related, when using external vendors (an external company contracted to provide your users a service) with minimum commitment, allowing you to share your traffic load between those vendors in a way that ensures reaching those commitments. You define the percent of traffic sent to each given endpoint corresponding to the percentage of DNS requests answered with that endpoint.
Deploy a Change Progressively
When developing production services, sometimes making a potentially disruptive change (such as deploying a new feature, changing the behavior of an existing one, or updating a system to a new version) is needed. In such cases, deploying your change and crossing your fingers while hoping for success is, at best, risky. DNS traffic management can help by allowing movement of a small percentage of your traffic to a cluster that’s already updated, then move more and more chunks of traffic until all of the traffic has been moved to the cluster with the new feature. This approach is called green-blue deployment. You can then update the other cluster, which allows you to be ready for the next update or failover.
Regionalize Traffic Decisions
You might find cases where some endpoints are more performant in some regions than others, which might happen when using external vendors. If performance is important for users, as it is for Shopify’s merchants and their customers, then you want to make sure the most performant endpoints are used for users in each region by allowing DNS answers based on the client’s location. Most of the time, geolocation can be fine-grained to the country, state or province level, or applied to a broader region of the world. Routing rules are defined to indicate what should be answered depending on the origin of requests. Once done, a client connecting from a location will get the answer that fits them.
Our DNS traffic management journey took us from many manually set-up, maintained, and updated traffic management approaches to a fully automated self-served system used by more than 40 domains owned by more than 12 different teams, and handling more than 100M requests per 24h. If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.
A Brief History of TLS Certificates at Shopify
Transport Layer Security (TLS) encryption may be commonplace in 2020, but this wasn’t always the case. Back in 2014, our business owner storefront traffic wasn’t encrypted. We manually provisioned the few TLS certificates that were in production. In this post, we’ll cover Shopify’s journey from manually provisioning TLS certificates to the fully automated system that supports over 1M business owners today.
In the Beginning
Up to 2014, only business owner shop administration and checkout traffic were encrypted. All checkouts were on the checkout.shopify.com domain. Secured shop administration functions used the *.myshopify.com certificate and a single-domain certificate for checkout.shopify.com. Our Operations team renewed the certificates manually as needed. During this time, teams began research on what it would take for us to offer TLS encryption for all business owners in an automated fashion.
Shopify Plus
We launched Shopify Plus in early 2014. One of Plus’s earliest features was TLS encrypted storefronts. We manually provisioned certificates, adding new domains to the Subject Alternative Name (SAN) list as required. As our certificate authority placed a limit on the number of domains per certificate, certificates were added to support the new domains being onboarded. At the time, Internet Explorer on Windows XP was still used by a significant number of users, which prevented our use of the Server Name Indication (SNI) extension.
While this addressed our immediate needs, there were several drawbacks:
- Manual certificate updates and provisioning were labor-intensive and needed to be handled with care.
- Additional IP addresses were needed to support new certificates.
- Having domains for non-related shops in a single certificate wasn’t ideal.
The pace of onboarding was manageable at first. As we onboarded more merchants, it was apparent that this process wasn’t sustainable. At this point, there were dozens of certificates that all had to be manually provisioned and renewed. For each Plus account onboarded, the new domains had to be manually added. This was labor-intensive and error-prone. We worked on a fully automated system during Shopify’s Hack Days, and it became a fully staffed project in May 2015.
Shopify’s Notary System
Automating TLS certificates had to address multiple facets of the process including
- How are the certificates provisioned from the certificate authority?
- How to serve the certificates at scale?
- What other considerations are there for offering encrypted storefronts?
Shopify's Notary System
Provisioning Certificates
Our Notary system provisions certificates. When a business owner adds a domain to their shop, the system receives a request for a certificate to be provisioned. The certificate provisioning is fully automated via Application Programming Interface (API) calls to the certificate authority. This includes the order request, domain ownership verification, and certificate/private key pair delivery. Certificate renewals are performed automatically in the same fashion.
While it makes sense that we group domains from a shop to one certificate, the system handles all domains separately for simplicity. Each certificate has one domain with a unique private key. The certificate and private key are stored in a relational database. This relational database is accessible by the load balancers for terminating TLS connections.
Scaling Up Certificate Provisioning
At the time, we hosted our nginx load balancers at our datacenters. Storing the TLS certificates on disk and reloading nginx when certificates changed wasn’t feasible. In a past article, we talked about our use of nginx and OpenResty Lua modules. Using OpenResty allowed us to script nginx to serve dynamic content outside of the nginx configuration. In addition, browser support for the TLS SNI extension was almost universal. By leveraging the TLS SNI extension, we dynamically load TLS certificates from our database in a Lua middleware via the ssl_certificate_by_lua
module. Certificates and private keys are directly accessible from the relational database via a single SQL query. An in-memory Least Recently Used (LRU) cache reduced the latency of TLS handshakes for frequently accessed domains.
Solving Mixed Content Warnings
With TLS certificates in place for business owner shop domains, we could offer encrypted storefronts for all shops. However, there was still a significant hurdle to overcome. Each shop’s theme could have images or assets referencing non-encrypted Uniform Resource Locators (URLs). Mixing of encrypted and unencrypted content would cause the browser to display a Mixed Content warning, denoting that some resources on the page are not encrypted. To resolve this problem, we had to process all the shop themes to replace references to HTTP with HTTPS.
With all the infrastructure in place, we realized the goal of supporting encrypted storefronts for all merchants in February 2016. The same system is still in place and has scaled to provide TLS certificates for all of our 1M+ merchants.
Let’s Encrypt!
Let’s Encrypt is a non-profit certificate authority that provides TLS certificates at no charge. Shopify has been and is currently a sponsor. The service launched in April 2016, shortly after our Notary went into production. With the exception of Extended Verification (EV) certificates and other special cases, we’ve migrated away from our paid certificate authority in favor of Let’s Encrypt.
Move to the Cloud
In June 2019, our network edge moved from our datacenter to a cloud provider. The number of TLS certificates in our requirements needing support drastically reduced the viable vendor list. Once the cloud provider was selected, our TLS provisioning system had to be adapted to work with their system. There were two paths forward, using the cloud provider’s managed certificates or continuing to provision Let’s Encrypt certificates and upload them. The initial migration leveraged the provider’s certificate provisioning.
Using managed certificates from the cloud provider has the advantage of being maintenance-free after they’ve been provisioned. There are no storage concerns for certificates and private keys. In addition, certificates are automatically renewed by the vendor. Administrative work was required during the migration to guide merchants to modify their domain’s Certification Authority Authorization (CAA) Domain Name System (DNS) records as needed. Backfilling the certificates for our 1M+ merchants took several weeks to complete.
After the initial successful migration to our cloud provider, we revisited the certificate provisioning strategy. As we maintain an alternate edge network for contingency, the Notary infrastructure is still in place to provide certificates for that infrastructure. The intent of using provider managed certificates is for it to be a stepping stone for deprecating Notary in the future. While the cloud provider-provisioned certificates worked well for us, there are now two sets of certificates to keep synchronized. To simplify certificate state and operation load, we now use the Notary provisioned certificates for both edge networks. Instead of provisioning certificates on our cloud provider, certificates from Notary are uploaded as new ones are required.
Outside of our business owner shop storefronts, we rely on nginx for other services that are part of our cloud infrastructure. Some of our Lua middleware, including the dynamic TLS certificate loading code, was contributed to the ingress-nginx Kubernetes project.
Our TLS certificate journey took us from a handful of manually provisioned certificates to a fully automated system that can scale up to support over 1M merchants. If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Learn about the actions we’re taking as we continue to hire during COVID‑19
Your Circuit Breaker is Misconfigured
Circuit breakers are an incredibly powerful tool for making your application resilient to service failure. But they aren’t enough. Most people don’t know that a slightly misconfigured circuit is as bad as no circuit at all! Did you know that a change in 1 or 2 parameters can take your system from running smoothly to completely failing?
I’ll show you how to predict how your application will behave in times of failure and how to configure every parameter for your circuit breaker.
At Shopify, resilient fallbacks integrate into every part of the application. A fallback is a backup behavior which activates when a particular component or service is down. For example, when Shopify’s Redis, that stores sessions, is down, the user doesn’t have to see an error. Instead, the problem is logged and the page renders with sessions soft disabled. This results in a much better customer experience. This behaviour is achievable in many cases, however, it’s not as simple as catching exceptions that are raised by a failing service.
Imagine Redis is down and every connection attempt is timing out. Each timeout is 2 seconds long. Response times will be incredibly slow, since requests are waiting for the service to timeout. Additionally, during that time the request is doing nothing useful and will keep the thread busy.
Utilization, the percentage of a worker’s maximum available working capacity, increases indefinitely as the request queue builds up, resulting in a utilization graph like this:
Utilization during service outrage
A worker which had a request processing rate of 5 requests per second now can only process half a request per second. That’s a tenfold decrease in throughput! With utilization this high, the service can be considered completely down. This is unacceptable for production level standards.
Semian Circuit Breaker
At Shopify, this fallback utilization problem is solved by Semian Circuit Breaker. This is a circuit breaker implementation written by Shopify. The circuit breaker pattern is based on a simple observation: if a timeout is observed for a given service one or more times, it’s likely to continue to timeout until that service recovers. Instead of hitting the timeout repeatedly, the resource is marked as dead and an exception is raised instantly on any call to it.
I'm looking at this from the configuration perspective of Semian circuit breaker but another notable circuit breaker library is Hystrix by Netflix. The core functionality of their circuit breaker is the same, however, it has less available parameters for tuning which means, as you will learn below, it can completely lose its effectiveness for capacity preservation.
A circuit breaker can take the above utilization graph and turn it into something more stable.
Utilization during service outage with a circuit breaker
The utilization climbs for some time before the circuit breaker opens. Once open, the utilization stabilizes so the user may only experience some slight request delays which is much better.
Semian Circuit Breaker Parameters
Configuring a circuit breaker isn’t a trivial task. It’s seemingly trivial because there are just a few parameters to tune:
name
-
error_threshold
-
error_timeout
-
half_open_resource_timeout
-
success_threshold
.
However, these parameters cannot just be assigned to arbitrary numbers or even best guesses without understanding how the system works in detail. Changes to any of these parameters can greatly affect the utilization of the worker during a service outage.
At the end, I'll show you a configuration change that drops the utilization requirement of 263% to 4%. That’s the difference between complete outage and a slight delay. But before I get to that, let’s dive into detail about what each parameter does and how it affects the circuit breaker.
Name
The name
identifies the resource being protected. Each name
gets its own personal circuit breaker. Every different service type, such as MySQL, Redis, etc. should have its own unique name
to ensure that excessive timeouts in a service only opens the circuit for that service.
There is an additional aspect to consider here. The worker may be configured with multiple service instances for a single type. In certain environments, there can be dozens of Redis instances that a single worker can talk to.
We would never want a single Redis instance outage to cause all Redis connections to go down so we must give each instance a different name
.
For this example, see the diagram below. We will model a total of 3 Redis instances. Each instance is given a name
“redis_cache_#{instance_number}”.
3 Redis instances. Each instance is given a name “redis_cache_#{instance_number}”
You must understand how many services your worker can talk to. Each failing service will have an aggregating effect on the overall utilization. When going through the examples below, the maximum number of failing services you would like to account for is defined by failing_services
. For example, if you have 3 Redis instances, but you only need to know the utilization when 2 of those go down, failing_services
should be 2.
All examples and diagrams in this post are from the reference frame of a single worker. None of the circuit breaker state is shared across workers so we can simplify things this way.
Error Threshold
The error_threshold
defines the number of errors to encounter within a given timespan before opening the circuit and starting to reject requests instantly. If the circuit is closed and error_threshold
number of errors occur within a window of error_timeout
, the circuit will open.
The larger the error_threshold
, the longer the worker will be stuck waiting for input/output (I/O) before reaching the open state. The following diagram models a simple scenario where we have a single Redis instance failure.
error_threshold = 3, failing_services = 1
A single Redis instance failure
3 timeouts happen one after the other for the failing service instance. After the third, the circuit becomes open and all further requests raise instantly.
3 timeouts must occur during the timespan before the circuit becomes open. Simple enough, 3 times the timeout isn’t so bad. The utilization will spike, but the service will reach steady state soon after. This graph is a real world example of this spike at Shopify:
A real world example of a utilization spike at Shopify
The utilization begins to increase when the Redis services goes down, after a few minutes, the circuit begins opening for each failing service and the utilization lowers to a steady state.
Furthermore, if there’s more than 1 failing service instance, the spike will be larger, last longer, and cause more delays for end users. Let’s come back to the example from the Name section with 3 separate Redis instances. Consider all 3 Redis instances being down. Suddenly the time until all circuits are open triples.
error_threshold = 3, failing_services = 3
3 failing services and each service has 3 timeouts before the circuit opens
There are 3 failing services and each service has 3 timeouts before the circuit opens. All the circuits must become open before the worker will stop being blocked by I/O.
Now, we have a longer time to reach steady state because each circuit breaker wastes utilization waiting for timeouts. Imagine 40 Redis instances instead of 3, a timeout of 1 second and an error_threshold of 3 means there’s a minimum time of around 2 minutes to open all the circuits.
The reason this estimate is a minimum is because the order that the requests come in cannot be guaranteed. The above diagram simplifies the scenario by assuming the requests come in a perfect order.
To keep the initial utilization spike low, the *error_threshold* should be reduced as much as possible. However, the probability of false-positives must be considered. Blips can cause the circuit to open despite the service not being down. The lower the error_threshold
, the higher the probability of a false-positive circuit open.
Assuming a steady state timeout error rate is 0.1% in your time window of error_timeout
. An error_timeout
of 3 will give you a 0.0000001% chance of getting a false positive.
You must balance this probability with your error_timeout
to reduce the number of false positives circuit opens. When the circuit opens, it will be instantly raising for every request that is made during error_timeout
.
Error Timeout
The error_timeout
is the amount of time until the circuit breaker will try to query the resource again. It also determines the period to measure the error_threshold
count. The larger this value is, the longer the circuit will take to recover after an outage. The larger this value is, the longer a false positive circuit open will affect the system.
error_threshold = 3, failing_services = 1
The circuit will stay open for error_timeout
amount of time
After the failing service causes the circuit to become open, the circuit will stay open for error_timeout
amount of time. The Redis instance comes back to life and error_timeout
amount of time passes so requests start sending to Redis again.
It’s important to consider the error_timeout
in relation to half_open_resource_timeout
. These 2 parameters are the most important for your configuration. Getting these right will determine the success of the circuit breakers resiliency mechanism in times of outage for your application.
Generally we want to minimize the error_timeout
because the higher it is, the higher the recovery time. However, the primary constraints come from its interaction with these parameters. I’ll show you that maximizing error_timeout
will actually preserve worker utilization.
Half Open Resource Timeout
The circuit is in half-open state when it’s checking to see if the service is back online. It does this by letting a real request through. The circuit becomes half-open after error_timeout
amount of time has passed. When the operating service is completely down for an extended period of time, a steady-state behavior arises.
failing_services = 1
Circuit becomes half-open after error_timeout
amount of time has passed
Error threshold expires but the service is still down. The circuit becomes half-open and a request is sent to Redis, which times out. The circuit opens again and the process repeats as long as Redis remains down.
This flip flop between the open and half-open state is periodic which means we can deterministically predict how much time is wasted on timeouts.
By this point, you may already be speculating on how to adjust wasted utilization. The error_timeout
can be increased to reduce the total time wasted in the half-open state! Awesome — but the higher it goes, the slower your application will be to recover. Furthermore, false positives will keep the circuit open for longer. Not good, especially if we have many service instances. 40 Redis instances with a timeout of 1 second is 40 seconds every cycle wasted on timeouts!
So how else do we minimize the time wasted on timeouts? The only other option is to reduce the service timeout. The lower the service timeout, the less time is wasted on waiting for timeouts. However, this cannot always be done. Adjusting this timeout is highly dependent on how long the service needs to provide the requested data. We have a fundamental problem here. We cannot reduce the service timeout because of application constraints and we cannot increase the error_timeout
because the recovery time will be too slow.
Enter half_open_resource_timeout
, the timeout for the resource when the circuit is in the half-open state. It gets used instead of the original timeout. Simple enough! Now, we have another tunable parameter to help adjust utilization. To reduce wasted utilization, error_timeout
and half_open_resource_timeout
can be tuned. The smaller half_open_resource_timeout
is relative to *error_timeout*, the better the utilization will be.
If we have 3 failing services, our circuit diagram looks something like this:
failing_services = 3
A total of 3 timeouts before all the circuits are open
In the half-open state, each service has 1 timeout before the circuit opens. With 3 failing services, that’s a total of 3 timeouts before all the circuits are open. All the circuits must become open before the worker will stop being blocked by I/O.
Let’s solidify this example with the following timeout parameters:
error_timeout
= 5 secondsThe total steady state period will be 8 seconds with 5 of those seconds spent doing useful work and the other 3 wasted waiting for I/O. That’s 37% of total utilization wasted on I/O.
Note: Hystrix does not have an equivalent parameter for half_open_resource_timeout
which may make it impossible to tune a usable steady state for applications that have a high number of failing_services
.
Success Threshold
The success_threshold
is the amount of consecutive successes for the circuit to close again, that is, to start accepting all requests to the circuit.
The success_threshold
impacts the behavior during outages which have an error rate of less than 100%. Imagine a resource error rate of 90%, with a success_threshold
of 1, the circuit will flip flop between open and closed quite often. In this case there’s a 10% chance of it closing when it shouldn’t. Flip flopping also adds additional strain on the system since the circuit must spend time on I/O to re-close.
Instead, if we increase the success_threshold
to 3, then the likelihood of an open becomes significantly lower. Now, 3 successes must happen in a row to open the circuit reducing the chance of flip flop to 0.1% per cycle.
Note: Hystrix does not have an equivalent parameter for success_threshold
which may make it difficult to reduce the flip flopping in times of partial outage for certain applications.
Lowering Wasted Utilization
Each parameter affects wasted utilization in some way. Semian can easily be configured into a state where a service outage will consume more utilization than the capacity allows. To calculate the additional utilization required, I have put together an equation to model all of the parameters of the circuit breaker. Use it to plan your outage effectively.
The Circuit Break Equation
This equation applies to the steady state failure scenario in the last diagram where the circuit is continuously checking the half-open state. Additional threads reduce the time spent on blocking I/O, however, the equation doesn’t account for the time it takes to context switch a thread which could be significant depending on the application. The larger the context switch time, the lower the thread count should be.
I ran a live test to test out the validity of the equation and the utilization observed closely matched the utilization predicted by the equation.
Tuning Your Circuit
Let’s run through an example and see how the parameters can be tuned to match the application needs. In this example, I’m integrating a circuit breaker for a Rails worker configured with 2 threads. We have 42 Redis instances, each configured with its own circuit and a service timeout of 0.25s.
As a starting point, let’s go with the following parameters. Failing instances is 42 because we are judging behaviour in the worst case, when all of the Redis instances are down.
Parameter | Value |
failing_instances | 42 |
service_timeout | 0.25 seconds |
error_threshold | 3 |
error_timeout | 2 seconds |
success_threshold | 2 |
half_open_resource_timeout | 0.25 seconds (same as service timeout) |
Plugging into The Circuit Breaker Equation, we require an extra utilization of 263%. Unacceptable! Ideally we should have something less than 30% to account for regular traffic variation.
So what do we change to drop this number?
From production observation metrics, I know 99% percent of Redis requests have a response time of less than 50ms. With a value this low, we can easily drop the half_open_resource_timeout
to 50ms and still be confident that the circuit will close when Redis comes back up from an outage. Additionally, we can increase the error_timeout
to 30 seconds. This means a slower recovery time but it reduces the worst case utilization.
With these new numbers, the additional utilization required drops to 4%!
I use this equation as something concrete to relate back to when making tuning decisions. I hope this equation helps you with your circuit breaker configuration as it does with mine.
Author's Edit: "I fixed an error with the original circuit breaker equation in this post. success_threshold does not have an impact on the steady state utilization because it only takes 1 error to keep the circuit open again."
If this sounds like the kind of problems you want to solve, we're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.Four Steps to Creating Effective Game Day Tests
At Shopify, we use Game Day tests to practice how we react to unpredictable situations. Game Day tests involve deliberately triggering failure modes within our production systems, and analyzing whether the systems handle these problems in the ways we expect. I’ll walk through a set of best practices that we use for our internal Shopify Game Day tests, and how you can apply these guidelines to your own testing.
Shopify’s primary responsibility is to provide our merchants with a stable ecommerce platform. Even a small outage can have a dramatic impact on their businesses, so we put a lot of work into preventing them before they occur. We verify our code changes rigorously before they’re deployed, both through automated tests and manual verification. We also require code reviews from other developers who are aware of the context of these changes and their potential impact to the larger platform.
But these upfront checks are only part of the equation. Inevitably, things will break in ways that we don’t expect, or due to forces that are outside our control. When this happens, we need to quickly respond to the issue, analyze the situation at hand, and restore the system back to a healthy state. This requires close coordination between humans and automated systems, and the only way to ensure that it goes smoothly is to practice it beforehand. Game Day tests are a great way of training your team to expect the unexpected.
1. List All the Things That Could Break
The first step to running a successful Game Day test is to compile a list of all the potential failure scenarios that you’re interested in analyzing. Collaborate with your team to take a detailed inventory of everything that could possibly cause your systems to go haywire. List all the problem areas you know about, but don’t stop there—stretch your imagination!
- What are the parts of your infrastructure that you think are 100% safe?
- Where are your blind spots?
- What would happen if your servers started inexplicably running out of disk space?
- What would happen if you suffered a DNS outage or a DDOS attack?
- What would happen if all network calls to a host started timing out?
- Can your systems support 20x their current load?
You’ll likely end up with too many scenarios to reasonably test during a single Game Day testing session. Whittle down the list by comparing the estimated impact of each scenario against the difficulty you’d face in trying to reasonably simulate it. Try to avoid weighing particular scenarios based on your estimates of the likelihood that those scenarios will happen. Game Day testing is about insulating your systems against perfect storm incidents, which often hinge on failure points whose danger was initially underestimated.
2. Create a Series of Experiments
At Shopify, we’ve found that we get the best results from our Game Day tests when we run them as a series of controlled experiments. Once you’ve compiled a list of things that could break, you should start thinking about how they will break, as a list of discrete hypotheses.
- What are the side effects that you expect will be triggered when you simulate an outage during your test?
- Will the correct alerts be dispatched?
- Will downstream systems manifest the expected behaviors?
- When you stop simulating a problem, will your systems recover back to their original state?
If you express these expectations in the form of testable hypotheses, it becomes much easier to plan the actual Game Day session itself. Use a separate spreadsheet (using a tool like Google Sheets or similar) to catalogue each of the prerequisite steps that your team will walk through to simulate a specific failure scenario. Below those steps indicate the behaviors that you hypothesize will occur when you trigger that scenario, along with an indicator for whether this behavior occurs. Lastly, make sure to list the necessary steps to restore your system back to its original state.
Example spreadsheet for a Game Day test that simulates an upstream service outage. A link to this spreadsheet is available in the “Additional Resources” section below.
3. Test Your Human Systems Too
By this point, you’ve compiled a series of short experiments that describe how you expect your systems to react to a list of failure scenarios. Now it’s time to run your Game Day test and validate your experimental hypotheses. There are a lot of different ways to run an Game Day test. One approach isn’t necessarily better than another. How you approach the testing should be tailored to the types of systems you’re testing, the way your team is structured and communicates, the impact your testing poses to production traffic, and so on. Whatever approach you take, just make sure that you track your experiment results as you go along!
However, there is one common element that should be present regardless of the specifics of your particular testing setup: team involvement. Game Day tests aren’t just about seeing how your automated systems react to unexpected pressures—you should also use the opportunity to analyze how your team handles these situations on the people side. Good team communication under pressure can make a huge difference when it comes to mitigating the impact of a production incident.
- What are the types of interactions that need to happen among team members as an incident unfolds?
- Is there a protocol for how work is distributed among multiple people?
- Do you need to communicate with anyone from outside your immediate team?
Make sure you have a basic system in place to prevent people from doing the same task twice, or incorrectly assuming that something is already being handled.
4. Address Any Gaps Uncovered
After running your Game Day test, it’s time to patch the holes that you uncovered. Your experiment spreadsheets should be annotated with whether each hypothesis held up in practice.
- Did your off hours alerting system page the on-call developer?
- Did you correctly switch over to reading from the backup database?
- Were you able to restore things back to their original healthy state?
For any gaps you uncover, work with your team to determine why the expected behavior didn’t occur, then establish a plan for how to correct the failed behavior. After doing so, you should ideally run a new Game Day test to verify that your hypotheses are now valid with the new fixes in place.
This is also the opportunity to analyze any gaps in communication between your team, or problems that you identified regarding how people distribute work among themselves when they’re under pressure. Set aside some time for a follow up discussion with the other Game Day participants to discuss the results of the test, and ask for their input on what they thought went well versus what could use some improvement. Finally, make any necessary changes to your team’s guidelines for how to respond to these incidents going forward.
In Conclusion
Using these best practices, you should be able to execute a successful Game Day test that gives you greater confidence in how your systems—and the humans that control them—will respond during unexpected incidents. And remember that a Game Day test isn’t a one-time event: you should periodically update your hypotheses and conduct new tests to make sure that your team remains prepared for the unexpected. Happy testing!
Additional resources
- How Complex Systems Fail—a list of high-level characteristics of systemic failure that can help you think about how to model your failure scenarios.
- Common Ground and Coordination in Joint Activity—an investigation into how humans coordinate work within technically complex situations.
- List of Resilience Engineering papers and publications—a curated list of further reading materials about Resilience Engineering.
- Example spreadsheet for tracking Game Day test scenarios
How Shopify Manages Petabyte Scale MySQL Backup and Restore
At Shopify, we run a large fleet of MySQL servers, with numerous replica-sets (internally known as “shards”) spread across three Google Cloud Platform (GCP) regions. Given the petabyte scale size and criticality of data, we need a robust and efficient backup and restore solution. We drastically reduced our Recovery Time Objective (RTO) to under 30 minutes by redesigning our tooling to use disk-based snapshots, and we want to share how it was done.
Challenges with Existing Tools
For several years, we backed up our MySQL data using Percona’s Xtrabackup utility, stored its output in files, and archived them on Google Cloud Storage (GCS). While pretty robust, it provided a significant challenge when backing up and restoring data. The amount of time taken to back up a petabyte of data spread across multiple regions was too long, and increasingly hard to improve. We perform backups in all availability regions to decrease the time it takes to restore data cross-region. However, the restore times for each of our shards was more than six hours, which forced us to accept a very high RTO.
While this lengthy restore time was painful when using backups for disaster recovery, we also leverage backups for day-to-day tasks, such as re-building replicas. Long restore times also impaired our ability to scale replicas up and down in a cluster for purposes like scaling our reads to replicas.
Overcoming Challenges
Since we run our MySQL servers on GCP’s Compute Engine VMs using Persistent Disk (PD) volumes for storage, we invested time in leveraging PD’s snapshot feature. Using snapshots was simple enough, conceptually. In terms of storage, each initial snapshot of a PD volume is a full copy of the data, whereas the subsequent ones are automatically incremental, storing only data that has changed.
In our benchmarks, an initial snapshot of a multi-terabyte PD volume took around 20 minutes and each incremental snapshot typically took less than 10 minutes. The incremental nature of PD snapshots allows us to snapshot disks very frequently, helps us with having the latest copy of data, and minimizes our Mean Time To Recovery.
Modernizing our Backup Infrastructure
Taking a Backup
We built our new backup tooling around the GCP API to invoke PD snapshots. This tooling takes into account the availability regions and zones, the role of MySQL instance (replica or master) and the other MySQL consistency variables. We deployed this tooling in our Kubernetes infrastructure as CronJobs, giving the jobs a distributed nature and avoiding tying them to our individual MySQL VMs allowing us to avoid having to handle coordination in case of a host failure. The CronJob is scheduled to run every 15 minutes across all the clusters in all of our available regions, helping us avoid costs related to snapshot transfer across different regions.
Backup workflow selecting replica and calling disk API to snapshot, per cron schedule
The backup tooling creates snapshots of our MySQL instances nearly 100 times a day across all of our shards, totaling thousands of snapshots every day with virtually no failures.
Since we snapshot so frequently, it can easily cost thousands of dollars every day for snapshot storage if the snapshots aren’t deleted correctly. To ensure we only keep (and pay for) what we actually need, we built a framework to establish a retention policy that meets our Disaster Recovery needs. The tooling enforcing our retention policy is deployed and managed using Kubernetes, similar to the snapshot CronJobs. We create thousands of snapshots every day, but we also delete thousands of them, keeping only the latest two snapshots for each shard, and dailies, weeklies, etc. in each region per our retention policy
Backup retention workflow, listing and deleting snapshots outside of retention policy
Performing a Restore
Having a very recent snapshot always at the ready provides us with the benefit of being able to use these snapshots to clone replicas with the most recent data possible. Given the small amount of time it takes to restore snapshots by exporting a snapshot to a new PD volume, this has brought down our RTO to typically less than 30 minutes, including recovery from replication lag.
Backup restore workflow, selecting a latest snapshot and exporting to disk and attaching to a VM
Additionally, restoring a backup is now quite simple: The process involves creating new PDs with source as the latest snapshot to restore and starting MySQL on top of that disk. Since our snapshots are taken while MySQL is online, after restore it must go through MySQL InnoDB instance recovery, and within a few minutes the instance is ready to serve production queries.
Assuring Data Integrity and Reliability
While PD snapshot-based backups are obviously fast and efficient, we needed to ensure that they are reliable, as well. We run a backup verification process for all of the daily backups that we retain. This means verifying two daily snapshots per shard, per region.
In our backup verification tooling, we export each retained snapshot to a PD volume, attached to Kubernetes Jobs and verify the following:
- if a MySQL instance can be started using the backup
- if replication can be started using MySQL Global Transaction ID (GTID) auto-positioning with that backup
- if there is any InnoDB page-level corruption within the backup
Backup verification process, selecting daily snapshot, exporting to disk and spinning up a Kubernetes job to run verification steps
This verification process restores and verifies more than a petabyte of data every day utilizing fewer resources than expected.
PD snapshots are fast and efficient, but the snapshots created exist only inside of GCP and can only be exported to new PD volumes. To ensure data availability in case of catastrophe, we needed to store backups at an offsite location. We created tooling which backs up the data contained in snapshots to an offsite location. The tooling exports the selected snapshot to new PD volume and runs Kubernetes Jobs to compress, encrypt and archive the data, before transferring them as files to an offsite location operated by another provider.
Evaluating the Pros and Cons of Our New Backup and Restore Solution
Pros
- Using PD snapshots allows for faster backups compared to traditional file-based backup methods.
- Backups taken using PD snapshots are faster to restore, as they can leverage vast computing resources available to GCP.
- The incremental nature of snapshots results in reduced backup times, making it possible to take backups more frequently.
- The performance impact on the donors of snapshots is noticeably lower than the performance impact of the donors of xtrabackup based backups.
Cons
-
Using PD snapshots is more expensive for storage compared to traditional file based backups stored in GCS.
-
The snapshot process itself doesn’t perform any integrity checks, for example, scanning for InnoDB page corruption, ensuring data consistency, etc. which means additional tools may need to be built.
-
Because snapshots are not inherently stored as a conveniently packaged backup, it is more tedious to copy, store, or send them off-site.
We undertook this project at the start of 2019 and, within a few months, we had a very robust backup infrastructure built around Google Cloud’s Persistent disk snapshot API. This tooling has been serving us well and has introduced us to new possibilities like, scaling replicas up and down for reads quickly using these snapshots apart from Disaster recovery.
If database systems are something that interests you, we're looking for Database Engineers to join the Datastores team! Learn all about the role on our career page.
A New Kubectl Plugin for Kubernetes Ingress Controller ingress-nginx
Shopify makes extensive use of ingress-nginx, an open source Kubernetes Ingress controller built upon NGINX. Nearly every request Shopify serves is handled at one point by ingress-nginx, and we are active contributors to the project. Since we make use of ingress-nginx and its many features (like annotations and configmap) so heavily, the quality of its debugging experience is very important to us. While debugging ingress-nginx was always possible using a complex litany of kubectl commands, the experience was frequently frustrating. To help solve this issue, I recently contributed a kubectl plugin to the project. It provides a number of features that make ingress-nginx much easier to upgrade and debug, saving us time and increasing our confidence while working with it.
Easier Upgrades
ingress-nginx is a fast moving project. New releases happen every few weeks, and usually contain one or two breaking changes. When running a very large cluster, it can be difficult to know whether or not any configuration changes need to be made to remain compatible with the new version. Our usual process for upgrading ingress-nginx before the plugin existed was to read the CHANGELOG for the new version, export every single ingress in a cluster as YAML, then manually grep through those YAMLs to find anything that would be broken by the new version. With the plugin, it's as simple as running:
to get a nice, formatted list of everything you might need to change.
Improved Ingress Listing
When running the vanilla kubectl get ingresses, a fairly minimal amount of information is returned:
Often you don’t just care about the hosts and addresses of a whole ingress, you care about the individual paths inside the ingress, as well as the services they point to. Using the plugin, you can get a more detailed view of the contents of those ingresses without inspecting each one:
Using the plugin, it is much easier to answer questions like “what service is this path hitting?” or “does this site have TLS configured?”.
Better Debugging
There are many common debugging strategies for ingress-nginx that can become tedious to carry out manually. Usually, you are required to find and select an ingress pod to inspect, and you are required to filter the output of whatever commands you run in order to find the information you’re looking for. The plugin provides convenient wrappers for many kubectl commands that make it quicker and easier to perform these tasks, selecting a single ingress pod automatically to run the command in. As an example:
can be replaced by a single command:
Likewise, inspecting the internal state of the controller is much easier. In the case of reading the controller’s generated nginx.conf for a particular host:
can be replaced by:
ingress-nginx stores some of its configuration state dynamically, making use of the openresty lua-nginx-module to add additional request handling logic. The plugin can be used to inspect this state as well. As an example, if you are using the session affinity annotations to add a session cookie, but the cookie doesn’t seem to be applied to requests, you can first use the plugin to check if the controller is registering the annotation correctly:
This shows that the annotation is correctly reflected in the controller’s dynamic configuration.
My Experience
Since the addition of the plugin, I have found that the time it takes me to upgrade or debug ingress-nginx has decreased substantially. When I first arrived as an intern here at Shopify in January 2019, I was tasked with upgrading our ingress-nginx deployments to version 0.22.0. I spent a long time grepping through ingress manifest dumps looking for breaking changes. I spent time trying to come up with the kubectl invocation that would allow me to inspect nginx.conf inside the controller. I didn’t even know that the dynamic configuration information existed, let alone the arcane incantations that would allow me to read it. There existed no way at all to read certificate information. It took me days to fully roll out the new version.
Near the end of my internship, I upgraded our deployments to ingress-nginx version 0.24.1. Finding breaking changes required only a few invocations of the lint subcommand. Debugging the controller configuration was similarly quick. I had the confidence to ship the new version much more quickly, and did so in a fraction of the time it had taken me a few months ago. Much of this can be attributed to the fact that there was now a single tool that allowed me to easily perform every debugging function I had previously been doing, plus a few more that I hadn’t even known existed. In addition, having all of these previously unintuitive and little-documented debugging tricks collected together in one easily usable tool will make it far easier to get started with debugging ingress-nginx for those who are unfamiliar with the project, as I was.
It’s also true that much of this time difference is due to my growth in both confidence and competence at Shopify. I’ve learned a great deal about Kubernetes, especially the nuts and bolts of how requests to Shopify services get from client to server and back. I’ve had the privilege of being paid to work on an open-source project. I’ve learned the skills, both technical and interpersonal, to function as part of a team in a large organization. Changes that I’ve made, code that I’ve written, have helped to process literally billions of web requests during my time here. This has been an extremely productive four months.
The plugin was released as part of ingress-nginx version 0.24.0, but should be compatible with version 0.23.0 as well. You can find the full plugin documentation and install instructions in the ingress-nginx docs. To get involved with the ingress-nginx project, or to ask questions, drop into the #ingress-nginx channel on the Kubernetes Slack.
If solving problems like these interests you, consider interning at Shopify. The intern application window is now open and closes on Wednesday, May 15th at 9:00 AM EST. Applications will not be accepted after this date.
Engineering a Historic Moment: Shopify Gets Ready for Cannabis in Canada
The biggest change for Shopify was the requirement to store personal information in Canada. This required Canadian-specific infrastructure that we were able to develop through our recent move to the Cloud and Google Cloud Platform’s new region in Montreal. Using this platform as our foundation, we created a new instance of Shopify, in an entirely new region, to meet the needs of this industry. In our migration, we built several new Google Cloud Platform projects (all based in the Montreal region) which included key projects housing Shopify’s core infrastructure such as PCI compliant payment processing infrastructure and a regional data warehouse.
The core infrastructure, which runs on a mixture of Google Kubernetes Engine and Google Compute Engine, already existed in our other regions which meant adding another region was relatively straightforward. We used Terraform to declare and configure all parts of the underlying infrastructure, like networks and Kubernetes Engine clusters. We also took advantage of improved resiliency features in Google Cloud Platform, such as regional clusters. We structured our compute node clusters to segregate workloads, minimizing the noisy neighbour problem to ensure maximum stability and reliability. After a few months of building out this infrastructure, configuring and testing it, we had the first working version of this new regional infrastructure running test shops with a functional storefront and admin. That’s when we faced our next major challenge: scaling.
A major factor in our scaling ability were social factors—particularly, determining the behavior of cannabis consumers, an area with little to no available research. Most research focused on cannabis producers, whereas Shopify needed to figure out the behavior of cannabis consumers. We modeled a number of different traffic scenarios and provisioned enough infrastructure to ensure we could handle the peak traffic from each one. Some of the possibilities we considered included:
- A strong initial, worldwide surge of interest on storefront pages as curiosity about a government-run online cannabis store peaks
- Waves of traffic based on multiple days of media coverage across the world, with local timezone spikes
- Very strong initial sales in the first minutes and hours of store openings as Canadians rush to be one of the first to legally purchase recreational cannabis
- Possible bursts of denial of service attacks from malicious actors
We went through multiple cycles of load testing using a mix of different storefront traffic patterns, varying the relative percentages of search, product browsing, collection browsing and checkout actions to stress the system in different ways. Each cycle included different fixes and configuration changes to improve the performance and throughput of the system until we were satisfied that we would be able to handle all possible traffic scenarios. In addition, we modeled and tested different types of bot attacks to ensure our platform defenses were effective. Finally, we conducted multiple pre-mortem discussions and built out mitigation plans to address any scenario which would cause downtime for our merchants.
At the same time, we were solving how to keep personal information contained in Canada. This was extremely challenging as Shopify was built from day one with a number of storage and communications systems located outside of Canada, such as our data warehouse and network infrastructure. We examined each system for personal information to ensure that this information remains stored in Canada.
We ensured there were protections for regional storage in multiple places: inside the application, within the hosts, and at the network/infrastructure level. For our main Ruby on Rails application, we:
- Built a library which captured network requests and verified the requested host belonged to a list of known safe endpoints.
- Utilized strict network firewall rules and minimized interconnections to ensure that data wouldn’t accidentally traverse into other jurisdictions.
- Deployed the containers which house the main application with the absolute minimum number of secrets necessary for the service to function in order to ensure that any service outside the jurisdiction reached in error would simply reject the request due to insufficient credentials.
- Ensured the infrastructure used unique SSL certificates so data would not cross-pollinate between internal pieces of the system.
- Deployed all these protections, in combination with monitoring and alerting, ensuring the teams involved were notified of potential issues.
As launch day neared, we reduced the amount of change we applied to the environment to minimize risk. While the merchants were in their final testing cycles, we continued to perform load testing to ensure that the environment was optimally configured and ready. Having a successful launch day was critical for our merchants and we decided to scale the environment to handle five times the traffic and sales volume projections for launch day. Internally, we ran a series of game days (a form of fault injection where we test our assumptions about the system by degrading its dependencies under controlled conditions) for core infrastructure teams to validate that system performance and alerting was sufficient.
On launch day, merchants chose to take full advantage of the excitement and opened their stores one minute after midnight in their local time zones. That meant we’d see both retail and online launches starting at 10:31 PM EDT on October 16th (Newfoundland and Labrador) and continue through every hour until 3:01 AM EDT on October 17th (British Columbia). And at 12:01 NST, the first legal sale of cannabis in Canada was made on Shopify’s point of sale in Newfoundland followed by successful launches in Prince Edward Island, Ontario and British Columbia — all with zero downtime, excellent performance and secure storage and transmission of personal information within Canada.
Being part of launching a new retail industry and acting as a trusted partner with multiple licensed sellers while building infrastructure with regional data storage requirements, all on a strict deadline, was quite a challenge which required coordination across all Shopify departments. We learned a lot about what it takes to support regulated industries and restricted markets, knowledge which will help us support similar markets in the future, both in Canada and throughout the world. A number of the technologies and processes we developed during this project will continue to be improved and reused to support future deployments with similar requirements. Overall, it was incredibly rewarding to be part of a historic launch by contributing to and supporting the success of licensed recreational cannabis retailers throughout the country.
Intrigued? Shopify is hiring and we’d love to hear from you. Please take a look at the Production Engineer and Senior Technical Security Analyst roles available.
Preparing Shopify for Black Friday and Cyber Monday
Making commerce better for everyone is a challenge we face on a daily basis. For our Production Engineering team, it means ensuring that our 600,000+ merchants have a reliable and scalable platform to support their business needs. We need to be able to support everything our merchants throw at us—including the influx of holiday traffic during Black Friday and Cyber Monday (BFCM). All of this needs to happen without an interruption in service. We’re proud to say that the effort we put in to deploying, scaling, and launching new projects on a daily basis gives our merchants access to a platform with 99.98% uptime.
Black Friday Cyber Monday 2018
To put the impact of this into perspective, Black Friday and Cyber Monday is what we refer to as our World Cup. Each year, our merchants push the boundaries of our platform to handle more traffic and more sales. This year alone, merchants sold over $1.5 billion USD in sales throughout the holiday weekend.
What people may not realize is that Shopify is made up of many different internal services and interaction points with third-party providers, like payment gateways and shipping carriers. The performance and reliability of each of this dependencies can potentially affect our merchants and buyers in different ways. That’s why our Production Engineering teams preparations for BFCM run the entire gamut.
To increase the chances of success on BFCM Production Engineering run “game days” on our systems and their dependencies. Game days are a form of fault injection where we test our assumptions about the system by degrading its dependencies under controlled conditions. For example, we’ll introduce artificial latency into the code paths that interact with shipping providers to ensure that the system continues working and doing something reasonable. That could be, for instance, falling back to another third party or hard-coded defaults if a third party dependency were to become slow for any reason, or verifying that a particular service responds as expected to a problem with their main datastore.
Besides fault injection work, Production Engineering also run load testing exercises where volumes similar to what we expect during BFCM are created synthetically and sent to the different applications to ensure that the system and its components behave well under the onslaught of requests they’ll serve on BFCM.
At Shopify, we pride ourselves on continuous and fast deploys to deliver features and fixes as fast as we can; however, the rate of change on a system increases the probability of issues that can affect our users. During the ramp-up period for BFCM, we manage the normal cadence of the company by establishing both a feature freeze and a code freeze. The feature freeze starts several weeks before BFCM and means no meaningful changes to user-facing features are deployed to prevent changes on merchant’s workflows. At that point in the year, changes, even improvements can have an unacceptable learning curve for merchants that are diligently getting ready for the big event.
A few days before BFCM and during the event an actual code freeze is in effect, means that only critical fixes can be deployed and everything else must remain in stasis. The idea is to reduce the possibility of introducing bugs and unexpected system interactions that could cause the service to be compromised during the peak days of the holiday season.
Did all of our preparations work out? With BFCM in the rearview mirror, we can say, yes. This BFCM weekend was a record breaker for Shopify. We saw nearly 11,000 orders created per minute and around 100,000 requests per second being served for extended periods during the weekend. All and all, most system metrics followed a pattern of 1.8 times what they were in 2017.
The somewhat unsurprising conclusion is that running towards the risk by injecting faults, load testing, and role-playing possible disaster scenarios pays off. Also, reliability goes beyond your “own” system most complex platforms these days have to deal with third parties to provide the best service possible. We have learned to trust our partners but also understand that any system can have downtime and in the end, Shopify is responsible to our merchants and buyers.
How an Intern Released 3 Terabytes Worth of Storage Before BFCM
Hi there! I’m Gurpreet and currently finishing up my second internship at Shopify. I was part of the Products team during both of my internships. The team is responsible for building and maintaining the products area of Shopify admin. As a developer, every day is another opportunity to learn something new. Although I worked on many tasks during my internship, today I will be talking about one particular problem I solved.
The Problem
As part of the Black Friday Cyber Monday (BFCM) preparations, we wanted to make sure our database was resilient enough to smoothly handle increased traffic during flash sales. After completing an analysis of our top SQL queries, we realized that the database was scanning a large number of fixed-size storage units, called innoDB pages, just to return a single record. We identified the records, historically kept for reporting purposes, that caused this excess scanning. After talking among different teams and making sure that these records were safe to delete, the team decided to write a background job to delete them.
So how did we accomplish this task which could have potentially taken our database down, resulting in downtime for our merchants?
The Background Job
I built the Rails background job using existing libraries that Shopify built to avoid overloading the database while performing different operations including deletion. A naive way to perform deletions is sending either a batch delete query or one delete query per record. It’s not easy to interrupt MySQL operations and doing the naive approach would easily overload the database with thousands of operations. The job-iteration library allows background jobs to run in iterations and it’s one of the Shopify libraries I leveraged to overcome the issue. The job runs in small chunks and can be paused between iterations to let other higher priority jobs run first or to perform certain checks. There are two parts of the job; the enumerator and the iterator. The enumerator fetches records in batches and passes one batch to the iterator at a time. The iterator then fetches the records in the given batch and deletes them. While this made sure that we weren’t deleting a large number of records in a single SQL query, we still needed to make sure we weren’t deleting the batches too fast. Deleting batches too fast results in a high replication lag and can affect the availability of the database. Thankfully, we have an existing internal throttling enumerator which I also leveraged writing the job.
After each iteration, the throttling enumerator checks if we’re starting to overload the database. If so, it automatically pauses the job until the database is back in a healthy state. We ensured our fetch queries used proper indexes and the enumerator used a proper cursor for batches to avoid timeouts. A cursor can be thought of as flagging the last record in the previous batch. This allows fetching records for the next batch by using the flagged record as the pivot. It avoids having to re-fetch previous records and only including the new ones in the current batch.
The Aftermath
We ran the background job approximately two weeks before BFCM. It was a big deal because not only did it free up three terabytes of storage and resulted in large cost savings, it made our database more resilient to flash sales.
For example, after the deletion, as seen in the chart below, our database was scanning around ~3x fewer pages in order to return a single record. Since the database was reading fewer pages to return a single record, it meant that during flash sales, it can serve an increased number of requests without getting overloaded because of unnecessary page scans. This also meant that we were making sure our merchants get the best BFCM experience with minimal technical issues during flash sales.
Database Scanning After Deletion
Truth to be told, I was very nervous watching the background job run because if anything went wrong, that meant downtime for the merchants, which is the last thing we want and man, what a horrible intern experience. At the peak, we were deleting approximately six million records a minute. The Shopify libraries I leveraged helped to make deleting over 🔥5 billion records🔥 look like a piece of cake 🎂.
5 billion Records Deleted
What I Learned
I learned so much from this project. I got vital experience with open source projects when using Shopify’s job-iteration library. I also did independent research to better understand MySQL indexes and how cursors work. For example, I didn’t know about partial indexes and how they worked. MySQL will pick a subset of prefix keys, based on the longest prefix match with predicates in the WHERE clause, to be used by the partial index to evaluate the query. Suppose we have an index on (A,B,C). A query with predicates (A,C) in the WHERE clause will only use the key A from the index, but a query with predicates (A,B) in the WHERE clause will use the keys A and B. I also learned how to use SQL EXPLAIN to analyze SQL queries. It shows exactly which indexes the database considered using, which index it ended up using, how many pages were scanned, and a lot of other useful information. Apart from improving my technical skills, working on this project made me realize the importance of collecting as much context as one can before even attempting to solve the problem. My mentor helped me with cross-team communication. Overall, context gathering allowed me to identify any possible complications ahead of time and make sure the background job ran smoothly.
Can you see yourself as one of our interns? Applications for the Summer 2019 term will be available at shopify.com/careers/interns from January 7, 2019. The deadline for applications is Monday, January 21, 2019, at 9:00 AM EST!
Running Apache Kafka on Kubernetes at Shopify
In the Beginning, There Was the Data Center
Shopify is a leading multi-channel commerce platform that powers over 600,000 businesses in approximately 175 countries. We first adopted Apache Kafka as our data bus for reliable messaging in 2014 and mainly used it for collecting events and log aggregation across the systems.
In that first year, our primary focus was building trust in the platform with our data analysts and developers by automating all aspects of cluster management, creating the proper in-house tooling needed for our daily operations, and helping them use it with minimum friction. Initially, our deployment was a single regional Kafka cluster in each of our data centers and one aggregate cluster for our data warehouse. The regional clusters would mirror their data to the aggregate using Kafka’s mirrormaker.
Apache Kafka deployment in the data center
Fast forward to 2016, and we’re managing many multi-tenant clusters in all our regions. These clusters are the backbone of our data superhighway — delivering billions of messages every day to our data warehouse and other application-specific Kafka consumers. Chef provisioned, configured and managed our Kafka infrastructure in the data center. We deploy a configuration change to all clusters at once by updating one or more files in our Chef GitHub repository.
Moving to the Cloud
In 2017, Shopify started moving some services from our data centers to the cloud. We took on the task of migrating our Kafka infrastructure to the cloud with zero downtime. Our target was to achieve reliable cluster deployment with predictable and scalable performance and do all this without sacrificing ease of use and security. Migration was a three-step process:
- Deploy one regional Kafka cluster in each cloud region we use, and deploy an aggregate Kafka cluster in one of the regions.
- Mirror all regional clusters in the data center and in the cloud to both aggregate clusters in the data center and in the cloud. This guarantees both aggregate clusters will have the same data.
- Move Kafka clients (both producers and consumers) from the data center clusters and configure them to point to the cloud clusters.
Apache Kafka deployment during our move to the cloud
By the time we migrated all clients to the cloud clusters, the regional clusters in the data center had zero incoming traffic and we could safely shut them down. That was followed by a safe shutdown of the aggregate Kafka cluster in the data center as no more clients were reading from it.
Virtual-Machines or Kubernetes?
We compared running Kafka brokers in Google Cloud Platform (GCP) as Virtual Machines (VM) vs. running it in containers managed by Kubernetes and we decided to use Kubernetes for the following reasons.
The first option using GCP VMs is closer in concept to how we managed physical machines in the data center. There, we have full control of the individual servers, but we also need to write our own tooling to monitor, manage the state of the cluster as a whole, and execute deployments in a way that we do not impact Kafka availability. For example, we can’t perform a configuration change and restart all Kafka brokers at once —this results in a service outage.
Kubernetes, on the other hand, offers abstract constructs to manage a set of containers together as a stateless or stateful cluster. Kubernetes manages a set of Pods. Each Pod is a set of functionally related containers deployed together on a server called a Node. To manage a stateful set of nodes like a Kafka cluster, we used Kubernetes StatefulSets to control deployment and scaling of containers with an ordered and graceful deployment of changes including guarantees to prevent compromising the overall service availability. And to implement our own custom behavior that’s not provided by Kubernetes, we extended it using Custom Resources and Controllers, an extension for Kubernetes API to create user-defined resources and implement actions when these resources are updated.
This is an example of a Kubernetes StatefulSet template used to configure a Kafka cluster of 30 nodes:
Kubernetes StatefulSet template
Containerizing Kafka
Running Kafka in a docker container is straightforward, the simplest setup is for the Kafka server configuration to be stored in a Kubernetes ConfigMap and to mount the configuration file in the container by referencing the proper configMap key. But… pulling a third party Kafka image is risky since depending on a Kafka image from an external registry risks application failure if the image is changed or removed! We highly recommend hosting your own container registry and building your own Kafka image. In a critical software environment where you want to minimize sources of failures, it’s more reliable to build the image yourself and host it in your own registry, giving you more control on its content and availability.
Best Practices
Our Kafka Pods contain the Kafka container itself and another resource-monitoring container. Kafka isn’t friendly with frequent server restarts because restarting a Kafka broker or container means terabytes of data shuffling around the cluster. Restarting many brokers at the same time risks having offline-partitions and consequently data-loss. These are some of the best practices we learned and implemented to tune the cluster availability:
- Node Affinity and Taints: Schedules Kafka containers on nodes with the required specifications. Taints guarantees that other applications can’t use nodes required for Kafka containers.
- Inter-pod Affinity and Anti-Affinity prevents the Kubernetes scheduler from scheduling two Kafka containers on the same node.
- Persistent Volumes is persistent storage for Kafka pods and guarantees that a Pod always mounts the same disk volume when it restarts.
- Kubernetes Custom Resources extends Kubernetes functionality; we use to automate and manage Kafka Topic provisioning, cluster discovery, and SSL certificate distribution.
- Kafka broker’s rack-awareness reduces the impact of a single Kubernetes zone failure by mapping Kafka containers to multiple Kubernetes zones
- Readiness Probe guarantees how fast we roll configuration changes to cluster nodes.
We successfully migrated all our Kafka clusters to the cloud. We run multiple regional Kafka clusters and an aggregate one to mirror all other clusters before feeding its data into our data warehouse. Today, we stream billions of events daily across all clusters — these events are key to our developers, data analysts, and data scientists to build a world-class, data-driven commerce platform.
If you are excited about working on similar systems join our Production-Engineering team at Shopify here: Careers at Shopify
Iterating Towards a More Scalable Ingress
Shopify, the leading cloud-based, multi-channel commerce platform, is growing at an incredibly fast pace. Since the beginning of 2016, the number of merchants on the platform increased from 375,000 to 600,000+. As the platform scales, we face new and exciting challenges such as implementing Shopify’s Pod architecture and future proofing our cloud storage usage. Shopify’s infrastructure relies heavily on Kubernetes to serve millions of requests every minute. An essential component of any Kubernetes cluster is its ingress, the first point of entry in a cluster that routes incoming requests to the corresponding services. The ingress controller implementation we adopted at the beginning of the year is ingress-nginx, an open source project.
Before ingress-nginx, we used Google Cloud Load Balancer Controller (glbc). We opted out of glbc because, for Shopify, it underperformed on the cloud. We observed underperforming load balancing and request queueing, particularly during deployments. Shopify currently deploys around 40 times per day without scheduling downtime. At the time we identified these problems, glbc wasn’t endpoint aware while ingress-nginx was. Having endpoint awareness allows the ingress to implement alternative load balancing solutions and not rely on the solution offered by Kubernetes Services through kube-proxy. The above reasons, together with the NGINX expertise Shopify acquired through running and maintaining its NGINX (supercharged with Lua) edge load balancers, made the Edgescale team migrate the ingress on our Kubernetes clusters from glbc to ingress-nginx.
Even though we now leverage endpoint awareness through ingress-nginx to enhance our load balancing solution, there are still additional performance issues that arise at our scale. The Edgescale team, which is in charge of architecting, building and maintaining Shopify’s edge infrastructure, began contributing optimizations to the ingress-nginx project to ensure it performs well at Shopify’s scale and as a way to give back to the ingress-nginx community. This post focuses on the dynamic configuration optimization we contributed to the project which allowed us to reduce the number of NGINX reloads throughout the day.
Now’s the perfect time to introduce myself 😎— my name is Francisco Mejia, and I’m a Production Engineering Intern on the Edgescale team. One of my major goals for this internship was to learn and become familiar with Kubernetes at scale, but little did I know that I would spend most of my internship contributing to a Kubernetes project!
One of the first performance bottlenecks we identified when using ingress-nginx was the high frequency of NGINX reloads during application deployments. Whenever application deployments occurred on the cluster, we observed increased latencies for end users which lead us to investigate and find a solution to this problem.
NGINX uses a configuration file to store the active endpoints for every service it routes traffic to. During deployments to our clusters, Pods running the older version are killed and replaced with Pods running the updated version. It’s possible that a single deployment may trigger multiple reloads, as the controller receives updates for the endpoint changes. Any time NGINX reloads it reads an NGINX configuration file into memory, starts new worker processes and signals the old worker processes to shutdown gracefully.
Although NGINX reloads gracefully, reloads are still detrimental from a performance perspective. Old worker processes being shut down results in increased memory consumption, and the reset of keepalive connections and load balancing state. Clients that previously had open keepalive connections with the old worker processes now need to open new connections with the new worker processes. In addition, opening connections at a faster rate means that the server will need to allocate more resources to handle connection requests. We addressed this issue by introducing dynamic configuration to the ingress controller.
To reduce the number of NGINX reloads when deployments occur we added the ability for ingress-nginx to update application endpoints by maintaining them in-memory, thereby eliminating the need for NGINX to regenerate the configuration file and issue a reload. We accomplished this by creating an HTTP endpoint inside NGINX using lua-nginx-module that receives endpoint configuration updates from the ingress controller and modifies an internal Lua shared dictionary that stores the endpoint configuration for all services. This mechanism enabled us to both: skip NGINX reloads during deployments and significantly improved request latencies, especially during deploys.
Here’s a more granular look at the general flow when we instruct the controller to dynamically configure endpoints:
- A Kubernetes resource is modified, created or deleted.
- The ingress controller sees the changes and sends a POST request to /configuration/backends containing the up to date list of endpoints for every service.
- NGINX receives a POST request to /configuration/backends which is served by our Lua configuration module.
- The module handles the request by receiving the list of endpoints for all services and updates a shared dictionary that keeps track of the endpoints for all backends.
My team carried out tests to compare the latency of requests between glbc and ingress-nginx with dynamic configuration enabled. The test consisted of the following:
- Find a request rate for the load generator where the average request latency is under 100ms when using glbc to access an endpoint.
- Use the same rate to generate load on an endpoint behind ingress-nginx and compare latencies, standard deviation and throughput.
- Repeat step 1, but this time carry out application deploys while load is being generated to endpoints.
The latencies were distributed as follows:
Up until the 99.9th percentile of request latencies both ingresses are very similar, but when we reach 99.99th percentile or greater, ingress-nginx outperforms glbc by multiple orders of magnitude. It’s vital to minimize the request latency as much as possible as it highly impacts merchants success.
We also compared the request latencies when running the ingress controller with and without dynamic configuration. The results were the following:
From the graph, we can see that the 99th percentile of latencies when using dynamic configuration is comparable to the 99th percentile when using the vanilla ingress controller - with roughly similar results.
We also carried out the previous test, but this time during application deploys - here’s where we really get to see the impact of the dynamic configuration feature. The results are depicted below:
It’s clear from the graph that there was a huge increase in performance after the 80th percentile from ingress-nginx with dynamic configuration.
When operating at Shopify’s scale a whole new world of engineering challenges and opportunities arise. Together with my teammates, we have the opportunity to find creative ways to solve optimization problems involving both Kubernetes and NGINX. We contributed our NGINX expertise to the ingress-nginx project and will continue doing so. The contribution explained throughout this post wouldn’t have been possible without the support of the ingress-nginx community, massive kudos to them 🎉! Keep an eye out for more ingress-nginx updates on its GitHub page!
E-Commerce at Scale: Inside Shopify's Tech Stack - Stackshare.io
9 minute read
Before 2015, we had an Operations and Performance team. Around this time, we decided to create the Production Engineering department and merge the teams. The department is responsible for building and maintaining common infrastructure that allows the rest of product development teams to run their code. Both Production Engineering and all the product development teams share responsibility for the ongoing operation of our end user applications. This means all technical roles share monitoring and incident response, with escalation happening laterally to bring in any skill set required to restore service in case of problems.
Shopify’s Infrastructure Collaboration with Google
We’re always working to deliver the best commerce experience to our merchants and their customers. We provide a seamless merchant experience while shaping the future of retail by building a platform that can handle the traffic of a Kylie Cosmetic flash sale (they sell out in 20 seconds), ship new features into production hundreds of times a day, and process more than double the amount of orders year over year.
For Production Engineering to meet these needs, we regularly review our technology stack to ensure we are using the best tools for the job and our journey to the Cloud is a perfect example. That’s why, we are excited to share that Shopify is now building our Cloud with Google, but before sharing the details of this announcement, we want to provide some context on our journey.
Shopify has been a cloud company since day one. We provide a commerce cloud to our merchants, solving their worries about hiring full-time IT staff to manage the infrastructure side of the business. Cloud is part of our DNA and our public cloud connection goes back to 2006, the same year both Shopify and Amazon Web Services (AWS) launched. Early on, we leveraged the public cloud as a small piece of our commerce cloud. It was great for hosting some of our smaller services, but we found the public cloud wasn’t a great fit for our main Rails monolith.
We’re pragmatic about how to evolve and invest in our infrastructure. In our startup days - with a small team - we valued simplicity and chose to focus on shipping the foundations of a commerce platform by deferring more complex infrastructure like database sharding. As we grew in scale and engineering expertise, we took on solving more complex patterns. With each major infrastructure scalability feature we shipped, like database sharding, application sharding, and production load testing, we continued to revisit how to horizontally scale our Rails application across thousands of servers. Over the years, we moved more and more of our supporting services to the Cloud, gaining additional context which fed into our developing monolith Cloud strategy.
Our latest push to the Cloud started over two years ago. Google launched Google Kubernetes Engine (GKE) (formerly Google Container Engine) as we had just finished production-hardening Docker. In 2014, Shopify invested in Docker to capitalize on the benefits of immutable infrastructure: predictable, repeatable builds and deployments; simpler and more robust rollbacks; and elimination of configuration management drift. Once you’re running containers, the next natural step is to take inspiration from Google’s Borg and start building out a dynamic container management and orchestration system. Being early adopters of Docker meant there weren’t many open-source options available, so we decided to build minimal container management features ourselves. The community and codebase were in its infancy and changing rapidly. Building these features allowed us to focus on application scalability and resilience while avoiding additional complexity as the Docker community matured.
In 2016, internal discussions began around what Shopify would look like in the future. The infrastructure changes from 2012 to 2016 allowed us to lay the foundation for using the Cloud in a pragmatic way via database sharding, application sharding, perf testing and automated failovers, but we were still missing an orchestration solution. Luckily, several exciting developments were happening, and the most promising one for Shopify was Kubernetes, an open-source container management system created by the teams at Google that built Borg and GKE.
After 12 years of building and running the foundation of our own commerce cloud with our own data centers, we are excited to build our Cloud with Google. We are working with a company who shares our values in open-source, security, performance and scale. We are better positioned to change the face of global commerce while providing more opportunities to the 600,000+ merchants on our platform today.
Since we began our Google Cloud migration, we have:
- Built our Shop Mover, a selective database data migration tool, that lets us rebalance shops between database shards with an average of 2.5s of downtime per shop
- Migrated over 50% of our data center workloads, and counting, to Google Cloud
- Contributed and leveraged, Grafeas, Google’s open source initiative to define a uniform way for auditing and governing the modern software supply chain
- Grown to over 400 production services and built a platform as a service (PaaS) to consolidate all production services on Kubernetes
- Joined the Cloud Native Computing Foundation (CNCF) and participated in the Kubernetes Apps Special Interest Group and Application Definition Working Group
By leveraging Google’s deep understanding of global infrastructure at scale, we’re able to ensure that every engineer we hire focuses on building and shaping the future of commerce on a global scale.
Stay tuned. We’re excited to share more stories about Shopify’s journey to Google Cloud with you.
Dale Neufeld, VP of Production Engineering
A Pods Architecture To Allow Shopify To Scale
In 2015, it was no longer possible to continue buying a larger database server for Shopify. We finally had no choice but to shard the database, which allowed us to horizontally scale our databases and continue our growth. However, what we gained in performance and scalability we lost in resilience. Throughout the Shopify codebase was code like this:
Sharding.with_each_shard do
some_action
end
If any of our shards went down, that entire action would be unavailable across the platform. We realized this would become a major problem as the number of shards continued to increase. In 2016 we sat down to reorganize Shopify’s runtime architecture.
Future Proofing Our Cloud Storage Usage
How we reduced error rates, and dropped latencies across merchants’ flows
Reading Time: 6 Minutes
Shopify merchants trust that when they build their stores on our platform, we’ve got their back. They can focus on their business, while we handle everything else. Any failures or degradations that happen put our promise of a sturdy, battle-tested platform at risk.
To do so, we need to ensure that the platform stays up and stays reliable. Shopify since 2016 has grown from 375,000 merchants to over 600,000. As of today, an average of 450,000 S3 operations per second are being made through our platform. However, that rapid growth also came with an increased S3 error rate, and increased read and write latencies.
While we use S3 at Shopify, if your application uses any flavor of cloud storage, and its use of cloud storage strongly correlates with the growth of your user base—whether it’s storing user or event data—I’m hoping this post provides some insight into how to optimize your cloud storage!
Implementing ChatOps into our Incident Management Procedure
8 minute read
Production engineers (PE) are expected to be incident management experts. Still, incident handling is difficult, often messy, and exhausting. We encounter new incidents, search high and low for possible explanations, sometimes tunnel on symptoms, and, under pressure, forget some best practices.
At Shopify, we care not only about handling incidents quickly and efficiently, but also PE well-being. We have a special IMOC (incident manager on call) rotation and an incident chatbot to assist IMOCs. This post provides an overview of incident management at Shopify, the responsibility of different roles during an incident, and how our chatbot works to support our team.
Upgrading Shopify to Rails 5
Today, Shopify runs on Rails 5.0, the latest version. It’s important to us to stay on the latest version so we can improve the performance and stability of the application without having to increase the maintenance cost of applying monkey patches. This guarantees we would always be in the version maintained by the community; and, that we would have access to new features soon.
Upgrading the Shopify monolith—one of the oldest and the largest Rails applications in the industry—from Rails 4.2 to 5.0 took us nearly a year. In this post, I’ll share our upgrade story and the lessons we learned. If you're wondering how the Shopify scale looks like or you plan a major Rails upgrade, this post is for you.
Surviving Flashes of High-Write Traffic Using Scriptable Load Balancers (Part II)
7 minute read
In the first post of this series, I outlined Shopify’s history with flash sales, our move to Nginx and Lua to help manage traffic, and the initial attempt we made to throttle traffic that didn’t account sufficiently for customer experience. We had underestimated the impact of not giving preference to customers who’d entered the queue at the beginning of the sale, and now we needed to find another way to protect the platform without ruining the customer experience.
Surviving Flashes of High-Write Traffic Using Scriptable Load Balancers (Part I)
7 minute read
This Sunday, over 100 million viewers will watch the Super Bowl. Whether they’re catching the match-up between the Falcons and the Patriots, or there for the commercials between the action, that’s a lot of eyeballs—and that’s only counting America. But all that attention doesn’t just stay on the screen, it gets directed to the web, and if you’re not prepared curious visitors could be rewarded with a sad error page.
The Super Bowl makes us misty-eyed because our first big flash sale happened in 2007, after the Colts beat the Bears. Fans rushed online for T-shirts celebrating the win, giving us a taste of what can happen when a flood of people convene on one site in a very short duration of time. Since then, we’ve been continually levelling up our ability to handle flash sales, and our merchants have put us to the test: on any given day, they’ll hurl Super Bowl-sized traffic, often without notice.
Why Shopify Moved to The Production Engineering Model
6 minute read
The traditional model of running large-scale computer systems divides work into Development and Operations as distinct and separate teams. This split works reasonably well for computer systems that are changed or updated very rarely, and organizations sometimes require this if they’re deploying and operating software built by a different company or organization. However, this rigid divide fails for large-scale web applications that are undergoing frequent or even continuous change. DevOps is the term for a movement that’s gathered steam in the past decade to bring together these disciplines.
Adventures in Production Rails Debugging
5 minute read
At Shopify we frequently need to debug production Rails problems. Adding extra debugging code takes time to write and deploy, so we’ve learned how to use tools like gdb
and rbtrace
to quickly track down these issues. In this post, we’ll explain how to use gdb to retrieve a Ruby call stack, inspect environment variables, and debug a really odd warning message in production.
We recently ran into an issue where we were seeing a large number of similar warning messages spamming our log files:
/artifacts/ruby/2.1.0/gems/rack-1.6.4/lib/rack/utils.rb:92: warning: regexp match /.../n against to UTF-8 string
This means we are trying to match an ASCII regular expression on a UTF-8 source string.
Secrets at Shopify - Introducing EJSON
This is a continuation of our series describing our evolution of Shopify toward a Docker-powered, containerized data centre. Read the last post in the series here.
One of the challenges along the road to containerization has been establishing a way to move application secrets like API keys, database passwords, and so on into the application in a secure way. This post explains our solution, and how you can use it with your own projects.
Tuning Ruby's Global Method Cache
I was recently profiling a production Shopify application server using perf
and noticed a fair amount of time being spent in a particular function, st_lookup
, which is used by Ruby’s MRI implementation for hash table lookups:
Hash tables are used all over MRI, and not just for the Hash
object; global variables, instance variables, classes, and the garbage collector all use MRI’s internal hash table implementation, st_table
. Unfortunately, what this profile did not show were the callers of st_lookup
. Is this some application code that has gone wild? Is this an inefficiency in the VM?
Docker at Shopify: How We Built Containers that Power Over 100,000 Online Shops
This is the second in a series of blog posts describing our evolution of Shopify toward a Docker-powered, containerized data center. This instalment will focus on the creation of the container used in our production environment when you visit a Shopify storefront.
Read the first post in this series here.
Why containerize?
Before we dive into the mechanics of building containers, let's discuss motivation. Containers have the potential to do for the datacenter what consoles did for gaming. In the early days of PC gaming, each game typically required video or sound driver massaging before you got to play. Gaming consoles however, offered a different experience:
- predictability: cartridges were self-contained fun: always ready-to-run, with no downloads or updates.
- fast: cartridges used read-only memory for lightning fast speeds.
- easy: cartridges were robust and largely child-proof - they were quite literally plug-and-play.
Predictable, fast, and easy are all good things at scale. Docker containers provide the building blocks to make our data centers easier to run and more adaptable by placing applications into self-contained, ready-to-run units much like cartridges did for console games.
Building an Internal Cloud with Docker and CoreOS
This is the first in a series of posts about adding containers to our server farm to make it easier to scale, manage, and keep pace with our business.
The key ingredients are:
- Docker: container technology for making applications portable and predictable
- CoreOS: provides a minimal operating system, systemd for orchestration, and Docker to run containers
Shopify is a large Ruby on Rails application that has undergone massive scaling in recent years. Our production servers are able to scale to over 8,000 requests per second by spreading the load across 1700 cores and 6 TB RAM.
Kafka Producer Pipeline for Ruby on Rails
-
We were looking for a reliable way to collect event data and send it to our data warehouse.
-
We were considering a more service-oriented architecture, and needed a standardized way of message passing between the components.
-
We were starting to evaluate containerization of Shopify, and were searching for a way to get logs out of containers.
We were intrigued by Kafka due to its highly available design. However, Kafka runs on the JVM, and its primary user, LinkedIn, runs a full JVM stack. Shopify is mainly Ruby on Rails and Go, so we had to figure out how to integrate Kafka into our infrastructure.
Building a Rack Middleware
I'm Chris Saunders, one of Shopify's developers. I like to keep journal entries about the problems I run into while working on the various codebases within the company.
Recently we ran into a issue with authentication in one of our applications and as a result I ended up learning a bit about Rack middleware. I feel that the experience was worth sharing with the world at large so here's is a rough transcription of my entry. Enjoy!
I'm looking at invalid form submissions for users who were trying to log in via their Shopify stores. The issue was actually at a middleware level, since we were passing invalid data off to OmniAuth which would then choke because it was dealing with invalid URIs.
The bug in particular was we were generating the shop URL based on the data that the user was submitting. Normally we'd be expecting something like mystore.myshopify.com or simply mystore, but of course forms can be confusing and people put stuff in there like http://mystore.myshopify.com or even worse my store. We'd build up a URL and end up passing something like https://http::/mystore.myshopify.com.myshopify.com and cause an exception to get raised.
Another caveat is that we aren't able to even sanitize the input before passing it off to OmniAuth, unless we were to add more code to the lambda that we pass into the setup initializer.
Adding more code to an initializer is definitely less than optimal, so we figured that we could implement this in a better way: adding a middleware to run before OmniAuth such that we could attempt to recover the bad form data, or simply kill the request before we get too deep.
We took a bit of time to learn about how Rack middlewares work, and looked to the OmniAuth code for inspiration since it provides a lot of pluggability and is what I'd call a good example of how to build out easily extendable code.
We decided that our middleware would be initialized with a series of routes to run a bunch of sanitization strategies on. Based on how OmniAuth works, I gleaned that the arguments after config.use MyMiddleWare
would be passed into the middleware during the initialization phase - perfect! We whiteboarded a solution that would work as follows:
Now that we had a goal we just had to implement it. We started off by building out the strategies since that was extremely easy to test. The interface we decided upon was the following:
We decided that the actions would be destructive, so instead of creating a new Rack::Request
at the end of our strategies call, we'd change values on the object directly. It simplifies things a little bit but we need to be aware that order of operations might set some of our keys to nil
and we'd have to anticipate that.
The simplest of sanitizers we'd need is one that cleans up our whitespace. Because we are building these for .myshopify.com domains we know the convention they follow: dashes are used as separators between words if the shop was created with spaces. For example, if I signed up with my super awesome store when creating a shop, that would be converted into my-super-awesome-store. So if a user accidentally put in my super awesome store we can totally recover that!
Now that we have a sanitization strategy written up, let's work on our actual middleware implementation.
According to the Rack spec, all we really need to do is ensure that we return the expected result: an array that consists of the following three things: A response code, a hash of headers and an iterable that represents the content body. An example of the most basic Rack response is:
Per the Rack spec, middlewares are always initialized where the first object is a Rack app, and whatever else afterwards. So let's get to the actual implementation:
That's pretty much it! We've written up a really simple middleware that takes care of cleaning up some bad user input that necessarily isn't a bad thing. People make mistakes and we should try as much as possible to react to this data in a way that isn't jarring to the users of our software.
You can check out our implementation on Github and install it via RubyGems. Happy hacking!
What Does Your Webserver Do When a User Hits Refresh?
Your web application is likely rendering requests when the requesting client has already disconnected. Eric Wong helped us devise a patch for the Unicorn webserver that will test the client connection before calling the application, effectively dropping disconnected requests before wasting app server rendering time.
The Flash Sale
A common traffic pattern we see at Shopify is the flash sale, where a product will be discounted heavily or only available for a very short period of time. Our customer's flash sales can cause traffic spikes an order of magnitude above our typical traffic rate.
This blog post highlights one of the problems dealing with these traffic surges that we solved during our preparation for the holiday shopping season.
In a flash sale scenario, with our app servers under high load, response time grows. As our response time increases, customers attempting to buy items will hit refresh in frustration. This was causing a snowball effect that would contribute to reduced availability.
Connection Queues
Each of our application servers run Nginx in front of many Unicorn workers running our Rails application. When Nginx receives a request, it opens a queued connection on the shared socket that is used to communicate with Unicorn. The Unicorn workers work off requests in the order they're placed on the socket’s connection backlog.
The worker process looks something like:
The second step takes the bulk majority of time of processing a request. Under load, the queue of pending requests sitting on the UNIX socket from Nginx grows until it reaches maximum capacity (SOMAXCONN). When the queue reaches capacity, Nginx will immediately return a 502 to incoming requests as it has nowhere to queue the connection.
Pending Requests
While the app worker is busy rendering a request, the pending requests in the socket backlog represent users waiting for a result. If a users hits refresh, their browser closes the current connection and their new connection enters the end of the queue (or nginx returns a 502 if the queue is full). So what happens when the application server gets to the user's original request in the queue?
Nginx and HTTP 499
The HTTP 499 response code is not part of the HTTP standard. Nginx logs this response code when a user disconnects before the application returned a result. Check your logs - an abundance of 499s is a good indication that your application is too slow or over capacity, as people are disconnecting instead of waiting for a response. Your Nginx logs will always have some 499s due to clients disconnecting before even a quick request finishes.
HTTP 200 vs HTTP 499 Responses During a Flash Sale
When Nginx logs an HTTP 499 it also closes the downstream connection to the application, but it is up to the application to detect the closed connection before wasting time rendering a page for a client who already disconnected.
Detecting Closed Sockets
With the asynchronous nature of sockets, detecting a closed connection isn't straightforward. Your options are:
- Call select() on the socket. If a connection is closed, it will return as "data available" but a subsequent read() call will fail.
- Attempt to write to the socket.
Unfortunately it is typical for web applications to find out the client socket is closed only after spending the time and resources rendering the page, when it attempts to write the response. This is what our Rails application was doing. The net effect was that for every time a user pressed refresh, we would render that page, even if the user had already disconnected. This would cause a snowball effect until eventually our app workers were doing little but rendering pages and throwing them away and our service was effectively down.
What we wanted to do was test the connection before calling the application, so we could filter out closed sockets and avoid wasting time. The first detection option above is not great: select() requires a timeout, and generally select() with even the shortest timeout will take a fraction of a millisecond to complete. So we went with the second solution: Write something to the socket to test it, before calling the application. This is typically the best way to deal with resources anyways: just attempt to use them and there will be an error if there’s something in the way. Unicorn was already acting that way, just not until after wasting time rendering the page.
Just write an 'H'
Thankfully all HTTP responses start with "HTTP/1.1", so (rather cheekily) our patch to Unicorn writes this string to test the connection before calling the application. If writing to the socket fails, Unicorn moves on to process the next request and only a trivial amount of time is spent dealing with the closed connection.
Eric Wong merged this change into Unicorn master and soon after released Unicorn V4.5.0. To use this feature you must add 'check_client_connection true' to your Unicorn configuration.
StatsD at Shopify
What is StatsD good for?
In my experience, there are two things that StatsD really excels at. First, getting a high level overview of some custom piece of data. We use NewRelic to tell us about the performance of our apps. NewRelic provides a great overview of our performance as a whole, even down to which of our controller actions are slowest, and though it has an API for custom instrumentation I've never used it. For custom metrics we're using StatsD.
We use lots of memcached, and one metric we track with StatsD is cache hits vs. cache misses on our frontend. On every request that hits a cacheable action we send an event to StatsD to record a hit or miss.
Caching Baseline (Green: cache hits, Blue: cache misses)
Note: The graphs in this article were generated by Graphite, the real-time graphing system that StatsD runs on top of.
As an example of how this is useful, we recently added some data to a cache key that wasn't properly converted to a string, so that piece of the key was appearing to be unique far more often than it was. The net result was more cache misses than usual. Looking at our NewRelic data we could see that performance was affected, but it was difficult to see exactly where. The response time from our memcached servers was still good, the response time from the app was still good, but our number of cache misses had doubled, our number of cache hits had halved, and overall user-facing performance was down.
A problem
It wasn't until we looked at our StatsD graphs that we fully understood the problem. Looking at our caching trends over time we could clearly see that on a specific date something was introduced that was affecting caching negatively. With a specific date we were able to track down the git commit and fix the issue. Keeping an eye on our StatsD graphs we immediately saw the behaviour return to the normal trend.
Return to Baseline
The second thing that StatsD excels at is proving assumptions. When we're writing code we're constantly making assumptions. Assumptions about how our web app may be used, assumptions about how often an interaction will be performed, assumptions about how fast a particular operation may be, assumptions about how successful a particular operation may be. Using StatsD it becomes trivial to get real data about this stuff.
For instance, we push a lot of products to Google Product Search on behalf of our customers. There was a point where I was seeing an abnormally high number of failures returned from Google when we were posting these products via their API. My first assumption was that something was wrong at the protocol level and most of our API requests were failing. I could have done some digging around in the database to get an idea of how many failures we were getting, cross referenced with how many products we were trying to publish and how frequently, etc. But using our StatsD client (see below) I was able add a simple success/failure metric to give me a high level overview of the issue. Looking at the graph from StatsD I could see that my assumption was wrong, so I was able to eliminate that line of thinking.
statsd-instrument
We were excited about StatsD as soon as we read Etsy's announcement. We wrote our own client and began using it immediately. Today we're releasing that client. It's been in use in production since then and has been stalwartly collecting data for us. On an average request we're sending ~5 events to StatsD and we don't see a performance hit. We're actually using StatsD to record the raw number of requests we handle over time.
statsd-instrument provides some basic helpers for sending data to StatsD, but we don't typically use those directly. We definitely didn't want to litter our application with instrumentation details so we wrote metaprogramming methods that allow us to inject that instrumentation where it's needed. Using those methods we have managed to keep all of our instrumentation contained to one file in our config/initializers folder. Check out the README for the full API or pull down the statsd-instrument rubygem to use it.
A sample of our instrumentation shows how to use the library and the metaprogramming methods:
# Liquid Liquid::Template.extend StatsD::Instrument Liquid::Template.statsd_measure :parse, 'Liquid.Template.parse' Liquid::Template.statsd_measure :render, 'Liquid.Template.render' # Google Base GoogleBase.extend StatsD::Instrument GoogleBase.statsd_count_success :update_products!, 'GoogleBase.update_products' # Webhooks WebhookJob.extend StatsD::Instrument WebhookJob.statsd_count_success :perform, 'Webhook.perform'
That being said, there are a few places where we do make use of the helpers directly (sans metaprogramming), still within the confines of our instrumentation initializer:
ShopAreaController.after_filter do StatsD.increment 'Storefront.requests', 1, 0.1 return unless request.env['cacheable.cache'] if request.env['cacheable.miss'] StatsD.increment 'Storefront.cache.miss' elsif request.env['cacheable.store'] == 'client' StatsD.increment 'Storefront.cache.hit_client' elsif request.env['cacheable.store'] == 'server' StatsD.increment 'Storefront.cache.hit_server' end end
Today we're recording metrics on everything from the time it takes to parse and render Liquid templates, how often our Webhooks are succeeding, performance of our search server, average response times from the many payment gateways we support, success/failure of user logins, and more.
As I mentioned, we have many tools in our data toolbox, and StatsD is a low-friction way to easily collect and inspect metrics. Check out statsd-instrument on github.