At Shopify, we use Game Day tests to practice how we react to unpredictable situations. Game Day tests involve deliberately triggering failure modes within our production systems, and analyzing whether the systems handle these problems in the ways we expect. I’ll walk through a set of best practices that we use for our internal Shopify Game Day tests, and how you can apply these guidelines to your own testing.
Shopify’s primary responsibility is to provide our merchants with a stable ecommerce platform. Even a small outage can have a dramatic impact on their businesses, so we put a lot of work into preventing them before they occur. We verify our code changes rigorously before they’re deployed, both through automated tests and manual verification. We also require code reviews from other developers who are aware of the context of these changes and their potential impact to the larger platform.
But these upfront checks are only part of the equation. Inevitably, things will break in ways that we don’t expect, or due to forces that are outside our control. When this happens, we need to quickly respond to the issue, analyze the situation at hand, and restore the system back to a healthy state. This requires close coordination between humans and automated systems, and the only way to ensure that it goes smoothly is to practice it beforehand. Game Day tests are a great way of training your team to expect the unexpected.
1. List All the Things That Could Break
The first step to running a successful Game Day test is to compile a list of all the potential failure scenarios that you’re interested in analyzing. Collaborate with your team to take a detailed inventory of everything that could possibly cause your systems to go haywire. List all the problem areas you know about, but don’t stop there—stretch your imagination!
- What are the parts of your infrastructure that you think are 100% safe?
- Where are your blind spots?
- What would happen if your servers started inexplicably running out of disk space?
- What would happen if you suffered a DNS outage or a DDOS attack?
- What would happen if all network calls to a host started timing out?
- Can your systems support 20x their current load?
You’ll likely end up with too many scenarios to reasonably test during a single Game Day testing session. Whittle down the list by comparing the estimated impact of each scenario against the difficulty you’d face in trying to reasonably simulate it. Try to avoid weighing particular scenarios based on your estimates of the likelihood that those scenarios will happen. Game Day testing is about insulating your systems against perfect storm incidents, which often hinge on failure points whose danger was initially underestimated.
2. Create a Series of Experiments
At Shopify, we’ve found that we get the best results from our Game Day tests when we run them as a series of controlled experiments. Once you’ve compiled a list of things that could break, you should start thinking about how they will break, as a list of discrete hypotheses.
- What are the side effects that you expect will be triggered when you simulate an outage during your test?
- Will the correct alerts be dispatched?
- Will downstream systems manifest the expected behaviors?
- When you stop simulating a problem, will your systems recover back to their original state?
If you express these expectations in the form of testable hypotheses, it becomes much easier to plan the actual Game Day session itself. Use a separate spreadsheet (using a tool like Google Sheets or similar) to catalogue each of the prerequisite steps that your team will walk through to simulate a specific failure scenario. Below those steps indicate the behaviors that you hypothesize will occur when you trigger that scenario, along with an indicator for whether this behavior occurs. Lastly, make sure to list the necessary steps to restore your system back to its original state.
Example spreadsheet for a Game Day test that simulates an upstream service outage. A link to this spreadsheet is available in the “Additional Resources” section below.
3. Test Your Human Systems Too
By this point, you’ve compiled a series of short experiments that describe how you expect your systems to react to a list of failure scenarios. Now it’s time to run your Game Day test and validate your experimental hypotheses. There are a lot of different ways to run an Game Day test. One approach isn’t necessarily better than another. How you approach the testing should be tailored to the types of systems you’re testing, the way your team is structured and communicates, the impact your testing poses to production traffic, and so on. Whatever approach you take, just make sure that you track your experiment results as you go along!
However, there is one common element that should be present regardless of the specifics of your particular testing setup: team involvement. Game Day tests aren’t just about seeing how your automated systems react to unexpected pressures—you should also use the opportunity to analyze how your team handles these situations on the people side. Good team communication under pressure can make a huge difference when it comes to mitigating the impact of a production incident.
- What are the types of interactions that need to happen among team members as an incident unfolds?
- Is there a protocol for how work is distributed among multiple people?
- Do you need to communicate with anyone from outside your immediate team?
Make sure you have a basic system in place to prevent people from doing the same task twice, or incorrectly assuming that something is already being handled.
4. Address Any Gaps Uncovered
After running your Game Day test, it’s time to patch the holes that you uncovered. Your experiment spreadsheets should be annotated with whether each hypothesis held up in practice.
- Did your off hours alerting system page the on-call developer?
- Did you correctly switch over to reading from the backup database?
- Were you able to restore things back to their original healthy state?
For any gaps you uncover, work with your team to determine why the expected behavior didn’t occur, then establish a plan for how to correct the failed behavior. After doing so, you should ideally run a new Game Day test to verify that your hypotheses are now valid with the new fixes in place.
This is also the opportunity to analyze any gaps in communication between your team, or problems that you identified regarding how people distribute work among themselves when they’re under pressure. Set aside some time for a follow up discussion with the other Game Day participants to discuss the results of the test, and ask for their input on what they thought went well versus what could use some improvement. Finally, make any necessary changes to your team’s guidelines for how to respond to these incidents going forward.
Using these best practices, you should be able to execute a successful Game Day test that gives you greater confidence in how your systems—and the humans that control them—will respond during unexpected incidents. And remember that a Game Day test isn’t a one-time event: you should periodically update your hypotheses and conduct new tests to make sure that your team remains prepared for the unexpected. Happy testing!
- How Complex Systems Fail—a list of high-level characteristics of systemic failure that can help you think about how to model your failure scenarios.
- Common Ground and Coordination in Joint Activity—an investigation into how humans coordinate work within technically complex situations.
- List of Resilience Engineering papers and publications—a curated list of further reading materials about Resilience Engineering.
- Example spreadsheet for tracking Game Day test scenarios