Production engineers (PE) are expected to be incident management experts. Still, incident handling is difficult, often messy, and exhausting. We encounter new incidents, search high and low for possible explanations, sometimes tunnel on symptoms, and, under pressure, forget some best practices.
At Shopify, we care not only about handling incidents quickly and efficiently, but also PE well-being. We have a special IMOC (incident manager on call) rotation and an incident chatbot to assist IMOCs. This post provides an overview of incident management at Shopify, the responsibility of different roles during an incident, and how our chatbot works to support our team.
Incident Management at Shopify
The IMOC’s role is to lead the incident response. Their main focus is on communication, ensuring the response continues to move forward and escalate as required, and on follow through. Our incident response process springs from the Incident Command System (ICS), which provides us with a common hierarchy within which responders from multiple agencies can be effective.
In Shopify’s ICS model case, the hierarchy is simplified to the Incident Commander (which we call the IMOC) who leads the incident response; the Public Information Officer who takes care of public communication, called a Support Response Manager (SRM) at Shopify; and, an operations section that directs all the actions needed to solve the incident, usually the component experts in our case.
It’s essential to note that the IMOC is on call for coordinating the incident response, not for fixing production issues (which is the component expert’s mission). They ensure that the incident goes through the following steps:
At Shopify, this roughly translates into:
- When we detect that something’s broken, this usually results in a page.
- We evaluate and confirm the issue, and then the IMOC starts an incident.
- The IMOC will ensure communication with other teams (or the operation sections expert in the outage area) and merchants has occurred, and confirm that a proper fix is found and pushed.
- If the fix is working as expected, the IMOC will stop the incident, and then document the service disruption.
- An investigation happens, followed by a root cause analysis (RCA) meeting, which usually leads to action items.
- Addressing all action items results in the final resolution of the incident.
During an incident, incident response steps shouldn't be left to memory, especially when wanting to consistently offer an effective streamlined experience. Our chatbot, Spy, makes this easier by assisting the IMOC through the incident response. Spy features a set of incident commands that help reduce manual effort and context switching.
We integrated the bot with our conversation tool and several third-party tools (PagerDuty, StatusPage, GitHub) to send timely reminders. Here is an overview of our current ChatOps setup:
ChatOps is all about conversation-driven operations and uses group chat tools to go beyond basic conversation with context and actions taken from within the chat tool itself. It allows the infrastructure to be brought into the conversation. Communication between those third-party services and our chatbot is usually done via webhooks.
At Shopify, we use Slack as our chat app, and our main bot Spy stems from Lita. Lita is open source and written in Ruby, and can be extended easily. Behaviors can be added by simply defining modules.
With the above overview of our ChatOps setup and the previous IMOC duties, we can now look into how Spy specifically helps with incident response.
Spy has three main sets of commands that help with IMOC duties:
- spy page
- spy incident
- spy status
Together, they help the IMOC go through the incident response funnel steps.
Step 1: Failure detection
An alert will page the IMOC. Someone who notices the failure may also do so via the `spy page` command. Example: `spy page imoc order notifications not going out`
The IMOC can also use a Spy command to acknowledge the page (e.g. `spy pager imoc ack 125`).
Step 2: Start incident
Once the issue is confirmed, the IMOC starts an incident with the `spy start incident` command. Spy will then bind the incident to a #war-room Slack channel where all the discussions will take place. This is to focus all incident response discussion in one place and allow all participants to have a shared understanding of what is going on. Without this focus, we can easily end up with independent parallel discussions and lots of confusion about who is doing what and why.
Spy also automatically notifies the SRM in case they get related merchant calls or chats.
If there was any prior ongoing issue, Spy will display that in the #war-room Slack channel.
Step 3: Communication
After the IMOC starts the incident, communication is crucial to ensure that every stakeholder is aware of the issue. Through the `spy incident tldr` command, anyone at the company can ask Spy at any given moment what incidents are going on and see who is involved, when it started, and consult a brief summary.
Communication with other teams can be done right from the war room. Here again no need to switch context: we can use `spy incident tell` or `spy page` command. As for actions, some can be performed right from #war-room for everyone to see, whether it’s a data center failover or a deploy lock. This also prevents confusion around what actions have been performed as they all occur in one space.
Occasionally we communicate with some third-party services that have a hosted status page and pager, to resolve incidents or update the page.
Step 4: Fix and mitigate
Spy can perform different mitigation actions can as it is closely embedded in our infrastructure. Some examples include rebalancing traffic, data center failover, blackholing jobs, locking deploy stacks.
Step 5: Stop incident
Once the fix has been shipped and verified in production, the IMOC can use `spy incident stop` command, which will generate a service disruption document to verify and post once ready.
Step 6: Document the service disruption
Spy will add any #war-room notes tagged with a notepad emoji (📝) or prefixed with `spy incident note` command to the service disruption document and post the resulting document in a direct message to the IMOC.
Taking care of IMOCs
Spy also sends timely reminders. For instance, if an incident has been ongoing for a while and the status page hasn’t been updated, Spy will send the IMOC a reminder. It also has an on-call fatigue prevention mechanism built-in: if the IMOC has been handling an incident for pre-specified amount of time, Spy will reach out to the IMOC squad to help the current IMOC.
Our chatbot supports best practices and "streamlines" incident response. Spy binds the incident to a discussion channel where all communications happen, allows status page updates directly from the chat room, keeps notes and records event times, and generates an incident summary. Our bot helps PEs stay focused during the incident, reduces post-incident toil, and supports detailed post-mortems. Spy has become a very important member of our team: always available and ready to help, enabling us to really effectively lead an incident response.
To learn more, catch my 2017 talk at SREcon: