- When to use this runbook
- What is an incident?
- Incident Roles and Responsibilities
- Steps to follow during an incident
(La version française est en cours d’élaboration)
When to use this runbook
When an issue happens on a product we support (application down, bugs in production, product doing funny things that make you go 🤨)
When a new ticket comes into the support backlog with an “urgent” priority (and has been validated)
If this is a reported vulnerability, or something sent to security@cds-snc.ca, refer to our Vulnerability Disclosure : CDS Standard Operating Procedure ↗️ [it’s still an incident, just more sensitive]
If you have not been trained on the incident commander (IC) role, this document is a close to “one stop shop” in getting you ready – however some formal training will help you be more effective. Reach out to the SRE team for IC training if you have not yet had training and are expected to act as an IC as soon as you can.
What is an incident?
We know the word “incident” sounds scary but if it makes you look at something sideways or say “that’s odd”, err on the side of caution. You will never get in trouble for calling something an incident that turns out not to be.
Incident Roles and Responsibilities
See CDS Incident Management Handbook – Roles
Steps to follow during an incident
What to do when there is an incident:
Acknowledge
As the incident commander, acknowledge the incident in the channel it came from:
- If it came from an Opsgenie alert, tell Opsgenie you are on it
- If it came from a Freshdesk customer issue, let them know you are on it
- If someone emailed / slacked / sent you a carrier pigeon about the incident, respond to them, via that communication channel, that you are looking into it
Open communications
The goal at this stage is to establish and focus all incident team communications in well-known places.
We use a bot developed by the CDS SRE Team. You call the bot in any Slack channel by typing /incident and fill in the information. It will automatically create for you a channel, an incident report and a Google Meet link.
With a dedicated channel, we can invite only the people who need to be there and reduce the noise for others in the main product channel. After the incident is over, we’ll want to know what happened, when it happened, etc. Slack is a great place to find that information if you aren’t so good at documenting as you go. If we have someone new jump in on the incident they can have all the background information at their fingertips.
Other benefits for being public, it increases:
- Psychological safety since incidents aren’t things to hide or be ashamed of, they are normal things that happen.
- Knowledge transfer, folks can see what is happening and learn from the incident.
- Chance an incident could be fixed faster as someone with knowledge who isn’t directly involved may help faster.
Setup the channel
You should set the topic of the channel to current incident status (investigating, onfire, all-clear, etc) and keep it up to date. You will also pin any important messages to the channel (anything someone joining the conversation would want to know, or people in the conversation want to refer back to often) such as the description of the issue, observations, changes, decisions.
The Slack bot will automatically bookmark the incident report and the video link. The IC could add more links such as the support ticket or any useful URLs.
You will then pull in folks that should be involved in the incident:
- Operation Lead
- Security
- Product Manager
- Someone from Outreach (tbd)
If there is a security element to the incident send an email to security+securite@cds-snc.ca with a brief description of the situation and link to the newly created Slack channel.
This is where you may want to start delegating the incident response roles to other team members.
Assess
After the incident team has their communication channels set up, the next step is to assess the incident’s severity so the team can decide what level of response is appropriate.
In order to do that we ask:
- What is the impact on customers?
- Is this impacting a single client, multiple clients on a single service, or multiple services?
- Prioritize the incident to determine who, how, and when to work on the incidence. Is it a SEV1 issue and work 24/7 until it’s fixed? Or can this be flagged and worked on the next morning? Check with the PM and business unit director – if they are unavailable make your best guess and document your decision making.
Here’s a recommended severity levels description:
Severity | Description | Examples |
1 | A critical incident with very high impact | Service is down for all usersConfidentiality or privacy is breachedUser data loss |
2 | A major incident with significant impact | Service is unavailable for a subset of usersCore functionality is significantly impacted |
3 | A minor incident with low impact | A minor inconvenience to users, workaround availableUsable performance degradation |
Once you establish the impact of the incident, adjust or confirm the severity of the incident issue and communicate that severity to the team.
Severity 3 incidents are assigned to the responsible teams as appropriate for resolution normally during business hours, whereas severity 1 and 2 require an immediate response and continuous 24/7 management through to resolution.
Send initial comms
If the incident is considered a security event, stop here and follow this procedure.
When you’re reasonably confident that the incident is real, you need to communicate it externally. The Communication Lead 🔒 ↗️ informs external folks who need to know. If this is a partner product that means talking to the partner. If this is a CDS product that means updating the status page (tbd). Communicating quickly and openly about all our incidents helps to build trust with our teams and users.
Summarize what is happening in Slack into words we can share externally. For example:
- This is what we know.
- This is what we’re doing next.
- You can expect to hear an update from us by this time.
We know you won’t want to tell the world about every incident we have but what is better – if they have the information or to find out later that we swept it under the rug.
Write a sample response for the incident if new users reach out to report it. There are templates if you want to use them. They are here. 🔒 ↗️
Manage and reply to all incoming tickets on the incident, continually update external stakeholders throughout the incident.
Iterate
Based on information learned in “Assess”, Operations Lead and Incident Commander 🔒 ↗️ discuss the incident and decide on how the incident should be mitigated and/or resolved. When in doubt, the incident commander should have the final say.
It’s highly recommended you have an open Google Meet for the duration of the incident. If the folks involved on the call agree, turn on captions ↗️ and use the Chrome extension ↗️ to save captions to a document to help with writing the post-mortem report.
Your biggest challenges as an IC are around maintaining the team’s discipline:
- Is the team communicating effectively?
- What are the current observations, theories, and streams of work?
- Are we making decisions effectively?
- Are we making changes intentionally and carefully? Do we know what changes we’re making? Are team members communicating their changes properly? Are team members confirming with IC before changes are made?
- Are roles clear? Are people doing their jobs? Do we need to escalate to more teams?
In any case, don’t panic, it doesn’t help. Stay calm and the rest of the team will take that cure.
You have to keep an eye on team fatigue and pan team handovers. A dedicated team can risk burning themselves out when resolving a complex incident, you should look out for how long members have been awake for and how long they’ve been working on the incident for, and decide who’s going to fill their roles next.
Resolve
This is the bulk of the incident response. An incident is resolved when the business impact has ended.
The Operations Lead mitigate and/or resolve the issue. We recommend the following process:
- Before making any changes, review them with the Incident Commander
- Mitigate first to reduce the impact as you resolve the complete issue
- If a temporary technical solution exists:
- Determine the risk of this solution and get approval from the IC. If it has a low chance of making the problem worse. Implement the fix.
- Get the code reviewed by a colleague before pushing it to production.
- In a bind, push to production
- If a manual workaround exists
- Instruct the client on the manual workaround (or implement the manual work around).
- Keep working on a longer term fix.
- Write what you’re doing in the Slack channel to keep everyone up to date
- What you have tried
- What happened when you tried it
- What you are going to try next
- Write what you need others to do in the slack channel
- Before making any changes review with Incident Commander
- Identify root cause and resolve
- Get a colleague to review the code before pushing to production.
- If you have the time, write tests for the problem. These tests need to be written, but they can be written the next day if the problem is severe and immediate action is required.
Blameless post-mortem
We do a retrospective to ensure we’ve done the work to identify any and all root causes and come up with action items to reduce the probability / impact of it happening again.
We also share these reports with new hires to help them get up to speed on lessons we’ve learned prior to them joining.
“Umm. That sounds great but I don’t want to get in trouble because of my actions.” Never fear. If you do a retrospective and learn from the incident you won’t get in trouble.
The way it works:
- The IC finds a facilitator, ideally should be someone who wasn’t involved in the incident
- Within a week of incident resolution, a session is booked with the team involved in the incident + facilitator
- The IC fills in the incident report prior to session
- The team put the action items from the session into appropriate tools (trello, github, or diary)
Archive the incident channel
Once all the information from the channel has been documented in your incident report, you can archive the channel to reduce the noise in our Slack.
Also, in this step, archive / clean up any other things that were made during the incident