Language selection

Search

Incident Management Handbook

(La version française est en cours d’élaboration)

Introduction

When something goes wrong, whether it’s an outage or a broken feature, team members need to respond immediately and restore service. This process is called incident management.

In service delivery, change is the only constant. This means systems will continually be stressed in new and different ways. Teams that understand this, also understand that it’s not a matter of if – but when – systems will fail. Taking steps to prepare for these failures and run blameless postmortem processes should be recognized as a critical element of ongoing success, and integrated into the DNA of all delivery teams.

Objectives

This handbook outlines the stakeholders and actions required to ensure that incidents (including security events) are addressed in a consistent, coordinated, and timely fashion CDS-wide. The handbook will be tested and reviewed frequently, and modified as required.

We want to be able to consistently:

  • Clarify roles and responsibilities.
  • Document and test procedures.
  • Train staff.
  • Simplify decision-making by people and teams in incidents and blameless postmortems.
  • Build a consistent culture across teams of how we identify, manage, and learn from incidents.
  • Create a fast and predictable response to every incident every single time.

We want to comply with the following ITSG-33 ↗️ incident management controls (14):

  1. IR-1: Incident Response Policy And Procedures
  2. IR-2: Incident Response Training
  3. IR-2.1: Incident Response Training | Simulated Events
  4. IR-3: Incident Response Testing And Exercises
  5. IR-3.2: Incident Response Testing And Exercises > Coordination with related Plans
  6. IR-4: Incident Handling
  7. IR-4.4: Incident Handling > Information Correlation
  8. IR-4.9: Incident Handling > Dynamic Response Capability
  9. IR-5: Incident Monitoring
  10. IR-6: Incident Reporting
  11. IR-6.2: Incident Reporting > Vulnerabilities related to incidents
  12. IR-7: Incident Response Assistance
  13. IR-8: Incident Response Plan
  14. SA-15.10: Development Process, Standards, And Tool > Incident Response Plan

Overview

Who is this guide for?

If you’re on a team that offers services to any type of users, this handbook is for you.

What is an incident?

At CDS, we define an incident as an event that causes disruption to or a reduction in the quality of a service which requires a response.

Our incident values

A process for managing incidents can’t cover all possible situations, so we empower our teams with general guidance in the form of values. At CDS, we work with the following values:

  • Put people at the heart of services: We’re driven by empathy and put people’s needs first.
  • Do the hard work to make things easier:  We do not shy away from hard conversations but approach them with courage, humility, compassion, and integrity
  • Work in the open to help clear a path: We share our work, progress, and failures to help others learn from what we do. 
  • Take care of each other: We work to create a space where people feel they belong, take risks, learn from mistakes, and speak openly.  

Incident Stages

StageRationale
DetectA service includes enough monitoring and alerting to detect incidents before our users do. The best monitoring alerts us to problems before they even become incidents.
Respond & AssessDeclare an incident
Triage and prioritize. Recognize cyber security events. 
RecoverOur users don’t care why their service is down, only that we restore service as quickly as possible.

Never hesitate in getting an incident resolved quickly so that we can minimize impact to our users.
LearnAlways blameless. Incidents are part of running services. We improve services by holding teams accountable, not by apportioning blame.
Improve
Never have the same incident twice. Identify the root cause and the changes that will prevent the whole class of incident from occurring again. 

Tools

OpsGenie

We use Opsgenie to manage on-call rotations and escalations. To get access reach out to #sre-and-tech-ops in Slack.

Slack Bot

The CDS Site Reliability Engineering team built the following bot for site reliability engineering 🔒 ↗️. Launch the SRE Incident Bot in any Slack channel by typing /incident and fill in the information.

You can also post !incident in any Slack channel or thread, an auto-reply will remind you where to find the incident response runbook.

Slack, Google Meet and Drive

To create a Slack room, Google Meet and a Post-Mortem report instantly, launch the SRE Incident Bot in any Slack channel by typing /incident and fill in the information.

incident command on Slack

Once the channel is created, use /sre incident roles in that channel to assign the roles. 

All the incident reports are tracked here: ​​Incidents 🔒 ↗️.

Status Page

Communicating status with both internal stakeholders and customers through a status page ↗️ helps keep everyone in the loop.


Roles

Incident Commander (IC)

Each incident is driven by the incident commander (IC), who has overall responsibility for and authority for the incident. The incident commander is empowered to take any action necessary to resolve the incident, which includes paging anyone in the organization and keeping those involved in an incident focused on restoring service as quickly as possible.

The incident commander is a role, rather than an individual on the incident. The advantage of defining roles during an incident is that it allows people to become interchangeable. As long as a given person knows how to perform a certain role, they can take that role for any incident.

The IC should:
  • Command and coordinate the incident response
  • Delegate roles as needed. By default, the IC assumes all roles that have not been delegated yet.
  • Communicate effectively. 
  • Brief Senior Management
  • Stay in control of the incident response.
  • Work with other responders to resolve the incident.
  • Appoint others to incident command positions as needed
  • Provide information to and coordinate with crisis communications or media relations team
  • Terminate the response and demobilize resources when the situation has been stabilized
  • Transition to a new IC
    • The current IC should transition the open incident to a new IC if their on-call schedule is over or they feel they could not continue for various reasons. 
    • Provide enough context for the new IC
      • Current state of the Incident
      • What next steps are
      • Names of People currently in the Incident

Communications Lead (CL)

CL is the public face of the incident and a person familiar with public communications (usually from the Outreach team). The CL should:

  • With the IC, determine if the incident requires public communications
  • Provide periodic updates to the incident response team and stakeholders (includes SCMA)
  • Develop a plain-language summary of the incident in collaboration with the IC
  • Work with the IC and PL to develop communications to users
  • Manage inquiries about the incident with SCMA

Policy Lead (PL)

The PL leads policy analysis related to the incident and makes recommendations to the IC on the policy implications of the incident, how to respond, and how to communicate this to clients, government stakeholders, and the public, if necessary. This work should predominantly happen alongside Respond/Assess and Recover phases and in post-incident communication. On Platform, PL is almost always the policy advisor already embedded on the product team. If your policy advisor is away, or your team does not have a policy advisor, please contact Nisa Malli, Delivery Policy Lead, and she will assign one to the incident. The PL should:

  • Notify the Delivery Policy Team and Team Lead; flag if there are cross-product/ interoperability/CDS-wide implications. Convene ad hoc policy consult as needed.
  • Assess potential policy implications of the incident (e.g., privacy, PT-M, etc), advise the IC and CL on the appropriate response, and support IC’s briefings to senior management.
  • Communicate incident to OCIO PDPD (Privacy and Data Protection Division) and TBS LSU (Legal Service Unit) when needed (see Chapter 9 for details)
  • In collaboration with CL and product team: translate technical incident documentation into plain language, which could include editing the incident response report, drafting and sending emails to other parts of TBS, public communications such as emails to clients or updates on product/CDS website.  
  • Update product policy documentation (e.g., SLA, ToU) if necessary post-incident.
  • May act as the IC or PO alongside their PL role.

Operations Lead (OL)

OL with the help of others is to respond to the incident by applying operational tools to mitigate or resolve the incident. This person is responsible for developing theories about what’s broken and why, deciding on changes, and running the technical team. Works closely with the IC and can also be done by the IC.

Post Mortem Owner (PO)

The IC nominates one person to be accountable for completing the postmortem. The PO drives the postmortem through drafting and approval, all the way until it’s published. 

Everyone in the incident

  • Document all work being done in the incident slack channel
  • Verify with the Incident Commander before doing work on production or staging systems (code changes, merging PRs, communicating with external stakeholders, configuration changes, deployments).

Respond

Set up an on-call team

Being on-call means being available during a set period of time, and being ready to respond to production incidents during that time with appropriate urgency. On-call is a large and complex topic, saddled with many constraints and a limited margin for trial and error. Every team is different and we’ve put a dedicated guide for it: Product support guidance – Guide for product teams ↗️.

The on-call welcome email

A small check list of items to review automatically sent to on call individuals:

  • Do I have OpsGenie installed on my phone?
  • Do I have access to Freshdesk?
  • Do I have SSO access to the AWS accounts?
  • Do I have access as admin to the product console?
  • Phone book and Slack channels to watch out for or call for help

Response

We developed a list of steps to follow when an incident happens (Incident Response Runbook ) which allows team members involved in the incident to know what to do.

For a security related incident How to Report a Security and Privacy Event or Incident 🔒 ↗️.

Escalation

Your first responders might be all the people you need in order to resolve the incident, but more often than not, you need to bring other teams into the incident by paging them. We call this escalation. 

We currently leverage Opsgenie to define on-call rosters so that any given team has a rotation of staff who are expected to be contactable to respond in an emergency. 

Communication

During an emergency, proper communication is often the main key element of a successful response. You have two types of audiences to keep updating: CDS staff and our clients. 

If the incident impacts our users, an initial communication is to alert them that a service is experiencing some type of outage or degraded performance. At CDS we have an outreach team that can help you craft and translate a clear  message. You can ping them at #commsandoutreach 🔒 ↗️.

Next, updating internal teams regularly creates a consistent shared truth about the incident. When something goes wrong, information is often scarce, and if you don’t establish a reliable source of truth about what’s happened and how you’re responding, then people will tend to jump to their own conclusions.  This will create confusion. 

Here’s a pattern to follow:

  • In Slack start a thread with just a title: <severity> – <incident summary>
  • Open with a 1-2 sentence summary of the incident’s current state and impact
  • A current status section with 2-4 bullet points
  • A next steps section with 2-4 bullet points
  • State when and where the next round of communications will be sent out

Before sending, we review the communications for completeness using this checklist:

  • Have we described the actual impact on customers?
  • Did we say how many internal and external customers are affected?
  • If the root cause is known, what is it?
  • If there is an ETA for restoration, what is it?
  • When & where will the next update be?

It is important to be clear about what we do and don’t know. You need to be explicit about unknowns. This reduces uncertainty. For example, if you don’t know what the root cause is yet, it’s far better to say “the root cause is currently unknown” than to simply omit any mention of it. 

For external communications, we try to keep updates “short and sweet” because external customers usually aren’t interested in the technical details of the incident, they just want to  know if it’s fixed and if not, when it will be. For a longer explanation we should aim to publish a technical blog post after the incident post-mortem.

Incident communications is an art, and the more practice you have, the better you’ll be. In our incident management training, we role-play a hypothetical incident, draft communications for it, and read them out loud to the rest of the class. This is a good way to build this skill before doing it for real.

Iterate

There’s no single prescriptive process that will resolve all incidents—if there were, we’d simply automate that and be done with it. Instead, we iterate on the following process to quickly adapt to a variety of incident response scenarios:

  • Observe what’s going on. Share and confirm observations. 
  • Develop theories about why it’s happening.
  • Develop and conduct experiments that prove or disprove those theories.
  • Repeat.

You will recognize this as a generalization of the “Plan-Do-Check-Act” cycle, the “Observe-Orient-Decide-Act” cycle, or simply the scientific method. 

The biggest challenges for the IC at this point are around maintaining the team’s discipline:

  • Is the team communicating effectively?
  • What are the current observations, theories, and streams of work?
  • Are we making decisions effectively?
  • Are we making changes intentionally and carefully? Do we know what changes we’re making?
  • Are roles clear? Are people doing their jobs? Do we need to escalate to more teams?

Health Check

It is important throughout the incident response to run a health check on the team. Look out for strong emotions, noise, burnout, etc. The IC has to keep an eye on team fatigue and plan team handovers. A dedicated team can risk burning themselves out when resolving complex incidents—ICs should look out for how long members have been awake for and how long they’ve been working on the incident for, and decide who’s going to fill their roles next.

Resolve

An incident is resolved when the current or imminent business impact has ended. At that point, the emergency response ends and the team transitions onto any cleanup tasks and the postmortem.

We send final internal and external communications when the incident is resolved. The internal communications have a recap of the incident’s impact and duration, including how many support cases were raised and other important incident dimensions, and clearly state that the incident is resolved and there will be no further communications about it. The external communications are usually brief, telling customers that service has been restored and we will follow up with a postmortem.The final responsibility of the incident manager is to get accountability for completion of the postmortem. See the next chapter for how we do that.


The Post Mortem

Postmortems are a tool for determining what happened, documenting the timelines and discovering what went right and what went wrong. When adopted properly, they help us understand how a given outcome occurred and, if necessary, provide the action items to ensure that outcome doesn’t happen again in the future, thereby strengthening the services delivered.

At CDS we have defined a set of principles that reflects the spirit of the organization values and the importance of postmortem. You can find them at CDS Principles for running incident response and doing postmortems ↗️.

By providing a space for our colleagues to share these insights, we can learn from their experiences—and are all empowered to get better at what we do.

When is a postmortem needed?

We always do postmortems for severity 1 and 2 (“major”) incidents. For minor incidents they’re optional. We encourage people to use the postmortem process for any situation where it would be useful.

Who completes the postmortem?

During or shortly after resolving the issue, the IC finds a facilitator (ideally should be someone who wasn’t involved in the incident) who is accountable for completing the postmortem. 

Why should postmortems be blameless?

When things go wrong the natural human reaction is to ask “who is to blame” and to insulate ourselves from being blamed. But blame actually jeopardizes the success of the postmortem because:

  • When people feel the risk to their standing in the eyes of their peers or to their career prospects, it usually outranks “my employer’s corporate best interests” in their personal hierarchy, so they will naturally dissemble or hide the truth in order to protect their basic needs;
  • Blaming individuals is unkind and, if repeated often enough, will create a culture of fear and distrust;
  • Even if a person took an action that directly led to an incident, what we should ask is not “why did individual X do this”, but “why did the system allow them to do this, or lead them to believe this was the right thing to do.”

As the person responsible for the postmortem you need to actively work against this natural tendency to blame. The postmortem needs to honestly and objectively examine the circumstances that led to the fault so we can find the true causes and mitigate them. We assume good intentions on the part of our staff and never blame people for faults.

In our postmortems, we use these techniques to create personal safety for all participants:

  • In the meeting and the report, make an opening comment stating that this is a blameless postmortem and why;
  • Refer to individuals by role (e.g., “the on-call widgets engineer”) instead of name (while remaining clear and unambiguous about the facts); and
  • Ensure that the postmortem timeline, causal chain, and mitigations are framed in the context of systems, processes, and roles, not individuals.

Postmortem process overview

For postmortems to be effective, the process has to make it easy for teams to identify causes and fix them. We’ve found the following methods useful:

  • Single-point accountability for postmortem results makes responsibilities clear. The Postmortem Owner is always the person accountable.
  • Use video conference meetings to speed up analysis, quickly create a shared understanding, and align the team on what needs fixing.
  • Postmortem review and approval of high severity (1 and 2) incidents by our SRE leadership teams helps to set the right level of rigor and priority.
  • Significant mitigations have an agreed Service Level Objective (SLO) for completion (8 weeks in most cases), with reminders and reports to ensure they are completed

The postmortem owner follows these steps:

  1. Edit the postmortem report auto-generated by the SRE bot and complete the fields/descriptions
  2. Schedule the postmortem meeting. Invite the delivery team, impacted teams and other stakeholders using the meeting invitation template.
  3. Meet with the team and run through the agenda (see “Postmortem meetings” below)
  4. Follow up with the responsible team leads to getting time-bound commitment to the actions that were agreed in the meeting.

Postmortem report

Our postmortem report has several sections to collect all the important details. You will find the description of each in the template 🔒 ↗️. 


Training

It is recommended to have drills at regular intervals to ensure that each individual within the incident response team is able or knows how to perform their duties during an incident. 


Performance Indicators

We should track every incident’s start-of-impact time, detection time, and end-of-impact time. We use these fields to calculate time-to-recovery (TTR) which is the interval between start and end, and time-to-detect (TTD) which is the interval between the start and detect. The distribution of your incident TTD and TTR is often an important business metric. 


Problem management

An important tool in the diagnosis of incidents is the known error database (KEDB). The KEDB identifies any problems or known errors that have caused incidents in the past and provides information about any workarounds that have been identified.


Service Level

The breach of a service level is itself an incident.

tbd


Business Continuity Plan

tbd


Privacy Breach Management

How CDS handles personal information is governed by the Privacy Act. To ensure CDS is following that law, TBS has established guidance and tools ↗️ for teams to use to manage privacy breaches. If an incident involves the improper or unauthorized collection, use, disclosure, retention or disposal of personal information, the privacy breach management process is required. The process has the following steps: 

Step 1: Preliminary Assessment and Containment ↗️

Step 2: Full Assessment ↗️

Step 3: Notification ↗️

Step 4: Mitigation and Prevention ↗️

Step 5: Notification to the Office of the Privacy Commissioner and the Treasury Board of Canada Secretariat ↗️

Step 6: Lessons Learned ↗️

The Policy Lead can help assess if the information involved is personal/protected and support the privacy breach management process, including notifying OCIO PDPD (Privacy and Data Protection Division) and TBS LSU (Legal Service Unit) that an incident has occurred (Step 1). Many incidents do not rise to the threshold of breach of privacy (e.g., the incident occurred in staging, the information involved is not personal information, or no information was released) but all incidents need to be assessed.


Appendix

Severity Levels

SeverityDescriptionExamples
1A critical incident with very high impactService is down for all usersConfidentiality or privacy is breachedUser data loss
2A major incident with significant impactService is unavailable for a subset of usersCore functionality is significantly impacted
3A minor incident with low impactA minor inconvenience to users, workaround availableUsable performance degradation

Post mortem meeting invitation template

Please join me for a blameless postmortem of <link to incident report>, where we <summary of incident>.

The goal of a postmortem is to maximize the value of an incident by understanding all contributing causes, documenting the incident for future reference and pattern discovery, and enacting effective preventative actions to reduce the likelihood or impact of recurrence.

In this meeting we’ll review the incident, determine its significant causes and decide on actions to mitigate them.


Resources ↗️

Date modified: