{"id":1313,"date":"2022-10-27T20:47:51","date_gmt":"2022-10-27T20:47:51","guid":{"rendered":"https:\/\/articles.alpha.canada.ca\/cds-intranet-employee-guide\/?page_id=1313"},"modified":"2022-11-29T16:18:13","modified_gmt":"2022-11-29T16:18:13","slug":"incident-management-handbook","status":"publish","type":"page","link":"https:\/\/articles.alpha.canada.ca\/cds-intranet-employee-guide\/incident-management-handbook\/","title":{"rendered":"Incident Management Handbook"},"content":{"rendered":"\n<ul class=\"toc wp-block-list\"><li><a href=\"#h-introduction\">Introduction<\/a><\/li><li><a href=\"#h-objectives\">Objectives<\/a><\/li><li><a href=\"#h-overview\">Overview<\/a><\/li><li><a href=\"#h-respond\">Respond<\/a><\/li><li><a href=\"#h-the-post-mortem\">The Post Mortem<\/a><\/li><li><a href=\"#h-training\">Training<\/a><\/li><li><a href=\"#h-performance-indicators\">Performance Indicators<\/a><\/li><li><a href=\"#h-problem-management\">Problem Management<\/a> <\/li><li><a href=\"#h-service-level\">Service Level<\/a> <\/li><li><a href=\"#h-business-continuity-plan\">Business Continuity Plan<\/a> <\/li><li><a href=\"#h-privacy-breach-management\">Privacy Breach Management<\/a><\/li><li><a href=\"#h-appendix\">Appendix<\/a><\/li><li><a href=\"#h-resources\">Resources<\/a><\/li><\/ul>\n\n\n\n<p><em>(La version fran\u00e7aise est en cours d&#8217;\u00e9laboration)<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-introduction\">Introduction<\/h2>\n\n\n\n<p>When something goes wrong, whether it&#8217;s an outage or a broken feature, team members need to respond immediately and restore service. This process is called incident management.<\/p>\n\n\n\n<p>In service delivery, change is the only constant. This means systems will continually be stressed in new and different ways. Teams that understand this, also understand that it\u2019s not a matter of if &#8211; but when &#8211; systems will fail. Taking steps to prepare for these failures and run blameless postmortem processes should be recognized as a critical element of ongoing success, and integrated into the DNA of all delivery teams.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-objectives\">Objectives<\/h2>\n\n\n\n<p>This handbook outlines the stakeholders and actions required to ensure that incidents (including security events) are addressed in a consistent, coordinated, and timely fashion CDS-wide. The handbook will be tested and reviewed frequently, and modified as required.<\/p>\n\n\n\n<p>We want to be able to consistently:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Clarify roles and responsibilities.<\/li><li>Document and test procedures.<\/li><li>Train staff.<\/li><li>Simplify decision-making by people and teams in incidents and blameless postmortems.<\/li><li>Build a consistent culture across teams of how we identify, manage, and learn from incidents.<\/li><li>Create a fast and predictable response to every incident every single time.<\/li><\/ul>\n\n\n\n<p>We want to comply with the following <a href=\"https:\/\/cyber.gc.ca\/en\/guidance\/it-security-risk-management-lifecycle-approach-itsg-33\" target=\"_blank\" rel=\"noreferrer noopener\">ITSG-33<\/a> \u2197\ufe0f incident management controls (14):<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><strong>IR-1:<\/strong> Incident Response Policy And Procedures<\/li><li><strong>IR-2:<\/strong> Incident Response Training<\/li><li><strong>IR-2.1:<\/strong> Incident Response Training | Simulated Events<\/li><li><strong>IR-3:<\/strong> Incident Response Testing And Exercises<\/li><li><strong>IR-3.2:<\/strong> Incident Response Testing And Exercises &gt; Coordination with related Plans<\/li><li><strong>IR-4:<\/strong> Incident Handling<\/li><li><strong>IR-4.4:<\/strong> Incident Handling &gt; Information Correlation<\/li><li><strong>IR-4.9:<\/strong> Incident Handling &gt; Dynamic Response Capability<\/li><li><strong>IR-5:<\/strong> Incident Monitoring<\/li><li><strong>IR-6:<\/strong> Incident Reporting<\/li><li><strong>IR-6.2:<\/strong> Incident Reporting &gt; Vulnerabilities related to incidents<\/li><li><strong>IR-7:<\/strong> Incident Response Assistance<\/li><li><strong>IR-8:<\/strong> Incident Response Plan<\/li><li><strong>SA-15.10:<\/strong> Development Process, Standards, And Tool &gt; Incident Response Plan<\/li><\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-overview\">Overview<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-who-is-this-guide-for\">Who is this guide for?<\/h3>\n\n\n\n<p>If you\u2019re on a team that offers services to any type of users, this handbook is for you.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-what-is-an-incident\">What is an incident?<\/h3>\n\n\n\n<p>At CDS, we define an incident as an event that causes <strong>disruption<\/strong> to or a <strong>reduction in the quality<\/strong> of a service which requires a response. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-our-incident-values\">Our incident values<\/h3>\n\n\n\n<p>A process for managing incidents can\u2019t cover all possible situations, so we empower our teams with general guidance in the form of values. At CDS, we work with the following values:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Put people at the heart of services: We\u2019re driven by empathy and put people\u2019s needs first.<\/li><li>Do the hard work to make things easier:&nbsp; We do not shy away from hard conversations but approach them with courage, humility, compassion, and integrity<\/li><li>Work in the open to help clear a path: We share our work, progress, and failures to help others learn from what we do.&nbsp;<\/li><li>Take care of each other: We work to create a space where people feel they belong, take risks, learn from mistakes, and speak openly.&nbsp;&nbsp;<\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-incident-stages\">Incident Stages<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Stage<\/strong><\/td><td><strong>Rationale<\/strong><\/td><\/tr><tr><td><em>Detect<\/em><\/td><td>A service includes enough monitoring and alerting to detect incidents before our users do. The best monitoring alerts us to problems before they even become incidents.<\/td><\/tr><tr><td><em>Respond &amp; Assess<\/em><\/td><td>Declare an incident<br>Triage and prioritize. Recognize cyber security events.&nbsp;<\/td><\/tr><tr><td><em>Recover<\/em><\/td><td>Our users don\u2019t care why their service is down, only that we restore service as quickly as possible.<br><br>Never hesitate in getting an incident resolved quickly so that we can minimize impact to our users.<\/td><\/tr><tr><td><em>Learn<\/em><\/td><td>Always blameless. Incidents are part of running services. We improve services by holding teams accountable, not by apportioning blame.<\/td><\/tr><tr><td><em>Improve<\/em><br><\/td><td>Never have the same incident twice. Identify the root cause and the changes that will prevent the whole class of incident from occurring again.&nbsp;<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\" \/>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-tools\">Tools<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-opsgenie\">OpsGenie<\/h4>\n\n\n\n<p>We use <a href=\"https:\/\/cds-snc.app.opsgenie.com\/teams\/list\" target=\"_blank\" rel=\"noreferrer noopener\">Opsgenie<\/a> to manage on-call rotations and escalations. To get access reach out to <a href=\"https:\/\/gcdigital.slack.com\/archives\/CS2L5CHKK\">#sre-and-tech-ops<\/a> in Slack.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-slack-bot\">Slack Bot<\/h4>\n\n\n\n<p>The CDS Site Reliability Engineering team built the following <a href=\"https:\/\/github.com\/cds-snc\/sre-bot\" target=\"_blank\" rel=\"noreferrer noopener\">bot for site reliability engineering<\/a> \ud83d\udd12 \u2197\ufe0f. Launch the SRE Incident Bot in any Slack channel by typing<strong> \/incident<\/strong> and fill in the information.<\/p>\n\n\n\n<p>You can also post <strong>!incident<\/strong> in any Slack channel or thread, an auto-reply will remind you where to find the incident response runbook.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-slack-google-meet-and-drive\"><strong>Slack, Google Meet and Drive<\/strong><\/h4>\n\n\n\n<p>To create a Slack room, Google Meet and a Post-Mortem report instantly, launch the SRE Incident Bot in any Slack channel by typing<strong> \/incident<\/strong> and fill in the information.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"644\" height=\"135\" src=\"https:\/\/articles.alpha.canada.ca\/uploads\/sites\/5\/2022\/11\/incident-slack-command.png\" alt=\"incident command on Slack\" class=\"wp-image-1396\" srcset=\"https:\/\/articles.alpha.canada.ca\/uploads\/sites\/5\/2022\/11\/incident-slack-command.png 644w, https:\/\/articles.alpha.canada.ca\/uploads\/sites\/5\/2022\/11\/incident-slack-command-300x63.png 300w\" sizes=\"auto, (max-width: 644px) 100vw, 644px\" \/><\/figure>\n\n\n\n<p>Once the channel is created, use <strong>\/sre incident roles <\/strong>in that channel to assign the roles.&nbsp;<\/p>\n\n\n\n<p>All the incident reports are tracked here: \u200b\u200b<a href=\"https:\/\/drive.google.com\/drive\/folders\/13CqZaprwAqebkrcFnOdwkAt3JT4Fs2It\" target=\"_blank\" rel=\"noreferrer noopener\">Incidents<\/a> \ud83d\udd12 \u2197\ufe0f.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-status-page\">Status Page<\/h4>\n\n\n\n<p>Communicating status with both internal stakeholders and customers through a <a href=\"https:\/\/status-statut.cds-snc.ca\/\">status page<\/a> \u2197\ufe0f helps keep everyone in the loop.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\" \/>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-roles\">Roles<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-incident-commander-ic\">Incident Commander (IC)<\/h4>\n\n\n\n<p>Each incident is driven by the incident commander (IC), who has overall responsibility for and authority for the incident. The incident commander is empowered to take any action necessary to resolve the incident, which includes paging anyone in the organization and keeping those involved in an incident focused on restoring service as quickly as possible.<\/p>\n\n\n\n<p>The incident commander is a role, rather than an individual on the incident. The advantage of defining roles during an incident is that it allows people to become interchangeable. As long as a given person knows how to perform a certain role, they can take that role for any incident.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">The IC should:<\/h5>\n\n\n\n<ul class=\"wp-block-list\"><li>Command and coordinate the incident response<\/li><li>Delegate roles as needed. By default, the IC assumes all roles that have not been delegated yet.<\/li><li>Communicate effectively.&nbsp;<\/li><li>Brief Senior Management<\/li><li>Stay in control of the incident response.<\/li><li>Work with other responders to resolve the incident.<\/li><li>Appoint others to incident command positions as needed<\/li><li>Provide information to and coordinate with crisis communications or media relations team<\/li><li>Terminate the response and demobilize resources when the situation has been stabilized<\/li><li>Transition to a new IC<ul><li>The current IC should transition the open incident to a new IC if their on-call schedule is over or they feel they could not continue for various reasons.&nbsp;<\/li><li>Provide enough context for the new IC<ul><li>Current state of the Incident<\/li><li>What next steps are<\/li><li>Names of People currently in the Incident<\/li><\/ul><\/li><\/ul><\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-communications-lead-cl\">Communications Lead (CL)<\/h3>\n\n\n\n<p>CL is the public face of the incident and a person familiar with public communications (usually from the Outreach team). The CL should:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>With the IC, determine if the incident requires public communications<\/li><li>Provide periodic updates to the incident response team and stakeholders (includes SCMA)<\/li><li>Develop a plain-language summary of the incident in collaboration with the IC<\/li><li>Work with the IC and PL to develop communications to users<\/li><li>Manage inquiries about the incident with SCMA<\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-policy-lead-pl\">Policy Lead (PL)<\/h3>\n\n\n\n<p>The PL leads policy analysis related to the incident and makes recommendations to the IC on the policy implications of the incident, how to respond, and how to communicate this to clients, government stakeholders, and the public, if necessary. This work should predominantly happen alongside Respond\/Assess and Recover phases and in post-incident communication. On Platform, PL is almost always the policy advisor already embedded on the product team. If your policy advisor is away, or your team does not have a policy advisor, please contact Nisa Malli, Delivery Policy Lead, and she will assign one to the incident.&nbsp;The PL should:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Notify the Delivery Policy Team and Team Lead; flag if there are cross-product\/ interoperability\/CDS-wide implications. Convene ad hoc policy consult as needed.<\/li><li>Assess potential policy implications of the incident (e.g., privacy, PT-M, etc), advise the IC and CL on the appropriate response, and support IC\u2019s briefings to senior management.<\/li><li>Communicate incident to OCIO PDPD (Privacy and Data Protection Division) and TBS LSU (Legal Service Unit) when needed (see Chapter 9 for details)<\/li><li>In collaboration with CL and product team: translate technical incident documentation into plain language, which could include editing the incident response report, drafting and sending emails to other parts of TBS, public communications such as emails to clients or updates on product\/CDS website.&nbsp;&nbsp;<\/li><li>Update product policy documentation (e.g., SLA, ToU) if necessary post-incident.<\/li><li>May act as the IC or PO alongside their PL role.<\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-operations-lead-ol\">Operations Lead (OL)<\/h3>\n\n\n\n<p>OL with the help of others is to respond to the incident by applying operational tools to mitigate or resolve the incident. This person is responsible for developing theories about what\u2019s broken and why, deciding on changes, and running the technical team. Works closely with the IC and can also be done by the IC.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-post-mortem-owner-po\">Post Mortem Owner (PO)<\/h3>\n\n\n\n<p>The IC nominates one person to be accountable for completing the postmortem. The PO drives the postmortem through drafting and approval, all the way until it\u2019s published.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-everyone-in-the-incident\">Everyone in the incident<\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li>Document all work being done in the incident slack channel<\/li><li>Verify with the Incident Commander before doing work on production or staging systems (code changes, merging PRs, communicating with external stakeholders, configuration changes, deployments).<\/li><\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-respond\">Respond<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-set-up-an-on-call-team\">Set up an on-call team<\/h3>\n\n\n\n<p>Being on-call means being available during a set period of time, and being ready to respond to production incidents during that time with appropriate urgency. On-call is a large and complex topic, saddled with many constraints and a limited margin for trial and error. Every team is different and we\u2019ve put a dedicated guide for it: <a href=\"https:\/\/cds-snc.github.io\/guide-product-teams-equipes-produits\/product_support_guidance\/\" target=\"_blank\" rel=\"noreferrer noopener\">Product support guidance &#8211; Guide for product teams<\/a> \u2197\ufe0f.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-the-on-call-welcome-email\">The on-call welcome email<\/h3>\n\n\n\n<p>A small check list of items to review automatically sent to on call individuals:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Do I have OpsGenie installed on my phone?<\/li><li>Do I have access to Freshdesk?<\/li><li>Do I have SSO access to the AWS accounts?<\/li><li>Do I have access as admin to the product console?<\/li><li>Phone book and Slack channels to watch out for or call for help<\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-response\">Response<\/h3>\n\n\n\n<p>We developed a list of steps to follow when an incident happens (<a href=\"https:\/\/articles.alpha.canada.ca\/cds-intranet-employee-guide\/incident-management-handbook\/incident-response-runbook\/\">Incident Response Runbook<\/a> ) which allows team members involved in the incident to know what to do.<\/p>\n\n\n\n<p>For a security related incident <a href=\"https:\/\/docs.google.com\/document\/d\/1rxNeqsH9agVdWuQrzy5WCT5xpQjOG7W2Foy-rJMNKKg\/edit#\" target=\"_blank\" rel=\"noreferrer noopener\">How to Report a Security and Privacy Event or Incident<\/a>&nbsp;\ud83d\udd12 \u2197\ufe0f.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-escalation\">Escalation<\/h3>\n\n\n\n<p>Your first responders might be all the people you need in order to resolve the incident, but more often than not, you need to bring other teams into the incident by paging them. We call this escalation.&nbsp;<\/p>\n\n\n\n<p>We currently leverage Opsgenie to define on-call rosters so that any given team has a rotation of staff who are expected to be contactable to respond in an emergency.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-communication\">Communication<\/h3>\n\n\n\n<p>During an emergency, proper communication is often the main key element of a successful response. You have two types of audiences to keep updating: CDS staff and our clients.&nbsp;<\/p>\n\n\n\n<p>If the incident impacts our users, an initial communication is to alert them that a service is experiencing some type of outage or degraded performance. At CDS we have an outreach team that can help you craft and translate a clear&nbsp; message. You can ping them at <a href=\"https:\/\/gcdigital.slack.com\/archives\/C5HNME1QC\" target=\"_blank\" rel=\"noreferrer noopener\">#commsandoutreach<\/a> \ud83d\udd12 \u2197\ufe0f.<\/p>\n\n\n\n<p>Next, updating <strong>internal<\/strong> teams regularly creates a consistent shared truth about the incident. When something goes wrong, information is often scarce, and if you don\u2019t establish a reliable source of truth about what\u2019s happened and how you\u2019re responding, then people will tend to jump to their own conclusions.&nbsp; This will create confusion.&nbsp;<\/p>\n\n\n\n<p>Here\u2019s a pattern to follow:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>In Slack start a thread with just a title: &lt;severity&gt; &#8211; &lt;incident summary&gt;<\/li><li>Open with a 1-2 sentence summary of the incident\u2019s current state and impact<\/li><li>A current status section with 2-4 bullet points<\/li><li>A next steps section with 2-4 bullet points<\/li><li>State when and where the next round of communications will be sent out<\/li><\/ul>\n\n\n\n<p>Before sending, we review the communications for completeness using this checklist:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Have we described the actual impact on customers?<\/li><li>Did we say how many internal and external customers are affected?<\/li><li>If the root cause is known, what is it?<\/li><li>If there is an ETA for restoration, what is it?<\/li><li>When &amp; where will the next update be?<\/li><\/ul>\n\n\n\n<p>It is important to be clear about what we do and don\u2019t know. You need to be explicit about unknowns. This reduces uncertainty. For example, if you don\u2019t know what the root cause is yet, it\u2019s far better to say \u201cthe root cause is currently unknown\u201d than to simply omit any mention of it.&nbsp;<\/p>\n\n\n\n<p>For external communications, we try to keep updates \u201cshort and sweet\u201d because external customers usually aren\u2019t interested in the technical details of the incident, they just want to&nbsp; know if it\u2019s fixed and if not, when it will be. For a longer explanation we should aim to publish a technical blog post after the incident post-mortem.<\/p>\n\n\n\n<p>Incident communications is an art, and the more practice you have, the better you&#8217;ll be. In our incident management training, we role-play a hypothetical incident, draft communications for it, and read them out loud to the rest of the class. This is a good way to build this skill before doing it for real.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Iterate<\/h3>\n\n\n\n<p>There\u2019s no single prescriptive process that will resolve all incidents\u2014if there were, we\u2019d simply automate that and be done with it. Instead, we iterate on the following process to quickly adapt to a variety of incident response scenarios:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Observe<\/strong> what\u2019s going on. Share and confirm observations.&nbsp;<\/li><li><strong>Develop theories<\/strong> about why it\u2019s happening.<\/li><li><strong>Develop and conduct experiments<\/strong> that prove or disprove those theories.<\/li><li><strong>Repeat<\/strong>.<br><\/li><\/ul>\n\n\n\n<p>You will recognize this as a generalization of the \u201cPlan-Do-Check-Act\u201d cycle, the \u201cObserve-Orient-Decide-Act\u201d cycle, or simply the scientific method.&nbsp;<\/p>\n\n\n\n<p>The biggest challenges for the IC at this point are around maintaining the team\u2019s discipline:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Is the team communicating effectively?<\/li><li>What are the current observations, theories, and streams of work?<\/li><li>Are we making decisions effectively?<\/li><li>Are we making changes intentionally and carefully? Do we know what changes we\u2019re making?<\/li><li>Are roles clear? Are people doing their jobs? Do we need to escalate to more teams?<br><\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Health Check<\/h3>\n\n\n\n<p>It is important throughout the incident response to run a health check on the team. Look out for strong emotions, noise, burnout, etc. The IC has to keep an eye on team fatigue and plan team handovers. A dedicated team can risk burning themselves out when resolving complex incidents\u2014ICs should look out for how long members have been awake for and how long they\u2019ve been working on the incident for, and decide who\u2019s going to fill their roles next.<br><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Resolve<\/h3>\n\n\n\n<p>An incident is resolved when the current or imminent business impact has ended. At that point, the emergency response ends and the team transitions onto any cleanup tasks and the postmortem.<br><\/p>\n\n\n\n<p>We send final internal and external communications when the incident is resolved. The internal communications have a recap of the incident\u2019s impact and duration, including how many support cases were raised and other important incident dimensions, and clearly state that the incident is resolved and there will be no further communications about it. The external communications are usually brief, telling customers that service has been restored and we will follow up with a postmortem.The final responsibility of the incident manager is to get accountability for completion of the postmortem. See the next chapter for how we do that.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-post-mortem\">The Post Mortem<\/h2>\n\n\n\n<p>Postmortems are a tool for determining what happened, documenting the timelines and discovering what went right and what went wrong. When adopted properly, they help us understand how a given outcome occurred and, if necessary, provide the action items to ensure that outcome doesn\u2019t happen again in the future, thereby strengthening the services delivered.<\/p>\n\n\n\n<p>At CDS we have defined a set of principles that reflects the spirit of the organization values and the importance of postmortem. You can find them at <a href=\"https:\/\/github.com\/cds-snc\/docs\/blob\/main\/development\/rfcs\/0010-postmortem-principles.md\" target=\"_blank\" rel=\"noreferrer noopener\">CDS Principles for running incident response and doing postmortems<\/a> \u2197\ufe0f.<\/p>\n\n\n\n<p>By providing a space for our colleagues to share these insights, we can learn from their experiences\u2014and are all empowered to get better at what we do.<\/p>\n\n\n\n<p><strong>When is a postmortem needed?<\/strong><\/p>\n\n\n\n<p>We always do postmortems for severity 1 and 2 (\u201cmajor\u201d) incidents. For minor incidents they\u2019re optional. We encourage people to use the postmortem process for any situation where it would be useful.<\/p>\n\n\n\n<p><strong>Who completes the postmortem?<\/strong><\/p>\n\n\n\n<p>During or shortly after resolving the issue, the IC finds a facilitator (ideally should be someone who wasn\u2019t involved in the incident) who is accountable for completing the postmortem.&nbsp;<\/p>\n\n\n\n<p><strong>Why should postmortems be blameless?<\/strong><\/p>\n\n\n\n<p>When things go wrong the natural human reaction is to ask \u201cwho is to blame\u201d and to insulate ourselves from being blamed. But blame actually jeopardizes the success of the postmortem because:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>When people feel the risk to their standing in the eyes of their peers or to their career prospects, it usually outranks \u201cmy employer\u2019s corporate best interests\u201d in their personal hierarchy, so they will naturally dissemble or hide the truth in order to protect their basic needs;<\/li><li>Blaming individuals is unkind and, if repeated often enough, will create a culture of fear and distrust; <\/li><\/ul>\n\n\n\n<ul class=\"wp-block-list\"><li>Even if a person took an action that directly led to an incident, what we should ask is not \u201cwhy did individual X do this\u201d, but \u201cwhy did the system allow them to do this, or lead them to believe this was the right thing to do.\u201d<\/li><\/ul>\n\n\n\n<p>As the person responsible for the postmortem you need to actively work against this natural tendency to blame. The postmortem needs to honestly and objectively examine the circumstances that led to the fault so we can find the true causes and mitigate them. We assume good intentions on the part of our staff and never blame people for faults.<\/p>\n\n\n\n<p>In our postmortems, we use these techniques to create personal safety for all participants:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>In the meeting and the report, make an opening comment stating that this is a blameless postmortem and why;<\/li><li>Refer to individuals by role (e.g., \u201cthe on-call widgets engineer\u201d) instead of name (while remaining clear and unambiguous about the facts); and<\/li><li>Ensure that the postmortem timeline, causal chain, and mitigations are framed in the context of systems, processes, and roles, not individuals.<\/li><\/ul>\n\n\n\n<p><strong>Postmortem process overview<\/strong><\/p>\n\n\n\n<p>For postmortems to be effective, the process has to make it easy for teams to identify causes and fix them. We\u2019ve found the following methods useful:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Single-point accountability<\/strong> for postmortem results makes responsibilities clear. The Postmortem Owner is always the person accountable.<\/li><\/ul>\n\n\n\n<ul class=\"wp-block-list\"><li>Use<strong> video conference meetings<\/strong> to speed up analysis, quickly create a shared understanding, and align the team on what needs fixing.<\/li><li><strong>Postmortem review and approval <\/strong>of high severity (1 and 2) incidents by our SRE leadership teams helps to set the right level of rigor and priority.<\/li><li>Significant mitigations have an agreed <strong>Service Level Objective<\/strong> (SLO) for completion (8 weeks in most cases), with reminders and reports to ensure they are completed<\/li><\/ul>\n\n\n\n<p>The postmortem owner follows these steps:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Edit the postmortem report auto-generated by the SRE bot and complete the fields\/descriptions<\/li><li>Schedule the postmortem meeting. Invite the delivery team, impacted teams and other stakeholders using the meeting invitation template.<\/li><li>Meet with the team and run through the agenda (see \u201cPostmortem meetings\u201d below)<\/li><li>Follow up with the responsible team leads to getting time-bound commitment to the actions that were agreed in the meeting.<\/li><\/ol>\n\n\n\n<p><strong>Postmortem report<\/strong><\/p>\n\n\n\n<p>Our postmortem report has several sections to collect all the important details. You will find the description of each in the <a href=\"https:\/\/docs.google.com\/document\/d\/1Nb9Zh0OR2HulVtpqDpFnRTcq6VwsWSaKK9QS3Y6o0TE\/edit\" target=\"_blank\" rel=\"noreferrer noopener\">template<\/a> \ud83d\udd12 \u2197\ufe0f.&nbsp;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-training\">Training<\/h2>\n\n\n\n<p>It is recommended to have drills at regular intervals to ensure that each individual within the incident response team is able or knows how to perform their duties during an incident.&nbsp;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-performance-indicators\">Performance Indicators<\/h2>\n\n\n\n<p>We should track every incident\u2019s start-of-impact time, detection time, and end-of-impact time. We use these fields to calculate time-to-recovery (TTR) which is the interval between start and end, and time-to-detect (TTD) which is the interval between the start and detect. The distribution of your incident TTD and TTR is often an important business metric.&nbsp;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-problem-management\">Problem management<\/h2>\n\n\n\n<p>An important tool in the diagnosis of incidents is the known error database (KEDB). The KEDB identifies any problems or known errors that have caused incidents in the past and provides information about any workarounds that have been identified.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-service-level\">Service Level<\/h2>\n\n\n\n<p>The breach of a service level is itself an incident.<\/p>\n\n\n\n<p>tbd<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-business-continuity-plan\">Business Continuity Plan<\/h2>\n\n\n\n<p>tbd<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-privacy-breach-management\">Privacy Breach Management<\/h2>\n\n\n\n<p>How CDS handles personal information is governed by the <em>Privacy Act<\/em>. To ensure CDS is following that law, <a href=\"https:\/\/www.canada.ca\/en\/treasury-board-secretariat\/services\/access-information-privacy\/privacy\/breach-management.html\" target=\"_blank\" rel=\"noreferrer noopener\">TBS has established guidance and tools<\/a> \u2197\ufe0f for teams to use to manage privacy breaches. If an incident involves the improper or unauthorized collection, use, disclosure, retention or disposal of personal information, the privacy breach management process is required. The process has the following steps:&nbsp;<\/p>\n\n\n\n<p><a href=\"https:\/\/www.canada.ca\/en\/treasury-board-secretariat\/services\/access-information-privacy\/privacy\/breach-management.html#step1\" target=\"_blank\" rel=\"noreferrer noopener\">Step 1: Preliminary Assessment and Containment<\/a> \u2197\ufe0f<\/p>\n\n\n\n<p><a href=\"https:\/\/www.canada.ca\/en\/treasury-board-secretariat\/services\/access-information-privacy\/privacy\/breach-management.html#step2\" target=\"_blank\" rel=\"noreferrer noopener\">Step 2: Full Assessment<\/a> \u2197\ufe0f<\/p>\n\n\n\n<p><a href=\"https:\/\/www.canada.ca\/en\/treasury-board-secretariat\/services\/access-information-privacy\/privacy\/breach-management.html#step3\" target=\"_blank\" rel=\"noreferrer noopener\">Step 3: Notification<\/a> \u2197\ufe0f<\/p>\n\n\n\n<p><a href=\"https:\/\/www.canada.ca\/en\/treasury-board-secretariat\/services\/access-information-privacy\/privacy\/breach-management.html#step4\" target=\"_blank\" rel=\"noreferrer noopener\">Step 4: Mitigation and Prevention \u2197\ufe0f<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/www.canada.ca\/en\/treasury-board-secretariat\/services\/access-information-privacy\/privacy\/breach-management.html#step5\" target=\"_blank\" rel=\"noreferrer noopener\">Step 5: Notification to the Office of the Privacy Commissioner and the Treasury Board of Canada Secretariat<\/a> \u2197\ufe0f<\/p>\n\n\n\n<p><a href=\"https:\/\/www.canada.ca\/en\/treasury-board-secretariat\/services\/access-information-privacy\/privacy\/breach-management.html#step6\" target=\"_blank\" rel=\"noreferrer noopener\">Step 6: Lessons Learned<\/a> \u2197\ufe0f<\/p>\n\n\n\n<p>The Policy Lead can help assess if the information involved is personal\/protected and support the privacy breach management process, including notifying OCIO PDPD (Privacy and Data Protection Division) and TBS LSU (Legal Service Unit) that an incident has occurred (Step 1). Many incidents do not rise to the threshold of breach of privacy (e.g., the incident occurred in staging, the information involved is not personal information, or no information was released) but all incidents need to be assessed.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-appendix\">Appendix<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Severity Levels<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Severity<\/strong><\/td><td><strong>Description<\/strong><\/td><td><strong>Examples<\/strong><\/td><\/tr><tr><td>1<\/td><td>A critical incident with very high impact<\/td><td>Service is down for all usersConfidentiality or privacy is breachedUser data loss<\/td><\/tr><tr><td>2<\/td><td>A major incident with significant impact<\/td><td>Service is unavailable for a subset of usersCore functionality is significantly impacted<\/td><\/tr><tr><td>3<\/td><td>A minor incident with low impact<\/td><td>A minor inconvenience to users, workaround availableUsable performance degradation<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Post mortem meeting invitation template<\/h3>\n\n\n\n<p><em>Please join me for a blameless postmortem of <\/em><strong><em>&lt;link to incident report&gt;<\/em><\/strong><em>, where we <\/em><strong><em>&lt;summary of incident&gt;<\/em><\/strong><em>.<\/em><\/p>\n\n\n\n<p><em>The goal of a postmortem is to maximize the value of an incident by understanding all contributing causes, documenting the incident for future reference and pattern discovery, and enacting effective preventative actions to reduce the likelihood or impact of recurrence.<\/em><\/p>\n\n\n\n<p><em>In this meeting we\u2019ll review the incident, determine its significant causes and decide on actions to mitigate them.<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\" \/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-resources\">Resources  \u2197\ufe0f<\/h2>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"https:\/\/www.canada.ca\/en\/government\/system\/digital-government\/online-security-privacy\/security-identity-management\/government-canada-cyber-security-event-management-plan.html\" target=\"_blank\" rel=\"noreferrer noopener\">Government of Canada Cyber Security Event Management Plan (GC CSEMP) 2018<\/a><\/li><li><a href=\"https:\/\/www.cyber.gc.ca\/en\/guidance\/developing-your-incident-response-plan-itsap40003\" target=\"_blank\" rel=\"noreferrer noopener\">Developing your incident response plan (ITSAP.40.003) &#8211; Canadian Centre for Cyber Security<\/a>&nbsp;<\/li><li><a href=\"https:\/\/www.ncsc.gov.uk\/collection\/incident-management\/\" target=\"_blank\" rel=\"noreferrer noopener\">UK &#8211; National Cyber Security Centre &#8211; Incident Management<\/a><\/li><li><a href=\"https:\/\/nvlpubs.nist.gov\/nistpubs\/SpecialPublications\/NIST.SP.800-61r2.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">NIST &#8211; Computer Security Incident Handling Guide (pdf)<\/a>&nbsp;<\/li><li><a href=\"https:\/\/www.atlassian.com\/incident-management\/handbook\" target=\"_blank\" rel=\"noreferrer noopener\">The Atlassian Incident Management Handbook<\/a>&nbsp;<\/li><li><a href=\"https:\/\/sansorg.egnyte.com\/dl\/6Btqoa63at\" target=\"_blank\" rel=\"noreferrer noopener\">SANS Incident Handbook<\/a>&nbsp;<\/li><li><a href=\"https:\/\/sre.google\/sre-book\/table-of-contents\/\" target=\"_blank\" rel=\"noreferrer noopener\">Google &#8211; Site Reliability Engineering (SRE) book<\/a>&nbsp;<\/li><li><a href=\"https:\/\/sre.google\/workbook\/table-of-contents\/\" target=\"_blank\" rel=\"noreferrer noopener\">Google &#8211; Site Reliability Engineering (SRE) workbook<\/a><\/li><li><a href=\"https:\/\/cds-snc.github.io\/guide-product-teams-equipes-produits\/product_support_guidance\/\" target=\"_blank\" rel=\"noreferrer noopener\">Product support guidance &#8211; Guide for product teams<\/a><\/li><li><a href=\"https:\/\/gds-way.cloudapps.digital\/standards\/incident-management.html\" target=\"_blank\" rel=\"noreferrer noopener\">How to manage technical incidents &#8211; The GDS Way<\/a><\/li><li><a href=\"https:\/\/sre.google\/sre-book\/postmortem-culture\/\" target=\"_blank\" rel=\"noreferrer noopener\">Google SRE &#8211; Postmortem Culture: Learning from Failure<\/a>&nbsp;<\/li><li><a href=\"https:\/\/team-manual.cloud.service.gov.uk\/incident_management\/incident_process\/\" target=\"_blank\" rel=\"noreferrer noopener\">Incident Process &#8211; PaaS Team Manual<\/a>&nbsp;<\/li><li><a href=\"https:\/\/postmortems.pagerduty.com\/culture\/blameless\/\" target=\"_blank\" rel=\"noreferrer noopener\">PagerDuty &#8211; The Blameless Postmortem<\/a>&nbsp;<\/li><li><a href=\"https:\/\/www.pagerduty.com\/resources\/learn\/incident-postmortem\/\" target=\"_blank\" rel=\"noreferrer noopener\">What is an Incident Postmortem? | Articles | PagerDuty<\/a>&nbsp;<\/li><li><a href=\"https:\/\/www.etsy.com\/codeascraft\/blameless-postmortems\/\" target=\"_blank\" rel=\"noreferrer noopener\">Etsy Engineering | Blameless PostMortems and a Just Culture<\/a>&nbsp;<\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Objectives Overview Respond The Post Mortem Training Performance Indicators Problem Management Service Level Business Continuity Plan Privacy Breach Management Appendix Resources (La version fran\u00e7aise est en cours d&#8217;\u00e9laboration) Introduction When something goes wrong, whether it&#8217;s an outage or a broken feature, team members need to respond immediately and restore service. This process is called\u2026 <a class=\"read-more\" href=\"https:\/\/articles.alpha.canada.ca\/cds-intranet-employee-guide\/incident-management-handbook\/\">Read more<span class=\"wb-sl\"> of Incident Management Handbook<\/span><\/a><\/p>\n","protected":false},"author":79,"featured_media":0,"parent":0,"menu_order":1,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1313","page","type-page","status-publish","hentry"],"slug_en":"incident-management-handbook","slug_fr":null,"id_en":1313,"id_fr":null,"lang":"en","_links":{"self":[{"href":"https:\/\/articles.alpha.canada.ca\/cds-intranet-employee-guide\/wp-json\/wp\/v2\/pages\/1313","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/articles.alpha.canada.ca\/cds-intranet-employee-guide\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/articles.alpha.canada.ca\/cds-intranet-employee-guide\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/articles.alpha.canada.ca\/cds-intranet-employee-guide\/wp-json\/wp\/v2\/users\/79"}],"replies":[{"embeddable":true,"href":"https:\/\/articles.alpha.canada.ca\/cds-intranet-employee-guide\/wp-json\/wp\/v2\/comments?post=1313"}],"version-history":[{"count":17,"href":"https:\/\/articles.alpha.canada.ca\/cds-intranet-employee-guide\/wp-json\/wp\/v2\/pages\/1313\/revisions"}],"predecessor-version":[{"id":1473,"href":"https:\/\/articles.alpha.canada.ca\/cds-intranet-employee-guide\/wp-json\/wp\/v2\/pages\/1313\/revisions\/1473"}],"wp:attachment":[{"href":"https:\/\/articles.alpha.canada.ca\/cds-intranet-employee-guide\/wp-json\/wp\/v2\/media?parent=1313"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}