Table of contents
Incident Report
What is an Incident Postmortem?
Postmortem (or Post-mortem) is a process intended to help you learn from past incidents. It typically involves an analysis or discussion soon after an event has taken place.
It typically involves a blameless free analysis and discussion soon after an incident or event has taken place.
Why do Postmortems?
The postmortem process drives focus, instills a culture of learning, and identifies opportunities for improvement that otherwise would be lost. Without a postmortem you fail to recognize what you are doing right, where you could improve, and most importantly, how to avoid making the same mistakes in the future.
An effective incident postmortem plan
In order for postmortem to be effective and allow you to build a culture of continuous improvements — you want to implement a simple, repeatable process that everyone can participate in. How you do this will depend on your culture and your team.
Structure of Postmortem
The structure is actually surprisingly simple and yet powerful. The report is made up of five parts, an issue summary, a timeline, root cause analysis, resolution and recovery, and lastly, corrective and preventative measures.
Issue Summary:
short summary (5 sentences)
list the duration along with start and end times (include timezones)
state the impact (most user requests resulted in 500 errors)
close to the root cause
Timeline:
list the timezone
covers the outrage duration
when outage began
when staff was notified
actions, events….
when service was restored
Root Cause:
give a detailed explanation of the event
do not sugarcoat
Resolution and Recovery:
- give a detailed explanation of actions taken(include times)
Corrective and Preventative Measures:
an itemized list of ways to prevent it from happening again
what can we do better next time
Here’s a sample postmortem report of code6ix.tech
The following is the incident report for the 100% failed request to Code6ix.tech website that occurred on April 10th,2023.
Issue Summary:
From 12:10 PM to 12:30 PM WAT GMT +1, requests to code6ix.tech website resulted in 500 error response messages. All requests by users sent to the server gave back a server error with the status message 500. At this peak, the issue affected 100% of the web server traffic to load the website. There was a 100% downtime as users were not able to access the website domain.
The root cause of the outage was a result of an A — record Domain Name System not being created to map the server’s IP address to the domain name code6ix.tech after the website was moved to a new web server.
Timeline (all times West African Time):
11: 47 AM: Transfer to a new Web server
12:08 PM: Website Returns Internal Server Error to request for code6ix.tech.
12:09 PM: Datadog monitoring the server sends Sev-1 alert
12:10 PM: Pagers alerted teams on-call
12:12 PM: Engineers investigated the web server configuration on the sites-enabled directory /etc/nginx/
12:15 PM: Calls were escalated to all SRE Engineers
12:21 PM: Engineers Detected the DNS A — record not configured to the new web server IP address.
12:24 PM: Configuration of A record to map the domain to the new web server IP address.
12:30 PM: 100% of website traffic back online
Root Cause and Resolution:
At 11:48 AM WAT, configuration to transfer the domain to a new web server was carried out. The transfer was completed and implemented but the DevOps Engineer did not map an A Record DNS type to the new web server which caused an overall system downtime when the request to the new server came, which happened at 12:08 PM WAT.
All users could not access the website as the domain was moved to a new server without mapping the IP address to the domain so as to establish the DNS A Record.
Resolution:
The monitoring system alerted our engineers who investigated and quickly escalated the issue. By 12:09 PM, the incident response team identified the error to be no A Record for the mapped domain to the new web server at 12:21 PM WAT.
The engineers on call quickly created this record and mapped the domain to the new server IP address at 12:24 PM WAT. The website was 100% up at 12:30 PM WAT when the record was created. The user’s request to the webserver returned a status code of 200.
Corrective and Preventive Measures:
In the last two days, we’ve conducted an internal review and analysis of the outage. The following are actions we are taking to address the underlying causes of the issue and to help prevent recurrence and improve response times:
Create configurations to adequately ensure A records are put in place during domain transfer to a new web server.
Add active canonical records that can redirect requests to backup servers during the downtime of the active server.
Improve the process for auditing all high-risk configuration options.
Add a faster rollback mechanism and improve the traffic ramp-up process, so any future problems of this type can be corrected quickly.
Develop a better mechanism for quickly delivering status notifications during incidents.
This article is my submission as part of a technical writing project for the ALX Africa Software Engineering Program. If you have opinions or critics, please comment below.
Thank You.