When something goes wrong, it’s important to find out why – and how to avoid similar incidents in the future. This is the premise behind Post-Incident Reviews (PIRs). In an accompanying article, Andy Brodie explains why Access Worldpay values PIRs; here, he describes how we conduct them to maximise their benefits.
To summarise my
The PIR itself centres on a weekly meeting. How long this takes depends on:
|What’s a ‘problem record’?|
|We create a problem record when we are first aware of an underlying problem. We may create it reactively, after an incident, or proactively, as a result of events.|
Events are messages that state something meaningful about a service. They are generated either by monitoring software or by the service itself. For example, a busy service is still operating within its latency budget but is slowly creeping up over time and will eventually exceed the SLA, causing an incident. Monitoring the events that state the 95th percentile latency for calls – e.g. every minute – means we can spot trends, raise a problem and resolve it proactively, before an incident actually occurs.
We only close problem records once we have proved that the underlying problem has been resolved. Complex problems may take weeks to resolve, so problem records exist to document the essential facts and to track the resolution process.
To maximise productivity, we try to discuss only those incidents that meet certain criteria:
PIR meetings are led by a moderator, who introduces each incident and invites someone to manage the discussion – usually the on-call engineer who led the investigation and any subsequent fix.
The discussion begins with a thorough description of the incident, often by walking through the timeline. Timelines can only be completed once the incident has been closed, so it’s helpful to keep a simple audit trail – e.g. messages in a chat channel – that can be used to add new events or information to the timeline.
Specifically, the review tries to identify four key stages in the incident lifecycle:
A large gap between Incident Started and Incident Detected could suggest a need to improve monitoring, perhaps by making alerts more sensitive or changing where they’re directed. A young service often records ‘false’ incidents because its alert thresholds are too sensitive.
Equally, large gaps between Incident Detected and Squad Response could suggest that we need to examine how we notify squads; that an incident was new enough to miss the alert system; or that a squad member needs a louder alarm!
During the incident, the team will investigate its cause and conduct tests, after which it proposes a fix. The process of fixing the incident involves three separate stages:
PIRs examine three specific metrics:
When assessing this third metric, we examine the original rationale for the fix. In a PIR, we need to differentiate between what we knew at the time and what we know now, in hindsight. The aim of the PIR is to learn as much as possible so that we can apply this valuable, 20/20 post-event knowledge to future incidents while they are still active.
It’s not unknown for PIRs to contain surprising conclusions – “like the big reveal of a whodunnit,” according to one regular. Such revelations often identify problem tasks, which often apply across multiple services. This is why bringing disparate people together in a PIR can add genuine value.
|What’s a ‘problem task’?|
|A problem task is an activity we carry out to understand or resolve the root cause of a problem. Any and all activity that contributes to resolving a problem is recorded as a task and a problem cannot be closed until every relevant task has been completed.|
Broadly, PIRs uncover two types of problem, each of which prompts its own questions.
Functional issues When the code doesn’t behave in line with customer requirements or expectations, we need to know:
Non-functional issues These cover scenarios where the service is affected by an external influence; where there’s a partial failure; or where the incident impacts security, availability, latency, resilience and/or recovery.
In such situations, the universe of potential questions is enormous and choosing the right ones often comes from combining experience with analysis of the incident’s symptoms and impact. Often, we ask questions like these:
Could a different action have allowed the service to continue, at least partially? a. Would tweaked configuration of load shedding or circuit breakers reduce the impact? b. If a service instance failed, could we redirect to working instances more quickly? c. Were there any warning signs before it broke; for example, if memory or disk were running out, could we have flagged this earlier in our alerts?
Such questions demonstrate why it helps to have representatives from each squad attending PIRs. Experience shows that most incidents and problems affect more than one service. What’s more, even if a single fix would resolve the underlying problem across the board, wide attendance reduces the likelihood of another squad finding themselves in the same situation in future.
The description of PIRs contained here may give the impression that incidents are discussed entirely dispassionately. Despite everybody’s best efforts, this doesn’t always happen. To ensure that the response is appropriate and helpful, therefore, moderators listen out for certain types of comments, such as these four below.
|Comment||Meaning and Response|
|“I made a mistake.”||It’s important to appreciate the openness being offered here and consider whether we could automate the process or add an extra review step.|
|“I thought that was Team X’s responsibility.”||This indicates a misunderstanding between teams that’s usually fairly quick and easy to resolve.|
|“Team X didn’t do what they’re meant to.”||Similar to the previous comment, this defensive tone can easily happen when an incident is high-impact and people are worried, frustrated or even angry. A useful response might be to ask Team X to confirm their responsibility and examine their actions, or whether there’s a gap in understanding. This gives Team X the opportunity to present their own understanding of the incident. Once a conversation starts between teams, a resolution is usually quick to follow.|
|“We haven’t been able to work out why.”||Nobody likes unanswered questions so this can be demoralising for the teams. However, it is common with incidents that ‘just seem to go away’. The PIR provides an opportunity to ask for help from other teams who may have different skills or offer new insights. A fresh perspective is often incredibly useful.|
The success of PIRs rests largely on participants’ transparency, honesty and desire to improve. And as a tribe, Access Worldpay appreciates their broader benefits: “We’re not just trying to improve the quality of our products, we’re trying to improve the quality in our entire way of working, too,” explains Jonathan, a Quality Assurance Engineer. “Within our operating model, it’s down to us to react to and address every incident that our products encounter,” he continues: “To this end, PIRs can be an underrated investment in future time saving.”
Speaking from my own point of view, as we follow a
Andy Brodie is Senior Product Manager for Quality of Service at Access Worldpay.
Rate this article
Search our documentation, API references and articles.
Got any feedback or bugs to report?
To discuss how we can help your business, or to learn more about us, just get in touch.
Ask our developer community.