How post-incident reviews happen at Access Worldpay

Ways of Working

Written by Andy Brodie
15 July 2020

When something goes wrong, it's important to find out why - and how to avoid similar incidents in the future. This is the premise behind Post-Incident Reviews (PIRs). In an accompanying article, Andy Brodie explains why Access Worldpay values PIRs; here, he describes how we conduct them to maximize their benefits.

To summarize my first article, Access Worldpay holds PIRs to maximize the learning opportunity afforded by the service failures that inevitably happen from time to time. Successful PIRs do this in way that avoids blame, which damages our culture and service quality. They also let us examine how we can collectively improve in future, whether by dealing with future incidents faster or more efficiently, or by avoiding them altogether.

How do they work?

The PIR itself centres on a weekly meeting. How long this takes depends on:

How many incidents are ready for review - each one usually takes 10-15 minutes.
The complexity of each incident.
How much pre-review information has been gathered - this can include documentation of the incident's timeline and impact, as well as a problem record.

What's a 'problem record'?
We create a problem record when we are first aware of an underlying problem. We may create it reactively, after an incident, or proactively, as a result of events. Events are messages that state something meaningful about a service. They are generated either by monitoring software or by the service itself. For example, a busy service is still operating within its latency budget but is slowly creeping up over time and will eventually exceed the SLA, causing an incident. Monitoring the events that state the 95th percentile latency for calls - e.g. every minute - means we can spot trends, raise a problem and resolve it proactively, before an incident actually occurs. We only close problem records once we have proved that the underlying problem has been resolved. Complex problems may take weeks to resolve, so problem records exist to document the essential facts and to track the resolution process.

What's a 'problem record'?

We create a problem record when we are first aware of an underlying problem. We may create it reactively, after an incident, or proactively, as a result of events.
Events are messages that state something meaningful about a service. They are generated either by monitoring software or by the service itself. For example, a busy service is still operating within its latency budget but is slowly creeping up over time and will eventually exceed the SLA, causing an incident. Monitoring the events that state the 95th percentile latency for calls - e.g. every minute - means we can spot trends, raise a problem and resolve it proactively, before an incident actually occurs.
We only close problem records once we have proved that the underlying problem has been resolved. Complex problems may take weeks to resolve, so problem records exist to document the essential facts and to track the resolution process.

To maximize productivity, we try to discuss only those incidents that meet certain criteria:

All the relevant people can attend - those who managed and/or resolved the incident, and engineering and product representatives.
We've documented the incident's timeline and impact.
We've created a problem record which is up to date with the current investigation / resolution.
We'll discuss proactive problems - where something bad could happen in future - if this can help the wider team.

Describing the incident

PIR meetings are led by a moderator, who introduces each incident and invites someone to manage the discussion - usually the on-call engineer who led the investigation and any subsequent fix.

The discussion begins with a thorough description of the incident, often by walking through the timeline. Timelines can only be completed once the incident has been closed, so it's helpful to keep a simple audit trail - e.g. messages in a chat channel - that can be used to add new events or information to the timeline.

Specifically, the review tries to identify four key stages in the incident lifecycle:

Incident Started - the earliest point of the lifecycle.
Incident Detected - when the incident is first reported by a person or automated monitoring.
Squad Response - the point when the first squad member began addressing the incident.
Customer behavior - whether the incident prompted customer behavior to change, e.g. by switching to a backup provider or by retrying failing operations.

A large gap between Incident Started and Incident Detected could suggest a need to improve monitoring, perhaps by making alerts more sensitive or changing where they're directed. A young service often records 'false' incidents because its alert thresholds are too sensitive.

Equally, large gaps between Incident Detected and Squad Response could suggest that we need to examine how we notify squads; that an incident was new enough to miss the alert system; or that a squad member needs a louder alarm!

Assessing the fix

During the incident, the team will investigate its cause and conduct tests, after which it proposes a fix. The process of fixing the incident involves three separate stages:

Fix Applied - when a change was made to resolve the incident.
Fix Confirmed - when it was confirmed that the fix worked.
Incident Resolved - when it was confirmed that the impact of the incident had ended, particularly if an incident required several fixes.

PIRs examine three specific metrics:

The time elapsed between Incident Started and Fix Confirmed is important because an active incident is still impacting something or somebody. If this is a customer, we need to stop the incident as quickly and safely as possible.
The gap between Fix Applied and Fix Confirmed - a long gap could indicate that we need to improve our monitoring.
Fix Applied does not lead to Fix Confirmed - in this scenario, the fix may have no effect or make the situation worse. This is where PIRs can offer enormous value.

When assessing this third metric, we examine the original rationale for the fix. In a PIR, we need to differentiate between what we knew at the time and what we know now, in hindsight. The aim of the PIR is to learn as much as possible so that we can apply this valuable, 20/20 post-event knowledge to future incidents while they are still active.

It's not unknown for PIRs to contain surprising conclusions - "like the big reveal of a whodunnit," according to one regular. Such revelations often identify problem tasks, which often apply across multiple services. This is why bringing disparate people together in a PIR can add genuine value.

What's a 'problem task'?
A problem task is an activity we carry out to understand or resolve the root cause of a problem. Any and all activity that contributes to resolving a problem is recorded as a task and a problem cannot be closed until every relevant task has been completed.

What causes incidents?

Broadly, PIRs uncover two types of problem, each of which prompts its own questions.

Functional issues When the code doesn't behave in line with customer requirements or expectations, we need to know:
1. Do we need a new test case to fail if this problem happens again?
2. Do we have an existing test case that picked this up but the problem wasn't detected?
3. Were there missing acceptance criteria in the feature or story?
4. Were any omissions deliberate or accidental? In other words, did something happen that should never happen, or did we miss something?
Non-functional issues These cover scenarios where the service is affected by an external influence; where there's a partial failure; or where the incident impacts security, availability, latency, resilience and/or recovery.
In such situations, the universe of potential questions is enormous and choosing the right ones often comes from combining experience with analysis of the incident's symptoms and impact. Often, we ask questions like these:
Could a different action have allowed the service to continue, at least partially? a. Would tweaked configuration of load shedding or circuit breakers reduce the impact? b. If a service instance failed, could we redirect to working instances more quickly? c. Were there any warning signs before it broke; for example, if memory or disk were running out, could we have flagged this earlier in our alerts?
How easy it was it to recover? a. Do we know the current state of the customer and downstream third parties or do we need to reconcile? b. Did we need to restore any data from backup and, if so, did we hit our Recovery Point Objective (amount of data lost)?
Is this problem specific to the service or could it apply elsewhere? For example, if we tripped over a problem in a third-party library or service, could it affect another of our services later?

Such questions demonstrate why it helps to have representatives from each squad attending PIRs. Experience shows that most incidents and problems affect more than one service. What's more, even if a single fix would resolve the underlying problem across the board, wide attendance reduces the likelihood of another squad finding themselves in the same situation in future.

The human factor

The description of PIRs contained here may give the impression that incidents are discussed entirely dispassionately. Despite everybody's best efforts, this doesn't always happen. To ensure that the response is appropriate and helpful, therefore, moderators listen out for certain types of comments, such as these four below.

Comment	Meaning and Response
"I made a mistake."	It's important to appreciate the openness being offered here and consider whether we could automate the process or add an extra review step.
"I thought that was Team X's responsibility."	This indicates a misunderstanding between teams that's usually fairly quick and easy to resolve.
"Team X didn't do what they're meant to."	Similar to the previous comment, this defensive tone can easily happen when an incident is high-impact and people are worried, frustrated or even angry. A useful response might be to ask Team X to confirm their responsibility and examine their actions, or whether there's a gap in understanding. This gives Team X the opportunity to present their own understanding of the incident. Once a conversation starts between teams, a resolution is usually quick to follow.
"We haven't been able to work out why."	Nobody likes unanswered questions so this can be demoralising for the teams. However, it is common with incidents that 'just seem to go away'. The PIR provides an opportunity to ask for help from other teams who may have different skills or offer new insights. A fresh perspective is often incredibly useful.

Improving, investing, discovering

The success of PIRs rests largely on participants' transparency, honesty and desire to improve. And as a tribe, Access Worldpay appreciates their broader benefits: "We're not just trying to improve the quality of our products, we're trying to improve the quality in our entire way of working, too," explains Jonathan, a Quality Assurance Engineer. "Within our operating model, it's down to us to react to and address every incident that our products encounter," he continues: "To this end, PIRs can be an underrated investment in future time saving."

Speaking from my own point of view, as we follow a Test Driven Development (TDD) approach, I know we do a lot of testing before code is allowed into production so, whatever the incident, it's likely to be a product of multiple factors that we thought were independent, but turned out not to be. As a result, almost all incidents have an "Ah-ha!" moment when the root cause is discovered, followed by an "Oh wow, we didn't think THAT could happen!" Which just goes to show that PIRs can be not only useful, but fascinating, too!

Andy Brodie is Senior Product Manager for Quality of Service at Access Worldpay.