Why post-incident reviews matter
to Access Worldpay
Just sometimes, something goes wrong. In IT operations, we call this an ‘incident’. After high-impact incidents, it’s industry best practice to hold Post-Incident Reviews (PIRs). These examine what happened, why, and how we can improve. In the first of two articles, Andy Brodie explains why Access Worldpay values PIRs so highly that it now uses them to learn from every single 'incident'.
It’s simply a fact that computers and the services they run are fallible. This is partly because the humans that create and program them are also imperfect, and because it’s impossible to test the infinite combinations of potential scenarios within live service. Naturally, we want our creations to be as close to perfect as possible, however, so when things do go wrong, we’re determined to find out why.
Like many organisations, Worldpay from FIS uses the
For high-impact incidents, this includes the Post-Incident Review (PIR), which provides a process to discuss what went wrong and why, plus how we might prevent things going wrong in the future – specifically, by specifying new covering requirements and acceptance criteria on future work.
|What’s an ‘incident’?|
|It’s a record we create whenever something has gone wrong and has been detected by anything or anybody inside or outside Worldpay (e.g. a customer or supplier). The 'incident' is closed once its impact has ended, but this doesn’t mean the underlying problem has been solved. For example: an incident with a service’s operation in a single AWS Availability Zone (AZ) can be mitigated and closed by redirecting traffic to another AZ, but that hasn’t fixed the problem (or ‘root cause’) in the first AZ.|
Expand and adapt
"Every incident is a learning opportunity for the teams. Not only is it important to identify and resolve the root cause but also to learn from it. We're looking at monitoring and alerting improvements, increasing engineer training, plus expanding knowledge and experience sharing across the whole tribe.We have seen invaluable conversations happening within Access Worldpay since starting the PIRs for every incident. This will help stop some of the future incidents" Sophie Hirst - Technical Service Owner.
Within FIS, the Access Worldpay tribe decided to hold PIRs for every incident that happened, whatever its size or impact, simply because the process yields such valuable outcomes. Specifically:
- Access Worldpay teams work in a BRO (Build-Release-Operate) model in which the team building a service also operates it once it has been released. By requiring manual intervention in a service’s operation, incidents take engineers away from writing new code. To maximise productivity, therefore, we must minimise both the number and impact of incidents.
- Embedding high-quality incident management into our culture has its own value, particularly for new engineers who may be unfamiliar with the benefits of best-practice 'incident' management, particularly with respect to minimising customer impact.
- Low-impact incidents or near-misses often act like compiler warnings – a ‘smell’ that something is wrong that, left unaddressed, could lead to a high-impact 'incident' in the future.
However, PIRs can be extremely time-consuming, which also takes engineers away from writing new code. As a result, Access Worldpay adapted the PIR process to capture the benefits of the process at a lower time-cost.
Three key principles
Access Worldpay has three principles by which it treats incidents and, by extension, PIRs. These are:
- Incidents are inevitable – no service or system will ever be perfect.
- Incidents are a learning opportunity to improve and ensure that we can continue to innovate and build world-class services.
- Blame is counter-productive and erodes trust, thereby damaging our culture and ultimately undermining the quality of our services.
The second and third points may sound rather trite. However, experience shows that this approach is the most effective way to get the best outcome – high-quality services developed and operated by skilled, motivated teams, both of which are always improving.
On the third principle, an unspoken and troublesome kind of blame is the self-inflicted kind. Everybody makes mistakes, so it’s important for everybody involved in PIRs to remember this and continue to promote these three key points so that confidence is built up, not knocked down.
Towards better services
For everything but long-running incidents, a PIR is held after the relevant 'incident' is closed but often before the underlying problem is resolved.
|What’s a ‘problem’?|
|It’s the term used to describe the root cause of an 'incident', or the potential root cause of a future 'incident'. This might be:|
1. a code or infrastructure problem that requires a change;
2. a documentation problem that requires a clarification or correction;
3. a problem with a process that needs a step added – or better yet, removed.
The main goal of the PIR is to work out, as a team, if there’s anything we can do better in future. This is deliberately broad; it means not only avoiding future incidents completely but, if the problem has yet to be resolved, how to deal with future incidents faster or with less human intervention.
In summary, Access Worldpay holds PIRs to maximise our productivity and optimise our services and culture. PIRs achieve this most successfully when they promote learning, eschew blame and recognise that prevention is better – and far cheaper – than cures. This first article has sought to explain why we value PIRs in principle and, broadly, what we seek to gain from them.
Andy Brodie is Senior Product Manager for Quality of Service at Access Worldpay.