Just sometimes, something goes wrong. In IT operations, we call this an ‘incident’. After high-impact incidents, it’s industry best practice to hold Post-Incident Reviews (PIRs). These examine what happened, why, and how we can improve. In the first of two articles, Andy Brodie explains why Access Worldpay values PIRs so highly that it now uses them to learn from every single 'incident'.
It’s simply a fact that computers and the services they run are fallible. This is partly because the humans that create and program them are also imperfect, and because it’s impossible to test the infinite combinations of potential scenarios within live service. Naturally, we want our creations to be as close to perfect as possible, however, so when things do go wrong, we’re determined to find out why.
Like many organisations, Worldpay from FIS uses the
For high-impact incidents, this includes the Post-Incident Review (PIR), which provides a process to discuss what went wrong and why, plus how we might prevent things going wrong in the future – specifically, by specifying new covering requirements and acceptance criteria on future work.
|What’s an ‘incident’?|
|It’s a record we create whenever something has gone wrong and has been detected by anything or anybody inside or outside Worldpay (e.g. a customer or supplier). The 'incident' is closed once its impact has ended, but this doesn’t mean the underlying problem has been solved. For example: an incident with a service’s operation in a single AWS Availability Zone (AZ) can be mitigated and closed by redirecting traffic to another AZ, but that hasn’t fixed the problem (or ‘root cause’) in the first AZ.|
"Every incident is a learning opportunity for the teams. Not only is it important to identify and resolve the root cause but also to learn from it. We're looking at monitoring and alerting improvements, increasing engineer training, plus expanding knowledge and experience sharing across the whole tribe.We have seen invaluable conversations happening within Access Worldpay since starting the PIRs for every incident. This will help stop some of the future incidents" Sophie Hirst - Technical Service Owner.
Within FIS, the Access Worldpay tribe decided to hold PIRs for every incident that happened, whatever its size or impact, simply because the process yields such valuable outcomes. Specifically:
However, PIRs can be extremely time-consuming, which also takes engineers away from writing new code. As a result, Access Worldpay adapted the PIR process to capture the benefits of the process at a lower time-cost.
Access Worldpay has three principles by which it treats incidents and, by extension, PIRs. These are:
The second and third points may sound rather trite. However, experience shows that this approach is the most effective way to get the best outcome – high-quality services developed and operated by skilled, motivated teams, both of which are always improving.
On the third principle, an unspoken and troublesome kind of blame is the self-inflicted kind. Everybody makes mistakes, so it’s important for everybody involved in PIRs to remember this and continue to promote these three key points so that confidence is built up, not knocked down.
For everything but long-running incidents, a PIR is held after the relevant 'incident' is closed but often before the underlying problem is resolved.
|What’s a ‘problem’?|
|It’s the term used to describe the root cause of an 'incident', or the potential root cause of a future 'incident'. This might be:|
1. a code or infrastructure problem that requires a change;
2. a documentation problem that requires a clarification or correction;
3. a problem with a process that needs a step added – or better yet, removed.
The main goal of the PIR is to work out, as a team, if there’s anything we can do better in future. This is deliberately broad; it means not only avoiding future incidents completely but, if the problem has yet to be resolved, how to deal with future incidents faster or with less human intervention.
In summary, Access Worldpay holds PIRs to maximise our productivity and optimise our services and culture. PIRs achieve this most successfully when they promote learning, eschew blame and recognise that prevention is better – and far cheaper – than cures. This first article has sought to explain why we value PIRs in principle and, broadly, what we seek to gain from them.
Andy Brodie is Senior Product Manager for Quality of Service at Access Worldpay.
Rate this article
Search our documentation, API references and articles.
Got any feedback or bugs to report?
To discuss how we can help your business, or to learn more about us, just get in touch.
Ask our developer community.