Creating a new way of working
The genesis of Build, Release, Operate
Ways of Working
Written by Patrick Bateman
16 July 2020
Asked to build a set of cutting-edge payments services, the Access Worldpay leadership team had a blank slate when deciding how their new organization could and should operate. Patrick Bateman, Head of Engineering, explains how they drew on first principles, sector research and painful past experience to create a new, hybrid model: Build, Release, Operate.
Once upon a time, not so long ago, I was part of a team given the green light to build a brand new product. Working within a large organization, we were an independent outfit, able to operate as we saw fit.
First time: eager
We made a running start. Within a year, we had a useful product that possessed all the essentials, and built using modern SDLC standards, continuous integration and deployment automation. Then we hit a brick wall. By the time we finally limped into production, our teams were demoralized and our customers' confidence was seriously undermined. What went wrong? Simply this: we'd failed to understand how our organization worked.
We hadn't appreciated that other areas decided what we could or couldn't do; that we were not a high priority; or that our developers weren't permitted to do even simple tasks. Ironically, we could have succeeded had we chosen to work within the organization's structure. It was our choice to go our own, Agile way that made us very Waterfall.
Next time: wiser
This experience still felt fresh when we were asked to assemble another new organization to build a modern eCommerce Gateway. What's more, the world had moved on. The increasing importance of DevOps cultures and the cloud made existing ways of working obsolete and gave us the perfect opportunity to start from scratch.
As we started building our Tribe 1, therefore, we went back to first principles. Looking at other operating models, we asked ourselves, what sort of organization did we want? How did we need to work?
Autonomy plus equality
Perhaps it was the lingering pain of burned fingers, but our starting ethos was to give our squads enough autonomy to deliver rapid, safe change. We also wanted to eliminate the friction between those building the product and those operating it, which tended to result in high lead times and low product quality.
Instead, we wanted our Tribe to give equal status to both operational and product quality. In fact, we wanted there to be no distinction at all - instead of 'product requirements' and 'non-functional requirements', there would simply be 'requirements', whose excellence would be prioritized not only by engineers, but also by product owners.
Data drives decisions
Thinking back through our own experiences running and operating products, and reviewing industry research papers (particularly Accelerate 2, which remains our primary reference), we identified three more objectives.
First, we wanted our products to be as autonomous as possible, so that we wouldn't need an army to operate them. We also needed them to be resilient - able to cope with anything that could happen, especially since they'd be based in the cloud. Combining this environment and service excellence required us to place operational requirements right up front and to design products that could cope well with chaos.
Finally, data. We never wanted to be in the position of making wild guesses about our service quality; we always had to know more about this than our customers. Therefore, we knew that every decision we made had to be driven by the data.
Where we landed
After looking at some other models, including Google's Site Reliability Engineering (SRE) model, we created a hybrid and gave it a name: Build, Release, Operate (BRO). Although it's not the neatest acronym, it does express what we do.
Each squad takes full accountability for their services. Our Tribe Principles state: "If you build it, you run it." As a result, the owning squad is responsible not only for the development, release and deployment of the service, but also for its 24/7 support. Although this sounds oppressive, many support functions - dashboards, monitoring and so on - are integrated into the product from the start, so the reality is rather less scary.
People and processes
We also make sure that each squad has the right people to cover their product's full lifecycle. And instead of keeping their expertise in separate silos, we encourage our engineers to become T shaped. While valuing their core skills, we cross-train our software engineers to take on cloud and operational tasks, and equip our cloud engineers in techniques such as TDD and ATDD.
One of the lessons learned from before was the importance of bringing the organization with us. Consequently, we worked to help existing IT functions prepare for our new operating model. Because we were, effectively, integrating the concerns of infrastructure, deployment and other areas within our squads, we had to demonstrate that weren't dodging the associated risks, such as letting haste overcome safety. Wherever we couldn't own a function, such as change management, we found new ways to share responsibility and refine the processes themselves.
The Tribe today
Fast forward to now and we have nine squads running 19 services, averaging 9.3 deployments per week and a 7.75 day lead time for changes. This capacity is possible partly thanks to our BRO model, but also because of the importance we attach to high service quality.
Together, they've helped us to deliver the resilience we always intended. On average, our squads experience just two out-of-hours call-outs each week, none of which have ever required a code change. Instead, most are mainly acknowledging an issue; although they may require a minor intervention, they've never been complicated or timely.
The services we've built are highly reliable, too. Our plan was always to build 5-9s services -i.e. 99.999% availability. This imperative - along with the desire to maintain our strong customer relationships - breeds quick reactions. Wherever we depend on downstream networks or platforms, we monitor these carefully and engineer our services specifically to cope with any downstream problems. Today, we're consistently meeting our 5-9s target.
Self-improving squads
As our squads have matured, they've discovered new ways to make themselves more effective. Initially, an on-call engineer alerted to an event wouldn't have the right tools to diagnose the issue properly. This prompted our squads to insert more operational stories into their backlogs; they've now become increasingly confident with their operational capability and far more effective at understanding how to tune their monitoring thresholds and restrict alerts to actionable reasons.
Then there's observability. Although it's the new buzz word, we were in early, before any buzz began, looking for gaps and weak spots. Our decision to give our squads autonomy, ownership and decent diagnostic tooling has paid dividends as they've demonstrated enormous diligence and truly forensic inquisitiveness about service operations.
Learn, don't blame
In several cases, for example, some unexpected activity has prompted squad members to dig far deeper to discover behaviors that the service was never designed for. This has then led to a conversation with the customer and the release of a new feature. Situations like these also reinforce our desire to get more information to the right people to achieve our other objective of being completely data-driven.
Such curiosity has improved more than our service quality. When things do go wrong, we operate a 'learn and improve' culture. This aligns with squad accountability and removes the old "blame culture" behaviors. Each week, we operate our own Post Incident Review (PIR) process where we foster a community that takes an interest in uncovering root causes and sharing common problems.
Where we're going
This is good, but we want much more. Specifically, we want to become still more responsive and make our products and services even more resilient. As a result, we're putting ever more focus on automation.
A central priority is to make our lead times even shorter. To achieve this, we need to increase the flow of features through our pipelines, which in turn means tackling those constraints that still exist. One of these is change management, so we're looking at ways to automate change approval. We're also looking at how we can automate QSA evidence reporting within our standard pipeline in order to speed up our PCI-DSS audit process.
Faster, stronger, better
Another focus is our tooling. Given how much we depend on these, it's essential that we make our tools as reliable as possible, even going so far as to place some redundancy into our monitoring capabilities. We build redundancy and resilience into our services, so why shouldn't we expect this of our tools? We want to build ever more automation into our processes, specifically into repeatable activities, such as incident management, that occur when we integrate our products into existing corporate processes.
Ultimately, our ambition is to reduce our time to market while also increasing quality at every level. We can currently build and release a new service in three months - we want to cut this to four weeks (two sprints).3 When we can do this while maintaining the resilience, reliability and overall quality of our services, we may take a second to celebrate before identifying some new targets. Because, as it turns out, our BRO model has produced a Tribe that is always trying to learn and improve and, therefore, is never wholly satisfied.
1 The term Tribe comes for the organization model Spotify developed to provide ownership and autonomy.
2 Book by Nicole Forsgren, Jez Humble and Gene Kim, Accelerate: The Science of Lean Software and Devops: Building and Scaling High Performing Technology.
3 Disclaimer - this aim is for non-PCI-DSS scoped services. Combining soaring ambition with a smidgen of realism has always been important.