Five principles of circuit breakers
Circuit breakers provide a complementary protection to timeouts. Whereas timeouts exist primarily to protect a client, circuit breakers primarily protect the service. In this article, that partners the
Circuit breakers take their name from electrical components such as fuses, or RCDs (
This analogy is why circuit breakers use the unintuitive terms of closed (good, everything is working) and open (bad, everything is stopped).
In services, circuit breakers are generally wrapped around operations that might behave unpredictably or badly if further calls are made on that operation. Operations are typically calls to downstream services or third party libraries. These are analogous to the electrical components that RCDs protect.
In their default state, circuit breakers are always closed. The logic to decide whether to open a circuit is based on analysis of recent historical behavior. Once a specified failure threshold is reached, the circuit is opened which stops any future calls being made. The circuit remains open for either a period of time or until other criteria are met. Then the circuit is closed again and normal service is resumed. Criteria might include letting a small percentage of calls go through and seeing whether they work or not.
Circuit breakers are very different from timeouts but are often, mistakenly, used in an interchangeable way. So, before you use a circuit breaker, it is really important to understand what circuit breakers are really for. The analogy with electrical circuits is extremely relevant and useful: they are for when things are (perhaps literally) on fire. Consider the impact of a single failure when calling a downstream service, and then consider the impact of not even bothering to try the call in the first place, i.e. assuming failure. Therefore, circuit breakers are for situations when even attempting to do something could do more damage than not trying in the first place.
To help use circuit breakers effectively, here are some principles to consider.
Circuit breakers are primarily for protecting downstream services
If a downstream service is failing then in most cases (see next principle) it is not logical to stop trying it to call it. A failing service will be fixed and we want to maximize availability to customers, which means doing our best to fulfil their requests. This is especially relevant when you're writing a client to a third-party service, such as one that calls Access Worldpay. The vast majority of the responsibility for a service to defend itself lies with the service itself, not with the clients.
Circuit breakers MAY be used to protect the clients against impact
Here’s an exception to the above rule. A client may open a circuit breaker if the act of sending traffic causes damage that the downstream service is unaware of.
For example: a downstream service with side-effects (such as card payment authorization) continues to accept requests but does not respond. In this case it is acceptable for the calling service to open the circuit to prevent wider damage to Worldpay.
However, it is important to ensure that the owners of the calling service have considered:
- Can timeouts be relaxed to wait for an answer, in case one comes, without risking the client's stability?
- Does the operation require it? (e.g. does it have side-effects, is it easily recoverable/reconcilable)?
You can open and close circuits manually
It is impossible to predict every kind of catastrophic failure that may need a circuit to open. In other words you don’t know exactly how the downstream service will (mis)behave.
This is important for situations where the human operators have access to information that the client does not. There must be a "manual override" switch on a circuit breaker to open or close the circuit to either prevent damage or continue operations.
Circuit breaker thresholds must be agreed between owners of both client and service
Circuit breakers break the rules of service encapsulation, as services should be responsible for defending themselves in the event of failures. By opening a circuit, another (calling) service, by opening a circuit, is agreeing that:
"I will sacrifice the availability of my service, by not even attempting to call you for a period of time, to minimize the overall business impact of a failure."
A simpler way of expressing this principle is: "Always give humans an override switch".
When a circuit breaker is opened, it counts against the calling service's availability (when part of a critical path)
This clarifies the "sacrifice" of the first principle. If the calling service does not attempt to make the call then it cannot state that the downstream service is unavailable. It possibly is, but not definitely. Therefore when a circuit is open the calling service takes the hit, by default.
This assumes, of course, that the calling service doesn't have a fall-back process, or the ability to queue requests for later processing.
Circuit breakers are a powerful tool. They can help make your system more resilient and minimise the impact of major incidents. However, they must be used with care so that they do not cause more harm than good.