the what, the why, the how & the when
Timeouts. In software development, they're a necessary evil – a choice to inflict minor pain on client code in order to avoid something worse. Although simple to apply, their impact on service quality means that they require careful thought. In this article, Andy Brodie explains timeouts and outlines five principles to help set them effectively when writing clients to web services, such as Access Worldpay.
Whether you're a developer, a customer or just a member of the public, you'll have experienced timeouts. They play a key role in building robust web service clients, such as to our Payments, Wallets or Fraudsight services. When all's going well, it's like they aren't there – your client code makes a request, the service responds quickly and tells you whether the request succeeded or failed.
If there's a problem, though, their existence becomes all too apparent. Your code times out – simply because it's decided that it's better to not know the outcome than be kept on hold for any longer, waiting to find out.
No harm done?
How big a deal this is depends on the nature of the original request. If it's something that doesn't involve any changes to the state of the web service, then the client code can simply try again after a period of time. No harm done.
It's a different story when the request involves an operation such as a payment sale, which both authorizes a payment and initiates a settlement. Clearly, success or failure impacts the state of the service (and beyond) – it's important to know whether the sale did or didn't happen before making the same request again and risking a second, duplicate sale.
In situations like this, you need to consider how to reconcile the state of the client and also the state of the service, along with the ensuing complexity. For example, you may need to execute a query against the service to discover the outcome that you missed because of the timeout and only then decide whether to proceed with a retry.
That's a simple example, though – the complexity increases exponentially with the number of reconciliation operations each timeout requires. In other words, for every operation that timed out, you must also resolve all the failure cases for all the reconciliation operations.
Why do requests time out, anyway?
Given all this complexity, it's reasonable to ask: why do timeouts happen at all? There are two main reasons:
- Packet drops or mis-routing – when packets of data don't reach their destination, either to the service or back to the client. These failures affect single requests and seem random. Because they're so short and unpredictable, they're typically very hard to diagnose but are often caused by faulty network components.
- A service problem – the service might be experiencing unusually heavy load or a problem with a downstream dependency and can't therefore respond to the request as quickly as necessary.
A single failure, such as one caused by a packet drop, usually makes reconciliation straightforward. You discover the state of the service, reconcile it with the state of the client and carry on as before.
Life becomes more complicated with a service problem. For instance, if the original request timed out because of a service overload, there's a chance that your follow-up reconciliation operations will also time out.
Why do clients need timeouts?
When configuring client code, there are good reasons to include timeouts. They're usually included to protect the client from performance issues or from resource starvation.
- Performance issues: in these situations, the service shouldn't rush to time out and leave the client code not knowing what happened. If timing out is the only way to free up the client to do something else, then consider decoupling the client's response to its customer from the response it is waiting for itself from the service (see service chains, below).
- Resource starvation: If this client code is processing lots of concurrent requests and volume is ramping up, then you don't want it consuming large amounts of memory, CPU or I/O resources. Techniques such as asynchronous I/O can help minimise the overheads of each call, enabling you to safely wait longer for a reply.
Five key principles
Given how impactful timeouts can be, it's important to think carefully before setting their parameters. We've identified five principles to help achieve the best possible outcome in web service clients.
1. Prioritise service quality
Think first about the availability, latency or cost of the service. A process should only time out if that damages service quality less than the alternative – in other words, if not knowing the outcome of the process is better than waiting any longer for that process to finish. Different operations will, therefore, need different timeout parameters. Operations with no side-effects, or idempotent requests, can afford to have more aggressive timeouts because retrying them has no side-effects.
When an operation does time out, it must have a pre-defined recovery and reconciliation procedure – in other words, code that reconciles the view of the service with the view of the client code.
2. Mind your SLAs
Many services specify a maximum latency in the Service Level Agreement (SLA): "This service will respond in 900ms", for example. In this example, if the service responds after 900ms then it's considered unavailable from an SLA perspective, regardless of the response itself.
Should a client time out with the service before the SLA has been breached, the client is unavailable, not the service.
3. Don't be rigid
The period after which a client times out should be a function of resource consumption and availability. Therefore, it needs to be flexible. A client experiencing a very low load may be able to afford to time out after a few minutes; one with a very high load may time out much faster. For example, if most requests complete in less than a second, it may time out after just a few seconds.
Clients are expected to look after their own best interests; the service cannot be expected to do this for them. They should, therefore, adopt the principle that they will wait for a response to their request up to the point at which this impacts their own effectiveness. Once they reach this point, they should time out.
4. You can break a service chain
There are often chains of service requests within a service architecture. For example: Client X calls Service A, which itself is a client to Service B. If Client X times out waiting for Service A because Service A is awaiting a response from Service B, this doesn't mean that Service A should time out waiting for Service B.
Instead, Service A should wait as long as it can to find out whether or not the operation completed so it can be reported, logged or stored as appropriate. In this way, when Client X comes back to Service A asking what happened, it gets the answer without any need for a subsequent reconciliation between Service A and Service B.
5. Avoid too much repetition
Timeouts aren't very helpful if they just keep on happening – i.e. when every request fails. To avoid this, you should set your parameters to maximise the number of successful requests. If a process sees a sustained slowdown on a service that is causing repeated timeouts, it should consider how to increase the timeout period without jeopardising its own stability. For example:
- it could process fewer requests concurrently;
- it could prioritise this process by degrading another part of the client;
- it could give up give up on the service entirely for a specified period and try later in the expectation that the service's latency may, by then, have regained its normal level.
Finding a balance
By their very nature, timeouts are a blunt solution to a bad situation. As a result, they need to be considered with care and then configured so as to avoid causing problems that are bigger than those they're intended to solve. I hope that this article and the principles we've established at Access Worldpay will help you find the right balance for your own web service clients – one that achieves the best possible outcomes for your customers.
Andy Brodie is Senior Product Manager for Quality of Service at Access Worldpay.