Timeouts. In software development, they're a necessary evil – a choice to inflict minor pain on client code in order to avoid something worse. Although simple to apply, their impact on service quality means that they require careful thought. In this article, Andy Brodie explains timeouts and outlines five principles to help set them effectively when writing clients to web services, such as Access Worldpay.
Whether you're a developer, a customer or just a member of the public, you'll have experienced timeouts. They play a key role in building robust web service clients, such as to our Payments, Wallets or Fraudsight services. When all's going well, it's like they aren't there – your client code makes a request, the service responds quickly and tells you whether the request succeeded or failed.
If there's a problem, though, their existence becomes all too apparent. Your code times out – simply because it's decided that it's better to not know the outcome than be kept on hold for any longer, waiting to find out.
How big a deal this is depends on the nature of the original request. If it's something that doesn't involve any changes to the state of the web service, then the client code can simply try again after a period of time. No harm done.
It's a different story when the request involves an operation such as a payment sale, which both authorizes a payment and initiates a settlement. Clearly, success or failure impacts the state of the service (and beyond) – it's important to know whether the sale did or didn't happen before making the same request again and risking a second, duplicate sale.
In situations like this, you need to consider how to reconcile the state of the client and also the state of the service, along with the ensuing complexity. For example, you may need to execute a query against the service to discover the outcome that you missed because of the timeout and only then decide whether to proceed with a retry.
That's a simple example, though – the complexity increases exponentially with the number of reconciliation operations each timeout requires. In other words, for every operation that timed out, you must also resolve all the failure cases for all the reconciliation operations.
Given all this complexity, it's reasonable to ask: why do timeouts happen at all? There are two main reasons:
A single failure, such as one caused by a packet drop, usually makes reconciliation straightforward. You discover the state of the service, reconcile it with the state of the client and carry on as before.
Life becomes more complicated with a service problem. For instance, if the original request timed out because of a service overload, there's a chance that your follow-up reconciliation operations will also time out.
When configuring client code, there are good reasons to include timeouts. They're usually included to protect the client from performance issues or from resource starvation.
Given how impactful timeouts can be, it's important to think carefully before setting their parameters. We've identified five principles to help achieve the best possible outcome in web service clients.
Think first about the availability, latency or cost of the service. A process should only time out if that damages service quality less than the alternative – in other words, if not knowing the outcome of the process is better than waiting any longer for that process to finish. Different operations will, therefore, need different timeout parameters. Operations with no side-effects, or idempotent requests, can afford to have more aggressive timeouts because retrying them has no side-effects.
When an operation does time out, it must have a pre-defined recovery and reconciliation procedure – in other words, code that reconciles the view of the service with the view of the client code.
Many services specify a maximum latency in the Service Level Agreement (SLA): "This service will respond in 900ms", for example. In this example, if the service responds after 900ms then it's considered unavailable from an SLA perspective, regardless of the response itself.
Should a client time out with the service before the SLA has been breached, the client is unavailable, not the service.
The period after which a client times out should be a function of resource consumption and availability. Therefore, it needs to be flexible. A client experiencing a very low load may be able to afford to time out after a few minutes; one with a very high load may time out much faster. For example, if most requests complete in less than a second, it may time out after just a few seconds.
Clients are expected to look after their own best interests; the service cannot be expected to do this for them. They should, therefore, adopt the principle that they will wait for a response to their request up to the point at which this impacts their own effectiveness. Once they reach this point, they should time out.
There are often chains of service requests within a service architecture. For example: Client X calls Service A, which itself is a client to Service B. If Client X times out waiting for Service A because Service A is awaiting a response from Service B, this doesn't mean that Service A should time out waiting for Service B.
Instead, Service A should wait as long as it can to find out whether or not the operation completed so it can be reported, logged or stored as appropriate. In this way, when Client X comes back to Service A asking what happened, it gets the answer without any need for a subsequent reconciliation between Service A and Service B.
Timeouts aren't very helpful if they just keep on happening – i.e. when every request fails. To avoid this, you should set your parameters to maximise the number of successful requests. If a process sees a sustained slowdown on a service that is causing repeated timeouts, it should consider how to increase the timeout period without jeopardising its own stability. For example:
By their very nature, timeouts are a blunt solution to a bad situation. As a result, they need to be considered with care and then configured so as to avoid causing problems that are bigger than those they're intended to solve. I hope that this article and the principles we've established at Access Worldpay will help you find the right balance for your own web service clients – one that achieves the best possible outcomes for your customers.
Andy Brodie is Senior Product Manager for Quality of Service at Access Worldpay.
Rate this article
Search our documentation, API references and articles.
Got any feedback or bugs to report?
To discuss how we can help your business, or to learn more about us, just get in touch.
Ask our developer community.