In this text I intend to discuss some practical aspects related to the “multiple 9” percentages that are advertised by vendors regarding their reliability. Oh, and how to achieve 100% uptime (NOT).
SLA stands for Service Level Agreement and it is a binding contract between the vendor and the customer. It is usually expressed as the percentage of the time reference window (e.g. a year) when the Service should be functioning normally, delivering its desired output.
The Uptime of the Service represents the numeric portion of the agreement above, expressed either as a percentage or by using time units.
Note: the uptime/downtime definitions above do not completely apply with Services provided through different infrastructure sets, e.g. to different geographical regions, from different data centers. A downtime in one geographical region does not mean that the Service is unavailable to just every customer out there so a different calculation method must be figured out. A solution may be to estimate the number of requests not served during the downtime by looking historical data up and then do the Service availability estimates from that particular numeric figure.
How many 9s?
If one wants to do some calculations, there is a website for this. But roughly:
99.9% allows for close to 9 hours of downtime throughout the whole year.
99.95% allows for some minutes above 4 hours of downtime throughout the whole year.
99.99% allows for a little less than 1 hour of downtime throughout the whole year.
99.999% allows for 5 minutes of downtime throughout the whole year.
What are the downtime causes?
One can divide such factors in 2 main categories: planned and unplanned. Obviously, the first category should be explicitly computed, downtime wise, in order to fit in the SLA (maybe with a so-called “error budget“). The second category should also be considered and mitigated, but not all the risks can be averted.
Out of planned outages, e.g. for a web server running on Linux, one can think of:
Operating system updates: if a restart is required due to a kernel upgrade, a single restart can amount to roughly 2 minutes of downtime. 3 such restarts within a year will prevent getting a 99.999% uptime.
Service updates: depending on the service, the downtime can be as low as a few seconds, but sometime the downtime can go into minutes or significant fractions of the hour.
The unplanned outages can range from hardware issues (server failure) to software issues (e.g. critical bugs in the latest release). One should have mitigation plans in place for these types of issues in order to preserve the SLA. There are situations when the SLA cannot be preserved regardless of the effort that is being put in; this must also be considered.
Note: when thinking about availability one must also consider the uptime of the infrastructure between the end user and the Service itself. From an ideal 100% Service availability when sitting near the server, a random user out there may benefit up to the uptime provided by their internet service provider compounded with the uptime of the other infrastructure components on the path to the Service. From this perspective, offering more than a 99.9% SLA might not actually bring tangible benefits to the average user. From an income loss perspective this may not be enough, though.
Mitigating unplanned outages
What should one do when the unexpected happens? Maybe something in the line of:
Procedures for reverting to the latest known good software version. This can be as easy as reverting to a previous snapshot or running the deploy scripts with attributes pointing to the old software version. The DevOps team should be able to do it without any assistance from the Devs; maybe some automation can be put in place so that everyone can do it, regardless of the time during the day or through the night.
Having a failover architecture in place (e.g. Service replication) – this is actually the method suggested by Amazon in order to be able to provide a SLA over 99.95%.
Provisioning a cold standby or having procedures in place to quickly deploy a new node with the configuration of the failed one. This process can be fully automated, triggered by a health check.
Achieving a 100% SLA is not possible, not in any real world scenario. Having a failover architecture (e.g. multiple nodes delivering the Service in different availability zones / regions) may really help with the “9”s, though.
Note: This text was written by an AWS Certified Solutions Architect (Associate). Please do always work with an expert when setting up production environments.