In this text I intend to discuss some practical aspects related to the “multiple 9” percentages that are advertised by vendors regarding their reliability. Oh, and how to achieve 100% uptime (NOT).
SLA stands for Service Level Agreement and it is a binding contract between the vendor and the customer. It is usually expressed as the percentage of the time reference window (e.g. a year) when the Service should be functioning normally, delivering its desired output.
The Uptime of the Service represents the numeric portion of the agreement above, expressed either as a percentage or by using time units.
Note: the uptime/downtime definitions above do not completely apply with Services provided through different infrastructure sets, e.g. to different geographical regions, from different data centers. A downtime in one geographical region does not mean that the Service is unavailable to just every customer out there so a different calculation method must be figured out. A solution may be to estimate the number of requests not served during the downtime by looking historical data up and then do the Service availability estimates from that particular numeric figure.
How many 9s?
AWS provides a complete monitoring engine called CloudWatch; it works with metrics – including custom, user-provided metrics – and is able to raise alarms when any such metric crosses a certain threshold. This is the tool that is used for all perfomance monitoring tasks within AWS.
This text will cover a monitoring scenario regarding deploying an arbitrary appplication to the cloud and being able to determine what causes the performance limits to be met, be it the application code itself or resource limits enforced by Amazon.
Let’s assume that one has just started using Amazon Web Services and is deploying applications on free tier or other general purpose (T2) instances. One learns that the general purpose instances work with “credits” that allow dealing with short spikes through performance bursting – but once the credits are exhausted the performance is reverted to some baseline. All the particular details do not make a lot of sense but one needs to know if the application can meet the desired service limits with this setup.
More than a year ago, during a time when I barely knew anything about Cloud Computing or AWS, I was assigned along with a couple of colleagues on bringing an existing code base from “alpha” to “production” and ensure a smooth deployment to the Amazon Cloud. The customer wanted to “go live” in less than 3 months and be able to handle tens of thousands of visitors that would click on banners and fill their bank accounts; well, most likely they were just wishing for a good exit. On a side note, one of the photoplasty sections of the cracked.com website has an image about this type of business.
Starting the project
Things initially went to some direction – we dealt with many functionality issues, being able to fix and test more than 100 bugs and glitches; after all, this was the thing we knew best how to do and we also put in the long hours required for getting things done. We weren’t bothered by the cloud setup issues – the customer fiercely guarded the “keys to the kingdom” and agreed on instance and resource set-ups on a case-by-case basis only, all with the desire of keeping the Amazon bill as low as possible. We thought this was fine – it was their home, they knew best what they needed.