Category Archives: Monitoring
On SLAs and Uptime Guarantees

In this text I intend to discuss some practical aspects related to the “multiple 9” percentages that are advertised by vendors regarding their reliability. Oh, and how to achieve 100% uptime (NOT).

Definitions

SLA stands for Service Level Agreement and it is a binding contract between the vendor and the customer. It is usually expressed as the percentage of the time reference window (e.g. a year) when the Service should be functioning normally, delivering its desired output.

The Uptime of the Service represents the numeric portion of the agreement above, expressed either as a percentage or by using time units.

Note: the uptime/downtime definitions above do not completely apply with Services provided through different infrastructure sets, e.g. to different geographical regions, from different data centers. A downtime in one geographical region does not mean that the Service is unavailable to just every customer out there so a different calculation method must be figured out. A solution may be to estimate the number of requests not served during the downtime by looking historical data up and then do the Service availability estimates from that particular numeric figure.

How many 9s?

Continue Reading →

Instance Performance Monitoring in AWS

AWS provides a complete monitoring engine called CloudWatch. This works with metrics, including custom, user-provided metrics and it’s able to raise alarms when any such metric crosses a certain threshold. This is the only tool used for perfomance monitoring tasks within AWS.

This text will cover a monitoring scenario regarding deploying an arbitrary application to the “Cloud” and then being able to determine what causes performance limiting, be it in the application code itself or coming from limits enforced by Amazon.


Scenario

Let’s assume that you have just started using Amazon Web Services and are deploying applications on this free tier or by using general purpose (T2) instances. You quickly learn that the general purpose instances work with “credits” that allow dealing with short load spikes through performance bursting, but when these credits are exhausted, instance performance is reverted to some baseline. These particular details do not make a lot of sense, but you need to know if the application can meet the desired service targets while sticking to this setup.

Continue Reading →

A lesson on failure: “Going Cloud”

About a year ago, during a time when I was barely aware about Cloud Computing or Amazon Web Services, I got assigned, along with a couple of colleagues from this consulting company I was working for, to bring an existing codebase from “alpha” to “production” and then ensure its smooth deployment to Amazon Cloud.

The customer wanted to go “live” in less than 3 months; they also wanted to be able to handle tens of thousands of visitors that would obviously click on banners and make them money. What it’s actually more probable is that they were hoping for a good exit, that is passing the hot potato to somebody else while walking out with a proft. On a side note, there is a term that could be used for these people, but this is not a meme-text so I won’t go more on that route.

Starting on a new project

With this project, things initially went to some direction: we had to incrementally deal with quite a few functionality issues and in the end we were able to put fixes for more than 100 bugs and glitches. Actually, this was all that we could do, along with the long hours required to get things done.

We could not be bothered by any setup issues with the “Cloud” configuration: we knew near to nothing on the topic and the customer fiercely guarded the “keys to the kingdom”; they would only agree on instance and resource set-ups on a case-by-case basis anyway. They were probably thinking they were paying way too much for those pesky Eastern European contractors (us), so I kind of get the “why” on keeping a close eye over the Amazon bill. It was fine by us; at the end of the day it was their home, with their needs and their rules.

Continue Reading →

Previous Page · Next Page