Tag Archives: linux
Book Review: How Google does SRE

I’d like to present you the book I am trying to finish reading for some time now; a very dense book, with good practices and interesting details on how to keep planet-wide systems up & running with a bunch of very well prepared people.

Site Reliability Engineering: How Google Runs Production Systems (SRE)

What are the lessons one needs to walk away with, from this book? A few bullets:

Continue Reading →

MySQL Monitoring with Amazon CloudWatch

Amazon CloudWatch is the monitoring tool for all the Amazon Cloud services. It offers both White Box and Black Box monitoring for services managed by Amazon and can be extended to work with user-generated monitoring data.

This text covers the integration of a simple MySQL monitoring script with Amazon CloudWatch.

MySQL Monitoring

Let’s assume that we want to monitor the number of active connections to the MySQL server and have an indication on when this figure becomes close to the maximal value defined in the configuration file (max_connections). In order to be portable, we may want to also report this value to the monitoring engine, even if it’s unlikely that a change may occur without explicit human intervention.

MySQL provides 3 numeric figures we may be interested in:

Continue Reading →

On SLAs and Uptime Guarantees

In this text I intend to discuss some practical aspects related to the “multiple 9” percentages that are advertised by vendors regarding their reliability. Oh, and how to achieve 100% uptime (NOT).

Definitions

SLA stands for Service Level Agreement and it is a binding contract between the vendor and the customer. It is usually expressed as the percentage of the time reference window (e.g. a year) when the Service should be functioning normally, delivering its desired output.

The Uptime of the Service represents the numeric portion of the agreement above, expressed either as a percentage or by using time units.

Note: the uptime/downtime definitions above do not completely apply with Services provided through different infrastructure sets, e.g. to different geographical regions, from different data centers. A downtime in one geographical region does not mean that the Service is unavailable to just every customer out there so a different calculation method must be figured out. A solution may be to estimate the number of requests not served during the downtime by looking historical data up and then do the Service availability estimates from that particular numeric figure.

How many 9s?

Continue Reading →

Previous Page · Next Page