I’d like to present you the book I am trying to finish reading for some time now; a very dense book, with good practices and interesting details on how to keep planet-wide systems up & running with a bunch of very well prepared people.
What are the lessons one needs to walk away with, from this book? A few bullets:
SRE people should be Software Engineers at their very core as infrastructure these days is expressed by code. Their operational load should not exceed 50% as they serve a double purpose: keeping things up and running and, at the same time, build the systems of the future.
The “rite of passage” of an SRE is going on call; this is a side effect of the thorough preparation in the first few months of the employment. There is also a “sweet spot” on the weekly number of issues an engineer might be paged on – too few incidents and the engineer will start to disconnect from the system, too many and they may ignore some and/or get burned out.
Google looks for a few qualities in the prospective SREs – this is a reality of its own – but the whole environment is conceived in order to get the best out of the people involved and, at the same time, get the best result for the company. Some core values that come out of this book are collaboration, support, continous improvement and blameless failure analysis.
There are a couple of chapters covering technologies used by Google in order to get the services to the scale of the planet (and beyond). The technical details can at some time get confusing – I personally found the chapter on consensus algorithms pretty hard to follow. Also, the chapters on Load Balancing are a mandatory read for people seeking entrance into this field, not necessarily at Google.
One can purchase this book from Amazon (product link). Go buy it and prepare to spend many hours “digesting” its content.
Later Edit: the book is now public by Google (link).