Tag Archives: performance
XFS corruption – repairing

Note: The “magical” tool for XFS is (obviously) xfs_repair. Having it running can sometimes be the tough issue.

Introduction

How could a filesystem corruption happen? There are a couple of likely causes to it:

  • Kernel bugs: they are infrequent but they also did happen many times in the past and will still happen in the future. Not many things to be done about them, other than applying patches / keeping the kernel up to date;

  • Memory issues, e.g. memory errors propagated to the file system in control structures: they are usually mitigated with ECC memory but they can never be ruled out;

  • Underlying storage issues: quite unlikely but nevertheless possible;

  • Using the reset button on running servers: journaling file systems are almost always able to recover from such incident;

  • RAID controller issues: this could be the leading cause and not be easy to mitigate, even if firmware upgrade is sometimes possible.

Continue Reading →

When there’s no route forward (or so you fear)

Note: This is a text about the project work at one of my previous employments.

Introduction

These days everybody talks about Agile, Automation, DevOps and Continous (whatever), without truly understanding why things have gone to this direction. After all, for many years a software project had a couple of well-known steps that needed to be followed, like:

  • Full, thorough planning at the very beginning and from time to time, before significant milestones;

  • Development, lots of development behind closed doors;

  • Lots of manual QA work, little automation with some custom-written testing framework written from scratch by one of the developers;

  • Infrequent releases (e.g. every year or even every other year or so); releases were thoroughly prepared and tested, with code freezes for (sometimes) months before the Day.

Even if everybody knows these days that such approach may have been a bad way of doing things, it actually worked for many years because that was the way the world expected things to work. There were many constraints, e.g:

Continue Reading →

Book Review: How Google does SRE

I’d like to present you the book I am trying to finish reading for some time now; a very dense book, with good practices and interesting details on how to keep planet-wide systems up & running with a bunch of very well prepared people.

Site Reliability Engineering: How Google Runs Production Systems (SRE)

What are the lessons one needs to walk away with, from this book? A few bullets:

Continue Reading →

Next Page