How could a filesystem corruption happen? There are a couple of likely causes to it:
Kernel bugs: they are infrequent but they also did happen many times in the past and will still happen in the future. Not many things to be done about them, other than applying patches / keeping the kernel up to date;
Memory issues, e.g. memory errors propagated to the file system in control structures: they are usually mitigated with ECC memory but they can never be ruled out;
Underlying storage issues: quite unlikely but nevertheless possible;
Using the reset button on running servers: journaling file systems are almost always able to recover from such incident;
RAID controller issues: this could be the leading cause and not be easy to mitigate, even if firmware upgrade is sometimes possible.
All filesystem types can equally be affected by any of the causes above. XFS seems to be more visible in corruption-related search results due to a more widespread usage.
One wakes up in the morning, happy and with plans laid out for the entire day. Only that some error pops up on the pager:
“Corruption of in-memory data detected. Shutting down filesystem”
Yep, this one is bad. If the filesystem contains databases or web server files, things will go south within minutes: services will go down or stop serving users. That’s why smart people replicate functionalities / use load balancers or have a hot replica -type of architecture.
The only complicated matter is the requirement to be connected to the real keyboard / video of the server (e.g. through a KVM switch). This may mean that one has to go to the physical location of the server to perform such work.
Assuming this condition is met, the repair process is straightforward:
Boot the system in single-user mode, e.g. edit the kernel line in Grub and add the keyword single at the very end. This will give a root, no password asked console with no services started;
Unmount the corrupted file system (umount);
Issue a xfs_repair on that particular system.
Minutes later, as xfs_repair is usually quite fast, a normal reboot will put up the system to its former glory. There might be missing files, corrupted content, corrupted database tables – but hey, that’s life. At least everything started normally for now.
Easy? It is easy. Well, until the next time…