Crazy DevOps interview questions (3)

Note: The first 2 episodes of the interview series can be found here and here.


Question 1:

The interviewer comes in and hands you the following ls -la listing:

# ls -la
total 108
dr-xr-x---.  7 root root  4096 Sep  5 07:16 .
dr-xr-xr-x. 22 root root  4096 Sep  1 17:43 ..
-rw-------.  1 root root 15432 Sep  5 06:36 .bash_history
-rw-r--r--.  1 root root    18 May 20  2009 .bash_logout
-rw-r--r--.  1 root root   176 May 20  2009 .bash_profile
-rw-r--r--.  1 root root   176 Sep 23  2004 .bashrc
-rw-r--r--   1 root root     0 Sep  5 07:16 -f
...

They say the -f file must go; would you please delete it?

I suspect most candidates would try one of the following approaches:

  • Use quotes (e.g. rm “-f”). This does not work as the quotes are removed by the shell and -f is passed as such to the rm command. Such command will not return any error, though, as it tells rm to ignore non-existing files.

  • Escape the minus (e.g. rm \-f). This does not work as escaping is interpreted by the shell and, again, -f is passed as such to the rm command.

  • Double-escape the minus (e.g. rm \\-f). This does not work as the string \-f gets passed to rm, causing it to return an error; in the end, the file \-f does not exist while -f does.

At this point many candidates just give up, yielding a poor interview rating, which is truly a shame.

Anyway, how to delete such file? There are 2 solutions to this problem, both found by RTFM, i.e. man rm:

  1. Note the “--” (double minus) parameter, e.g:

    # rm -- -f
    
  2. Prefix the file name with the current directory, e.g.:

    # rm ./-f
    

The follow-up to this interview question can be enumerating the characters that are not allowed by the filesystem, which are not many, I believe \0 (null, the string terminator) and / are forbidden on Linux while the rest are allowed. On Windows the list is longer, though.


Question 2:

You are asked to get to the whiteboard and draw a cron definition to be put in a /etc/cron.d file for writing the date/time into a file every Wednesday at 5 PM. The date/time format is given to you, e.g. %Y-%m-%d %H:%M:%S.

An experienced candidate may start to immediately write something in the line of:

0 17 * * 3 root date +'%Y-%m-%d %H:%M:%S' >> /tmp/logfile

All good? … Nope.

What is wrong, you may ask yourself? You may remember a couple of details regarding the Cron operation:

  • The commands are run with the shell configured in /etc/crontab rather than your favorite shell, but for such simple command it should not matter.

  • Local PATH changes are not visible, although “/bin” usually gets into PATH through /etc/crontab.

  • Local environment variables are not visible – but none are used.

At this point most candidates just give up.

The answer is with the way Cron interprets the % (percent) character – this is the end of line indication. The command is split into lines by % delimiters and then passed as such to /bin/sh. Quotes do not help with anything as the shell receives something in the line of:

date +'
Y-
m-
d 
H:
M:
S' >> /tmp/logfile

This obviously generates an error. The solution is to escape the percent signs in the Cron command definition:

0 17 * * 3 root date +'\%Y-\%m-\%d \%H:\%M:\%S' >> /tmp/logfile

This finally looks good (well, sort of – but at least it does what it should do).


Question 3:

A situation came along: a mission-critical process writes data to some file whose contents must be preserved at all costs, only that some junior administrator has just run a rm -f command over it. The process is up & running but in a few hours the log rotatation will kick in and send a HUP signal.

Let’s just ignore the coldness one may feel through the spine – this is actually quite a simple issue to solve:

  1. Identify the pid of the process and then do a listing in the /proc/_pid_/fd in order to identify the file descriptor of the deleted file, e.g.:

    # ls -la /proc/966/fd/10
    lrwx------ 1 root root 64 Sep  5 08:56 /proc/966/fd/10 -> /var/lib/critical/process.out (deleted)
    
  2. Grab the contents of the deleted file:

    # dd if=/proc/966/fd/10 of=/tmp/recovered.out
    

Hold on: how about the contents that the process keeps on writing after we issue the dd command?

In order to avoid data loss one may think about using a combination of:

  • Issue a SIGSTOP to freeze the process, then get the contents with dd and then terminate the process (kill) / restart the service.

  • Get the contents and issue a SIGHUP immediately afterwards.

None of these are guaranteed to preserve everything if the application buffers data (well, a mission-critical application should flush buffers after every write and not cache anything, this is a good point to raise in the interview). Nevertheless, there is a clean solution for that, involving the tail command:

# tail -f -n +0 /proc/966/fd/10 > /tmp/recovered.out

This command can be left running until the log rotation event if a faster service restart is not desirable.


Question 4:

The interviewer comes into the room and tells you that a node got into a 100% storage situation which was analysed by some administrator who could not identify the cause as all file sizes seem in order. Restarting the node is not possible due to mission-critical software running on it.

Sounds bad? The diagnosis is actually simple – if no visible file is large enough then it means that some process keeps a reference to a very large deleted file (or some deleted files add up to a very large size). Finding the process and the file is the tough part but the lsof command helps:

# lsof | grep deleted | awk '{print $2,$4,$8,$9}'
2006 34w 24967 /tmp/vteAIZ4MY
2006 35u 28708 /tmp/vteXJZ4MY
2006 36w 28985 /tmp/vteYU9XMY
....

What did I do here? I have filtered the lsof output by deleted files and then got it to print 4 fields only:

  1. The pid of the process that keeps the deleted file open;

  2. The file descriptor (with the access type);

  3. The file size – the interesting field one should keep an eye on;

  4. The file name.

Once we have identified the file, there are 2 options:

  • Reload or restart the process, e.g. kill -HUP _pid_;

  • Truncate the file, e.g. truncate -s 0 /proc/_pid_/fd/_fd_.

That’s it for today, thank you for your read!


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.