Crazy DevOps interview questions

Who likes interviews? Me neither. Well, depends…

If you get any of the following questions during an interview then either the interviewer did read this one or he’s getting info from the same sources as I am. Either way, let’s get one step forward.

Question 1:

Suppose you log into a node and find the following situation:

# ls -la /bin/chmod
rw-r--r-- 1 root root 56032 ian 14  2015 /bin/chmod

Is it fixable?

… yes. Most of the time. Let’s remember how the executables are being started on Linux: with a system loader, ld-linux-something. Let’s check:

# ldd /bin/chmod =>  (0x00007ffdf27fc000) => /lib/x86_64-linux-gnu/ (0x00007fb11a650000)
	/lib64/ (0x00007fb11aa15000)

OK, got it:

# /lib64/ /bin/chmod +x /bin/chmod

Fun, isn’t it? The follow up question obviously is “what if the loader’s rights are unproperly set” or something similar. The answer to that is that not all the filesystem issues are fixable with easy commands. One may even have to mount the file system on a different installation (e.g. from a Live CD or attach the virtual storage to another, “good” node) and fix things up from there.

Question 2:

A running process gets stuck on reading a socket (variation: it can be blocked – deadlocked – on reading from/writing to or it may be receiving EAGAIN and being in some sort of a livelock). How can one close this bad socket/file descriptor without killing the process?

… so, let’s ask ourselves: where do we get to see from the outside the files or sockets opened by a certain process? Yes, in /proc:

# ps -e | grep myprocess
  979 pts/0    00:00:00 /usr/bin/myprocess

# ls -la /proc/979/fd
total 0
dr-x------ 2 root   root  0 oct 22 21:37 .
dr-xr-xr-x 9 nobody dip   0 oct 22 21:37 ..
lrwx------ 1 root   root 64 oct 22 21:37 0 -> /dev/null
lrwx------ 1 root   root 64 oct 22 21:37 1 -> /dev/null
lrwx------ 1 root   root 64 oct 22 21:37 2 -> /dev/null
lrwx------ 1 root   root 64 oct 22 21:37 3 -> socket:[9814]
lrwx------ 1 root   root 64 oct 22 21:37 4 -> socket:[9816]
lrwx------ 1 root   root 64 oct 22 21:37 5 -> socket:[9817]
lrwx------ 1 root   root 64 oct 22 21:37 6 -> socket:[9818]
lr-x------ 1 root   root 64 oct 22 21:37 7 -> pipe:[9825]
l-wx------ 1 root   root 64 oct 22 21:37 8 -> pipe:[9825]
lrwx------ 1 root   root 64 oct 22 21:37 9 -> socket:[9827]

Now we’re onto something; at least we now have the pid and the fd (file descriptor). Removing open sockets or interacting with them from outside of the process is not possible; this means we have to (basically) get in. How? With gdb. Let’s asume that pid == 979 and fd == 9:

# gdb
  ... (lots of output) ...
(gdb) attach 979
Attaching to process 979
  ... (lots of output) ...
(gdb) p close(9)
$1 = 0
(gdb) detach
Detaching from program: /usr/bin/myprocess, process 979
(gdb) quit

That was tricky, wasn’t it? I suspect the one who’s asking such question in an interview is looking for a real expert, though.

Question 3:

How would you fix an issue like the following:

# ls
bash: fork: retry: Resource temporarily unavailable

The short explanation for the behaviour above is that the system is out of pids and the only access left is through this particular shell interface.

NB: Some of you might have already noticed, this is the classical fork bomb question that is sometimes asked when interviewing with Google or another big company. They do not seek a definitive solution (there is no universal one) but rather to evaluate one’s approach to solving it.

Well, the real answer is that in such situations the lag may be so bad (after all, there are many processes competing for run time) that the only resolution for such issue is a hard reboot. But in a typical interview setup one may be expected to provide a “smart” answer that involves a few shell scripting tricks.

There are some issues to be solved before the “smart” answer can be expressed:

  • How do you figure out the process name? In the situation of a fork bomb it may be the same process name taking up all the available pids.

  • Once you have the process name, how do you identify all the pids?

  • Once you have a (very large) pid list, how do you run the kill command?

The first question is not usually asked by the interviewer. The name is usually “known” or can be figured out through trial and error by browsing /proc/. Even if the browsing part is not trivial; after all, you can’t run ls or cat as they are external commands. But one can do it with shell scripting built ins:

# read t < /proc/1091/cmdline ; echo $t

The list of pids can also be walked over with some shell builtins (NB: while creates a process and cannot be used):

# for p in /proc/[0-9]* ; do echo ${p#/proc/} ; done
 ... long pid list, one for every line ...

Now on the kill part. Doing some kill -9 is trivial but would not resolve the problem; killing processes will only make room for other processes to spawn. This then requires a 2 step process:

  1. Putting all malicious processes in “suspend” state (kill -SIGSTOP);

  2. Actually removing them (kill -SIGKILL);

The code can now be written as follows (NB: the double “[” if syntax is the built in, the single one creates a new process):

# for p in /proc/[0-9]*/cmdline ; do read cmd < $p ; if [[ $cmd = 'bomb_process_name' ]]; then t=${p%/cmdline} ; kill -SIGSTOP ${t#/proc/} ; fi ; done
# for p in /proc/[0-9]*/cmdline ; do read cmd < $p ; if [[ $cmd = 'bomb_process_name' ]]; then t=${p%/cmdline} ; kill -SIGKILL ${t#/proc/} ; fi ; done


  • The string comparison relies on whitespace being present between the variable, equal sign and the string (e.g. a˽=˽”xxx” is a comparison, a=”xxx” is not interpreted this way by Bash).

  • The equal (=) string comparison operator can be replaced with ==; in this scenario both of them achieve the same result.

  • The string manipulation operators (# and %) perform the shortest match – in such scenario (a static string, not an expression), ## and %% – the longest match – would have provided the same result.

  • The cmdline in /proc/<pid> is not a reliable source to get the process name. This works for basic bash fork bombs or for results of poor programming. A motivated attacker will most likely randomize the process command line – e.g. snprintf(argv[0], size, “randomstuff”).

  • The other sources of getting the process name (/proc/<pid>/comm, /proc/<pid>/status) can also be overriden with something like: e.g. prctl(PR_SET_NAME, (char*) “randomstuff”) – yes, yes, const_cast, I know.

Follow ups:

  • What if it’s a bash fork bomb and you are in a bash shell? Wouldn’t the first command also lock you out?

  • What if kill is not built in and only the external version is available?

The answer to those is that not everything can be solved cleanly. One may think about /proc/self – a symlink to some real pid; reading the link is possible with readlink but this one is an external command. Luckily we have the internal construct echo $$ that provides us with our own pid that we can then exclude in the if condition above.

With kill as an external command I’m not even sure this problem can ever be solved in all the situations.

If the fork bomb was triggered by a normal user, the /etc/security/limits.conf can be immediately appended with something like:

echo "forkuser hard nproc 20" >> /etc/security/limits.conf

This will stop the active component of the bomb (even if we issue a “exec…” on the terminal, the pid will not be claimed by the fork bomb). In this scenario, just executing the second kill command will most likely free enough pids so that the normal operation of the system can be resumed.

If the user is root or has certain capabilities set (e.g. CAP_SYS_ADMIN) then this option is not possible. Consequently, collecting all the pids in a shell script variable and issuing a exec /usr/bin/kill locks us out when the second command from the pair needs to be run (and, unfortunately, the first command needs to be run in order to defuse the active component of the fork bomb). Cleaning up the (now practically stale) pids is the tough issue.

Note: there is no guarantee that the pid freed by the now gone shell remains free in order for another process within our direct control to claim it. There may be multiple daemons running on that machine competing for this resource type.

Nevertheless, one may try to explore:

  • cron, but this is a “best effort” solution as there is no guarantee that cron is able to run.

  • inittab, if available on the system (on newer distributions this is gone). Putting the kill commands in the file to be run when switching to a runlevel (e.g. 4) and then forcing the runlevel change with telinit. Again, this is not guaranteed to work.

Systems where we have physical access to: they usually open multiple log in screens, accesible with key combinations like ALT + digit. This allows us to get back in. For systems that we access through virtual terminals (e.g. through ssh, xterm, GNOME terminal) this is not possible so the question does not really have an answer.

Those were the “crazy” interview questions for today. Hope you enjoyed them!

Note: There are 3 more episodes of the interview series here, here and here.

3 thoughts on “Crazy DevOps interview questions

Leave a Reply

Your email address will not be published. Required fields are marked *