Some of our current software runs on Unicorn, which if you aren’t the target audience for this post, is a process-based Ruby webserver that has:
- a very old-school website,
- a GitHub mirror,
- and an architecture based on a master process that forks a number of child worker processes to handle requests.
We got interested lately in having exactly one of a set of Unicorn workers spawn a background thread that would report some periodic healthcheck data. The idea was that the healthcheck results would be identical for all workers, so we only needed to report the data once per Unicorn master process. But we didn’t want to run a reporting thread on a master process, as it isn’t encouraged to fork a multithreaded process. (See for example Thorsten Ball’s Why Threads Can’t Fork, rachelbythebay’s Don’t mix threads and forks, or more recently byroot’s Why does everyone hate fork?).
As the fork(2) manpage explains, when you fork
, all threads except the active thread will die and not get resumed in the child process:
- The child process is created with a single thread—the one that
called fork(). The entire virtual address space of the parent
is replicated in the child, including the states of mutexes,
condition variables, and other pthreads objects; the use of
pthread_atfork(3) may be helpful for dealing with problems
that this can cause.
- After a fork() in a multithreaded program, the child can
safely call only async-signal-safe functions (see
signal-safety(7)) until such time as it calls execve(2).
Stopping all background threads when forking might be what you want, in some cases, but it can leave resources dangling, connections unreleased, and so on, depending what was happening in the other threads at the time.
Anyway - if we don’t want to report healthcheck data from the master
process, and we want to report it from only one of n worker processes, then this raises an interesting interprocess coordination problem.
How can you guarantee that out of a a pool of n workers, exactly one will run a given observability task at any given time? And how can you guarantee that if one worker dies, another will automatically start running the observability task?
It kind of reminds me of Zookeeper - a cluster coordination problem - except that in this case, we aren’t trying to coordinate processes across a whole cluster; we are only trying to coordinate processes within a particular container.
Naive approach
The first thing that occurred to me was this:
- At boot time, each child process will check for the existence of a file at a standard path (let’s say
/tmp/coordination.pid
). - If
/tmp/coordination.pid
is not found, then create it, and write the current pid to it. Whichever process does this first is volunteering to run the healthcheck task. - If
/tmp/coordination.pid
is already present, then check if a process with that pid is running.- If so, then sleep for a while and then check again.
- If not, then proceed from step 2 as if the file were not found.
Problems with this approach:
- There is some chance of a race condition in between steps 1 and 2, wherein two processes simultaneously find that
/tmp/coordination.pid
is absent and then each try to write their pids to the same path. The chances of this could perhaps be mitigated by waiting for a random interval before attempting step 1. - For the numerous workers that are sleeping, it’s inefficient that they have to wake up every so often to recheck step 3. This imposes a pointless polling cost.
My colleague Dmytro suggested that we use flock
instead, which essentially delegates the whole coordination problem to the operating system and solves both of these problems.
I had never heard of it before.
Flock(2)
I found flock
hard to learn about. There are manpages (flock(2)) and Hacker News discussions, but they don’t cover the set of use cases for file locking very clearly. I think the core use case is “several processes want to write to the same shared file and need to cooperate with each other.”
In any case, it is a system call that comes with some caveats. The first two I found:
- The relationship between file descriptors and file locks is slightly confusing to me in the context of forks.
- Per this discussion, flock is handled poorly over NFS, although I don’t think that edge case is very relevant to our kubernetes cluster. Fortunately, the edge case pertains to how flock is handled differently for processes running on the NFS server itself than for NFS clients, so even if our ops team started to run NFS for our web workers without telling me, the edge case would not affect us.
In any event, flock
can nicely be used to coordinate only-once semantics among a set of worker processes. The way it works for our use case is this:
- Open a file at a standard path.
- Attempt to acquire an exclusive lock (
LOCK_EX
) on the file. Use the blocking form offlock
that just blocks the caller until the lock can be acquired. - If you acquire the lock, then you can go ahead and run your instrumentation task, or whatever only-once activity you want to conduct.
- When the process that is currently holding the lock eventually exits or is killed, the operating system will automatically wake up the next process in line and give them the lock. You never have to poll, in this approach.
Ruby implementation
Ruby provides a standard (though platform-dependent) interface to flock
, available at File#lock.
One can write an implementation roughly like this in a Unicorn configuration file:
TMP_FILE_PATH = "/tmp/coordination.pid"
after_fork do |server, worker|
Thread.new do
File.open(TMP_FILE_PATH, File::RDWR | File::CREAT, 0644) do |f|
f.flock(File::LOCK_EX) # will block indefinitely if the lock is not acquired
# now run whatever background task you want here, such as reporting system health.
end
end
end
So far, this has worked quite well for us, and it seems likely to be much more robust than any DIY solution I could have come up with.
Further reading
- kernel source > fs/locks.c
- interesting kernel commit that introduces a tree of dependent requests for a given file lock
(Standard disclaimer: I am absolutely not an expert on the Linux kernel, although I do enjoy trying to read the source code from time to time.)