Another short post with a surprising thing to learn about Ruby.
We had a threaded service that called require on a certain directory at runtime. It was expected to run the service only on one thread at a time.
My mental model of this was:
requireis basically a no op once a file is already loaded- so it doesn’t cost much to run
requireeach time the threaded service executes.
One day, due to some weirdness in a testing environment, this threaded service ran on multiple threads simultaneously.
Our Sidekiq worker came to a halt.
My colleague Dmytro found out that, if you try to require the same directory from multiple threads simultaneously, the process can deadlock.
Reproducing
A simple reproduction looks like this:
# alpha.rb
puts "[Thread A] Inside alpha.rb, sleeping before requiring beta..."
sleep 1
require_relative 'beta'
puts "[Thread A] alpha.rb finished."
# beta.rb
puts "[Thread B] Inside beta.rb, sleeping before requiring alpha..."
sleep 1
require_relative 'alpha'
puts "[Thread B] beta.rb finished."
# experiment.rb
puts "--- Starting Deadlock Reproduction ---"
t1 = Thread.new do
puts "Thread 1: Requiring 'alpha'..."
require_relative 'alpha'
end
t2 = Thread.new do
puts "Thread 2: Requiring 'beta'..."
require_relative 'beta'
end
[t1, t2].each(&:join)
puts "This will never print because of the deadlock."
This prints the following when I test it:
--- Starting Deadlock Reproduction ---
Thread 1: Requiring 'alpha'...
[Thread A] Inside alpha.rb, sleeping before requiring beta...
Thread 2: Requiring 'beta'...
[Thread B] Inside beta.rb, sleeping before requiring alpha...
experiment.rb:13:in 'Thread#join': No live threads left. Deadlock? (fatal)
3 threads, 3 sleeps current:0x0000000aa933b400 main thread:0x00000001030b86e0
* #<Thread:0x00000001023e8d18 sleep_forever>
rb_thread_t:0x00000001030b86e0 native:0x0000000200c23080 int:0
experiment.rb:13:in 'Thread#join'
experiment.rb:13:in 'Array#each'
experiment.rb:13:in '<main>'
* #<Thread:0x0000000121df67c8 experiment.rb:3 sleep_forever>
rb_thread_t:0x0000000aa933b200 native:0x000000016dcbf000 int:0 mutex:3 cond:1
depended by: tb_thread_id:0x00000001030b86e0
/Users/eli/scratch/require-deadlock/alpha.rb:3:in 'Kernel#require_relative'
/Users/eli/scratch/require-deadlock/alpha.rb:3:in '<top (required)>'
experiment.rb:5:in 'Kernel#require_relative'
experiment.rb:5:in 'block in <main>'
* #<Thread:0x0000000121df66b0 experiment.rb:8 sleep_forever>
rb_thread_t:0x0000000aa933b400 native:0x000000016ddcb000 int:0
/Users/eli/scratch/require-deadlock/beta.rb:3:in 'Kernel#require_relative'
/Users/eli/scratch/require-deadlock/beta.rb:3:in '<top (required)>'
experiment.rb:10:in 'Kernel#require_relative'
experiment.rb:10:in 'block in <main>'
from experiment.rb:13:in 'Array#each'
from experiment.rb:13:in '<main>'
Why does this happen
Well, short answer, there is a lock inside require and multiple threads can deadlock when they contend for requiring the same resources.
Long answer — I have not had time to look into the Ruby internals around this, but the Ruby Hacking Guide does describe some of the mechanisms.
From Chapter 18, Loading:
The problem comes after. Like the comment says “the loading of Ruby programs is serialised”. In other words, a file can only be loaded from one thread, and if during the loading another thread tries to load the same file, that thread will wait for the first loading to be finished […]
The process to enter the waiting state is simple. A
st_tableis created inloading_tbl, the association “feature=>waiting thread” is recorded in it.curr_threadis ineval.c’s functions, its value is the current running thread.The mechanism to enter the waiting state is very simple. A
st_tableis created in theloading_tblglobal variable, and a “feature=>loading thread” association is created.curr_threadis a variable fromeval.c, and its value is the currently running thread. That makes an exclusive lock. And inrb_feature_p(), we wait for the loading thread to end like the following.When
rb_thread_schedule()is called, the control is transferred to an other thread, and this function only returns after the control returned back to the thread where it was called. When the file name disappears fromloading_tbl, the loading is finished so the function can end. Thecurr_threadcheck is not to lock itself (figure 1).
(I have not checked if this is still 100% accurate; the Ruby Hacking Guide is old.)
Fix
We changed our code to avoid dynamic require calls at runtime. Instead we loaded the necessary code at start time.
(It is code that does not need to be loaded in all scenarios, so that’s why it was not loaded by default in the first place. It is pointless to load code when it isn’t needed; but then the hard part is to reliably know when it is needed.)
Anyway, the point is, I just had no idea that require had an underlying lock. It’s hard to plan around things you don’t know exist.