We had an interesting case for performance analysis this week at work. Our staging server started to display a strange pattern of intermittent slow responses.

It was an unusual response pattern because it wasn’t localized to certain endpoints, nor was it a general system slowdown. Overall system resources like CPU usage and memory were all fine across all the different components of our system. However, slowness was scattered randomly across almost all our service endpoints.

The problem manifested as a very particular 5-second delay, intermittently added to the server response time. So instead of having the usual response time graph — where most of the endpoints are fast and then there is a long, slightly random tail of slow responses — in this case we were looking at our regular response times, mixed together with bursts of requests that took about 5 seconds.

The response time logs looked about like this:

GET / 201ms
GET / 130ms
GET / 86ms
GET / 160ms
GET / 5102ms
GET / 5073ms
GET / 95ms
GET / 188ms

I’m just making up these exact numbers, but you can get the general pattern.

The affected endpoints included everything except the service health check endpoint. Upon investigation, the health check endpoint turns out to use a different base class from all the other requests. That was a useful finding, because it allowed us to narrow down the problem roughly to “functionality included in our standard controller base class.”

Early investigation steps

We checked out everything we could think of and couldn’t find a root cause. Not even a good clue. Lots of dead ends.

So.

What do you do when you’ve checked all the usual observability tools and found nothing useful?

Automated reproduction of the bug

After a day of getting nowhere, I wrote a quick concurrent load test that attempted to reproduce the problem.

It looked roughly like this:

#!/usr/bin/env ruby

require "net/http"

# Invoke with ruby test.rb "https://path.to.server/endpoint" 20 10
# where 20 is the number of threads and 10 is the number of requests per thread

url = URI(ARGV[0])
num_threads = ARGV[1].to_i
requests_per_thread = ARGV[2].to_i

results = []
mutex = Mutex.new

def send_request(url)
  start = Time.now
  print "."
  code = Net::HTTP.get_response(url).code

  { start_time: start, status: code, duration: Time.now - start }
end

threads = num_threads.times.map do |n|
  Thread.new do
    requests_per_thread.times.each do
      result = send_request(url)
      mutex.synchronize { results << result }
    end
  end
end

threads.each(&:join)

puts "\nSent #{num_threads * requests_per_thread} requests to #{url}"
puts "  #{num_threads} threads x #{requests_per_thread} requests per thread"

results.sort_by { _1[:start_time] }.each do |r|
  puts "%s: %.2f ms - HTTP %d" % [
    r[:start_time].strftime("%H:%M:%S.%L"),
    r[:duration] * 1000,
    r[:status]
  ]
end

(Disclaimer - not meant to be production ready code, just a quick script to find a problem.)

The script instantly reproduced the problem, and gave us some quantitative metrics about its extent, which was super useful.

I ran it a bunch of times against different endpoints, with varying thread counts and requests per thead. I think the smallest useful test set was 50-75 requests (5 threads), and the largest batch was 4000 requests (200 threads).

The two initial findings from the script were:

Later on, as we tried different things, I also used the load test script to check “Did this change make things better or worse? Is the problem still happening?” We tried a few infrastructure changes, like increasing the number of web workers and changing some karpenter node scaling configuration. They didn’t help, or at least not much.

Profiling your code

Unfortunately, being able to reproduce the problem still wasn’t enough to pinpoint the root cause. And the staging system’s users were eager for answers.

I tried to reproduce the problem in a few other dev/test environments. I could not reproduce it on any environment besides staging.

Eventually, I started to want a more detailed profile of what gets executed in our controller base class (ApplicationController in Rails terminology). I set up some profilers locally (not on staging) and ran it against our server (in particular, I ran ruby-prof and stackprof). The local profiler data helped me get a sense of what normal operation looks like, but, of course, it didn’t find the root cause of the staging issue.

It started to seem like the only thing that would help was to run the profiler on the staging system. We consider our staging system to be production-like, so we don’t normally have those kinds of tools available. We don’t usually add debugging output to our staging deployments, either.

I could have gone ahead with that route. I would probably have added StackProf::Middleware to the Rack stack, redeployed, re-run the load test, and dug through the output.

But instead I found something better.

Profiling your code in prod

I remembered that Julia Evans had worked on a Ruby profiler a few years ago. This led me to the fantastic tool she built, rbspy.

What’s great about rbspy is that you don’t have to change anything about your application itself. You just run rbspy, tell it the PID of your chosen process, and let it profile your service for a while. You re-run your load test to reproduce the problem while the profiler is running. Then you inspect the results.

The first time I tried to run rbspy, it crashed the web worker container by running out of memory.

But we increased system resources and I tried again.

All bugs are shallow with the right data

The first rbspy profiler run spotted the problem instantly. (It outputs results into a huge SVG flamechart that you can click around.)

The random slowness came from getaddrinfo, the libc function that does DNS lookups. The calling site was inside our postgres database connection routine, which is in a library that isn’t covered by our standard instrumentation tooling.

Some of the time, looking up the database hostname was taking ~5 seconds. Our application does a lot of per-request database connection switching (because reasons), so the problem could have affected almost any endpoint. Except the healthcheck.

I told this to my colleague in devops and he remembered that they recently changed the hosting setup for our local dns service (coredns). They rolled back the change and the 5 second delays went away instantly.

Lessons learned