A confusing issue to debug

Sep 5, 2024

We had an interesting case for performance analysis this week at work. Our staging server started to display a strange pattern of intermittent slow responses.

It was an unusual response pattern because it wasn’t localized to certain endpoints, nor was it a general system slowdown. Overall system resources like CPU usage and memory were all fine across all the different components of our system. However, slowness was scattered randomly across almost all our service endpoints.

The problem manifested as a very particular 5-second delay, intermittently added to the server response time. So instead of having the usual response time graph — where most of the endpoints are fast and then there is a long, slightly random tail of slow responses — in this case we were looking at our regular response times, mixed together with bursts of requests that took about 5 seconds.

The response time logs looked about like this:

GET / 201ms
GET / 130ms
GET / 86ms
GET / 160ms
GET / 5102ms
GET / 5073ms
GET / 95ms
GET / 188ms

I’m just making up these exact numbers, but you can get the general pattern.

The affected endpoints included everything except the service health check endpoint. Upon investigation, the health check endpoint turns out to use a different base class from all the other requests. That was a useful finding, because it allowed us to narrow down the problem roughly to “functionality included in our standard controller base class.”

Early investigation steps

We checked out everything we could think of and couldn’t find a root cause. Not even a good clue. Lots of dead ends.

We found a flood of erroneous healthcheck requests from a load balancer going to the wrong endpoint. Devops team fixed it.
We read the codebase for our controller base class and tried to find anything that connected to external services (database IO, redis IO, etc). Found nothing useful.
We checked the transaction traces from our standard observability tooling, which automatically runs profiles against slow requests and sends them for analysis. In every case, our observability tooling did not pinpoint the problem. Clearly, our observability tooling had some gaps.
We scanned through the application logs and the error handling reports. Found nothing.
I grepped the codebase and infrastructure config for anything with a 5 second timeout. Found a few possibilities, but none of them panned out.
I checked the logs and found the date when the problem started. It was the same date as a new code deployment, so I checked what changes had been deployed in that release cycle. Unfortunately, there were ~1k commits in the deployment, so it wasn’t easy to read through it all. I checked for changes to the controller base class and to the overall service configuration. Found more dead ends.
Our DBA checked for suspicious or slow database queries. Found some more dead ends. Reset the postgres stats collector to get more data. Still found nothing.

So.

What do you do when you’ve checked all the usual observability tools and found nothing useful?

Automated reproduction of the bug

After a day of getting nowhere, I wrote a quick concurrent load test that attempted to reproduce the problem.

It looked roughly like this:

#!/usr/bin/env ruby

require "net/http"

# Invoke with ruby test.rb "https://path.to.server/endpoint" 20 10
# where 20 is the number of threads and 10 is the number of requests per thread

url = URI(ARGV[0])
num_threads = ARGV[1].to_i
requests_per_thread = ARGV[2].to_i

results = []
mutex = Mutex.new

def send_request(url)
  start = Time.now
  print "."
  code = Net::HTTP.get_response(url).code

  { start_time: start, status: code, duration: Time.now - start }
end

threads = num_threads.times.map do |n|
  Thread.new do
    requests_per_thread.times.each do
      result = send_request(url)
      mutex.synchronize { results << result }
    end
  end
end

threads.each(&:join)

puts "\nSent #{num_threads * requests_per_thread} requests to #{url}"
puts "  #{num_threads} threads x #{requests_per_thread} requests per thread"

results.sort_by { _1[:start_time] }.each do |r|
  puts "%s: %.2f ms - HTTP %d" % [
    r[:start_time].strftime("%H:%M:%S.%L"),
    r[:duration] * 1000,
    r[:status]
  ]
end

(Disclaimer - not meant to be production ready code, just a quick script to find a problem.)

The script instantly reproduced the problem, and gave us some quantitative metrics about its extent, which was super useful.

I ran it a bunch of times against different endpoints, with varying thread counts and requests per thead. I think the smallest useful test set was 50-75 requests (5 threads), and the largest batch was 4000 requests (200 threads).

The two initial findings from the script were:

The more concurrent requests we got, the more often the 5 second slowdown started happening.
The health check endpoint never had the problem.

Later on, as we tried different things, I also used the load test script to check “Did this change make things better or worse? Is the problem still happening?” We tried a few infrastructure changes, like increasing the number of web workers and changing some karpenter node scaling configuration. They didn’t help, or at least not much.

Profiling your code

Unfortunately, being able to reproduce the problem still wasn’t enough to pinpoint the root cause. And the staging system’s users were eager for answers.

I tried to reproduce the problem in a few other dev/test environments. I could not reproduce it on any environment besides staging.

Eventually, I started to want a more detailed profile of what gets executed in our controller base class (ApplicationController in Rails terminology). I set up some profilers locally (not on staging) and ran it against our server (in particular, I ran ruby-prof and stackprof). The local profiler data helped me get a sense of what normal operation looks like, but, of course, it didn’t find the root cause of the staging issue.

It started to seem like the only thing that would help was to run the profiler on the staging system. We consider our staging system to be production-like, so we don’t normally have those kinds of tools available. We don’t usually add debugging output to our staging deployments, either.

I could have gone ahead with that route. I would probably have added StackProf::Middleware to the Rack stack, redeployed, re-run the load test, and dug through the output.

But instead I found something better.

Profiling your code in prod

I remembered that Julia Evans had worked on a Ruby profiler a few years ago. This led me to the fantastic tool she built, rbspy.

What’s great about rbspy is that you don’t have to change anything about your application itself. You just run rbspy, tell it the PID of your chosen process, and let it profile your service for a while. You re-run your load test to reproduce the problem while the profiler is running. Then you inspect the results.

The first time I tried to run rbspy, it crashed the web worker container by running out of memory.

But we increased system resources and I tried again.

All bugs are shallow with the right data

The first rbspy profiler run spotted the problem instantly. (It outputs results into a huge SVG flamechart that you can click around.)

The random slowness came from getaddrinfo, the libc function that does DNS lookups. The calling site was inside our postgres database connection routine, which is in a library that isn’t covered by our standard instrumentation tooling.

Some of the time, looking up the database hostname was taking ~5 seconds. Our application does a lot of per-request database connection switching (because reasons), so the problem could have affected almost any endpoint. Except the healthcheck.

I told this to my colleague in devops and he remembered that they recently changed the hosting setup for our local dns service (coredns). They rolled back the change and the 5 second delays went away instantly.

Lessons learned

All observability tooling has gaps. It’s good to fix them, when possible.
We don’t have good observability on how long DNS lookups are taking. Unfortunately, it seems to be nontrivial to instrument them on the DNS client side, because the DNS requests get sent out from getaddrinfo (I think) and there are no standard logs of getaddrinfo request performance. (Devops checked the coredns server metrics and they never showed any problems; but client-side metrics would have revealed the issue instantly.)
It’s good to know the difference between “the whole system is under heavy load,” “a certain endpoint is slow,” and “a rare resource contention issue is affecting performance intermittently.” In this case - it became clearer and clearer that it had to be the last of these possibilities.
For obscure issues, an automated reproduction script is great, both to help pinpoint the problem and to quickly check if your latest fix is working.
Sometimes there’s no substitute for profiling in prod. (By prod, here, I mean staging. But the point is the same — you can’t always analyze problems in the development environment; you might have to analyze them where they occur.)
Being able to profile a running server with rbspy, without any code changes, is amazing.
We still don’t know exactly why dns lookup from coredns was intermittently slow in the first place. We’ll figure it out, though.