How to instrument DNS lookups in Ruby

Dec 19, 2024

One of the things we did at work, after our strange performance issue in September, was to add some instrumentation to DNS lookups in Ruby.

As I mentioned last time, it isn’t easy to get performance data directly from getaddrinfo in Linux, which is the libc function that does DNS lookups. You would have to have a local DNS proxy service running and do the logging from there, which isn’t the recommended configuration for our Kubernetes-based system, as far as I understand.

Indirectly, this also means that you can’t easily instrument DNS lookups from C extensions in Ruby, if they call getaddrinfo directly. It probably is not worthwhile to fork every library you use and add your own custom instrumentation code.

However, some libraries with C extensions still use Ruby for the DNS lookups, as Ruby 3.1 provided helpful concurrency handling for the address lookup using Fibers. Fortunately, this happens to include our database adapter, pg).

This means that we can add helpful instrumentation directly in Ruby. It’s easy to measure all DNS lookups that use Addrinfo.getaddrinfo.

Sampling methodology

Methodologically speaking, if you are trying to instrument your DNS lookup calls, it doesn’t necessarily matter whether you capture every request. Even if you can manage to do that, you might not want to, as instrumentation has a runtime (and resource) cost. Particularly if you have a high volume of DNS lookups, you might want to use a random sampling approach, and only instrument X percent of DNS requests.

The approach I came up with was this:

Instrument a small percentage of all DNS lookups (like 2%).
Also, instrument all DNS lookups that are slower than a certain threshold (say 100ms, if you are expecting fast local DNS resolution).

If you record those two data streams separately, then you can get a good sense of aggregate performance, plus good coverage of the worst case scenario.

Implementation

Monkey patching the Ruby standard library is something to approach cautiously. It has some risk of breaking shared functionality and of adding extra request latency (if the instrumentation layer takes any time).

require "socket"

module DNSInstrumentation
  SAMPLING_RATE = 0.01
  SLOW_QUERY_REPORTING_THRESHOLD = 0.15

  def getaddrinfo(...)
    start_at = Process.clock_gettime(Process::CLOCK_MONOTONIC)

    result = super

    duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_at

    if duration > SLOW_QUERY_REPORTING_THRESHOLD || Kernel.rand < SAMPLING_RATE
      instrument(:dns_query, duration)
    end

    result
  end
end

# Apply the patch

# At boot time:
Addrinfo.singleton_class.prepend(DNSInstrumentation) if dns_instrumentation_enabled?

(Side note: It drives me slightly crazy that we have to resolve our database server’s IP address for every request, since for tenant partitioning reasons we have to establish a new database connection on every request and pg will look up the database server’s IP from its hostname every single time… There are ways around this, like pinning tenants to specific web workers or adding application-level DNS caching, but we aren’t currently looking into them.)

Suppose your instrument method just prints to stdout:

def instrument(type, duration)
  puts "DNS query took: #{duration}"
end

Then you can see results like this:

irb(main):008> Addrinfo.getaddrinfo("www.decasia.org", 80)
DNS query took: 0.048221999779343605
=>
[#<Addrinfo: [2606:4700:3031::6815:51e0]:80 UDP (www.decasia.org)>,
 #<Addrinfo: [2606:4700:3031::6815:51e0]:80 TCP (www.decasia.org)>,
 #<Addrinfo: [2606:4700:3037::ac43:a564]:80 UDP (www.decasia.org)>,
 #<Addrinfo: [2606:4700:3037::ac43:a564]:80 TCP (www.decasia.org)>,
 #<Addrinfo: 172.67.165.100:80 UDP (www.decasia.org)>,
 #<Addrinfo: 172.67.165.100:80 TCP (www.decasia.org)>,
 #<Addrinfo: 104.21.81.224:80 UDP (www.decasia.org)>,
 #<Addrinfo: 104.21.81.224:80 TCP (www.decasia.org)>]

(48ms is slow, but the second time I look up the same hostname, it takes only 1.7ms, presumably because MacOS — unlike Linux — does run a local DNS cache by default.)

Commentary

There’s nothing very technically remarkable about this instrumentation strategy, except that I didn’t realize we could do this until we did it.

Like a lot of things in software - you aren’t sure in advance if something is possible, and then when you need to know it, you just have to figure it out at the time.

On-demand knowledge, one could call it.

Posted under: observability