Thoughts on URL path routing

Oct 7, 2022

URL path routing is one of those things that gets more interesting the longer you think about it.

(This post is geared toward mid-level web developers. It won’t teach anything to the NGINX core developers.)

By “URL path routing,” I mean the part of a web server process that parses incoming HTTP requests, looks at the request path (and the HTTP verb), and ensures that your request is handled by the right handler function for that path.

The first line of an HTTP request looks like this: GET /path/to/something HTTP/1.1

So the question is, how does path/to/something get handled?

(The HTTP 1.1 spec calls this part of the request the request-target. I’ll call it the “request path” here, which is a term widely used in practice.)

The file system is an implicit router

If you look at the most basic, old-school, static web server, path routing is largely invisible.

Every URL path is mapped to a file on a server. You put files into your document root folder, and the web server serves them right back to the users. If your document root is /var/www, then you can put story.html into that folder, and it automatically shows up at website.com/story.html.

In this case, request path routing is almost a one to one map onto a directory tree. Almost one to one, but not completely. There are questions to answer. Options to configure.

The first special case here is this: What do we return if someone requests a directory instead of a file? Here we find our old friend index.html, a convention that you see in a basic NGINX configuration like this:

location / {
  index index.html;
  try_files $uri $uri/ =404;
}

Which tells the NGINX index module that if you request a path ending with a slash, the index.html file beneath that path will be returned. It also says that if you request /folder with no trailing slash, then the server will check if /folder/ (the corresponding directory) is available instead.

I must say that this approach to URL routing is surprisingly effective and scalable, up to a point. You can have folders and subfolders many levels deep. Maybe you can even use symlinks. I’ve seen huge archives (tens of thousands of files?) served up from this sort of configuration.

Back in PHP: a fractal of bad design, Eevee described filesystem routing as amounting to “No Routing,” period. But I think I would say, instead, that the file system itself is also a sort of routing system, which maps file system paths to different physical or virtual storage handlers. In a sense, a static web server just delegates path handling to another routing system.

But file-based routing breaks down as soon as you want to respond to a request with something other than the contents of a file.

Why do we route, anyway?

What if you want to send back to a user the result of an arbitrary function?

(And, let’s say, you don’t want to do anything as silly as wrap it in a Linux kernel module.)

Technically, a path routing layer is not even required for a web server. You wouldn’t need it for an application that handles every request the exact same way. The most basic Rack application just looks like this:

# config.ru - Version 1
run -> (env) { [200, {}, ["The meaning of life is 42\n"]] }

That’s just a single function* that always returns the same response, no matter what the input parameters. You can request any path and this application will return the same output. (*OK OK, technically speaking this is a Ruby stabby lambda, since Ruby doesn’t quite have first-class functions.)

So now what if you want to handle two paths differently?

# config.ru - Version 2
run -> (env) {
  case env["REQUEST_PATH"]
  when "/secret"
    [401, {}, ["The real meaning of life is a secret\n"]]
  else
    [200, {}, ["Officially, the meaning of life is 42\n"]]
  end
}

Now we have a handler function that inspects the input and responds differently depending what path you send it.

OK, but what if you decide that this is an ugly, un-extensible bit of code? What if you refactor this?

I’ve written test code that looks like this:

# config.ru - Version 3
run -> (env) {
  case env["REQUEST_PATH"]
  when "/config"
    config_response()
  when "/info"
    info_response()
  when "/normal"
    normal_response()
  else
    error_response()
  end
}

Now you have a function that just maps between request paths and handler functions.

Congrats, we just reinvented the use case for a routing layer! It’s just a standard way of mapping possible inputs onto handler functions. Instead of handling every request with the same function, a path router gives us a layer of indirection so we can send different requests to different places.

Routing, fundamentally, is a software design pattern, a permutation of what Kwindla Hultman Kramer calls the dispatcher pattern. It addresses the general problem, How do you map a complex input set to an arbitrarily large set of possible handler functions, when a case statement is inadequate? The specific implementations we’ll see here are all just possible solutions to this general question.

You can deduce the need for a routing layer by considering two problems with Version 3.

It’s full of hardcoded string constants and hardcoded handler names. What if we wanted to make this function configurable, so you don’t have to edit the source code every time you change the handler configuration?
How do you make this function as performant as possible?

Routing algorithms

Considering Version 3, the first question is: what is the performance of a big case statement?

Looks like it runs in O(n) where n is the number of possible path handlers. Not the best, especially as n gets larger. And we really want URL routing performance to be optimal, since this is a component of our web server that gets called on every single request, and it’s just overhead, taking time before we can even start generating the right response.

There are various approaches that people use at this point. Here are two common ones:

A routing search tree
A routing hash table

Let’s take a look at the search tree approach first.

Dynamic routing 1: An NGINX location tree

I’ve used NGINX as a general purpose webserver for a while. It’s highly configurable; the application is organized around the concept of different modules and multiple phases of request handling.

Let’s take a quick look at the way that static (non-regex) location routing is handled by the NGINX core http module.

The route handlers you see in something like NGINX are usually only accessible indirectly, in normal use cases. You don’t write a route mapping function like the one we wrote above; its functionality has been abstracted into an operation on a configuration tree. You don’t, directly, write a handler function for specific routes either; you invoke an NGINX module that handles some specific path that might, then, pass along a request to your actual application code. All you need to do as an ordinary developer is to generate the needed configuration data, which tells NGINX which modules should handle which paths, and with which configuration options.

Fun Fact

Technically, it is also possible to write ad hoc handler functions in NGINX configuration, if you really want to do this. You can abuse if and return from the rewrite module to build arbitrary logic:

if ($the_world_is_round) {
  return 200 "Welcome to a heliocentric universe, ${user_name}!";
}
return 500 "Hello ${user_name}, did you still believe in geocentrism?";

There are also modules that implement full-fledged scripting languages inside NGINX, such as njs. You could develop arbitrarily complex code this way, though it’s probably a terrible idea, and I’ve never seen that strategy used in practice. We usually do configuration in NGINX, and then put all our business logic in some other framework upstream of it.

For path routing purposes, each separate configuration node in NGINX is called a Location. A location can be defined as a prefix string, an exact string match, or a regular expression. The general strategy is to find the best possible prefix/exact match first and then test all the regular expressions. (The regular expressions are all tested in the order they appear in the configuration folder; I don’t know how to optimize a search through a list of arbitrary regular expressions.)

NGINX parses its static (non-regex) locations into a ternary search tree with this kind of structure:

         [root element]    <== represented by `/`
                |
                |
               egg
   box -----/   |    \------ house
  /   \         |           /     \
 bat   cat      |        glass    plate
                |
              plant

This would be a root with seven locations nested beneath it (/bat, /box, /cat, /egg, /house, /glass, and /plate), and one child location, /egg/plant.

It’s sorted lexicographically with lower values to the left, higher ones to the right, and child trees (termed inclusive) in the middle. The parent of each subtree is set to the middle element of the relevant subset of location blocks. It all descends from the root location.

Looks like the search time here is is O(log n) for n sibling location blocks, so we can see why they use this structure. The parent-child relations are just a linear search as far as I know; it will collapse into O(n) in the pathological case where you had only one gigantic parent-child structure (say a single folder tree 50 levels deep, with no sibling folders).

Once NGINX finds the correct location block, the location block (and other configuration) will invoke the relevant NGINX modules to build and return a response, handle headers, check authentication, and so on.

(You can learn a lot more about how this works by compiling NGINX with --with-debug and setting the debug log level; it then will report precisely how it searches through the static location tree.)

Dynamic routing 2: A Drupal URL Alias table

Meanwhile, consider an alternative approach that I’ve also seen in the wild: the routing table.

The routing table is most viable if you are using uniform request handling functions with no dynamic path inspection or parent-child directory hierarchy. Suppose your routing table looks like this:

| path       | handler_function | id |
|------------|------------------|----|
| /about     | page             | 1  |
| /company   | page             | 2  |
| /faq       | page             | 3  |
| /contact   | form             | 1  |
| /buy       | form             | 2  |
| /buy/now   | form             | 3  |

You only have two handler functions, page and form, so all your router has to do is build this hash table, use path for the keys, and then call the named type handler function with the given id parameter. Lookup is theoretically O(1), it’s perfect! You could use such an architecture to map URL paths to files, to database rows, to anything with functions and arguments!

What’s the big use case for a routing table?

In short: User contributed path structures. You don’t want end users to have to write a Ruby method. You don’t want to let them anywhere near an NGINX configuration file. You just want to show them a text field: “What path should this page have?”

In that context, a routing table is cheap; it’s safe; and it’s flexible without needing code or infrastructure changes.

Drupal 7 does something like this with a url_alias table (see system schema). This is just a relational database table with an alias column and then some other columns telling the system what to route to. Users get to specify the alias value. If they want it to look like a directory tree, they can just put slashes into the path value. It’s very brittle because end users usually don’t do a good job of maintaining a tree-like structure. (Predictably, there is an additional module you can use to auto-populate these values, helpfully named pathauto.)

This routing table can’t be a hash table in memory, because unlike NGINX, Drupal is a PHP application that doesn’t persist configuration data between requests. So it’s still a O(1) indexed database query, but it’s a database query per request to figure out the right path. I’m pretty sure it’s cached, at least, so it is probably quick when the cache is warm.

The actual schema looks like this:

| pid | alias     | source       | lang |
|-----|-----------|--------------|------|
| 100 | about     | node/1       | en   |
| 101 | home      | node/2       | en   |
| 102 | faq       | node/3       | en   |
| 103 | contact   | form/1       | en   |
| 104 | buy       | form/2       | und  |

With indexes:

index alias_language_pid on ('alias', 'language', 'pid')
index source_language_pid on ('source', 'language', 'pid')

Note that it includes a locale parameter lang (so you can route the same paths differently in different locales). It has bidirectional indexes (you can look up both by alias and by source). And the source field points, curiously, not to a function but to an internal path.

It’s all a bit funky. It turns out that Drupal’s url alias system is bolted on top of a whole separate routing system, the strangely termed “menu” system which handles Drupal’s internal paths. So the url alias table just tells Drupal what system path should handle a given request, and then the menu system figures out how to actually call the right handler (node or form in this case).

You have to register internal URL handlers by providing “configuration” in PHP code like this one for the node module:

  $items['node/%node'] = array(
    'title callback' => 'node_page_title',
    'title arguments' => array(1),
    'page callback' => 'node_page_view',
    'page arguments' => array(1),
    'access callback' => 'node_access',
    'access arguments' => array('view', 1),
  );

Believe it or not, this PHP configuration then gets persisted in yet another database table, menu_router. The full implementation uses 25 columns of settings and arguments in addition to the path spec itself, and it’s not very clearly designed, since “menus” in Drupal are an awkward mashup of a path routing system with UI navigation menu configuration.

I’m not going to go any farther into the weeds here. Drupal 7 is generally agreed to have really ugly technical architecture. I will say that this whole awkward system actually does work in production, and has powered innumerable large websites. I used to help maintain the website for a billion-dollar-a-year home building company built mainly on Drupal 7. It was a huge mess of spaghetti code too, but that wasn’t even the fault of the framework itself. And it lets non-technical users easily generate their own URL path structures. That’s arguably a big core feature for a CMS.

The Daily WTF of routers

Here’s another fun thing I found at a previous place I worked:

A homegrown content management system built on Ruby on Rails where URL routing was really, unusably slow.

It turned out to be implemented with a recursive path lookup function that generated n x m database queries to produce the routing table, where n was the number of pages and m was the depth of the routing tree.

To make matters worse, this was for a multisited system that didn’t assume that the site root started with “/”.

The algorithm was something like this:


class Router
  def route(request_path)
    # Note: This cache worked OK in prod, but was disabled in development:
    routes = Rails.cache.fetch("page_routes") do
      Page.all.map {|page| [page.id, page.path] }
    end

    routes.find { |route| request_path == route }
  end
end

class Page < ActiveRecord::Base
  belongs_to :parent, class_name: "Page", optional: true

  def path(site_id)
    if parent
      # This does a recursive lookup of parent paths,
      # and loading each parent is a separate db call:
      [self.filename, parent.path(site_id)].join("/")
    else
      [self.root_path(site_id), self.filename].join("/")
    end
  end

  # this is also a database call:
  def self.root_path(site_id)
    self.find_by(site_id: site_id, is_site_root: true)
  end
end

In development, the whole routing table was reloaded on every page load. It ended up causing huge delays during development on large sites.

I improved the performance of this by roughly 50% with some basic improvements, like not looking up the root_path again for each route. I then suggested shifting the design to use a Drupal-like routing table in the database. However, I believe they may have since abandoned the whole product, which would have rendered any further architectural improvements irrelevant.

Sometimes the products cease to exist long before they can be improved.

Well, we all know the problems with premature optimization.

Conclusions

Back in the day, I used to understand web servers as being fundamentally document-based. Static Apache website style. Request a path; it matches a document in a directory tree on a disk; and the document gets sent back to you.

But you can learn a lot from decoupling your understanding of a web server that speaks HTTP from the concept of a document. In a more abstract way, you can think of an HTTP request as just being a function call with some input parameters. And one of the parameters just happens to be something we call “a path.” It’s just hard to think of it this way when you start out by staring at the configuration layer of something like NGINX. The “functional” part of it is deeply buried by that point.

What’s more interesting is the huge set of design tradeoffs you can make here. What’s the ratio between convention, configuration, and pure flexibility? How much technical expertise will be needed to create a new route? Will your user want to write a request handler from scratch? Will they want to use a DSL to configure routing without needing to execute absolutely arbitrary code? Or will you make routing that’s so simple that even a nontechnical user can use it, as in a CMS?

You can see here how each framework draws the lines differently, and then it’s just up to the users to work with the constraints — or struggle against them.