In large technical systems, there’s a pervasive tension between separation and connection. What I mean by this is:
- There are architectural advantages to having strong walls between things.
- However, there are also architectural advantages to having tight connections between things.
These two things often push in opposite directions, making system design awkward, or at least delicate.
Let’s consider a few obvious examples.
Tenant silos, or horizontal boundaries
Suppose we are designing a multitenanted system. We might need to go in one of two contrary directions:
- Put every tenant in its own technical silo, keep everything highly separate. Use separate databases, separate workers, separate infrastructure for each tenant.
- or: Build multi-tenanted systems that cross boundaries and work with data across silos. Share resources (which might reduce infrastructure costs).
Silos can be good for security, for privacy, and possibly for operational flexibility. Maybe we want to allocate more resources for certain tenants. Maybe we want to define totally separate user accounts for every tenant.
But silos can also be inefficient and operationally frustrating — suppose we want to let all the tenants share certain resources, but now this is impossible.
Or suppose we want to know “across all tenants, how many records of type X do we have?” It sucks when the answer is “sorry, we don’t even have tooling to find that out right now.”
Or consider the costs of releasing updates - do you want deployment to be O(1) or O(N) for N tenants? Do you want to run database migrations 500 times or just once?
Or consider the problem of users with access to multiple tenants — do you want them to have to authenticate one time or N times to get access to N tenants?
Of course, you can always try to hack your way through this problem space, and get some of the benefits of both approaches. You can build approach 1 (premised on separation) and then add some hacks — ahem, I mean extensions — that get you some of the benefits of approach 2 (premised on connection). Or vice versa.
Web stack components, or vertical boundaries
Here’s a second example. Suppose you have a web stack that is organized into different layers like this:
- External load balancer
- NGINX
- Ruby application server
- Rack middleware stack
- Ruby on Rails application
Requests normally start at the first layer and, assuming everything is OK, then get passed down to the last layer. Your application layer then handles a response and sends a response back through all the other layers to the client.
This scenario also has a design space with a tradeoff between separation and connection, though not exactly the same kind as the horizontal tenant sharding case.
Essentially: Everything is cleaner and simpler when different layers of this stack are indifferent to each other. Everything is swappable. There are no hard dependencies. You want to change load balancers? Go ahead, it doesn’t affect anything in the application server.
But. When the architectural separation between layers is too absolute, certain use cases become impossible.
A use case we’ve looked at recently involves performance logging. Our default observability stack runs at the application level. It only starts when the other layers are complete.
This means it has blind spots.
- If a request never makes it to the application, it is invisible.
- If the application hits a request timeout, the observability tools crash with it, and you don’t get logs of what happened.
- If you want to know how long a request took, you can only measure starting when the observability tools kick in. What if you want to measure the time starting at level 1 all the way down to level 5?
To fix these problems, you have to build observability tooling that crosses levels of the web stack.
It’s great - now you have more information.
But you also have more dependencies, more tight coupling, more complex (often ad hoc) contracts, and thus more brittleness.
Suppose we started setting a custom request header at the load balancer, to measure total request time.
Now if you switch to a new load balancer, you have to remember to migrate your custom header, or the custom observability tooling will break. You have a new constraint. You’ve moved away from total separation towards great connection between components. At your peril.
Every way you go, there are perils.
Discussion
Needless to say, architectural purism is rarely optimal. You go in one direction and then tack back in the other. You want to satisfy competing constraints to the extent you can.
But the fact that we often end up in some middle zone of this design space doesn’t mean that there aren’t competing pressures in opposite directions. In this sense, large technical systems rarely reach a stable equilibrium in their own design space. Rather, they exhibit a moving set of tradeoffs that can shift suddenly under your feet.
This problem space is closely related to the more well-known problems of “standardization vs autonomy,” which are also a major tradeoff space. But if you think closely about it, they are orthogonal.
autonomy
^
|
separation <== ==> connection
|
v
standardization
All your teams can align on a model of tenant silos - that’s standardizing on separation. Or some of your teams can build global systems and some can build tenant systems - that’s more autonomous, but it’s orthogonal to whether systems are more or less connected to each other.
Coda: Leaky abstractions
I keep thinking this is related to the problem of leaky abstractions, though I haven’t quite found the words to talk about this. Every time we design for “separation,” we are always also designing interfaces for connecting across our silos, our layers. These interfaces somehow are never good enough, and then we are tempted to realign the architecture, or at least the interfaces, to plug the gaps in our system.
In any case, I constantly see people building “tenanted silos” and then, a few years later, going back towards “global services,” or vice versa. Like we’re all caught in some giant pendulum of unstable requirements.