A rat’s nest of configuration management

Feb 20, 2025

I decided to clean up our environment type taxonomy a little bit at work. This involved a little journey into configuration management.

Configuration strategies

Obviously, there isn’t just one way to configure your software. A few patterns I’ve noticed lately:

Hardcoded constants in the codebase
Conditional logic in the codebase that dynamically generates configuration values
YAML configuration files with global scope
YAML configuration files scoped by environment type
Direct access to environment variables from application code
Dynamic configuration loading from external configuration services.

At a high level, we can distinguish between static configuration and dynamic configuration. (This might be more of a spectrum than a binary distinction; for example, dynamic configuration at initialization time is logically distinct from dynamic configuration that can change at runtime.)

I think it’s pretty reasonable to use static configuration methods like code constants for values you are certain will never change dynamically. But beware: if you write a code constant that says TIMEOUT_SECONDS = 3 and one day the production system urgently needs TIMEOUT_SECONDS = 10, you will wish you had provided a dynamic configuration hook for this value. At the same time, too much configurability is also bad: it inevitably leads to inconsistency across runtime contexts that you don’t want.

Well, anyway, here I’m just interested in configuration that’s based on environment type, because what I wanted to do was clean up our environment types.

Rails configuration

Our applications tend to have complicated settings files organized by RAILS_ENV, a minimalist environment taxonomy provided by Rails which implicitly provides a set of allowed environment types (development, test, staging, production, and anything else you want).

According to the default “Rails way,” you can configure your application by setting environment-specific settings in an environment-specific configuration file located at config/environments/<ENVIRONMENT_NAME>.rb.

In our case, we have a large number of business logic flags and settings that don’t fit neatly into that approach. We handle this case using Rails’ config_for, which provides support for nicely loading a YAML file organized (by default) by RAILS_ENV. The YAML file in question can also evaluate ERB templates, meaning that you can put arbitrary Ruby logic into it.

In practice, mainly we use ERB to evaluate environment variables. Naturally, it ends up with a tangled mess, something like this:

# settings.yml
shared:
  allow_user_registration: true
  send_mail_from_domain: example.com
  cleanup_old_records: <%= ENV.fetch("CLEANUP_OLD_RECORDS", "false") == "true") %>
development:
  allow_user_registration: false
  cleanup_old_records: <%= ENV.fetch("CLEANUP_OLD_RECORDS", "false") == "true") %>
  user_rate_limit: 25
qa:
  <<: *staging
  allow_user_registration: <%= ENV.fetch("ALLOW_USER_REGISTRATION", "true") == "true") %>
  cleanup_old_records: <%= ENV["CLEANUP_POLICY"] == "strict"
    && ENV.fetch("CLEANUP_OLD_RECORDS", "false") == "true") %>
  user_rate_limit: <%= ENV["USER_RATE_LIMIT"] || 5 %>
staging: &staging
  allow_user_registration: <%= ENV.fetch("ALLOW_USER_REGISTRATION", "true") == "true") %>
  cleanup_old_records: <%= ENV.fetch("CLEANUP_OLD_RECORDS", "true") == "true") %>
  user_rate_limit: <%= ENV["SECONDARY_USER_RATE_LIMIT"] || 10 %>

Now suppose you have multiple Rails services built at different times using different configuration structures. And you would like to clean them up and make them consistent. Imagine that one of your services uses RAILS_ENV="qa" in the QA environment, but all the others use RAILS_ENV="staging".

This can be cleaned up parsimoniously by removing RAILS_ENV=qa from the one service that is not like the others.

This, however, means that you have to read all the configuration currently under qa and move it into the staging configuration context. A project which, though I’m no Hercules, reminds me vaguely of cleaning the Augean stables.

Fixing inconsistencies one by one

Let’s consider the cases here one by one.

allow_user_registration

Both qa and staging are set to the same environment variable ALLOW_USER_REGISTRATION with the same default. In this case, it should be fine to switch to staging without doing anything.

But it still took a minute to read both lines of code, compare them, and make sure that they’re identical. “Manual toil,” we call this at work. (This specific case could be handled by an LLM, I think, but you would still have to verify the results, which would take as long as doing it manually, because errors are impermissible.)

cleanup_old_records

The qa config checks two different conditions based on two different env vars (CLEANUP_POLICY, CLEANUP_OLD_RECORDS). Meanwhile, staging just checks one env var, CLEANUP_OLD_RECORDS, with a different default from qa.

What you have to do in this case to consolidate them is to discover what the use case is here, check what the current env vars are actually set to in the qa environment, and then, if possible, consolidate on only using CLEANUP_OLD_RECORDS, so that you can fall back to the staging config safely. You might have to alter the env vars themselves to make this work (we usually set env vars in code, using Helm charts, so this is fine).

This is becoming a research project, because you have to understand the use case and check env vars from a different system. I’m not convinced an LLM can do this yet, because it requires digging around elsewhere in a large technical system.

user_rate_limit

Both qa and staging are set to the same environment variable ALLOW_USER_REGISTRATION, but with different defaults. In this case you have to ensure that the ALLOW_USER_REGISTRATION env var is populated in qa and has the correct value (you can set it to 10 if it is not already set).

This isn’t really complicated but, again, manual toil.

Obviously these aren’t real examples, but imagine doing this project again and again across a large set of configuration points, and you can imagine the experience.

Virtual patch panels

It occurred to me that what we were building in settings.yml was the virtual equivalent of a patch panel.

The function of a patch panel is to allow you to route arbitrary outputs to arbitrary inputs. It is a sort of abstraction layer that allows for flexibility and configurability, while (hopefully) keeping the mess contained to a single zone.

In this case, our virtual patch panel takes inputs from the env vars or from a set of predefined values and then routes them into a settings table exposed to application developers. And that can be more or less of a huge mess. It takes discipline to keep it from becoming a huge mess.

We’d like to keep this fundamentally messy and arbitrary zone as clean and understandable as possible.

Best practices

I started thinking about things to not do in a configuration layer.

Don’t duplicate configuration code (see allow_user_registration).
Prefer flat configuration systems; try not to build inheritance trees of configuration when possible.
Prefer consistency across contexts. If you need a similar configuration system in two separate services, make it as identical as you can. Standardization reduces cognitive load on maintainers and, so, reduces the risk of configuration errors.
Don’t allow direct environment variable access from the codebase (as that is skipping what the “patch panel” is for). Keep all env var access centralized in one site, so it is instantly apparent to maintainers what env vars are required by the program. If this is impossible, at least minimize the number of places that access env vars.
Don’t write to the env vars as if they were a convenient set of mutable global variables. (Exceptions may be allowable in test code, maybe, sometimes. Ugh.)

Coda

Recently we’ve started switching to a feature flag SaaS solution which costs a lot and nobody really loves. It has heavy process overhead from non-engineering staff, a confusing UX, limited licenses because of having high costs per seat, and generally all the modern luxuries we’ve come to expect.

It is fine for its specific usecase (runtime evaluation of boolean flags) and it’s good for managing release-specific flags, but it’s not a complete replacement for settings.yml. So now, of course, we heavily use both systems.

#winning?

Posted under: best practices