Just where do env vars come from?

Mar 6, 2023

Famously, Linux processes accept an array of arguments at start time. In C, this looks like int main(int argc, char *argv[]).

But as we all learn sometime after writing hello world for the first time, these arguments aren’t the only arguments passed to your program at startup. There’s also a second set of arguments, termed the environment. These are the things we know colloquially as “env vars.”

# Passing into argv:
$ my_program --name emma

# Passing into an env var:
$ NAME="emma" my_program

(You can access them in C with getenv but you can also declare them as an argument to main, like int main(int argc, char *argv[], char *envp[]. This makes them available as a local variable.)

Environment variables are a complex system of their own. In an organization like mine, managing environment variables is a huge endeavor.

I started to get curious: What is an environment, technically speaking? And where does the environment come from?

Data structure

The arguments are an array (an ordered list), whereas the environment sometimes acts like a dictionary.

Until I wrote this post, I imagined that in Ruby, the environment was literally a hash (ENV). It turns out that no, Ruby just wraps the OS’s environment implementation in a hash-like interface.

Generally speaking, in unix-like systems, the environment variables are not implemented as a hash table. They are just an unordered array of null-terminated strings, where each string is a key-value pair combined with the character =. (Therefore, you can’t use the = character in an env var name, though it is perfectly valid as part of the value.) The final item in the list of env vars is a null pointer.

You can then use some common accessor functions provided by glibc. The most important ones are setenv, putenv, getenv. Setenv can only add or update env vars, while putenv can also remove them, and getenv is obvious.

Every time you get an env variable from glibc, it does a linear search through the current list of env vars to find a match. (For a slight performance boost, the current implementation filters by the first two characters before doing a full string comparison.)

Are environment variables part of the operating system?

I started to wonder: Are environment variables a fundamental feature of the Linux kernel? Are they part of the definition of a process? Are env variables handled by the system task manager?

Answer: Not really. If you look at what’s stored in the Linux kernel for each process, it doesn’t contain anything like a list of environment variables. (At least that’s how it looks to me from taking a glance at task_struct, the kernel data structure that represents a process.)

However, it turns out that the environment is part of the calling conventions of program execution in Unix systems. For example, it’s common to use a Linux system interface called execve to execute new programs. (execve is what the Bash shell uses to execute a command.) And when you call execve, you must pass the environment variables as an argument: int execve(const char *pathname, char *const argv[], char *const envp[]).

Thus, Linux absolutely does expect that every new process will be invoked with environment variables (even if the environment variables are an empty array). The environment variables aren’t used for process management by the kernel; they are just provided to your program as part of the program data (stored on the stack). You can then use that data for anything you want.

Where does the environment come from?

One of the things you learn as a working software developer is that usually the env vars are inherited from the parent process by default. Of course, the environment can be modified when you invoke the child process, but it’s often the case that, for instance, the PATH and other crucial env vars are propagated down through the process tree, unchanged unless you explicitly change them. There is kind of an implicit tree of env vars, starting at a parent process and propagating across all the child processes.

This being said, there are plenty of special cases where the child environment is reset to blank. Most often that would be for security reasons of one kind or another. As a result, environment variables aren’t really a tree structure as a result; they are a sort of broken tree, logically speaking.

This being said, we can still try to follow the tree up as far as we can. The question becomes: where does our environment get its initial state?

1. The shell

We often run Linux programs through a shell. Thus when you invoke a process, it’s common to get the initial set of env vars from the shell. You might customize these env vars in your shell configuration, typically with export FOO="bar".

A shell like Bash has its own variable handling system (bash:variables.c) that’s separate from the glibc environment handling system. But this variable handling system is itself initialized from the parent environment in #initialize_shell_variables.

So where does your shell session get its initial env vars from?

The shell gets its env vars from its parent process. If you log in from a console, your shell will be spawed by a process called login (the process that checks your credentials and then invokes your designated shell process). If you log in with SSH, your shell will be spawned by the sshd process.

OpenSSH provides a function called do_setup_env that initializes the basic environment variables before loading your shell. These would include HOME, USER, SHELL, TERM, and PATH (see openssh-portable:session.c). The analogous function in login would be init_environ, which does similar operations (see util-linux:login-utils/login.c).

But if you read the code, you’ll see that the sshd process also propagates its own env vars into the child shell processes. Where do those env vars come from?

3. Init

All processes in Linux descend from an init process, which has PID 1, and is the parent of all other processes. On systems I use, the init process is generally systemd.

It looks to me like systemd builds the initial env for a child process from several sources in systemd:src/core/execute.c.

accum_env = strv_env_merge(params->environment,
   our_env,
   joined_exec_search_path,
   pass_env,
   context->environment,
   files_env);

The systemd man page has more details on what those different sources are. When running services like sshd, systemd usually prefers to spawn new processes with a blank environment (except for env vars configured for that specific service). But when running interactive user programs, systemd will generally pass through its own environment vars by default. (See systemd:src/core/manager.c#manager_default_environment.)

If even systemd has environment variables, just like every other process, then where do those come from?

4. The Linux kernel

In the end, they have to come from the kernel. There’s nowhere else at this point, right?

The init process is invoked via the very simple function run_init_process. It executes the init process with execve, using a provided set of argv and envp values:

kernel_execve(init_filename, argv_init, envp_init);

What is the value of envp_init here?

In linux:init/main.c, we finally find the most basic default values for envp_init. They are the following:

HOME="/"
TERM="linux"

There you are: the default env vars set for a linux system. They’re pretty useless, honestly.

(These values have long since been overwritten by the time you log in with SSH. In practice, sshd is the top level source of env vars for your interactive sessions with remote systems.)

5. Arguments to the kernel

But there’s one last funny detail. It turns out that if you pass env var-like arguments into the kernel as arguments at boot time (docs), they will magically become env vars appended to the default envp_init values, and then they will be passed down into the init process (see unknown_bootoption).

So in the end, the very distinction between env vars and arguments breaks down. argc can magically become envp.

It’s unintuitive, but if you think about it, there’s no hard categorical distinction between args and env vars in the first place. You can pass values into your program either way, with only minor adjustments to your code. The distinction between the two is largely a matter of convention and semantics.

An “environment” is a fundamentally complex thing. It makes sense to me that there’s something arbitrary about how we represent it.