Bikeshedding: a method to refresh os.environ

steve.dower · July 8, 2024, 2:32pm

There was a discussion in Ideas (no need to read, I’ll summarise here) about adding a method to os.environ to allow regenerating its contents from the current process environment. There is no clear agreement on the name.

This is important because os.environ is technically a cache - variables are read at initialisation and are not updated again. Generally, a Python-only app is going to make changes through os.environ, which means it will be written to the cache and everything is fine. No need for this method.

However, if you call os.putenv() directly, or have native code that updates the environment without going through os.environ (e.g. using ctypes to call native putenv, or a library that updates it), those changes will not be visible to Python code. Currently, there’s just no way to get at them other than bypassing os.environ yourself to read the actual environment.

So there’s no dispute about the importance of this functionality - we are the only ones who can realistically provide it, and so we should. However, there’s significant dispute about the naming. You might like to see this message where @vstinner took a vote in Ideas on the options, with os.environ.refresh() being most popular, followed by os.environ.reload() and os.environ.reload_from_process().

The concerns about the name of the method are that the functionality is quite obscure, and generally is not something you would ever use until you’ve identified that your app has this issue. A name like refresh() or reload() potentially suggests a lot of things to a range of potential callers, but it does not suggest “reload if non-Python code in your app is changing the environment without telling you” to someone who’s browsing code completions.

(One obvious thing it may suggest is to refresh or reload the environment from the user profile. This is very much out of scope - it’s been discussed, attempted and dismissed already. There’s no way to know the parent process didn’t modify the environment before launching Python, and so resetting/changing that state may cause new problems.)

The function has already been added to 3.14 as os.environ.refresh() (see third paragraph in the os.environ docs), despite the naming controversy on the Ideas thread and the fact that it was never raised here.

So we’re raising it here to look for core developer consensus on renaming^[1] this function. Does refresh() adequately describe what is happening here? Will it be a tempting footgun with that name? (And any brilliant ideas on handling the fact that environ is not really thread-safe at all on any platform…)

Or moving - one proposal was to put it directly in os rather than on os.environ, though people are more likely to stumble blindly upon it in os so I don’t think we gave it much weight. ↩︎

vstinner · July 8, 2024, 3:18pm

This is incorrect. os.environ[key] = value calls os.putenv(key, value) and del os.environ[key] calls unsetenv(key).

davidhewitt · July 8, 2024, 3:45pm

Somewhat related here if discussing putenv and with the prospect of freethreading / subinterpreters now on the table.

My understanding is that putenv is not thread-safe and multiple threads calling in parallel can cause issues like use after free.

This is a repeated topic on the Rust language forums. Rust used a lock in its stdlib to protect calls to putenv but this wasn’t sufficient in the face of external software calling putenv directly.

Presumably CPython is mostly saved from concurrent modification at present by the GIL.

It might be that there’s not much CPython can do to mitigate this risk other than join Rust in waiting for libc to provide safer primitives. I’m not up to date in this topic so I’m unsure if there’s newer ideas.

kknechtel · July 8, 2024, 3:53pm

Honestly, I think the design problem is more fundamental.

If someone wants an up-to-date environment variable, calling something to refresh the cache and then accessing it creates a needless race condition. The right way, presumably, is os.getenv.

If the goal is to have the dictionary interface, then it should auto-update - or rather, implement __getitem__ to check each time. It violates the principle of least surprise to hear that a dict could have a stale cache. If it’s “backed by” something external, it should automatically be in sync with that.

(I’m sure, changing how os.environ.__getitem__ works is off the table. But there could be a new method and perhaps eventually a deprecation cycle.)

Performance shouldn’t be an issue:

$ python -m timeit --setup "import os" -- "os.getenv('PATH')"
500000 loops, best of 5: 701 nsec per loop

(on 10-year-old hardware btw)

Those who need to make a cache for program correctness can .copy the dict. (Yes, there’s theoretically a race condition here as well. It might be necessary to override .copy to filter out keys that go missing at the exact right instant.)

steve.dower · July 8, 2024, 3:54pm

The process variables are changed, but the values stored in os.environ are only updated if the caller directly updates them (which also updates the process variables). If the caller calls os.putenv themselves, os.environ does not get updated.

Apologies if I wasn’t clear.

steve.dower · July 8, 2024, 3:57pm

Performance can be an issue on other platforms. Anyone who’s using an environment variable in a tight loop is probably already caching it (if they’re worried about it changing partway through), but a repeated operation that regularly queries the environment could notice the difference. (Source: it’s happened to me.)

storchaka · July 8, 2024, 5:31pm

os.getenv is just a wrapper around os.environ:

def getenv(key, default=None):
    return environ.get(key, default)

putenv is not thread-safe. os.environ.refresh() is not thread-safe. If we add os.getenv() which returns actuall current value from the environment, it will also be not thread-safe. Current os.getenv() and environ.get() are thread-safe only because they return a cached value, which cannot be broken by concurrent putenv in C code. But they are not thread-safe if os.environ.refresh() is concurrently called. Changing os.getenv() to read non-cached value will make it less thread-safe.

Concurrent use of putenv in C code should be avoided. Concurrent use of os.environ.refresh() in Python code should be avoided. We cannot prevent or even detect if they were used concurrently. We can only add warnings.

And I think this is OK. There are many non-thread-safe functions in the stdlib. The user should use them only if this is safe. For example, call os.environ.refresh() only if they are sure that no other thread reads or modifies the environment or os.environ (including extensions).

barry · July 8, 2024, 5:43pm

That was my immediate thought as well, i.e. keeping os.environ “live”, which gets you os.getenv() for free.

As an aside, the mention of os.environ.refresh() in the 3.14 docs is pretty hidden, which also leads me to consider that maybe we need a different approach.

Let’s assume we can’t change os.environ for backward compatibility reasons. Why not implement a new object called say os.environment that has all the behavior we want?

set, del on os.environment update the process’s environment immediately
get on os.environment pulls its value from the process environment
No caching - you can cache values yourself if performance is critical
Locking underneath the hood for thread safety
No connection with os.putenv(), os.getenv() or os.unsetenv()
No .refresh() needed because it’s always fresh
No os.environmentb - however a method could be added such as os.environment.as_bytes() which would return a bytes-view of the process environment and would act exactly like os.environment.
Document the os.environment mapping-like object separately for clarity.

I don’t think you’d have to worry about os.system(), etc because changes to os.environment would directly modify the process environment anyway, so subprocesses should inherit it in the “normal” way.

We might end up keeping both os.environ and os.environment, allowing users to choose between performance and synchronicity.

gpshead · July 8, 2024, 7:25pm

The “environment” is a legacy C “API” that is not thread safe upon any writes and contains no notifications about changes. Being “live” would involve re-parsing the entire potentially huge C array of C strings upon each access. Yes that could be implemented, but it’d be an O(n) walk upon every access. And would be far less safe than our existing dict cache. There’s no good way to do this in a manner that is appropriate for software that is used to fast hash table lookup access to the environment. I don’t think most code actually wants this.

I like the .refresh() API that sounds like has been implemented for 3.14. The name is fine, the details of why belong in documentation. It is very rarely needed. Most C/C++/Fortran/Rust extension module code does not write to the raw char ** C data structure. Especially not in a way that they expect other in process language VMs to pick up on the changes. Which is why this hasn’t come up until now.

As for thread safety, there isn’t much Python can do other than attempt to prevent such issues from the limited view of CPython itself (IIUC we don’t currently explicitly attempt this, as subinterpreters and free threading may be noticing). There is literally nothing that can be done to prevent embedding or extension module code elsewhere in the process from scribbling over it messing everything up.

Somewhat related:

github.com/python/cpython

Avoid modifying the process global environment (not thread safe)

opened 12:47AM - 18 Jan 20 UTC

gpshead

type-bug interpreter-core 3.9 topic-subinterpreters

BPO | [39376](https://bugs.python.org/issue39376) --- | :--- Nosy | @gpshead, @e…ricsnowcurrently <sup>*Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.*</sup> <details><summary>Show more details</summary><p> GitHub fields: ```python assignee = None closed_at = None created_at = <Date 2020-01-18.00:47:44.215> labels = ['expert-subinterpreters', 'interpreter-core', '3.9', 'type-crash'] title = 'Avoid modifying the process global environment (not thread safe)' updated_at = <Date 2021-04-16.00:43:07.242> user = 'https://github.com/gpshead' ``` bugs.python.org fields: ```python activity = <Date 2021-04-16.00:43:07.242> actor = 'vstinner' assignee = 'none' closed = False closed_date = None closer = None components = ['Interpreter Core', 'Subinterpreters'] creation = <Date 2020-01-18.00:47:44.215> creator = 'gregory.p.smith' dependencies = [] files = [] hgrepos = [] issue_num = 39376 keywords = [] message_count = 2.0 messages = ['360222', '360225'] nosy_count = 2.0 nosy_names = ['gregory.p.smith', 'eric.snow'] pr_nums = [] priority = 'normal' resolution = None stage = 'needs patch' status = 'open' superseder = None type = 'crash' url = 'https://bugs.python.org/issue39376' versions = ['Python 3.9'] ``` </p></details>

github.com/python/cpython

Documentation for os.environ should say that os.environ caching behavior is undefined if other threads are setting environment values

opened 05:59PM - 14 Jun 24 UTC

yurivict

docs

# Documentation It says [here](https://docs.python.org/3/library/os.html): > T…his mapping is captured the first time the [os](https://docs.python.org/3/library/os.html#module-os) module is imported, typically during Python startup as part of processing site.py. Changes to the environment made after this time are not reflected in [os.environ](https://docs.python.org/3/library/os.html#os.environ), except for changes made by modifying [os.environ](https://docs.python.org/3/library/os.html#os.environ) directly. It should mention the caveat that such cashing would cause undefined behavior or/and crashes when other threads would be setting environment values. This would be true at least on Linux and FreeBSD. The same should be mentioned in the documentation of the function that initializes the Python interpreter and caches environment. These functions are not thread-safe.

github.com/python/cpython

`subprocess.run` docs should recommend copying `os.environ` on Windows

opened 02:27PM - 21 Jun 24 UTC

ncoghlan

docs topic-subprocess

As described in https://stackoverflow.com/questions/78652758/, when passing cust…om environment variables to `subprocess.run` or `subprocess.Popen` on Windows, you will usually need to include the existing keys from `os.environ` in order to reliably run arbitrary processes (exactly which keys are required to be present depends on what the subprocess is doing). The `subprocess.Popen` docs include a "Note" about this problem, but the `subprocess.run` docs omit it. Even the `subprocess.Popen` note is a bit vague, and only specifically mentions `SystemRoot`: > Note: If specified, env must provide any variables required for the program to execute. On Windows, in order to run a side-by-side assembly the specified env must include a valid SystemRoot. The common documentation for the `env` parameter reads: > If *env* is not None, it must be a mapping that defines the environment variables for the new process; these are used instead of the default behavior of inheriting the current process’ environment. It is passed directly to [Popen](https://docs.python.org/3/library/subprocess.html#subprocess.Popen). This mapping can be str to str on any platform or bytes to bytes on POSIX platforms much like [os.environ](https://docs.python.org/3/library/os.html#os.environ) or [os.environb](https://docs.python.org/3/library/os.html#os.environb). Rather than re-using the existing `subprocess.Popen` note, my suggestion would be to add the following sentence to the above paragraph: > On Windows, many applications require specific environment variables (such as `SystemRoot`) to be set as expected in order to work correctly. Accordingly, it is recommended to start from `os.environ.copy()` when customing the subprocess environment on Windows, rather than starting from an empty dictionary (otherwise the executed subprocesses may fail with various Windows OS errors).

brettcannon · July 8, 2024, 7:33pm

Any reason the name can’t be honest about what os.environ is and call it rebuild_cache() or something?

yoavdw · July 8, 2024, 7:48pm

This is why I think the name is not fine. I went over this a lot in the Ideas discussion so I’ll try avoid repeating this point after this post, but I don’t think most users are going to be able to tell what it does from the name, especially (beginner) Windows users, and that this isn’t what they’re (or anyone that hasn’t personally faced this issues) is gonna think “refreshing the environment” does.

I don’t think a very rare and niche use case should be solved by a function with such a general name.

barry · July 8, 2024, 8:01pm

I think you could avoid that. Let’s add “on demand” to the “live” aspect. E.g. os.environment['FOO'] would map directly to getenv("FOO"), and so on. Iterating over keys would have to iterate over **environ ^[1] and that might be expensive to support thread-safety or at least a consistent snapshot of name=value entries. That should be rare, and again, os.environment wouldn’t replace os.environ, but instead provide an alternative interface with different semantics.

on POSIX; modulo Windows and *_NSGetEnviron() ↩︎

bwoodsend · July 8, 2024, 10:10pm

I reckon that if you give it any name short of os.environ.refresh_cache_and_no_that_does_not_mean_that_externally_set_environment_variables_will_be_updated_it_only_includes_changes_made_by_this_process_to_this_process_if_you_dont_know_what_that_means_then_this_is_not_the_right_method_for_you() then most Windows users are going to think this is a way to propagate changes they made using Windows’s environment variable editor into already running processes.

Likewise with any variant of os.not_cached_environ. Any mention of the word cache will put anyone who doesn’t understand process environment inheritance down the wrong path of thinking that a cache is to blame for their global environment changes not propagating to existing processes.

Perhaps a name that somehow implies that it’s interfacing directly with the C API (os.c_environ)? I’m hoping to make it sound gutsy enough that anyone who doesn’t know exactly what they’re looking for will steer clear of it (or at least read the docstring).

storchaka · July 9, 2024, 2:16pm

I afraid that it will be less thread-safe. Currently, getting value is safe – at worst you can get an outdated value, but never crash. If you read it directly from the process environment, what happen if other thread concurrently modifies the process environment? You can get an incorrect partially modified value if it is overwritten inplace or read from freed memory if the buffer for the process environment was relocated. I don’t know implementation details, but there are no guaranties.

kknechtel · July 9, 2024, 3:16pm

I guess your idea is, creating the cache in the first place avoids this issue, because there aren’t other threads running in the Python process environment yet…? Does that cache creation happen at Python startup, or only when os is first imported? Because without running site, os won’t be imported automatically. So I can imagine that someone has two threads that both try to import os for the first time, etc.

barry · July 9, 2024, 4:00pm

It doesn’t have to be. With a new object like os.environment we can define the semantics we want, so thread locking around critical sections would be possible. Since in my mind, os.environment wouldn’t replace os.environ I don’t think performance concerns would be as important. The user can choose whichever API they want for the task at hand.

gpshead · July 9, 2024, 5:17pm

Is there significant demand for this to even exist? Several others on the original Ideas thread pointed out how this just doesn’t seem to be needed often enough to have a big public API for it.

The status quo of the dict cache has been the norm for so long. I feel like os.environment vs os.environ could just add confusion in a place where most people need never think about it.

Then the question from library maintainers becomes which one do they use? Maybe some user need updates so they should use the slow unsafe one just in case. That’s a bad decision for any library maintainer to have to make and will result in a proliferation of libraries gaining options around what environment to use and having to decide how to plumb that through and not doing it consistently and … yuck.

The explicitness of a specific API to refresh the global environ cache is nice. That isolates the places where a crash could happen to one particular call made by someone with an explicit need.

The locking isn’t sufficient fwiw, there is no technically wholly safe way to access the environment from C other than doing it at process startup before any threads have been spawned[1]. We can only protect Python users from other Python code.

In modern software stacks other threads must be assumed to exist in-process from code in non-Python languages that are never going to coordinate with CPython’s lock.

[1] which isn’t necessarily even possible in the face of C++ static initializers which run before main and someone with IMNSHO poor design senses could spawn threads from…

barry · July 9, 2024, 6:24pm

I dunno, but this topic does have “bikeshedding” in the subject line . FWIW^[1], I’ve never needed anything approaching os.environment or os.environ.refresh() so it’s all the same to me!

for all intents and purposes, indistinguishable from zero ↩︎

storchaka · July 10, 2024, 8:17am

This is not possible. The problem is that the process environment may be changed by the code we do not control.

storchaka · July 10, 2024, 8:24am

When os is first imported. To mitigate possible problems you can import os early, when you feel that it is safe. BTW, threading imports os, so no worry about Python threads.