Consideration of `pureread()`, `purewrite()`, `purefunc()` builtins

Consideration of pureread(), purewrite(), purefunc() builtins

This is one of the broken out pieces from a previous discussion

Impetus and Introduction

I tried writing a ProxyType class, and an implementation using it which had the goal of fully transparently having an on-disk and in memory representation of a data structure as .yaml. I found a number of stumbling blocks which I believe can be used to explore ways to improve the language. I’m first going to establish the technical basis for these. Afterwards, I’ll attempt to sufficiently prove their merit by covering the value which they could bring. Lastly, I will propose concepts for how they could be implemented.

Technical Basis

Enabling optimizations by detecting function purity with pureread(), purewrite(), and purefunc() builtins

There has long been want of a way to detect degrees of function purity. Purity is hard to introduce into the typing system as in many ways it is a characteristic or meta-attribute which doesn’t map cleanly to the typing system. Rather, by establishing relation about external state to a function’s execution, we can characterize functions. The basic way to do this is to establish whether the function does or does not perform a read or write from external state. pureread() reports True iff the function only reads data sent in through it’s parameters. purewrite() reports True iff state external to the function is not possible to alter external state (relevant to parameters like lists). purefunc() operates as return pureread(f) && purewrite(f) to reduce boilerplate.

Value and Merits

This section attempts to establish sufficient value for each of these changes through high-level use-case examples. It is not meant to be comprehensive and is surely going to be the most critical section for debate. Rather than be thorough to an extreme degree, I intent to expand on this through follow-up discussion. I believe doing so is more efficient and robust.

Use-case example for pureread(), purewrite(), and purefunc() builtins

functools.cache would immediately benefit for correctness by being able to test for function purity with purefunc(). The need that I’ve discovered for pureread() and purewrite() stem from the experience I gained with the ProxyType. The ProxyType needs specific detection for locking needs. If a proxied function is a pure function, it does not need locking. If a proxied function needs to read external state, it needs a read lock since there can safely be multiple readers. If a proxied function writes to external state, readers must be blocked and it must obtain a write lock. I believe this will be the strongest changes to propose in terms of acceptability.

Conceptual Implementation

This section proposes a way to go about implementing these proposals conceptually. These proposals are not sacred. They are merely a way.

An implementation for pureread(), purewrite(), and purefunc()

While expected to be the least controversial, this might also be the largest actual change. This will require expanding both the inspect module and the codeobject section of the datamodel. During the parsing of functions, parameters need to be marked as copied. Then, statement by statement, each is marked for reads and writes and whether the read or write is on a variable which is tagged as copied or not. Nested calls are recursively assessed on similar criteria. On finishing parsing, the entire function itself is tagged as being pure for reads and pure for writes. To address functions which conditionally are pure for reads and pure for writes, tracing for possible values in parameters in recursively assessed functions will allow for more precise reporting on purity. This set of changes really works against types which are pass by reference. To address this aspect, because otherwise passing list’s or dict’s would unexpectedly alter purity, a function qualifier of either pure or impure could be introduced to explicitly allow old behavior with being able to alter state through parameters which are passed by reference with the default changing to COW or visa versa.

My preference from experience of least surprising behavior is to change all pass by reference to be COW and introduce the impure keyword to restore the current exact parameter behavior. It is important to note that, even with full COW, objects which have functions on them which fail purity tests will still cause the function they’re used on to fail purity tests if those functions are called.

It seems to me that you are proposing to fundamentally change the nature of Python for niche applications in a way that I am 99+% sure will not happen. You might do better writing a purity module with the changes you need. This might be easier if you only run purity in a subinterpreter.

1 Like

The want and utility of reporting function pureness has been a long standing feature set. I do not see why reporting of such a common want falls under “fundamentally change the nature of Python for niche applications.”

OK, so here’s a question. The following is a complete Python file:

def f(x):
    return x.value

Is the function f “pureread” in the sense you mean? How can the interpreter establish that[1]?

If you can’t establish that f is “pureread”, then IMO the whole concept is pretty much worthless. It’s hard to imagine any useful function that wouldn’t be harder to analyze than this one, so where does that leave the idea?


  1. hint: properties ↩︎

4 Likes

This, I think, is what Terry means by “fundamentally change the nature of Python”. Either that, or the proposal is based on a fundamental misunderstanding of the current nature of Python. “Pass by reference” is not really a concept. When you assign one thing to another, including passing of parameters, it’s the same object. It is always the same object. There are not certain types that get passed by reference and others that get passed by value.

I have no idea what it would mean for parameters to be passed in some sort of COW form, but it’s not anything whatsoever like what Python currently is. At that point, you’re making an entirely new language.

3 Likes

It is following a reference to state outside of the function, and so would resolve as false. To me, this would be correct and desirable behavior. In a concurrent scenario, value could change between the function call and when the value is read and so fails the notion of pureread(). An implementation could mark the reference passed in as copied but none of what it points to would be tagged as copied. When the read from a non-copied value is detected in the parsing/evaluation of the function this would be detected.

That whole section was about discussing what an implementation could look like. It is already stated as ‘not sacred’. I’m tired, boss. I’m trying here.

Okay. Let’s take a step back. In what circumstances is this even useful? What would you be able to do with the knowledge that a function is pure? So vanishingly few functions will qualify that I can’t see any sort of value in it.

1 Like

It’s Copy On Write (COW). It’s a feature in MATLAB, polars and pandas>=3.0.

It’s actually really useful. Before COW was introduced in pandas, you had to chose between calling f(df) or f(df.copy()). In the former case you’re relying on f not to change the function input, and in the latter case you take a significant performance hit. The bad cases would be that either you write f(df) and the function author writes

def f(df):
  df["x"] = 1

I’ve had cache corruption due to that kind of thing.[1]

The other bad case is where you both call .copy() and then you’re just wasting RAM and CPU.[2]

As of pandas 3.0, we can write f(df[:]) and that’s an almost free operation which prevents any changes from back-propagating. I think the documentation refers to df[:] as ‘creating a view’ but I’m not entirely sure.


In MATLAB, all objects implicitly have this behavior. If I translate it to Python, it looks like

c = C()

c.p = "hello world"

def f(c: C):
  c.p = 0

f(c)
d = c
d.p = 1

assert c.p == "hello world"

ie: Simply passing a variable into a function or assigning it to a new name creates a kind of ‘view’ where modifying one version of the object does not change the others.[3]


Anyway this is a really cool features, that makes for example protecting your cache from corruption much easier and cheaper (CPU wise). But it is also a breaking change to implement it in Python, and it would break some kinda cool (but generally hard-to-maintain) Python code patterns.


  1. (That was actually with a normal Python dict, and the point stands.) ↩︎

  2. (Fear of CPU wastage was why my cache wasn’t properly protected by .copy() calls.) ↩︎

  3. (OK I’m actually not 100% certain what happens in MATLAB when you change the original object and look at whether the view has changed. Nor do I know what the pandas or polars behaviour would be.) ↩︎

I know what COW means, what I don’t know is how that would work with Python’s execution model. Pandas uses views. Most of Python doesn’t.

But without a real use-case, there’s not a lot to discuss other than the implementation. So we have to discuss the implementation.

1 Like

I think it would be possible to change the Python interpreter to interpret a = b as a = View(b) and f(a) as f(View(a)) no?

And I think I could write a View class that has the properties a view should.

It wouldn’t really be Python anymore, and performance would probably be rather bad compared with languages that are actually designed from the ground up with this aim in mind. But what fundamental problem in Python’s execution model am I missing?


I already gave you use-cases.

@functools.cache
def f(*args) -> list: ...

In normal Python, this is asking for trouble to the point you shouldn’t even consider it. With COW, this would just work. Wouldn’t that be nice :upside_down_face: ?

Then if I say a = None then I have a view to None, but not None itself. Which means that a is None won’t be true any more, because a is not None, it’s a view to None.

This basically breaks everything that you would expect about object identity.

Exactly.

Quite frankly, I don’t know! Why are you caching this function? What does it even mean? Does every call with the same args result in views to the same list? What happens when you have lots of views to the same list? This is so vague that I have NO CLUE what “just work” means.

Clearly you have in mind a set of semantics that have nothing to do with Python’s execution model. I wish you the very best in the development of your programming language.

2 Likes

So can you give me an example of a function that would be “pureread”? I’m unable to think of anything meaningfully useful.

2 Likes

The ones that jump out at me are Python free-threading, caching, reasoning about code, and general concurrency. Now some of this seems obvious to me (still possibly wrong), and I really don’t want to put words in your mouth. What about these use-case categories seem niche or unserious?

If I may be pithy, sounds kinda like the ProxyType I mentioned earlier.

A literal example I’m now working with is that I have some OpenAPIv3 document transforms that I am building for building a CLI with autocomplete using the prompt_toolkit module. I would like checks as I construct these that the source data structures are unaltered. These transforms do not lend themselves to comprehensions, and so a more general check for purity would be helpful in ensuring correct programming. I’d also like to multithread some of these autocompletions for variables when filling in arguments which can be satisfied by file paths on the system, and this would be able to be check for safety if I had these purity checks.

I’m going to repeat my earlier question. Can you give an example of an actual Python function (the code, not a description) that would be pureread() in your terms? It doesn’t have to be complex - something trivial would do as a starting point for discussion. But at the moment, I simply have no idea what you’re proposing because I don’t know what a pureread function would even look like.

You’ve already acknowledged that having any sort of attribute access means that a function isn’t pureread, because properties mean that attribute access can run arbitrary code. And I’m struggling to imagine any sort of meaningfully useful function that doesn’t access object attributes in some way.

1 Like

Using this snippet with the full file below. This code is not yet tested, it is only pseudocode at this time. The main highlight here is for correctness, where is something modifies values through passed in parameters it could be very destructive. A purewrite() test here included as a decorator would catch programming errors.

def resolve_all_references(oa_doc: Dict[str, Dict | List | str]):
    refs: List[List[str]] = all_ref_paths(oa_doc)


    is_purewrite: bool = False
    if is_purewrite:
        oa_doc = deepcopy(oa_doc)

    for ref_path in refs:
        replacement_path: str = ref_path[-1]
        if not replacement_path.startswith("#/"):
            raise RuntimeError("Reference path is either invalid or relative.")
        rp: List[str] = filter(bool, replacement_path[2:].split("/"))
        if not rp:
            raise RuntimeError("Empty reference path.")
        ds = oa_doc
        for key in rp:
            ds = ds[key]

        is_purewrite: bool = False
        if not is_purewrite:
            mod_oa_doc = oa_doc
        else:
            mod_oa_doc = deepcopy(oa_doc)

        for key in ref_path:
            if key == "$ref":
                break
            mod_oa_doc = mod_oa_doc[key]

        mod_oa_doc[rp[-1]] = ds
        del mod_oa_doc["$ref"]

    return mod_oa_doc

From

"""
Utilities needed to transform the OpenAPI documents into a CLI argument
friendly composable structure.
"""

from collections import defaultdict
from copy import deepcopy
from pathlib import Path
from typing import Dict, List, TypeAlias
import json


OAServerBlock: TypeAlias = List[Dict[str, str | Dict]]
ServerEntries: TypeAlias = Dict[str, Dict[str | Dict]]


def never():
    raise RuntimeError("Reached an impossible branch; program error.")


def transform_servers_block(servers: OAServerBlock) -> ServerEntries:
    server_entry_collector: Dict[str, List] = defaultdict(dict)
    entry: dict
    for entry in servers:
        server_entry_collector[entry["description"]][entry["url"]]: {
            varName: [val for val in varVals.values()]
            for varName, varVals in entry["variables"].items()
        }
    return server_entry_collector


def apply_lone_option_substitutions(
    base: str, vars: Dict[str, List[str | float | int | bool]]
) -> Dict[str, List[str | float | int | bool]]:
    for varName, vals in vars.items():
        if base.count(varName) == 0:
            raise RuntimeError(
                f"variable name {varName!r} does not appear in value pattern {base!r}."
            )
        elif base.count(varName) != 1:
            raise RuntimeError(
                f"variable name {varName!r} appears {base.count(varName)} in value pattern {base!r}."
            )
        for val_name, allowedSubsts in vals.items():
            if not isinstance(val_name, str):
                raise RuntimeError()
            for subst in allowedSubsts:
                if not isinstance(subst, str | float | int | bool):
                    raise RuntimeError()

    lone_options: Dict[str, str] = {}
    remaining_options: Dict[str, List[str | float | int | bool]] = {}
    for name, options in vars.items():
        if len(options) == 0:
            raise RuntimeError(
                "'variables' block has no possible values; OpenAPI document is malformed."
            )
        elif len(options) == 1:
            lone_options[name] = options[0]
        else:
            remaining_options[name] = options
    return base.format(**lone_options), remaining_options


def simplify_server_records(se: ServerEntries) -> ServerEntries:
    return {
        desc: {
            apply_lone_option_substitutions(url, vars) for url, vars in url_vars.items()
        }
        for desc, url_vars in se.items()
    }


def all_ref_paths(
    tree: List[Dict | List | str] | Dict[str, Dict | List | str],
) -> List[List[str]]:
    if isinstance(tree, str):
        return [tree]
    elif isinstance(tree, list):
        return [branch for branch in all_ref_paths(tree) if "$ref" in branch]
    elif isinstance(tree, dict):
        return list(
            filter(
                bool,
                [
                    [key, *branch] if "$ref" in [key, *branch] else []
                    for key, value in tree.items()
                    for branch in all_ref_paths(value)
                ],
            )
        )
    else:
        never()


def resolve_all_references(oa_doc: Dict[str, Dict | List | str]):
    refs: List[List[str]] = all_ref_paths(oa_doc)


    is_purewrite: bool = False
    if is_purewrite:
        oa_doc = deepcopy(oa_doc)

    for ref_path in refs:
        replacement_path: str = ref_path[-1]
        if not replacement_path.startswith("#/"):
            raise RuntimeError("Reference path is either invalid or relative.")
        rp: List[str] = filter(bool, replacement_path[2:].split("/"))
        if not rp:
            raise RuntimeError("Empty reference path.")
        ds = oa_doc
        for key in rp:
            ds = ds[key]

        is_purewrite: bool = False
        if not is_purewrite:
            mod_oa_doc = oa_doc
        else:
            mod_oa_doc = deepcopy(oa_doc)

        for key in ref_path:
            if key == "$ref":
                break
            mod_oa_doc = mod_oa_doc[key]

        mod_oa_doc[rp[-1]] = ds
        del mod_oa_doc["$ref"]

    return mod_oa_doc


def read_document(fp: Path) -> Dict:
    with open(fp, "r", encoding="utf-8") as f:
        raw_doc = json.load(f)

    referenceless_components = resolve_all_references(raw_doc)

    (
        transform_servers_block(raw_doc["servers"])
        if "servers" in referenceless_components
        else None
    )


def read_documents(*args: List[Path]):
    return [read_document(fp) for fp in args]

I don’t think you’re understanding the question here.You want a builtin called pureread which will return a boolean. Show us a function for which pureread(func) would return True. Otherwise, the functions you’re asking for are actually really easy - watch:

def pureread(func): return False
def purewrite(func): return False
def pure(func): return pureread(func) and purewrite(func)

There. Done. Unless there are examples where my implementations are wrong.

5 Likes

I mean,

def fib(n): 
    if n <= 1:
        return n
    return fib(n-1) + fib(n-2)

But that seems too trivial to be of interest. Everyone here saw that. So what are you getting after?