A massive PEP 649 update, with some major course corrections

steve.dower · April 12, 2023, 3:02pm

This might be the opportunity to consider it then. Otherwise, the current state of Sphinx may influence CPython to implement a vastly more complex system than what is required.

Just to throw an idea out there, what if the runtime annotations made the original indexes into the source file available (which should be on the AST?), but not necessarily the original string? That would require users to make the sources available if they want accurate documentation, but that seems like a reasonable expectation.

Or perhaps there’s some other approach here that would make it sufficiently easy for you to retrieve the original annotation without it necessarily having to be kept alive in memory (or stored independently in cached bytecode files) for all users all the time?

AA-Turner · April 12, 2023, 3:32pm

Sphinx will adapt to PEP 649 – we would most likely either use the VALUE or HYBRID format of inspect.get_annotations().

From Sphinx’s perspective as long as we are ourselves able to re-constitute something resembling the original annotation we’d be happy.

Sorry if my original comment sounded like I was advocating strongly in favour of the ‘original source code mode’ – Sphinx would consider using it if available, but is ambivalent towards its existence.

A

larry · April 12, 2023, 4:05pm

Yeah. Of course, normally a class or module will itself contain code objects, and you could look at the flags of those code objects to see whether or not a __future__ was active. But now I have to write code that iterates over namespaces looking for functions with code objects. And maybe those callables actually were defined in a different module and were only imported into this one. So I have to try and detect that, etc etc etc.

It’ll be simpler to just take the object, find its module, and look to see if module.annotations exists and is a _Future object.

If this is an opening gambit in a “do we really need STRING format” conversation, we discussed that in the thread. I think the answer is yes.

Unlike STRING format, computing HYBRID format can fail. All it really handles are NameError exceptions. But it does that by creating a proxy object, which it substitutes for the real value in the annotation expression. It’s true that our world is a bright, happy place where type hints don’t mind if you give them a proxy in lieu of a real value. And I expect the adoption of 649 will reinforce that going forward. But it remains true that users could annotate something with a live constructor that doesn’t like proxies.

     def foo(a: int(undefinemodule.value)): pass

As I think I touched on somewhere in my gigantic top post, we know we have to implement HYBRID format. This means we have to implement all the machinery needed to also render STRING format, and FORWARDREF format too in case we wanted it. Forgoing STRING and FORWARDREF format wouldn’t allow us to simplify the implementation hardly at all. (The difference between HYBRID format and FORWARDREF format is literally "do we allow __compute_annotations__ to see existing symbols, or do we create proxies for all symbols?)
STRING format is my guarantee that all users of PEP 563 “stringized annotations” will find their needs met by PEP 649, no matter their use case. They can still get the stringized annotations they need, and if they require total 563 compatibility, they could even store them back on the object (o.__annotations__ = inspect.get_annotations(o, STRING)). I’ll admit to some anxiety about removing STRING format, as I worry I/we have overlooked some important real-world use case for stringized annotations that requires STRING format or something like it.

steve.dower · April 12, 2023, 4:15pm

It was more trying to get to the core of whether we need to be able to accurately reconstruct the original string vs. just preserving it vs. just preserving enough information to find it later.

Either way, it seems we’re going to end up with a massively complicated implementation. I just don’t want the complication to be caused by type checkers (which obtain all their information out-of-band) and documentation generators (which conceivably could obtain all their information out-of-band).

larry · April 16, 2023, 8:00am

It’s been four days since the last reply in this topic. I’m not sure whether or not people are still voting, but, well, y’all have had plenty of time to vote. I’m going to close the polls, and declare this discussion concluded, approximately 24 hours from now.

Last looks, everybody!

larry · April 17, 2023, 4:14pm

I just closed the polls. Your votes provide a pretty clear signal, and I think the winning votes all make sense. I’m already incorporating those results in the next draft of PEP 649.

Please consider this topic closed. Thanks for your input, everybody!

larry · April 20, 2023, 5:31pm

I find that surprising. Literally I assumed documentation users would want STRING format, which in the latest revision I renamed to SOURCE format. And here you’ve chosen exactly the opposite

I’m curious both directions:

Why wouldn’t Sphinx (et al) just ask for the annotations in SOURCE format and paste them directly into the relevant spot in the docs?
What use would Sphinx have for real values, possibly sprinkled with ForwardRef proxies?

In the latest PEP revision I asserted that documentation use cases were most likely to use SOURCE format. Maybe I need to update that!

AA-Turner · April 20, 2023, 6:28pm

The answer to both is that we want to cross-reference the annotation type to the documentation entry for said object – for example within this page, the types of the arguments are linked back to the definition site (either within the same set of documentation or by using intersphinx).

I could see pydoc (or similar tools without automatic cross-referncing functionality) using the SOURCE format, though, if it simply wishes to faithfully represent the annotation and do nothing more with it.

A

methane · April 23, 2023, 7:15am

If accepted, this PEP would supersede PEP 563, and PEP 563’s behavior would be deprecated and eventually removed.

I don’t want deprecate PEP 563 when PEP 649 is accepted.

PEP 563 has no problem for static typing users. They used from __future__ import annotations for better performance and/or forward compatibility of new typing features (e.g. str | int).
We do not measure memory / GC overhead of PEP 649 yet. PEP 563 has really lightweight and I do not know PEP 649 overhead is negligible.

I think deprecating PEP 563 should be discussed after Python 3.11 became EOL and everyone can use PEP 649. At the point, many users can compare PEP 563 and 649 for real world use cases.

larry · April 23, 2023, 3:17pm

On the contrary, PEP 563 is causing problems for static typing users, because they often become runtime annotation users too. Please see the revised PEP and the problems 563 is causing, e.g. problems with TypedDict and dataclasses.

Keeping 563 means, at the very least, an ongoing maintenance headache for wrappers, which are themselves runtime annotation users, used by static typing users.

Although we don’t have a modern implementation of 649 (yet) to experiment with, experiments with my most recent co_annotations (Apr 19 2021) should be reasonably informative. The runtime cost of 649 is a new code object per annotated object, possibly enclosed in a tuple with a reference to either a class dict, a closure tuple, or both. That’s true in the old branch, and it won’t change much in the final implementation. (We add one parameter, and if <fastlocal> != 1: raise NotImplmplementedError() boilerplate, to every __annotate__ function generated by the compiler.)

If 649 is judged to be too resource-intensive, we already know of ways to address that. For example, we’ve talked for a while about adding a lazy-loading feature for .pyc files; it’s not difficult, we just haven’t needed it enough to actually do it. So for example we could make the __annotate__ code object lazily-loaded, which would greatly reduce its runtime cost when __annotate__ is never examined.

Ongoing work on Python will make 649 cheaper, too. Perhaps the Faster CPython team will make code objects smaller; this would benefit everybody, including 649. (I dimly recall they had an idea on how to do so, but I’m not sure that memory is correct, and I don’t remember any specifics.)

larry · April 23, 2023, 3:21pm

It seems the “documentation” use case called out in the PEP has uses for all three formats! We should update the PEP to reflect this.

methane · April 23, 2023, 5:11pm

Yes. Then user need to remove from __future__ import annotations from the module they are using runtime types.

What I really like about PEP 563 is it has zero GC objects. I can not estimate how many GC-tracked objects and references are created by PEP 649 before Python 3.12 become beta.

I am proposing to deprecate PEP 563 after that we confirm PEP 649 overhead is negligible.
If lazy loading or other optimizations will reduce PEP 649 overhead, we can deprecate PEP 563 at the time instead of Python 3.12.

larry · April 23, 2023, 6:25pm

So you contend that PEP 563 should be kept forever, for the limited number of users who have forward reference and circular reference problems with annotations, but who never make any use of annotations at runtime, and solely because it’s currently more efficient at runtime to do so? I don’t agree that that’s a good idea.

Historically, Python’s experiments with “here’s an alternate way to do an existing thing because of performance” (e.g. slots) often become outmoded. But by adding them to the language we become obligated to support them forever. This causes a continual trickle of problems we collectively we have to contend with, when the quirks of these infrequently-used alternate approaches once collide with new code and new language technologies. For example, dataclasses was added in Python 3.7, but didn’t support slots until Python 3.10.

I’m gonna quote the Zen here:

There should be one-- and preferably only one --obvious way to do it.

If we’re going to go with 649, I definitely think we should deprecate and remove 563 too. I believe if 649 causes resource issues we can fix them.

Why not? As I mentioned, the basic mechanism of PEP 649 hasn’t changed since my old co_annotations branch. Surely you could make your estimates based on that. PEP 649 adds:

a new code object (GC-tracked)
possibly a reference to the existing class dict (not created by 649)
possibly a reference to the closure tuple (GC-tracked, may or may not already be needed by other objects)
possibly a tuple to store them in (GC-tracked)

So this is at least one but no more than three new GC-tracked objects per annotated object created by PEP 649.

Isn’t the tuple used to store the alternating key/value strings in the modern 563 implementation a GC-tracked object?

methane · April 24, 2023, 2:07am

Oh, I didn’t say “never”. I just say Python 3.12 is too fast.

I am proposing to deprecate PEP 563 after we confirm PEP 649 is acceptable for most users, including WASM, RasPi, and huge scale server applications.

If overhead of PEP 649 is small enough, it’s OK to deprecate PEP 563 in Python 3.13.
If overhead of PEP 649 is not small enough, PEP 563 should be kept until some optimization is landed.

I never said never.

Because only 2 weeks before 3.11 become beta. I do not enough time to evaluate reference implementation until then. That’s why 3.11 is too early for me.

Tuples containing only constants are not GC tracked. PEP 563 has zero GC time overhead.

Additionally, same constants in a module are merged into one constant by compiler. So PEP 563 doesn’t use even one tuple per annotation. It is very lightweight.

encukou · April 25, 2023, 2:25pm

I finally read the PEP. Great work! It all makes sense to me, and allows future extension where appropriate.
I do have a few thoughts:

2 (exported as inspect.SOURCE)

Values are the text string of the annotation as it appears in the source code. May only be approximate; whitespace may be normalized, and constant values may be optimized.

Perhaps explicitly say that the approximation may change in future Python versions.

this PEP defines a new __locals__ attribute on functions. By default it is uninitialized, or rather is set to an internal value that indicates it hasn’t been explicitly set. It can be set to either None or a dictionary.

Will del f.__locals__ reset it to that internal value?
What does None do here?

Format identifiers are always predefined integer values.

inspect.VALUE = 1

inspect.FORWARDREF = 2

inspect.SOURCE = 3

This might be heading directly to the Rejected Ideas section, but: what’s the advantage over using strings, like inspect.VALUE = 'VALUE'? Strings are great when debugging, since the meaning remains clear.
Are interned strings much more expensive than ints? Is it too expensive to store the values in .pyc?

larry · April 25, 2023, 3:14pm

Thank you for the kind words!

Sure. It might even happen

I admit I foolishly assumed users would see the dunders and get scared off. I think you’re right that we should work out the semantics more thoroughly.

In general, this is a place to attach a dict to a function so that the dict is used as the “slow locals” namespace. I assumed users wouldn’t use it much, because it’s hard to get your hands on a code object that interacts with slow locals.

To answer your specific questions, what I had in mind was, setting it to None was equivalent to having it unset / deleted. The mysterious “internal value that represents an uninitialized state” I had in mind was that the new func_locals field on the PyFunctionObject would be preinitialized to NULL. And, I assumed (but didn’t specify) it’d behave like other mutable fields on function objects defined on the C side, e.g. __doc__, where deleting it is equivalent to setting it to None (and I believe the internal func_doc is reset to NULL in both those cases). If the semantics I spelled out for __locals__ don’t match that, I should fix the PEP.

You make an interesting point. Interned strings aren’t really any more expensive than ints today. And in my personal code I often use strings for these sorts of ad-hoc quick enums. I picked ints simply because I knew they were the cheapest possible thing. Every compiler-generated __annotate__ function is going to start with the same boilerplate:

if format != VALUE:
    raise NotImplementedError()

and I wanted that to be cheap as possible. But your suggestion of an (interned) string is entirely reasonable; the code is the same length, and the CPU time spent to compute it will be nearly the same. (When format is 'value', it would be exactly the same time, and when they’re not the same… maybe after identity fails it’d compare the hashes and fail quickly? I’m too lazy to go check, it’s such a chore wading through the current implementation of unicode objects.)

Spinning up a ret-conned argument future Python implementations may add optimizations around small integers. For example, they might have a LOAD_SMALL_INT bytecode that treats the oparg as a (signed) int, and pushes it on the stack. Or they might work with unboxed native integers directly, as I believe JITted code in PyPy does. Such optimizations might make __annotate__ functions even cheaper, if the format values are small integers.

But this isn’t a strong argument, and I don’t have a strong conviction beyond “format values should be cheap constants”. I see the advantage of using hard-coded strings here and I wouldn’t mind switching. Do you / does anybody feel strongly about this?

Jelle · April 25, 2023, 3:35pm

Would you consider simply disallowing assignment to this attribute? We can always allow assignment later if a use case comes up, but leaving it read-only now simplifies the interface and makes it so we don’t have to think about the semantics.

encukou · April 25, 2023, 3:41pm

See PyMemberDef for the built-in way of doing this: T_OBJECT_EX. It sets the pointer to NULL on del, and raises AttributeError when it reads a NULL.

IOW, generally the best way to represent an “unset” state is NULL in C and missing attribute in Python.

(There’s also the historical T_OBJECT, which also sets the pointer to NULL on del, but translates NULL to None on read. That was deemed inferior.)

Well, they might also add a LOAD_COMMON_CONST. See faster-cpython/ideas#577.
(Edit: or this could compile to CALL_INTRINSIC!)

Optimizations will follow usage. IMO, the PEP should focus on a good API rather than chasing microseconds on current CPython.

Jelle · April 25, 2023, 3:46pm

Another option could be to define an IntEnum in inspect:

class AnnotationFormat(enum.IntEnum):
    VALUE = 1
    FORWARDREF = 2
    STRINGS = 3

larry · April 25, 2023, 4:07pm

Well, I already need to set it in “user” code. The “fake globals” environment for computing SOURCE and FORWARDREF formats will rebind the code object to a new function object with synthetic globals / locals / closure.

I hadn’t thought about it, but maybe it would make sense to add a locals parameter to the function object constructor (func_new). If we did that then maybe we could make the attribute read-only.

That does look nice. I admit I wasn’t aware of it, but also, the function object doesn’t use it. __doc__ on a function object is just a garden-variety “member” (flags are 0). Now that I look more closely I don’t know who’s handling the “del on func.__doc__ sets it to NULL internally” thing… unless T_OBJECT_EX is equivalent to 0?

I don’t think that would work, because this VALUE format has to be generated by the compiler. With the current state of the art in CPython, the enum would have to be implemented in C, and the compiler would have to know how to emit a reference to it as it wrote out __annotate__ code objects. At the very least, this approach would make somebody (the module containing at least one __annotate__ function? the body of the __annotate__ functions themselves?) import the runtime value of VALUE. For simplicity and resource sakes I think the format values should be marshal-able constants.