PEP 683: Immortal Objects: Updates

Hi all,

@eric.snow and I would like to give some updates on the PEP, following-up from the last update that we’ve sent out. For context, this is the reference implementation:

Since then, PEP683 has been accepted with conditions. We’ve been able to successfully address all the required conditions within the control of the PEP implementation. There are some edge cases that might not be fully handled but we’ve noted that below and these are already being tracked as separate issues that can already be addressed upstream.

PEP Conditions Addressed:

  • Reset refcount in tp_dealloc: Partially Done. We’ve added the tp_dealloc checks as much as we could in the existing immortal objects. Some cases were not being addressed (yet) such as deep frozen objects. This is already being tracked in: Do Not Allow Static Objects to be Deallocated · Issue #101265 · python/cpython · GitHub. One this issue is resolved, we can fully address this point.

  • Types without tp_dealloc checks may not be immortalized: Partially Done. Refer to the point above.

  • Benchmark results: :1.03x slower Geometric Mean on the PyPerformance suite using GCC11.1 (results might vary, specially if using older compilers).

Details on Performance Measurements:

The results of 1.03x slower are valid as of 01/28/2023, this measurement was done on the immortal hash: a748e80 vs the baseline hash: 666c084. This Python was compiled on a Linux machine using GCC11.1. Note that the last time that this was measured was in October 2022 and the measured performance there was 1.02x slower Geometric Mean.

Regarding the increase from 1.02x → 1.03x, there’s two things to consider. First, is that the addition of checks in tp_dealloc, have added a slight performance cost, which were not considered in the past. Second, since this PR started to today, there have been many improvements to the runtime performance (particularly in the interpreter). Given that IncRef and DecRef have a constant overhead, this overhead becomes more pronounced as the runtime becomes more efficient. Therefore, all the improvements that have been done to reduce the runtime performance cost of immortalization have, so far, served mostly to maintain the same regression, making this a moving target.

Future Work:

While there is still a performance regression, there are more things that can be done (and they each have to be analyzed with their pros and cons). Here’s a list of three potential follow-ups that can happen after this PR. I will avoid giving any specific numbers on potential improvement as the end result will greatly vary once all the considerations for each of these opportunities are taken into account.

  1. Make Code Objects immortal and remove the expensive DecRef in the interpreter loop.
  2. Immortalize the startup heap, reducing the overall cost of all the known objects to be alive before any user code is executed.
  3. Create a GC “Gen3”, by immortalizing and moving objects that have lived in Gen2 for a while to the permanent generation. This will improve the performance of large/long-lived applications that end up using a constant set of common objects that live throughout the entire execution of the runtime.

Unicode Object Leaks:

Currently, the runtime does not have a strict guarantee that it does not leak object (and more specifically unicode objects). There’s already an open bug about it: Leaks on Python’s standard library at runtime shutdown and unfortunately, fixing this is not trivial as we need to track down the source of 1000+ refcount mismatches.

Under this condition, then it is impossible to correctly deallocate immortal string objects at runtime shutdown. This is only an issue in embedded programs that re-initialize the runtime multiple times. That is, if we have a reference that did not let go at runtime shutdown (the leak), then cleaning up immortal string objects means that these references will now be invalid in the next runtime initialization, causing a segmentation fault.

Therefore, for the implementation in the PR, we have guarded the ability to correctly clean up all the immortal string objects with an ifdef until the bug is closed and we can have a strict guarantee around leaks from the core runtime and standard library.

Instagram Usage:

As an extra note, we have been able to successfully run Instagram using the current upstream patch (sans the recent tp_dealloc changes). We did this in a two-step approach, the first, updating only the core runtime (without updating all the C-Extension with the new headers), and second, recompiling our thousands of extensions. Note that our extensions are being used as-is, without the explicit usage of the Limited API. Even under this scenario and without the tp_dealloc fixes, we never reached a scenario where we reached an accidental deallocation of an immortalized object.

Context on Implementation Details:

This is meant as helper notes for people reading through the implementation details and might be puzzled by some of the introduced changes.

  • Include/Python.h: string.h had to be included because we now use memcpy in object.h
  • Modules/gcmodule.c: These changes are to make sure that we correctly handle objects that reach the maximum (immortal reference count). These will be excluded from the cycle detection (which requires a correct reference count value to work) and moved into the permanent generation.
  • Objects/bytes_methods.c: For some reason, single character istitle checks were not working on windows applications with this change. This change just removed the one special case but maintains the correctness.
  • Lib/test/ Given that we have unicode objects leaks (I.e in import unittest), running this test as an embedded test multiple times will cause the application to crash. By adding manual assertions, we are able to get all the tests passing with the same checks, but it avoids the leaking library unittest.
  • Lib/test/ In address sanitizer mode, given that we now “leak” due to the unicode problems, the check_return causes the test to fail. This just updates the assumption to still maintain the essence of the test until we fix the Unicode Leaks.

Thanks for the PEP, I am looking forward to it but I have some feedback.

I don’t understand why we need to reset the refcount in tp_dealloc for immortal objects. If I am reading the PEP correctly, it is because of “this only matters in the 32-bit stable-ABI case”. To me that’s just hypothetical situation which we should not care about and IMO that’s not even possible.

I am quoting from PEP and will inline my replies.

Hypothetically, such an extension could incref an object to a value on the next highest bit above the magic refcount value. For example, if the magic value were 2^30 and the initial immortal refcount were thus 2^30 + 2^29 then it would take 2^29 increfs by the extension to reach a value of 2^31, making the object non-immortal. (Of course, a refcount that high would probably already cause a crash, regardless of immortal objects.)

Why would anyone do that in the first place? Any valid reason for doing that?
Current CPython refcount semantics are that every new reference is backed by an object pointer. Taking the extreme case that each of those objects increfing take just 8 bytes so equivalent of just object header not even gc header and have no data stored in them, for 2^29 increfs you need 2^29 * 8 bytes which exceeds the virtual address limit on 32 bit systems.

The more problematic case is where such a 32-bit stable ABI extension goes crazy decref’ing an already immortal object. Continuing with the above example, it would take 2^29 asymmetric decrefs to drop below the magic immortal refcount value. So an object like None could be made mortal and subject to decref. That still wouldn’t be a problem until somehow the decrefs continue on that object until it reaches 0. For statically allocated immortal objects, like None , the extension would crash the process if it tried to dealloc the object. For any other immortal objects, the dealloc might be okay. However, there might be runtime code expecting the formerly-immortal object to be around forever. That code would probably crash.

Again 2^29 asymmetric decrefs requires 2^29 increfs in the first place otherwise the code is breaking the current refcount semantics and is bound to crash.

I don’t see how both of these are problems which need addressing and definitely not at the cost of loss in performance.

As an extra note, we have been able to successfully run Instagram using the current upstream patch (sans the recent tp_dealloc changes).

That makes complete sense to me.

This also means that any important destructor/finalizer will not be run (in addition to potentially immortalizing very large data such as a Pandas dataframe). This is a very bad idea IMHO.

We are mostly assuming the worst possible case to ensure maximum compatibility. In some of the initial versions of this, we were not even addressing this edge case, assuming that it just would never happen. However, we added this explicitly since it’s one of the scenarios that @thomas (if I remember correctly) brought up in the 2022 Python Language Summit.

We also follow a similar tagging scheme for 64 bit systems in which case it would still be valid

We can always go back and decide to remove this if we don’t think this makes sense. For what it’s worth, adding the checks in some of the tp_dealloc functions didn’t regress performance by that much, so for now I think we are still fine!

Not necessarily, we can always re-work the runtime shutdown mechanism to clean these up! This will make it a bit trickier as we can’t guarantee ordering now but we can probably do some more plumbing to add some topology to the order of destruction (i.e start with certain kinds of objects first). In this way we can still maintain certain guarantees for the execution of finalizers.

On the immortalizing very large data structs, that’s a fair concern. That being said, the assumption is that if we get to this point, it means that they probably will live for the entire execution of the runtime. This is true in the large applications that we’ve observed at Meta. Of course, this is just one datapoint and not an accurate representation of the world but at least it’s some proxy.

The important point is that each of these should be considered separately to see if it makes sense to integrate them into CPython. It may very well be that the answer is that no after looking at all the pros and cons :slight_smile: