Lazy load of strings

I know that I pushed this topic 3 years ago and a number of people piped in that it was a good idea. Lets try once again to press this forward.

REMOVED SPECIFIC USE CASES AS IT WAS DISTRACTING
EDITED TO REMOVE A MUTILATION FROM C API TO PYTHON
EDITED TO REMOVE MUTILATION TO THE PROPOSAL INTRODUCT IN AI CLEANUP PASS BY ROLLING BACK TO EARLIER DRAFT

PEP: XXXX
Title: Lazy-Loading Strings for Improved Efficiency in Python
Author: [Your Name]
Status: Draft
Type: Standards Track
Created: YYYY-MM-DD
Python-Version: TBD

Abstract

This PEP proposes the introduction of lazy-loading strings in Python by
extending the existing PyUnicodeObject structure. Lazy-loading allows strings
to be initialized with a loader function that defers their full construction
until explicitly accessed. This feature improves memory efficiency and reduces
initialization costs for applications that handle large strings conditionally.

Lazy-loaded strings behave identically to regular Python strings once evaluated,
ensuring seamless integration with Python’s str type and string-related
operations.

Motivation

Lazy-loaded strings provide a mechanism to defer the initialization of string
data until explicitly accessed. This is particularly valuable in scenarios where
memory usage and performance are critical, such as:

  • Language bridges that pass strings between environments.
  • GUI toolkits with conditionally displayed dynamic content.
  • Large-scale data processing with sparsely accessed strings.
  • Web applications generating dynamic responses.
  • Machine learning pipelines with text-heavy datasets.

By introducing lazy-loaded strings, Python can offer developers an efficient way
to handle string-heavy workloads without unnecessary overhead.

Potential Python modules that could benefit from lazy loading is any kit for which Python strings are immutable and can have pass through semantics. These include Java bindings(JPype, PyJNIus, Chaquopy), C# bind (Pythonnet) ObjC binds, GUI kits (Qt), and other potential pass through libraries.

Specification

Overview

Lazy-loaded strings are a new type of string object that defers the computation
or loading of its value until accessed. They improve performance and memory
efficiency when working with external libraries or systems requiring deferred
string evaluation.

API Definition

A new API will be introduced to create lazy-loaded strings:

.. code-block:: C

PyObject* PyUnicode_FromLazyLoader(PyObject* loader) {
    /*
     * Creates a lazy-loaded string from a loader function.
     *
     * Parameters:
     *     loader: Is a Python proxy object which implements the ``__str__`` interface which 
     *    will produce a string when needed.  
     *
     * Returns:
     *     A PyObject representing a lazy-loaded string that defers evaluation
     *     until accessed, or NULL if an error occurs.
     */
}

Key Points:

  • Loader Function:
    • The loader parameter is a callable that takes no arguments and returns
      a string.
    • The loader function is invoked only when the lazy-loaded string is accessed
      for the first time.
  • Return Type:
    • The returned object is a lazy-loaded string, which behaves identically to a
      regular Python string once evaluated.

Points to consider: We may need to use something other than __str__ unless the representation is already knowable at creation time. We are to fill out the contents and not mutilate the other fields of the representation. Consider adding a representation enum so that memory model can be defined at creation time.

Behavior

Lazy-loaded strings exhibit the following behaviors:

  • Deferred Evaluation:
    • The loader function is not invoked until the string is accessed (e.g., via
      str(), len(), slicing, or string methods).
    • Until evaluation, the lazy-loaded string occupies minimal memory, storing
      only the loader function.
  • String Operations:
    • Lazy-loaded strings support all standard string operations (len(),
      slicing, concatenation, etc.).
    • Accessing or using a lazy-loaded string triggers evaluation, after which it
      behaves like a regular string.
  • Caching:
    • Once evaluated, the string value is cached within the lazy-loaded string
      object for subsequent access.
    • The loader function is not invoked again.
  • Error Handling:
    • If the loader function raises an exception during evaluation, the lazy-
      loaded string becomes invalid, and subsequent access will raise the same
      exception.

Edge Cases

  • Never Accessed:
    • If a lazy-loaded string is created but never accessed, the loader function
      is never invoked, and no string value is computed.
  • Thread Safety:
    • Lazy-loaded strings ensure thread-safe evaluation by locking the loader
      function during the first access. This prevents race conditions when the
      string is accessed concurrently.
  • Compatibility:
    • Lazy-loaded strings are fully compatible with Python’s existing string APIs
      and modules. No changes are required in existing modules to support lazy-
      loaded strings.

Implementation Details

Lazy-loaded strings will be implemented as a new type within CPython, extending
the PyUnicode type. The following changes will be made:

  • Modify Internal Type:

    • A modify existings internal type, PyUnicode, to include fields need for loading.
    • This type will store:
      • A reference to the loader function (PyObject*) which will add relevant contents in
        an agreed upon representation (8, 16 or 32 bit)
      • A flag indicating whether the string has been evaluated.
      • The cached string value (if evaluated) as the pointer to the outside memory.
  • Evaluation Logic:

    • When any string operation is performed a macro will check the string represention for null and if it is missing consult the loader to fetch the string.
      The resulting value is cached.
    • tp_repr should not lazy load to avoid potential conflicts with debugging instead it will give <lazy string %s> where %s is the loader type.
  • Integration:

    • Lazy-loaded strings will seamlessly integrate with Python’s existing string
      handling mechanisms.
    • No changes will be made to existing modules or APIs.

Output:

.. code-block:: text

Before access
Evaluating string...
12
After access

Output:

.. code-block:: text

Error: Failed to load string!

Performance Considerations

Lazy-loaded strings reduce memory usage and improve performance by deferring
string evaluation until needed. Benchmarks will measure:

  • Memory usage reduction in scenarios involving large strings.
  • Performance improvements when interacting with external libraries requiring
    deferred evaluation.

Preliminary tests suggest a significant reduction in memory usage for workloads
involving large strings, such as log processing tools or machine learning
pipelines.

Security Considerations

Lazy-loaded strings do not introduce new security risks, as they rely on user-
provided loader functions. Developers must ensure that the loader function is
safe and does not execute malicious code. Additionally, thread safety is ensured
during evaluation to prevent race conditions.

Backward Compatibility

This proposal does not require changes to existing modules or APIs. Lazy-loaded
strings are fully compatible with Python’s str type and string-related
operations. Existing codebases will not be affected by this feature.

Open Questions

  1. Features:
    • Should lazy-loaded strings support additional features, such as lazy-loaded
      bytes or other types?
  2. GC Interactions:
    • How should lazy-loaded strings interact with garbage collection, especially
      when the loader function holds references to external resources?
  3. Performance Impact:
    • What is the overhead of checking for an additional flag during string
      operations, and how can it be minimized?

References

  • PEP 393: Flexible String Representation
  • PEP 573: Module State Access
2 Likes

This looks like a relatively well thought-out lazy object.

My question is: why is this a PEP, rather than a library?

If I understand your proposal correctly, you introduce a class

class LazyLoadedString(str):
  slots = ("value", "loader")
  def __init__(self, loader):
    self.loader = loader
    self.value = None
  def __str__(self):
    if self.value is None:
      self.value = self.loader()
    return self.value
  def __len__(self):
    return len(str(self))
  def __getitem__(self, key):
    return str(self)[key]
  ...

and you only need to repeat this pattern for literally all the string methods.

If you overwrite all 71 (?) string methods this way, I don’t see how anything could go wrong. But you’d have to override the .__repr__ method too which means that if things do go wrong you’re in debug hell.

1 Like

The goal is to have support in the Python language for lazy loading because there were many interested parties. As far as I am aware the extension of string won’t work that way. Like Python ints the string type is immutable outside of the new method. And getting every method from Python is hard. Like exceptions there are many c side api calls the simply skip looking a the vtables. Meaning you can define the entry point but you can’t make it use it.

As for repr yes the pep should be updated. A repr call likely should NOT trigger a lazy load if the string is not already available.

Do you have a link to that previous discussion?

This is incorrect. PyObjC transparently translates NSString instances to a subclass of str.

IMHO it would be better to look into a protocol that allows implementing string-like types, just like the __index__ method allows implementing an integer-like type.

Two reasons for that related to PyObjC:

  1. Objective-C strings come in two flavors: NSString and NSMutableString, the latter is mutable. Both are “class clusters” in Objecive-C, with instances being instances of a subclasses of these classes with no way to detect if an instance is mutable at runtime. PyObjC represents both with the same subtype of str and that gives some wierd behaviour when a mutable string is actually mutated.

    I’d expect a similar challenge with C++'s std:string which is a mutable type and is hence sematically not a perfect match with Python’s str.

  2. Representing “foreign” strings as subclasses of str will mess up the sub typing relationships with the “foreign” base class. E.g., objc.pyobjc_unicode is not a subclass of PyObjC’s NSObject proxy, even though NSString is a subclass of NSObject in Objective-C.

Finally, a protocol might reduce the need to actually copy data (similar to how the buffer protocol can allow access to buffers without copying data). Whether or not that’s truly useful depends on the use cases and the representation of both Python and “foreign” strings.

2 Likes

Thanks for the clarification. The AI that was search for use cases gave me the wrong information on that one.

I should have asked it to filter only on types that have immutable strings.

I only have the mailing archived on my local machine as it was in a list service days. I can try to find the thread title and search the python archives for it.

I will edit this post with the details when I turn it up. The original issue came up in solicitation for evaluation of the C Python API circa 2020. It was revised by the working group in 2023. It was never shot down but rather not given priority because the proposer (me) was unable to work on it due to a employer based issue that has since been resolved. Department of Energy refused to sign the Python contributor contract. Waited three years, asked again got different answer.

Please don’t rely on AI for justifications for a proposal. If you want to use it for your own research that’s fine, but you need to verify its claims for anything you use to support your proposal - ideally, with links to supporting documentation (because if you can’t find any such documentation, how do you know the AI didn’t just hallucinate?)

12 Likes

I originally asked it to summarize a bunch of conversations that I had notes on from 5 years ago. If you follow the 5 year old thread I believe the same mistake was made in the original. Thus the AI failed to correct my already faulty notes when I asked it to freshen the research. It did add original material that will need to be verified before this goes from the ideas page to an active PEP, I missed that when I was reviewing its work. I spent two hours correcting hallucinations and directing it as to why one implementation worked and another failed, but I am only human so I missing screening for immutability when it pulled the final list together.

However, I think that it is best if the experts on the other kits chime in rather than me investigating and drawing the wrong conclusions as to where is applied. I would have made the exact same mistake for ObjC given my reading of their specification. They have immutable strings types and it was brought up prior conversations. The fact that one can’t tell an immutable from a mutable string at run time would be way to subtle for someone who doesn’t know the language details at the depth.

The other one on the list I suspect is a hallucination is the Torch reference. Unfortunately I don’t have much use of that kit beyond my numerical processing. It is clearly a pass through language but as they generally are fine with forced conversions to move between CPU and GPU, I was in favor of tossing that.


Here is the original conversation where PyObjC piped in. They mention eager conversion to a sub type meaning they would benefit from a delayed load, but also that theirs is mutable. I would have implemented it with a lazy load and added a iscurrent() to check for mutation. I would then have the str() method that would have returned a fresh string on mutation. But that is up to the their implementation team. I tagged that as a “positive response” in my notes and asked the AI to include it for discussion.

I have a similar problem in PyObjC which proxies Objective-C classes to Python (and the other way around). For interop with Python code I proxy Objective-C strings using a subclass of str() that is eagerly populated even if, as you mention as well, a lot of these proxy object are never used in a context where the str() representation is important. A complicating factor for me is that Objective-C strings are, in general, mutable which can lead to interesting behaviour. Another disadvantage of subclassing str() for foreign string types is that this removes the proxy class from their logical location in the class hierarchy (in my case the proxy type is not a subclass of the proxy type for NSObject, even though all Objective-C classes inherit from NSObject).

I primarily chose to subclass the str type because that enables using the NSString proxy type with C functions/methods that expect a string argument. That might be something that can be achieved using a new protocol, similar to operator.index of os.fspath. A complicating factor here is there’s a significant amount of Python code as well that explicitly tests for the str type to exclude strings from code paths that iterate over containers.

Ronald

2 Likes

Then don’t include it. All you achieve by including it and admitting to it is make it so that we can’t trust anything in your original post. We don’t know what you have and haven’t manually checked, so even if we trust you, we can’t trust the text of the PEP. Use AI for research, but make sure (especially for a PEP or pseduo-PEP) that every word you put out to the world would match exactly what you would have done as well.

This probably also explains why you are using C-like semantics and C-like names written in python code - which I would suggest you change.

4 Likes

I have deleted the specific use section as it clearly is distracting. All of the others are justified by numerous conversations in the past Whether any particular kit choses to use lazy loading or eager loading or whether it is better to present a lazy load of the current state and freshen on a mutable string is entirely up to the library developer.

I used an AI both for help in writing and to locate potential users for the API in large part because I have a severe reading disability (hence my need to edit my posts typically 5 or more times.) I felt incorrectly that having a clear statement of purpose and a complete list of potentially interested parties to start a conversation would be of aid, but clearly I am mistaken.

1 Like

Crud, it changed my C API to a Python one on the last iteration. The unfortunate thing about severe dyslexia is that I see what I think is present and not what is written, only having it read outloud, pointed out specifically can I see it, or waiting a few days to attempt to reread will it appear. I literally had to go over the PEP line be line twice before I could see what you were referring to. I will go fix.

Thanks.

The choice of whether to implement a derived type or a extending the behavior of the Python base type depends entirely on the level of compatibility. The goal is to pass checks in the C API for any module that may have been passed lazy loaded string without modification. To best achieve that goal it would be best of the base type of Python string representation support this functionality.

I found the previous draft before the AI lost the point of the conversation and modified the PEP to be more accurate to the original intent.

Editing the proposal like you have done here is generally frowned upon here because it makes it hard to read and follow the thread (e.g. my previous response reacts to text that is no longer in the proposal).

Removing the specific use cases makes it harder to reason about the usefulness of this feature though. The list of scenario’s you mention in the motivation section is too vague for this.

BTW. According to this page Qt’s QString is mutable.

Have you tried implementing your proposal? The string type has a fairly complex representation and implementation, which could make implementing this hard. The string type also is used a lot, changing its implementation like this could have a negative impact on the performance of code that doesn’t use the new mechanism.

3 Likes

If there are interested parties I would be happy to give it a shot. The last three proposals I have made were ignored even with a full implementation (fast type checking for single inheritance, moving user data to the front of object and general layout for private data) was presented because while I can implement code writing docs and PEPs is a large challenge.

That is why I started with the sample PEP this time even though it takes me twice as long to write a single page of text as implementing the whole idea.

Is there a better forum with which to work through a PEP to hammer out details? My perferred format is a github style review page where people can review on specifics of trouble spots and we work to get it in a final form. This forum where either i have to efit and mark that edits were made or repost which then means it is not clear what version is being commented on really seems like the wrong type of tool.

With regards to the mutability of strings. Qt was specific mentioned because of this forum post.

Qt and proxy loading

Just to add another use case…

PyQt (the Python bindings for Qt) has a similar issue. Qt implements
unicode strings as a QString class which uses UTF-16 as the “native”
representation. Currently PyQt converts between Python unicode objects
and QString instances as and when required. While this might sound
inefficient I’ve never had a report saying that this was actually a
problem in a particular situation - but it would be nice to avoid it if
possible.

It’s worth comparing the situation with byte arrays. There is no problem
of translating different representations of an element, but there is
still the issue of who owns the memory. The Python buffer protocol
usually solves this problem, so something similar for unicode “arrays”
might suffice.

Phil

They listed as this being a nice to have feature. Given their support as for 5 years ago I assumed that they were in favor.

This falls into some of the use cases here.

  1. Places where Python is acting as a pass through to reference a string in another language. (lazy saves a lot of waste).
  2. Places where Python and the other kit have fully compatible concepts of strings for which lazy loading is the best. (lazy saves waste if Python doesn’t access the string otherwise is a wash.)
  3. Places with incompatible contracts for which the primary use is to get the current value. Those wrappers would need handling to get a fresh copy on demand or the expectation is that the user will request a new version every time the need it. (currently eager loaded, benefits if use case one is also in play.)
  4. Other in which there is no sense in ever doing conversion as it is always a mutable string requirement. In those cases a proxy object with fetch on every load is better.

Just because something is mutable doesn’t mean it isn’t currently mapping into a Python string. The only difference being currently all are forced to be eager if you want to map to a string. That is a poor model for something like Java where a large string can be passing through. Java strings don’t share a common format with Python at all as they are UTF-8 encoded UTF-16 strings (yeah they goofed badly in Java 1.0 and just stuck with it). Thus everything must go through a translation layer. At some point I was writing a parser in which tokens were passing from Java through Python back to Java and discovered that most of the time was just decoding and setting up Python Unicode object just to throw them away. Hence this request.

I am working on the implementation. As per the suggestions the lazy loader will take a buffer type. The lazy loader if it is found to be needed because of a macro check will request a read only buffer. Here is where there is a decision to make. We either force the user to declare the buffer type at the allocation OR we look at the buffers item size to decide what type of deferred string is being created.

The two have different merits. If we force declaration that a number of Unicode functions don’t need checks as they can use the prestored values to give type. On the other hand the buffer deciding later may be less logic. Also if the buffer and the declared type mismatch we can be in trouble. As noted unicode is complex and may be fragile so it is hard choice as to what will be safest.

Additional notes if the buffer can self reference then we would be forced to make the Unicode object conditionally GC, where it only tracks if the buffer was requested. We could force the user to make the buffer object NOT be GC and thus we end up skipping the GC. But this is a large restriction and will lead to memory leaks.

Alternatively I can make deferred a direct subtype of unicode. That will make the derived string the only one that is GC. Either way the buffer will end up stored in the post data.any so that we don’t change the size and layout of the existing object in any way. This means that our string will fail an exact check. I would prefer to avoid that, but I understand adding two void* to every string is likely a bad trade.

Thus far I have added two additional bit flags to the type. One that shows there is a buffer object and one that indicates a load has happened. The load one slightly superfluous as the data.any not being null is a fair indication that the buffer has been loaded.

There is one point in the existing Unicode documentation that is troubling. It states there are 4 types of string objects and then only describes 3. The last is the general string but it is only implied. This is problematic because the undescribed type is the one that I will be working with.

The general plan is when a lazy load happens we request the buffer which we will hold onto. This will likely require extra memory over the buffer pointer itself as the buffer lease object is a static that needs to live somewhere.

There is also the issue of encoded buffers in which UTF8 or some other may live. For now I will be assuming STRICT 8, 16, 32 without any extension bytes are supplied. It will be the job of the buffer implementer to decide and decode their string to the proper buffer layout. We can give helper functions, but that is not the current scope of the PEP.


If a core developer can tell me which options are most likely to be acceptable, I can continue the implementation to the point that the PEP can be modified to be accurate to the implementation details. I am slightly less comfortable with adding the derived type as I don’t understand the structure of how initial types are formed in the C Python API, but I can if directed to figure it out.

There are now “Output:” sections in the original post without corresponding code.

An important consideration here is that current C API for accessing string contents (PyUnicode_KIND, PyUnicode_DATA, PyUnicode_GET_LENGTH etc.) is infallible. What should they do if PyUnicode_FromLazyLoader raises an exception? I can’t see many alternatives: we can raise a fatal process-killing error, or we can replace/deprecate/remove all this API (which, given the wide usage, would take a while).


The current str contains these representations:

  • The “canonical” representation (a Py_UCS{1,2,4} buffer)
  • The UTF-8 representation, created on-demand and cached
  • Any additional info can be added in subclasses

As I see it, this proposal asks for one big user-visible change:

  • The “canonical” representation can also be created on demand. (And this operation may fail.)

Adding PyUnicode_FromLazyLoader is a detail in comparison to that :‍)


A PyUnicode_FromLazyLoader API would make it trivial to implement “UTF-8-first” strings, which would contain UTF-8 until there’s some operation requires a fixed-size encoding (len, indexing, etc.). IMO, there’s a lot of appetite for such “zero-copy” UTF-8 strings.
(Operations like len & indexing could avoid extra memory by using an offset table instead of a UCSn copy, see Mark’s idea, but that would probably require removing APIs like PyUnicode_DATA entirely.)


IMO, loader should be a C function: int loader(PyObject *, int flags), with flags used to select whether it should fill in the UCSn representation or the utf-8 one.

2 Likes