Lazy load of strings

Thrameos · May 12, 2025, 11:43am

The usage pattern of those methods in the code that I was reviewing seemed like those three were unreachable without hitting a type test where the ensure loaded was taking place. We could insist the user specify the kind as the request to create the type. Length may be available but that defeats the point of a lazy load as we don’t want to pay decoding costs just to get the count.

The other issue that I ran into was that the buffer based implementation had a lot of overhead because i was hauling the general purpose Py_buffer around. I think this would need a lighter version of the buffer api to be viable option.

My working point seems to need to be to define a Py_charbuffer type with ability to provide the required kind, length and ownership strategy (does data need need to be freed). Does this make sense? If so does anyone have a input on what the api should look like or should i simply work from the buffer type?

Thrameos · May 12, 2025, 11:46am

Also please detail how you would like the utf-8 until desired strings to work. Where will the memory live prior to decoding and the ownership model? If you can outline the specification i will attempt to work it in.

encukou · May 12, 2025, 12:26pm

That seems like it should be an internal detail, chosen by whatever code defines the loader?

Easiest is probably a separately-allocated block owned by the str. It should be possible to optimize that without API changes.

Thrameos · May 12, 2025, 12:37pm

Can you give more info on your str idea. Is seems like you are saying make a new string object and copy the flags over and point the data to the string. Then just hold that string around until the outer front end is done. So just a dereference on the clear fuction. That would certainly be the lightest appoach if I am getting you idea correct?

So for utf-8 until needed the loader is just the utf-8 only. When a representation is demanded we convert and place it in the cover front end. I suppose that works as far as the ownership model. However I see a wrinkle in that the utf-8 string is just going to want to give itself in response to a str request. How will I force it to a fixed rep?

I will use that as the working point for my next session. If I missed something please follow up.

Thrameos · May 12, 2025, 3:29pm

Somehow, I didn’t see the IMO section when looking through my phone interface. That answers my question. I will work to make the API follow that direction.

Thrameos · May 12, 2025, 4:30pm

Here is what I believe the current idea would be. We add a method PyObject* PyUnicode_NewDeferred(loader deferred) that creates a deferred version of the unicode string. The layout does not change and the loader lives in the item data slot. There are some flags to support the request from the loader. I am not sure if we really need the ready as we could just check the data fields, but I will start by assuming it is there. I may also consider adding for the two data slots.

The three methods that must not fail will NEVER trigger a load request but they will return bogus values if the string has not yet been loaded. Any other method which is allowed to fail may trigger load if requests. In the case we are making a deferred string with a UTF8 content pre loaded, then the loader is the start of the elements and the utf8 data follows. The loader will be triggered to add the canonical section.

I hope that is consistent with your view.

Other questions… should the hash code be filled out from either utf8 or canonical form and skip the load if one of the forms is available?

The diagram should have read “The following will not” rather than “This will not”. It was supposed to be the following not the proceeding.

The utf8 only will help me in two ways. Asside from adding deferred loading, the decoders that I have for Java only produce utf-8. So if the loader can do either filling out the utf-8 or the canonical on request it will be a big win for me.

encukou · May 13, 2025, 11:39am

I don’t think we can change PyUnicode_GET_LENGTH to start returning -1. That would be a major backwards-incompatible change.
Even changing the meaning of PyASCIIObject.length would need an exception, which means some tough PEP debate weighing the pros and cons.
IMO, the best way for PyUnicode_LENGTH & KIND is to trigger loading, with any failure being a fatal error. Given that common loaders would only fail when out of memory, that might be acceptable. We’ll definitely need new, fallible APIs though. (But PyUnicode_DATA, PyUnicode_READ return meaningless values if you don’t know the kind, so those can return nonsense if KIND would fail.)

The hash algorithm is an internal detail, so the hash needs to be computed by Python from filled-in data. (Initially from the UCSn form; optionally using UTF-8 would be an optimization).

The flag would probably be best as a hint; for example if a loader can only produce utf-8 it should do that even when asked for UCSn.

Thrameos · May 13, 2025, 1:39pm

With regard the conversion lets consider the cases.

Suppose neither utf8 or data is loaded and the loader is requested to supply canonical. If the loader has canonical available it would fill it in. But it only has utf-8 so the proper would be fill out the utf-8 and call an API to convert the utf-8 to canonical. The request is served.

It gets wierder if the loader is requested to supply utf-8 and has it. So now it just fills out the utf-8 and declares job done. That would mean a second request for the canonical would be issued. It would be responsible for filling out the canonical from the previously supplied utf-8.

As for the must note fail perhaps we should consider the following. When the deferred string is created the loader will be called with flags equal zero. The loader is responsible for verifying whether the fetch will succeed without executing the fetch (is anything null, is it missing vital info,etc). Thus means the majority of exceptions it can generate will happen at the creation time not the request time. Does this help the issue?

Edit perhaps we can go beyond this with flags.

0 - prefetch check

1 - field request, can load or return -1, no exception may be set, on -1 kind =1 char, length=0, data is pointer to “”, flag error is set.

2 - canoncial fetch - if error flag set must set error and exit with -1, else load or set exception and return -1

3 - utf-8 fetch - same as canonical with utf8 field filled out may either set canonical or not at its option.

A field load now can never fail and any user of the api the failed to verify ready will get an emply string

encukou · May 13, 2025, 2:04pm

Python should only call the loader if both forms are missing. It can fill in the UTF-8 or UCSn itself, whenever it has the other form.

The user should do all that before calling PyUnicode_NewDeferred.
Unfortunately, memory allocation failures (MemoryError) can’t be checked for in advance, and loaders that can’t fail that way won’t be very useful.

A detail I missed: PyUnicode_NewDeferred should take the desired str (sub)class as argument.

Thrameos · May 13, 2025, 3:29pm

Okay so I need two versions (or one optional argument).

PyObject* PyUnicode_NewDeferred(PyObject* type, int (*loader)(PyUnicode* str, int flags), char* utf8)

Where
type is the type to create (must be unicode or type derived from unicode)
loader is the loader for the deferred string.
utf8 is an optional zero terminated string that will be freed when this string is destroyed or this call fails. This will be preloaded into the utf8 memory slot and the length set. NULL if not applicable.

The loader flags will be
0 - field fetch request. Either preload or fail. On success return 0. On fail set error flag, do not set exception, kind = 1char, length = 0, and data = NULL (or a static “” ??).
1 - canonical request - Check error flag, if error set then set exception to memory_error and return -1. On success KIND, LEN and DATA must be valid. Set owns_data if data field must be freed.
2 - utf-8 request - Check error flag, if error set then set exception to memory_error and return -1. On success utf-8 and utf8 length must be set. Set owns_utf8 flag is data field must be freed.

a field fetch must never set exceptions (we will clear the exception if set to prevent undesireable behavior unless it was set before the load request). There is a case in which utf-8 was fetch and filled before the canonical request in which case the loader must call the appropriate function to decode the string.

This will serve both for lazy utf-8 only and for lazy fetch to language bindings.

encukou · May 13, 2025, 5:26pm

If you have UTF-8 ready, why not use PyUnicode_FromStringAndSize?

Thrameos · May 13, 2025, 5:40pm

Sorry I was trying to answer your earlier suggesting that we could late load the utf-8 to canonical comment. Should I remove that from consideration or did I miss the purpose of the previous comment? If late loading of the canonical is not required the I will drop the optional argument from consideration and focus only on late load from external sources.

Thrameos · May 14, 2025, 3:05am

I though that meant you wanted

PyUnicode_NewDeferred(PyUnicode_Type, utf8_first, utf8)

in which the utf8 representation is the only thing loaded until a request for a fixed width encoding.

encukou · May 14, 2025, 8:25am

I meant that you could write something like the following.
(This assumes that there are new API functions that set string data, meant to be called from loaders, named PyUnicode_SetUTF8 and PyUnicode_SetDataAndSize; they’d need better names and API than here).

int my_func(...) {
   mystring = PyUnicode_NewDeferred(PyUnicode_Type, utf8_loader);
   PyUnicode_SetUTF8(mystring, utf8_buffer, utf8_len); // set the data *early*
   ...
}

int utf8_loader(PyObject *mystring, ...) {
    ...
    utf8_buffer = PyUnicode_AsUTF8AndSize(mystring, &utf8_size);
    ucs_buffer = do_the_convesion(utf8_buffer, utf8_size);
    PyUnicode_SetDataAndSize(mystring, ucs_buffer, ...);
    ...
}

But, Python itself should provide utf8_loader, and call it whenever a string needs UCSn and has UTF-8. And it should use this pattern to implement PyUnicode_FromStringAndSize, or perhaps a new function like PyUnicode_FromValidUTF8.

Custom loaders should be left for formats like UTF-16^[1], which Python doesn’t store natively.

as opposed to UCS-2 ↩︎

malemburg · May 14, 2025, 9:03am

I’ve read the PEP and passing around raw data using a different format (usually bytes, memoryviews or e.g. PyArrow strings) is indeed a good way to defer processing of the actual string data.

What I don’t understand is why you think this needs to be baked into CPython.

It is usually better to keep the raw data where you get it from and only start conversion to a Python string when it’s time to process the data as a Python string. Since there are many origin data formats for this and you ideally want zero-copy behavior during the “manage my string data” phase of the processing, it doesn’t really make sense to hardcode this logic into CPython.

Instead of having some wrapper object determine that it’s time to start processing the data as a Python string (using triggers on its methods), you have the application determine this via explicit conversion. And that can easily be had now (and is being done in practice) without changes to CPython.

This also allows you to work on or with the data during the “manage my string data” phase, since you’d still have access to the original data and not have to get this out of the lazy loaded Python string in some way (which, under your proposal, is not possible).

Thrameos · May 14, 2025, 10:00am

Part if my project goals is seamless. And for the most part we are. Java ints are Python ints (and mantain their identity). Same for every class what is shared with Python right down to exceptions. We even have integration of the stackframes on exception information (though there we have to hack python to do it.)

Strings was the only class where we couldn’t do that and not run into a bottleneck when processing. The eager loading means either we convert and are seamless except then pass through style us taking a hit or we hold off and force them to to call str. The late convert causes weird minor changes in design because many interfaces in Python such as dictionary keys in certain places must be str. Unfortunately Python lacks a tag to say “this thing is is meant to duck type as a str so use it like a string everywhere” meaning calling str at key spots in needed.

Unfortunately this left it in the case where there are two incomparable options selected at startup by convertStrings. It was a horrible design choice in my opinion and far pre dates me. It would seem that calling the extra str to tell it now is time to move it wouldn’t be too much. But that design decision means sometimes it eager loads and becomes a string and sometimes it doesn’t selected globally. And because it loses it identity on eager weird bugs happen on convert where I am trying to pass a string though and the identity gets lost. Or keep your identity and need str in paricular places. Being a global flag some people chose one and some people chose the other. But that means if you use one JPype using module with another JPype using module and one needs a different flag the one which started first works and the other fails.

The “I will duck type flag” such that every api can take a like a string and work is another option but likely breaks lots of other modules. So next best option is unification of the strings. It makes the convertStings flag meaningless as both options will give the same behavior. And it is compatible with all other project goals.

It is a very niche problem particular to a design choice that predates me to but the “should convert or should hold off” is a common problem in many Python modules. Many other Python modules likePyObj and Qt chose should convert even in the case where their strings are in fact mutable (part of the confusion at the start of this thread). It is just much more convenient to to let strings be used in Python apis without the extra str call. Thus a common problem calls for a common solution.

malemburg · May 14, 2025, 1:26pm

I am not familiar with JPype, so can’t comment on that use case.

Just to clarify: The only use case I see for deferred creation of strings from raw data is when you have to deal with long strings, and there I mean 1000+ code points.

For smaller strings, you don’t really gain much and the added maintenance overhead is not worth the trouble.

For longer strings, applications usually have very clear boundaries between the "manage my string data” and the “process my string data” phases, so there should only rarely be a case where automatic triggering of the conversion would make the code better.

Also note that all such lazy evaluations suffer from a major problem: that of not only deferring doing the actual work, but deferring raising possible exceptions when the work finally gets done.

Those exceptions can lead to very hard to debug situations and also makes writing robust code harder.

E.g. your code may be expecting a ValueError from the immediate processing and provide error handling for those cases. If your deferred creation happens to get finalized in the same block, a UnicodeDecodeError (which is a subclass of ValueError) may then trigger running this error handler, which was meant for an entirely different error situation. And this can happen several stack frames down the stack, perhaps even in code which you don’t control.

Explicit is better than implicit

Thrameos · May 14, 2025, 3:53pm

I disagree with… nothing you said. I would never have created an implicit option in the first place. And I would not have chosen it to be the default. (That was one of the few design decision I reversed. And sadly users liked the implicit and deliberately select the old behavior. Ugh!)

As for how it will be used we have to look at the current state of the implicit path… When I get a java string I also know the number of code points sort of. Java uses utf-16 with utf-32 extensions and then reports double encoded utf-16 as utf-8. So then number of code points it reports may be less. They were clever and just call the method GetStringUTF (notice no number). It is 8 bits only completely incompatable with anything if it contains a 32 bit code point. Oh on top of that the sort of utf-8 the hand me is not even native. To get the native I have another call which sends me wide utf-16 which are also not compatible with Python. But the original author of JPype had assumed it was compatible so that broken utf-8 was requested and passed through to give to Python. So the current routes of strings is Java converts once, it passes through to where Python needs it. C++ converts once to give real utf-8, then Python converts once to give 8,16,32 fixed eagerly, then if you want to route back to Java because this was just a pass through was just a pass through C++ must request request Python give utf8 and the process must reverse because Java native wants utf-16 with 32 as extension words. Think this exhausting I did too and wrote a comment about lazy sun programmers making other people lives worse thus promoting the heat death of the universe, that sadly got me called to hr… but that is another long story.

The planned use for the api is I will convert my string rep into an extension class of python string which carries through all the information such that if used as a pass through it simply hands the pointer to the original code points. Depending on the length reported by Java it will eager lead or late load. Either way the implicit path always gets a valid string to Python and thousand code point string pass through without any overhead.

Also failure is not an option. No matter the outcome whether it be a null pointer or total garbage Python will get its string. Only an out of memory error will stop it. Not really, as i will just point Python to a static “conversion failed. out of error message” eat the exception and pretend no one saw anthing (that do you mean I attempted to convert a 2Mbyte string?)

This is true because Java is handing me a fully validated string handle and I check all prequequisites at the decision point to eager or lazy. Though I understand that others may not have that luxury so we must account for it.

Thanks for listening to my woes!