In issue Add method to detect if a string contains surrogates · Issue #69456 · python/cpython · GitHub I long ago proposed a method for detecting that a string has surrogates, to make it possible to do “look before you leap” decisions. The email package in particular could use this (and has a hackish solution for it) because there are certain logic paths that need to differ depending on whether or not there are “smuggled” bytes, and these paths differ before encoding is actually done. The current method used by the email package is to try encoding and catch the error if it fails. We determined that this is the most efficient way to do the check, currently.
Stan Ulbrych has provided a PR to implement an str method to do this check, and Marc-Andre Lemburg prefers the name ‘issurroage’ for it. Peter Bierma is of the opinion that a PEP would be a good idea because this is a method on str and will affect other python implementations. Marc suggested posting here first to see what the consensus about that is, before I start learning how to write a PEP
The advantage of the method over the current function used by the email package is that it will be faster, since it only needs to search the string looking for the first surrogate, instead of invoking the full codec encoding machinery.
The method isdigit checks whether the string contains only digits, the method isalpha checks whether the string contains only alphabetic codepoints, etc.
Therefore, a method called issurrogate should check whether the string contains only surrogate codepoints.
If the method actually checks whether the string contains surrogate codepoints, but can contain also contain non-surrogate codepoints, then the name would be misleading.
In that case, I would suggest hassurrogates, or something similar.
This sounds great to me. I think a str method does deserve a PEP, but if a method would be “too much”, this could go in unicodedata, or string.
I’d paint the bikeshed as has_surrogates – if it doesn’t use is*, it’s not limited by early naming conventions.
(Internally, it could even make sense to store this on the string object (as tri-state yes/no/don’t know), and then perhaps allow surrogatepass-encoded bytes in the internal UTF-8 cache.)
Internally, it could even make sense to store this on the string object (as tri-state yes/no/don’t know), and then perhaps allow surrogatepass -encoded bytes in the internal UTF-8 cache.
That might be an interesting approach to this.
And I do agree that this should deserve a PEP
The purpose of the method is to be able to detect whether a string includes surrogate code points or not. .has_surrogates() sounds like a better name.
FWIW: If there’s consensus that such a method does make sense, then I don’t think we need a PEP. That’s why I asked to open a broader discussion. A PEP would be needed in case the discussion results in diverging opinions to allow the SC to decide.
BTW: You don’t want to import unicodedata just to figure out whether a string has surrogates or not. It’s a big module to load. If not a method, the string module would be a better place.
Reiterating what I said in the PR for posterity: I personally think that a method on str is fine, but it would be good to have a PEP to discuss why that’s better than a string function, and to bikeshed the name.
But really, a PEP is just a formality when changing builtins. Other Python implementations have to mimic our decisions, so skipping the process where they get a chance to express their opinion about it doesn’t seem fair to me. (Incidentally, a PEP is also a clever way to build awareness of the function when it goes in.)
This aspect is more than trivial. It’s very easy to miss a small line in What’s New, but people will generally notice an accepted PEP. Case in point: zip executables. In Python 2.6, it was introduced, with a bullet point under “Other Language Changes”, which is easily missed; in Python 3.5, it got enhanced with a module to more conveniently create them (a small but important improvement), which got far more publicity than the original feature because PEP 441 got a proper mention in the release highlights
I really like the flag-on-the-string approach, especially since the
encoder knows the answer at the time the string is created. That
may be a bit more complicated to implement, though.
Thinking about this some more, a flag on string would require updating that flag every time the string was operated on to produce a new string, which would, I think, be too much overhead.
I already proposed the idea with a flag. It can be merged with the ASCII-only flag in a 2-bit value:
ASCII-only, no surrogates.
Non-ASCII, no surrogates.
Non-ASCII, has surrogates.
Non-ASCII, status of surrogates is not known.
In many cases (UCS1, strict UTF-* decoders) the status of surrogates is known. In other cases it can be left undefined to the first request.
This would also help UTF-16 and UTF-32 encoders (and wchar_t encoder), because if we know that there are no surrogates, the loop can be simpler (or justs memcpy()). I have not implemented this years ago because I did not have evidence about benefit.
On other hand, the original cause of these issues is the email module. It uses surrogatees to represent bytes in invalis or unknown encoding. But there are flaws in these approach. For example:
It does not work well with encodings like Shift_JIS, in which ASCII codes can be part of multibyte character.
It does not work if the encoded text was cut in the middle of a multibyte character for folding. For long enough non-ASCII subject it is almost guarantead issue.
“Tagged strings”, with attached original encoding, become the only solution for these issues. And in such case the internal representation can be not str, but bytes. So I think that the email module may require significant refactoring, and we can not predict whether surrogates will still be so useful as now.