Fully support C / C++ features for string/bytes literals

mrolle45 · February 24, 2025, 4:47am

It seems to be a deliberate design for Python 3 to support anything in string literals that is valid in C, and with the same meanings of course. Particularly with escape sequences. For instance, in both languages, '\n' is the same as chr(0x0A). Thus C programmers will have no trouble understanding Python literals.
Python has already moved to include some C++ features that are not in C (yet). Notably the \N{name} escape. This is new to C++ 23.

I am simply saying that Python should incorporate the other new features of C++ 23. These would be part of the unicode-escape encoding, which is used by both the Python parser and the codecs.decode() function.

See this example of new features that work and don’t work on Python currently:

>>> '\N{NBSP}'
'\xa0'
>>> '\u{0000}'
  File "<python-input-3>", line 1
    '\u{0000}'
    ^^^^^^^^^^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape

The NBSP is actually an alias for NO-BREAK SPACE, and is normative in the Unicode standard.

The escape sequences in C++ 23 which are not in C are:

\N{name or alias}
\u{hex digit(s)}
\x{hex digit(s)}
\o{octal digit(s)}

I argue that these delimited escape sequences are somewhat easier to read and make it clear just what the escape is composed of. For instance, compare
"ab\xcdef" = "ab\x{cd}ef", 5 characters
"\12a" = "\o(12}a", 2 characters
"\1234" = "\o{123}4", 2 characters

Furthermore, these escapes can represent values larger than 255. For example, a character for the Greek letter δ can be written as "\x{3B4}", rather than "\u03B4", to convey the meaning that the value is simply a number rather than a unicode codepoint.

Someone please adopt this proposal

I don’t have the means to do any more than just post this suggestion. So if you, dear reader, see that there is consensus for the idea, then please file an issue. Thanks.

MegaIng · February 24, 2025, 6:33am

Actually, it would be more accurate to say that C++ has moved to adopt python features. The \N{name} escape sequences have existed in python for more then 10 years^[1]. I suspect they are pulled from somewhere else, but I haven’t done the research.

Not true. Stuff is copied where it makes sense, but not blindly. E.g. hex escape with only one digit are not supported.

I don’t think adding yet-another-way is going to have enough benefit to be justified the effort. Maybe if a core dev just does it and makes a corresponding PR it would be approved, but I don’t think a long discussion here is going to lead to anything - probably just a long string of +0 or -0 votes (unless there is some amazing argument for one side that I can’t currently imagine).

added in Python 3.3 ↩︎

JamesParrott · February 24, 2025, 10:14am

Only on Posix. Python interprets '\n' as CR LF on Windows and CR on MacOS (0x0D). PEP 278 – Universal Newline Support | peps.python.org

picnixz · February 24, 2025, 10:23am

The C++23 features mentioend are documented in Escape sequences - cppreference.com. For a bit of additional context, two (duplicated) issues were filled and I rejected them both: Python interreter and codecs module don't recognize unicode escape \u{xxx}. · Issue #129392 · python/cpython · GitHub and codecs module doesn't recognize new C++ 23 universal-character-name \u{xxx}. · Issue #130475 · python/cpython · GitHub. I’ll expand and amend my reasons that I’ve partially presented:

As it was already said, \N{NAME} has been supported for years by Python.
I don’t like the fact that we’re using { with both named entities (with \N) and ordinal values (whether it’s headecimal or octal digits). I can see value in readability but there are downsides (see below). For me, { is really tied with the use of \N + name.
Why do we need to align ourselves with C++23? why this feature in particular? as we already said on the issue(s), please present convincing evidence that this is needed in real-world applications. We never had any convincing answer to that inquiry so far.
It’s something new since the issues didn’t mention them, but we may have issues with "ab\x{cd}ef" due to f-strings and r-strings. For instance, how should we interpret r"ab\x{cd}ef" or cd = 42; f"ab\x{cd}ef"? currently the first gives 'ab\\x{cd}ef' while the second fails because the decoder is applied before f-strings are substituted even for constants. While the second isn’t a breaking change, the first may be a breaking change.

For example, a character for the Greek letter δ can be written as "\x{3B4}", rather than "\u03B4",

I would rather use \u03B4 in this case to convey the intent that it’s a unicode codepoint, because I don’t know how \delta can be a number. I fail to understand your point here.

I don’t have the means to do any more than just post this suggestion. So if you, dear reader, see that there is consensus for the idea, then please file an issue. Thanks.

As Guido said on #130475, if you care about the issue, please track it yourself. However, count me as -1 on this idea because I don’t think it’s useful at all.

In addition, Python explicitly expects \x to have exactly 2 hexadecimal characters (see 2. Lexical analysis — Python 3.13.2 documentation) which is different from what standard C does (and it is documented as such).

The same can be said for \u and \U where exactly 4 and 8 hexadecimal characters are required for them to work. If we allowed an arbitrary number of hexadecimal characters – which is what C++23 does –, then yes, it would be meaningful to allow some kind of visual separator. However, in this case, the separators become redundant. I also don’t know if users would bother using them (again, we need evidence that it’s a feature that would be used).

MegaIng · February 24, 2025, 10:29am

That is a property of the IO streams, not of the syntax. ord('\n') has the same result on all systems. (and in fact to some degree the C APIs are going to exhibit the same behavior of auto-translating \n to CR LF as needed, so this isn’t necessarily a difference between python and C, although the details are different)

r-string means that no escape sequences (except \"&\') are processed, right? So why would this be treated differently?

picnixz · February 24, 2025, 10:37am

Oh you’re right. My bad. Concerning my f-string question, currently print(f"\N{Musical Symbol G Clef}") prints 𝄞, so it should be fine to make the behaviour of \x{...} identical if we were to include that feature (but I wouldn’t support a PEP for that), namely f"\x{41}" would be equivalent to f"A" (currently it raises a SyntaxError).

storchaka · February 24, 2025, 11:01am

Different programming languages have different syntax. This is what make them different languages. C++ is a rather a late adopter. Python got a way to express Unicode characters outside of the 8-bit range in 2000 (in Python 2.0). The \uHHHH syntax was borrowed, I believe, from Java, which has it from beginning. The \UHHHHHHHH syntax was perhaps an original and fast-baken solution (it was not even mentioned in PEP 100). It is cumbersome, because first two (and often three) digits after \U are always zeroes, but we live with this 25 years. Most other programming languages got support for Unicode escape sequences later. Earlier adopters of Unicode usually supports \uHHHH. It seems that C++ borrowed the syntax for \N{...} from Python, and \x{...} and \o{...} from Perl. In Perl, they needed \x{...} because \u was already used for other purpose.

In summary, we do not this feature, because Python already has a way to express arbitrary Unicode characters. You can ask the C++ Working Group for support of Python syntax. Then tell us what they told you.

malemburg · February 24, 2025, 11:27am

When we added Unicode support, we had a look at what C/C++/Java had at the time (which wasn’t much) and used that as basis. Python has for a long time used C as inspiration, so this was a natural thing to do.

In those days, Unicode was pretty much UCS2, hence the \uXXXX literal. Later on, Unicode was expanded to UCS4, so we extended this to \uXXXXXXXX (why 8 X? because backwards compatibility was a concern).

That said, Python has made it’s own innovations since then and others have followed us.

Please don’t read too much into following C/C++/Java, though. We are only adding things which we find useful to have and there’s no incentive to just copy things to make Python compatible to C/C++/Java/etc.

JamesParrott · February 24, 2025, 11:40am

Ah right, thankyou. Good to know the difference.

MRAB · February 24, 2025, 5:41pm

It’s only recently that I discovered that \x in C isn’t limited to 2 hexadecimal digits. Up to that point, I’d only ever see it used with 2 hexadecimal digits, so I assumed that it always had to be 2.

Interestingly, Python originally followed C in allowing a variable number of digits and only later fixed it at 2 with PEP 223.

storchaka · February 24, 2025, 11:07pm

TIL that C99 supports \U.... I thought it is Python’s original syntax.

PeterL · February 27, 2025, 11:48pm

One place where this suggested syntax has a slight edge in my view is regarding delimiting. '\x412' has the appearance of a three digit hex code, but is 'A2'. And '\u03C00' is 'π0'
A small argument is that '\x{41}2' may be clearer than '\x412'
Not a fan of the '\N{digits}' part of the proposal, because it doesn’t really add anything except being able to drop leading zeros, and that doesn’t sit well with my model of 2 hex digits = a byte.