String joining design

@barry How could you forget form feed? :thinking:

Hm. Before we decide to endow the str class with a bunch of random attributes let’s think some more about whether that’s the right place. And even with the names you suggest, if someone encounters str.SPACE.join(...) they’ll probably have to look it up the first time to be sure what kind of magic it does.

FWIW, I personally prefer literals, e.g. ' '.join(...).

16 Likes

It might be better to split this discussion, but I think only Discourse admins can do that? @brettcannon ?

That said, I don’t think str.SPACE.join() (and friends) would be all that confusing. Yeah, maybe they have to look it up the first time, but once you know that SPACE is just a string, it – and all other such constants – should be obvious.

The benefit of sticking them on str is that because it’s a built-in, no imports are necessary. If they aren’t put on str I’m not sure what would be better and more obvious.

“Why is there str.SPACE but not int.ONE?”

5 Likes

While I would love to contribute to cpython, I don’t feel strongly about the feature. I was just asking out of curiosity, thought it had been brought up before and I wanted to know the reason it was rejected…

I also looked at the str/unicode object source code and it looked very complex, especially for someone who’s never written a python object in C before.

I have split this thread into its own topic from PEP 701 – Syntactic formalization of f-strings - #112 by guido

3 Likes

Or more realistically, float.PI, float.E, float.TAU, float.INF and float.NAN.

Even if there’s no intention from the core devs to establish a principle that “common constants for a type should be attributes of that type” I fully expect that if we do this for str, we’ll end up with a lot of energy spent on python-ideas arguing with people who feel that you can never have too much of a good thing :slightly_smiling_face:

3 Likes

Okay, maybe, but even if so, is it 1) a bad idea to add constants such as float.PI and 2) even if there is some call for that, is that a reason not to do it for str constants?

The names for the str constants will, generally be longer than the
literal, so this seems to be a foolish endeavor, taking up extra
characters in the code with by spelling out the constant, and having
more names (should they be in English or Tamil?) to need to learn and
remember.

float.PI and float.E sound much more interesting that str.SP or str.NL,
although they can only be approximated, whereas str constants could be
exact.

Yes, most of the ASCII control characters have short abbreviations that
were standardized by ASCII, but when you have to prefix them with “str.”
they are longer than the literals, even than the hex literals ‘\xA0’ and
certainly longer than ‘\n’ or ’ '. Unicode literals have far longer
names, in generally, so again the literal is simpler, shorter, and
doesn’t require reference to a document to know what is meant. There
are a few characters with similar appearance, but my favorite text
editor will tell me the hex code and the Unicode name, if I’m uncertain.

1 Like

Enter the math module which has exactly those five constants.

Is this a point in favor of string.NL et al. (referring to the string module) or not?

2 Likes

That’s fine. You can always use the literal if you’re indexing on saving characters. I still think using symbolic names instead improves readability in many cases.

using symbolic names instead improves readability in many cases

and decreases it in others:

(trying, and probably failing, to pretend I don’t have decades of experience with some of these forms…)

str.NL vs '\n'  -- neutral (edge to '\n' for "already programmers" crowd)
str.NEWLINE vs '\n' -- edge to str.NEWLINE
str.SP vs ' '   -- edge to ' '
str.SPACE vs ' ' -- edge to str.SPACE
str.EMPTY vs '' -- neutral
str.COMMASPACE vs ', ' -- edge to ', '

Note that the (IMHO) more readable ones are harder to type – maybe a trade off that’s worth it.

So I don’t think this is worth the complication.

The float ones, on the other hand, I think have some real merit, as there is no literal way to express those – but to waffle, it’s also common to need the math module anyway if you are using those (or numpy).

These collections of little marks are often difficult to read and/or parse for some users. Also, think about how screen readers might handle them. Depending on the font, etc. how easy is it to visually distinguish '' from ' '? These are issues I consider when I prefer symbolic names to literals such as this.

Well, I was assuming a good font and colorization in your editor – without that, named symbols are more readable for sure.

Screen readers – I have no idea.

Point of note: In C++, there is a very important difference between using a literal and using a named token when it comes to the standard streams:

std::cout << "Hello, " << name << "\n";

std::cout << "Hello, " << name << std::endl;

The std::endl token ends the line, and also flushes the buffer. With tokens like str.NL or str.SPACE, I would be wondering if they have additional meaning, too - for instance, some_string.split(str.SPACE) could conceivably mean “split on any whitespace” rather than being exactly equivalent to ' ' . So IMO this can impair readability compared to the literal. There’s no risk of mistyping it as there is with math.PI (would anyone spot the bug if you wrote 3.141592653589783 for pi?), and no additional meaning, so it’s just a longhand way of writing a literal - unless you have some need to be able to shadow the name str and change all the literals.

One can use a \N{} named escape if that’s easier to read, parse or remember. They’re based on the Unicode character names and aliases. For example:

>>> '\N{COMMA}\N{SPACE}'
', '
>>> '\n' == '\N{NEW LINE}' == '\N{LINE FEED}' == '\N{END OF LINE}'
True
>>> '\n' == '\N{NL}' == '\N{LF}' == '\N{EOL}'
True
3 Likes

Hey, this is great! ‘\N{NL}’.join is even longer than str.NL.join for
those that like verbosity, but it also has the definite advantage of
using standardized names that would be hard to fake or override.

I think it’s notable that I can write ' '.join(mylist) but not 10.bit_length() – but I can write (10).bit_length()! :face_with_spiral_eyes:

Jumping to discussion of constants sort of skips an important point that join being a named method of str objects wasn’t the only way that string-joining could have been implemented. Just to toss out some ideas

>>> join = str.join
>>> join(", ", "abc")
'a, b, c'
>>> class MyStr(str):
...     def __pow__(self, other):
...         return self.join(other)
>>> s = MyStr(",")
>>> s ** "abc"
'a,b,c'

Perhaps the history of %-formatting makes us wary of using an infix operator, but isn’t str.join part of the class of common and “fundamental” operations which might deserve to be operator-ized?


On the subject of defining constants and putting them somewhere, I would strongly prefer that they be in string, not attributes of str.

string already contains several useful constants, so it seems like a reasonable place to add more. Furthermore, to a previous point, this mirrors math having constants and parallel structures are good for comprehension.

These constants are almost uniformly more verbose than the literals. So I don’t think verbosity makes sense as a reason to reject import string. If str.NEWLINE.join(...) is better than "\n".join(...), then doesn’t the same argument apply to import string; string.NEWLINE.join(...)?

I’ll also say that I am somewhat against shorthand names like NL. IMO, string.NEWLINE is self-evident. string.NL is simple once you know what it means.

1 Like

Yes! Although, be careful: not every Python supports the same \N escapes. So by doing this, you’ll be stopping your code from running at all on MicroPython, and different versions of Python will have different lists of supported names. Names like “NEW LINE” will probably work on pretty much any CPython (even CPython 2.7), but errored out on PyPy 2.7.18/7.3.3 (it recognized the \N syntax but didn’t know the name “NEW LINE”), and I couldn’t even get Jython to recognize that the line was complete. I haven’t run into problems with the “\N{NL}” alias on any of the Python 3 implementations I have installed (including PyPy3 3.7.10/7.3.5), but you’d have to check other character aliases to see whether they got added.

In any case: names and aliases should never be removed, only added, but this should be considered another factor of compatibility to be considered.

Support for aliases was added in Python 3.3, so none of the newline alias examples that I gave works in a 2.7 u"" string literal. I only gave those examples to demonstrate alias support. The main reason I posted was to suggest a flexible, compile-time alternative to avoid “collections of little marks”, such as using "\N{COMMA}\N{SPACE}" instead of ", ".