String joining design

barry · January 2, 2023, 7:39pm

OT: Irrespective of f-strings, I really dislike the style of calling .join() on a string literal. I remember ages ago when we were debating how to add a string join function, there were arguments to making it either a built-in or a string function. String function won of course, and it does make sense, but I think it was @tim.one who suggested and preferred calling it on a variable instead of a string literal, i.e.

NL = '\n'
...
message = NL.join(bits_and_pieces)

So much more readable to my eyes, and even more so when the string literal has a semantic meaning which you want to convey in the code.

I totally get the convenience (and popularity) of using '\n'.join() but I still don’t like it so I don’t use it, which is probably why the “backslash restriction” in f-strings doesn’t bother me in practice that often.

sirosen · January 3, 2023, 12:27am

FWIW, and I know this is going even further OT, the popularity/prevalence ends up mattering a lot when making small contributions on projects.

If I show up on a project with a couple dozen open PRs and want to contribute a change, I’m definitely going to write "\n".join()!

Even if there were a stdlib method or constant for me to use, like string.NL.join(), I don’t want to bog down a contribution with extra discussion. The most common way of writing it becomes normative (except when the common way is incorrect, of course).

barry · January 3, 2023, 4:25pm

That’s okay, but I’d call that out in a PR against my projects. “When in Rome…” and all that.

This was discussed too IIRC and rejected because it would require an import, and adding a module ^[1] global when necessary is even easier.

usually ↩︎

ajoino · January 3, 2023, 5:05pm

Was it ever suggested to add some commonly used separators to the str object, like this

str.NL.join([1, 2, 3])    #    "1\n2\n3"
str.COMMA.join([1, 2, 3]) #    "1,2,3"

barry · January 3, 2023, 5:46pm

I don’t remember TBH, but that’s actually not a bad idea IMHO. Care to put together a simple PR? Here’s a list of ones I find commonly used:

NL = '\n'
SPACE = ' '
EMPTY = ''
COMMASPACE = ', '

Those are just the ones that come immediately to mind.

ericvsmith · January 3, 2023, 5:59pm

@barry How could you forget form feed?

guido · January 3, 2023, 6:22pm

Hm. Before we decide to endow the str class with a bunch of random attributes let’s think some more about whether that’s the right place. And even with the names you suggest, if someone encounters str.SPACE.join(...) they’ll probably have to look it up the first time to be sure what kind of magic it does.

FWIW, I personally prefer literals, e.g. ' '.join(...).

barry · January 3, 2023, 9:59pm

It might be better to split this discussion, but I think only Discourse admins can do that? @brettcannon ?

That said, I don’t think str.SPACE.join() (and friends) would be all that confusing. Yeah, maybe they have to look it up the first time, but once you know that SPACE is just a string, it – and all other such constants – should be obvious.

The benefit of sticking them on str is that because it’s a built-in, no imports are necessary. If they aren’t put on str I’m not sure what would be better and more obvious.

Rosuav · January 3, 2023, 10:01pm

“Why is there str.SPACE but not int.ONE?”

ajoino · January 3, 2023, 10:35pm

While I would love to contribute to cpython, I don’t feel strongly about the feature. I was just asking out of curiosity, thought it had been brought up before and I wanted to know the reason it was rejected…

I also looked at the str/unicode object source code and it looked very complex, especially for someone who’s never written a python object in C before.

pablogsal · January 3, 2023, 11:05pm

I have split this thread into its own topic from PEP 701 – Syntactic formalization of f-strings - #112 by guido

pf_moore · January 3, 2023, 11:56pm

Or more realistically, float.PI, float.E, float.TAU, float.INF and float.NAN.

Even if there’s no intention from the core devs to establish a principle that “common constants for a type should be attributes of that type” I fully expect that if we do this for str, we’ll end up with a lot of energy spent on python-ideas arguing with people who feel that you can never have too much of a good thing

barry · January 4, 2023, 12:10am

Okay, maybe, but even if so, is it 1) a bad idea to add constants such as float.PI and 2) even if there is some call for that, is that a reason not to do it for str constants?

Glenn · January 4, 2023, 12:32am

The names for the str constants will, generally be longer than the
literal, so this seems to be a foolish endeavor, taking up extra
characters in the code with by spelling out the constant, and having
more names (should they be in English or Tamil?) to need to learn and
remember.

float.PI and float.E sound much more interesting that str.SP or str.NL,
although they can only be approximated, whereas str constants could be
exact.

Yes, most of the ASCII control characters have short abbreviations that
were standardized by ASCII, but when you have to prefix them with “str.”
they are longer than the literals, even than the hex literals ‘\xA0’ and
certainly longer than ‘\n’ or ’ '. Unicode literals have far longer
names, in generally, so again the literal is simpler, shorter, and
doesn’t require reference to a document to know what is meant. There
are a few characters with similar appearance, but my favorite text
editor will tell me the hex code and the Unicode name, if I’m uncertain.

jcgoble3 · January 4, 2023, 12:39am

Enter the math module which has exactly those five constants.

Is this a point in favor of string.NL et al. (referring to the string module) or not?

barry · January 4, 2023, 12:46am

That’s fine. You can always use the literal if you’re indexing on saving characters. I still think using symbolic names instead improves readability in many cases.

ChrisBarker-NOAA · January 4, 2023, 1:21am

using symbolic names instead improves readability in many cases

and decreases it in others:

(trying, and probably failing, to pretend I don’t have decades of experience with some of these forms…)

str.NL vs '\n'  -- neutral (edge to '\n' for "already programmers" crowd)
str.NEWLINE vs '\n' -- edge to str.NEWLINE
str.SP vs ' '   -- edge to ' '
str.SPACE vs ' ' -- edge to str.SPACE
str.EMPTY vs '' -- neutral
str.COMMASPACE vs ', ' -- edge to ', '

Note that the (IMHO) more readable ones are harder to type – maybe a trade off that’s worth it.

So I don’t think this is worth the complication.

The float ones, on the other hand, I think have some real merit, as there is no literal way to express those – but to waffle, it’s also common to need the math module anyway if you are using those (or numpy).

barry · January 4, 2023, 1:46am

These collections of little marks are often difficult to read and/or parse for some users. Also, think about how screen readers might handle them. Depending on the font, etc. how easy is it to visually distinguish '' from ' '? These are issues I consider when I prefer symbolic names to literals such as this.

ChrisBarker-NOAA · January 4, 2023, 1:58am

Well, I was assuming a good font and colorization in your editor – without that, named symbols are more readable for sure.

Screen readers – I have no idea.

Rosuav · January 4, 2023, 2:09am

Chris Barker:

using symbolic names instead improves readability in many cases

and decreases it in others:

(trying, and probably failing, to pretend I don’t have decades of experience with some of these forms…)
str.NL vs '\n'  -- neutral (edge to '\n' for "already programmers" crowd)

Point of note: In C++, there is a very important difference between using a literal and using a named token when it comes to the standard streams:

std::cout << "Hello, " << name << "\n";

std::cout << "Hello, " << name << std::endl;

The std::endl token ends the line, and also flushes the buffer. With tokens like str.NL or str.SPACE, I would be wondering if they have additional meaning, too - for instance, some_string.split(str.SPACE) could conceivably mean “split on any whitespace” rather than being exactly equivalent to ' ' . So IMO this can impair readability compared to the literal. There’s no risk of mistyping it as there is with math.PI (would anyone spot the bug if you wrote 3.141592653589783 for pi?), and no additional meaning, so it’s just a longhand way of writing a literal - unless you have some need to be able to shadow the name str and change all the literals.