Include ability to do numeric based sorting of text

I opened a GitHub issue with a feature request to add a “natural sort” option to the standard library. Apparently this might be a tad controversial so it was suggested that I start a discussion here.

I think the GitHub comments are spot on–this is trivial to do for simple cases with key, and more complex cases are available via PyPI. There’s no need to add this to the stdlib. Really I’m not sure what “this” is–what’s “natural” is extremely subjective and depends on the context of the data. The user should know how to sort their data, and shouldn’t rely on Python to guess.

The idea of a “natural sort” that’s based on some heuristic ideas about the input data makes me think of Microsoft Excel :sweat_smile:

4 Likes

The optimist says the glass is 1/2 full. The pessimist says the glass is 1/2 empty. Microsoft Excel says the glass is January 2nd, and won’t let you change its mind on that point.

What for one person is “natural” is for another a great source of errant data and frustration.

Python offers you the ability to select your own key function; that should let you define your own rules for “natural” sort.

12 Likes

The fact that the natsort package has a decent amount of configurability means that a single function (in functools or maybe string I would assume) wouldn’t be enough to satisfy to real world usecases of this concept. So if you really want this to be added, someone needs to write a package for the stdlib [1], and a PEP. This isn’t just a simple feature request and would probably take a year of effort to have a chance to be added.


  1. Or consider if moving natsort into the stdlib makes sense ↩︎

2 Likes

The natural ordering of str is alphabetic (where the alphabet is the Unicode character set).

I think you’re asking for ordering by an interpretation of the strings as some other kind of object. You can do that with a key function, as has been suggested, but I suggest your wider application would benefit from a type with the ordering you find natural and a constructor from str. Sort those.

Loved this @Rosuav, but I had to check and it’s definitely 1st February.

natsort is better than Excel, but I wish to complain that it doesn’t understand the natural semantics of my application:

>>> natsorted(a)
['1 ft 5 in', '2 ft 7 in', '2 ft 11 in', '4 in', '7 ft 6 in', '10 ft 2 in']
>>> sorted(glasses)
['1/2 empty', '1/2 full', '1/4 full', '3/4 full', 'empty', 'full']
1 Like

Define “alphabetic”. Please, sort these letters into alphabetical order for me:

AaåáàäæÆ

The correct ordering for a pair of letters can depend on the language, and sometimes, on what kind of word they’re in (some languages alphabetize people’s names differently from other words, for example).

One very simple sort order is code point order where U+0001 comes before U+0002 comes before U+003C comes before U+01F8 and so on. For this to be useful, you’d have to first settle on a Unicode normalization form (NFC/NFD, or maybe NFKC/NFKD), as otherwise you could have two different representations for the same characters that end up sorting differently. But if you want to call the sort order “alphabetic”, you first have to decide which alphabet.

Thanks for checking. I don’t have Excel and wasn’t sure, but in any case, that’s what happens when a tool decides on a “natural” sort order. And, of course, every once in a while you get something that actually has a natural order, but you picked the wrong one…

https://www.reddit.com/r/ProgrammerHumor/comments/19bh1x5/whodidthis/

Exactly. It can never be perfect. Sorting things in an aesthetically pleasing way is something humans can definitely do, but computers seem to suck at…

It’s a good joke but (this is such an un-feature in Excel that) the punchline has to be regionalised.

You’re right, I should’ve written “the natural ordering of str is as a sequence of Unicode code points”. In Python it has a well-defined natural order, at least that’s the term Java uses in the same circumstances.

It all serves to demonstrate that another order is natural only with reference to a specific interpretation of the characters as encoding some kind of object with its own natural ordering.

>>> sorted(list('AaåáàäæÆ'))
['A', 'a', 'Æ', 'à', 'á', 'ä', 'å', 'æ']

… er unaturlig, som allt troll vet.

If I needed to sort in a way natural to a natural language, I might well define an str sub-type that would undoubtedly specify the language (above, nob) and normalisation to apply when “decoding” the Unicode.

Edited: because I misunderstood “you picked the wrong one”.

There’s a lot of customisation for natsort…

Yep. That’s well-defined, and doesn’t try to pretend to be lexicographical ordering for any particular language.

Fortunately you don’t have to go to that much effort! You need only provide a key function, and perfectly normal strings can be sorted in whatever way is most appropriate.

I was advocating a type (to the OP originally) because if the sort is natural to a domain, it will keep coming up in the imagined application.

We’d probably agree that a key function is good when you want to make an exception to the natural sort, but remembering to do it every time is hard.

That’s fair. Maybe it’s really a different type of collection, then? A collection which, when you ask it to sort itself, knows how different strings should be ordered?

No, I mean the type of the object to be sorted should know how to do comparison.

from functools import total_ordering

@total_ordering
class TupleStr(str):
    def __init__(self, s):
        self.t = tuple(map(int, s.split('.')))

    def __lt__(self, v):
        return self.t < v.t

    def __eq__(self, v):
        return self.t == v.t


a = ['2.7.10', '3.1', '3.8', '3.10.2', '3.11']

b = list(map(TupleStr, a))

print(sorted(a))
print(sorted(b))

prints:

['2.7.10', '3.1', '3.10.2', '3.11', '3.8']
['2.7.10', '3.1', '3.8', '3.10.2', '3.11']

This is because these strings actually encode version number tuples and it is better to represent them in a program as such, so that whatever tries to compare/sort them puts them into the semantically meaningful order.

I’d like to have given a language example but there is no locale.Locale object in CPython that I could give the class.

You can also use TupleStr as a key:

>>> list(sorted(a, key=TupleStr))
['2.7.10', '3.1', '3.8', '3.10.2', '3.11']

That I can agree with, when it comes to version numbers; those are a different type of thing. But alphabetizing a collection of book titles? Those are strings - unless you really think that BookTitle is a fundamentally different type of object?

This got me thinking of a hypothetical Book object in an ORM for a library database, and it’d make sense to define these functions in that case. Similarly you might have an Author object that has its own natural sorting (last name first).

So I can see the idea making sense in some contexts. But it’s always going to be specific to the problem domain.