READERS DIGEST CONDENSED SUMMARY: Problem solved and just speculating about using combined contributed wisdom to see if hybrid solutions are interesting.
I looked back at the discussion to see if the various approaches might have some advantages in terms of efficiency and want to suggest a hybrid for just this very specific purpose that @c-rob raised of a not uncommon problem.
To make clear, the problem being solved is determined to be a series of very similar strings as in filenames with a fixed pattern that can include one or more regions of contiguous digits that need to be sorted. The example offered happens to include hyphens as delimiters but more general solutions would not depend on this.
@brass75 (and others) offered a solution that solved it a bit more generally by creating a sorting key function that returned everything like so:
def sort_key(s: str) -> list:
return [int(p) if p.isdigit() else p for p in re.findall(r'\D+|\d+', s)]
Clearly this returns all parts of the filename and requires the sort algorithm to then compare up to and including the fourth item in the resulting list in order to decide which of two items sorts higher or lower:
>>> sort_key("file-page-10-table-1.csv")
['file-page-', 10, '-table-', 1, '.csv']
So, if one assumes a fixed format where only the numeric regions determine the sort order, an obvious enhancement is to just drop the else clause and re-arrange:
def sort_key(s: str) -> list:
return [int(p)for p in re.findall(r'\D+|\d+', s) if p.isdigit()]
This returns [10, 1] in the above example and sorts well.
But do we need to use isdigit() if we are already guaranteed we have a run of only digits involved? This would normally check if all the characters in the string are digits. It is a built-in and might be fast but in theory you might find it faster to just check if the first character is a digit either using p[0].isdigit() or perhaps comparing it directly to characters for zero and nine.
def sort_key(s: str) -> list:
return [int(p)for p in re.findall(r'\D+|\d+', s) if p[0].isdigit()]
or
def sort_key(s: str) -> list:
return [int(p)for p in re.findall(r'\D+|\d+', s) if '0' <= p[0] <= '9']
I have no idea without doing benchmarking if such changes make any difference.
And, then there is the issue of using a regular expression for this scenario. I note @jamestwebber offered an approach tailored to a specific filename format by doing this in the example:
>>> "file-page-10-table-1.csv".rsplit(".")[0].split("-")
['file', 'page', '10', 'table', '1']
Or a bit more legibly as a pipeline:
"file-page-10-table-1.csv"\
.rsplit(".")[0]\
.split("-")
He nicely noted that if your remove the “.csv” extension and then split the rest by use of the hyphens, you can make a list of the remaining parts. This works nicely if some assumptions about the names of files can be assumed to be true, such as not having any additional periods in the filename. That assumption may not apply in this case if the names were harvested from a PDF but assuming the names are pre-filtered to all match a pattern, fine.
def sort_key(s: str) -> list:
return [int(p)
for p in s.rsplit(".")[0].split("-")
if '0' <= p[0] <= '9']
This works fine too. Again, for small batches the differences in efficiency may be minor. Regular expressions can be quite fast for simple cases like this and there is overhead from calling two methods plus taking a slice in the method above.
Is it worth doing analysis and comparison? Not really but as an academic exercise, it is nice to think of possibilities.