How to sort filenames that have numbers in them but numbers are not zero-padded?

c-rob · September 19, 2024, 11:30am

Python 3.11 on Windows 10. I’m not a Python expert but I know a few handy things. I’m still learning.

I’m using Camelot to extract tables from a 34 page PDF. By default camelot outputs each table to it’s own CSV file. However each CSV file is named like this and comes out in this order when I use glob:

file-page-10-table-1.csv
file-page-11-table-1.csv
file-page-1-table-1.csv
file-page-1-table-2.csv
file-page-2-table-1.csv
file-page-2-table-2.csv
... some other files

I need to read these in the correct order to do some more processing of them and combine them into one file, and when I use glob it comes out in the incorrect order above. The correct order would be:

file-page-1-table-1.csv
file-page-1-table-2.csv
file-page-2-table-1.csv
file-page-2-table-2.csv
...
file-page-10-table-1.csv
file-page-11-table-1.csv

My camelot commands to read and output the files are:

 # Lattice method is default.
tables = camelot.read_pdf(pdffile, pages='2-17')
tables.export(csvfile, f='csv', compress=False)

When I did this and specified no pages:

tables = camelot.read_pdf(pdffile) # Lattice method is default.
tables.export(csvfile, f='csv', compress=False)

I got a mostly empty CSV file with about 3 lines of something I could not find in the original PDF. Page 1 has no table and that seems to mess up the output of camelot.

Is there a way to process all pages and output all tables to a single CSV file?
Or is there a way to sort these files in the correct order?

Thank you.

Notes:

There are no paths or path delimiters in my file list. The list only contains file names.

MegaIng · September 19, 2024, 11:39am

Look at packages like natsort

barry-scott · September 19, 2024, 12:45pm

The trick is to parse the string into a tuple of string and int.
For example:

Becomes:

('file-page-', 11, '-table-', 1, '.csv`)

Now you can sort a list of those tuples.

brass75 · September 19, 2024, 6:55pm

>>> lines = '''file-page-10-table-1.csv
... file-page-11-table-1.csv
... file-page-1-table-1.csv
... file-page-1-table-2.csv
... file-page-2-table-1.csv
... file-page-2-table-2.csv'''.splitlines()
>>> def sort_key(s: str) -> list:
...     parts = re.findall(r'\D+|\d+', s)
...     rc = list()
...     for part in parts:
...             try:
...                     x = int(part)
...             except ValueError:
...                     rc.append(part)
...             else:
...                     rc.append(x)
...     return rc
...
>>> print('\n'.join(sorted(lines, key=sort_key)))
file-page-1-table-1.csv
file-page-1-table-2.csv
file-page-2-table-1.csv
file-page-2-table-2.csv
file-page-10-table-1.csv
file-page-11-table-1.csv

For the use case that you have given we can separate consecutive digits from the rest of the string - leaving them in the correct location while converting them to integers - and using that as the sort key.

brass75 · September 19, 2024, 7:01pm

This is a shorter version of the same sort_key function.

def sort_key(s: str) -> list:
    return [int(p) if p.isdigit() else p for p in re.findall(r'\D+|\d+', s)]

jamestwebber · September 19, 2024, 7:22pm

In cases like this I prefer to write something much more strict, with or without using regular expressions. I’d do this because it will be less flexible, and fail on unexpected formats, which I’d like to know about.

# using regex
def sort_key(file_name: str) -> tuple:
    g = re.match(r'file-page-(\d+)-table-(\d+)\.csv', file_name).groups()
    return int(g[0]), int(g[1])

# or if number of parts is fixed, don't even use regex
def sort_key(file_name: str) -> tuple:
    parts = file_name.rsplit(".")[0].split("-")
    return int(parts[2]), int(parts[4])

These could be modified to include other parts of the filename if needed. But the idea is to break early if something doesn’t fit the expected pattern.

tjreedy · September 19, 2024, 7:53pm

Or … write a program that iterates through the files in the directory and 0-pads the ‘short’ numbers. And, if appropriate, change the program that generates file names. THis and the above responses should give you enough choices ;-).

Stefan2 · September 19, 2024, 8:07pm

One more:

def sort_key(s: str) -> list:
    a = re.split(r'(\d+)', s)
    a[1::2] = map(int, a[1::2])
    return a

effigies · September 19, 2024, 8:45pm

There was actually a stdlib tool that would do this nicely, up until Python 3.12:

>>> from distutils.version import LooseVersion
>>> LooseVersion("file-page-11-table-1.csv").version
['file', '-', 'page', '-', 11, '-', 'table', '-', 1, 'csv']
>>> sorted("""file-page-10-table-1.csv
file-page-11-table-1.csv
file-page-1-table-1.csv
file-page-1-table-2.csv
file-page-2-table-1.csv
file-page-2-table-2.csv""".splitlines(), key=LooseVersion)
['file-page-1-table-1.csv',
 'file-page-1-table-2.csv',
 'file-page-2-table-1.csv',
 'file-page-2-table-2.csv',
 'file-page-10-table-1.csv',
 'file-page-11-table-1.csv']

If putting all the shapes into the square hole is your thing, and you want to keep it going indefinitely, I maintain a copy of this class at looseversion · PyPI.

avi.gross · September 20, 2024, 1:54am

I enjoyed this discussion and the various ways people offered to solve a problem. I won’t share my version as it is fairly similar to what @brass75 did in the usual one-liner some find artistic and some others consider more "on the spectrum"tic.

Hidden

For any offended, just humor on the words artistic versus autistic.

Just a few comments on the variations.

Many of the solutions focus on one end and on supplying a function that can be used by a sorting algorithm to return a list or tuple containing any combo of characters strings and integers. That is fine. I can see other approaches including the one hinted at by @tjreedy who, I assume, would use some printf-like functionality that changes every integer back into 0-padded text of some fixed length such as “00000012” and if long enough, would work in any programming language except one like Python which supports weirdly large integers, LOL!

I note some people prefer to match just the specific filename format, as @jamestwebber points out. I, too, worry about possible edge cases. As an example, what if a file suffix happened to be .CS1 and we make it into two parts at the end of the list. Would that sort the way intended? Heck, there could be digits for unrelated reasons anywhere in the filename and if sorting some other kind of data, this could be a hassle as the text may contain something like a serial number of S12V6W meant to compare to say SX1260V. Heck, someone may be using hexadecimal!

So the more specific versions seem safer as long as you also have a step that guarantees all the entries meet the criterion.

I propose an interesting variation on this topic. I record lots of shows and organize them with file and folder names containing text like S6E21 to reflect it is from season 6 and is episode 21. Naturally, when I get around to it, I want to watch these in order.

So, what happens if a series continues into a tenth season or, like Jeopardy, goes over 99 episodes? You can, of course, edit filenames and widen each. But somewhat related to this exercise would be a program that reads in all the names, determines the width of the widest numeric region as in seeing a 10 or a 122 and precedes to calculate the new wider filenames that will still sort right and then rename them.

This is not a request as I know how I could write this myself. Just a thought experiment based on my dozens of gigabytes of such recordings that I may never ever finish.

barry-scott · September 20, 2024, 10:11am

Note file managers do the natural sort to avoid the 0 pad hack.

c-rob · September 20, 2024, 10:40am

Dan Shernicoff:

lines = '''file-page-10-table-1.csv
... file-page-11-table-1.csv
... file-page-1-table-1.csv
... file-page-1-table-2.csv
... file-page-2-table-1.csv
... file-page-2-table-2.csv'''.splitlines()

I like the new way you show to make a list. I have not seen that before.

I like to see all the ways to do this so I can better understand what needs to be done. Thanks to all.

c-rob · September 20, 2024, 10:44am

That worked! And I will use that later I’m sure.

# Web Page with many examples and use cases: https://github.com/SethMMorton/natsort?tab=readme-ov-file#natsort
# Activate your dev environment and use: pip install natsort
from natsort import natsorted
list = ['file-1-table-1', 'file-10-table-2', 'file-11-table-1', 'file-1-table-2']
list2 = natsorted(list)
print(list2)

a = ['2 ft 7 in', '1 ft 5 in', '10 ft 2 in', '2 ft 11 in', '7 ft 6 in']
a = natsorted(a)
print(a)

a = ['version-1.9', 'version-2.0', 'version-1.11', 'version-1.10']
print(natsorted(a))

# And it handles mixed types! Which I sometimes get when reading a spreadsheet.
a = ['4.5', 6, 2.0, '5', 'a']
print(natsorted(a))

Paddy3118 · September 20, 2024, 11:09am

I wrote this Natural sorting task and its Python solution on RosettaCode. It sorts numeric fields numerically, as you suggest, but may need cutting down as it does much more too.

avi.gross · September 20, 2024, 3:25pm

READERS DIGEST CONDENSED SUMMARY: Problem solved and just speculating about using combined contributed wisdom to see if hybrid solutions are interesting.

I looked back at the discussion to see if the various approaches might have some advantages in terms of efficiency and want to suggest a hybrid for just this very specific purpose that @c-rob raised of a not uncommon problem.

To make clear, the problem being solved is determined to be a series of very similar strings as in filenames with a fixed pattern that can include one or more regions of contiguous digits that need to be sorted. The example offered happens to include hyphens as delimiters but more general solutions would not depend on this.

@brass75 (and others) offered a solution that solved it a bit more generally by creating a sorting key function that returned everything like so:

def sort_key(s: str) -> list:
    return [int(p) if p.isdigit() else p for p in re.findall(r'\D+|\d+', s)]

Clearly this returns all parts of the filename and requires the sort algorithm to then compare up to and including the fourth item in the resulting list in order to decide which of two items sorts higher or lower:

>>> sort_key("file-page-10-table-1.csv")
['file-page-', 10, '-table-', 1, '.csv']

So, if one assumes a fixed format where only the numeric regions determine the sort order, an obvious enhancement is to just drop the else clause and re-arrange:

def sort_key(s: str) -> list:
    return [int(p)for p in re.findall(r'\D+|\d+', s) if p.isdigit()]

This returns [10, 1] in the above example and sorts well.

But do we need to use isdigit() if we are already guaranteed we have a run of only digits involved? This would normally check if all the characters in the string are digits. It is a built-in and might be fast but in theory you might find it faster to just check if the first character is a digit either using p[0].isdigit() or perhaps comparing it directly to characters for zero and nine.

def sort_key(s: str) -> list:
    return [int(p)for p in re.findall(r'\D+|\d+', s) if p[0].isdigit()]

or

def sort_key(s: str) -> list:
    return [int(p)for p in re.findall(r'\D+|\d+', s) if '0' <= p[0] <= '9']

I have no idea without doing benchmarking if such changes make any difference.

And, then there is the issue of using a regular expression for this scenario. I note @jamestwebber offered an approach tailored to a specific filename format by doing this in the example:

>>> "file-page-10-table-1.csv".rsplit(".")[0].split("-")
['file', 'page', '10', 'table', '1']

Or a bit more legibly as a pipeline:

"file-page-10-table-1.csv"\
.rsplit(".")[0]\
.split("-")

He nicely noted that if your remove the “.csv” extension and then split the rest by use of the hyphens, you can make a list of the remaining parts. This works nicely if some assumptions about the names of files can be assumed to be true, such as not having any additional periods in the filename. That assumption may not apply in this case if the names were harvested from a PDF but assuming the names are pre-filtered to all match a pattern, fine.

def sort_key(s: str) -> list:
    return [int(p) 
            for p in s.rsplit(".")[0].split("-") 
            if '0' <= p[0] <= '9']

This works fine too. Again, for small batches the differences in efficiency may be minor. Regular expressions can be quite fast for simple cases like this and there is overhead from calling two methods plus taking a slice in the method above.

Is it worth doing analysis and comparison? Not really but as an academic exercise, it is nice to think of possibilities.

brass75 · September 20, 2024, 3:40pm

Not Available:

So, if one assumes a fixed format where only the numeric regions determine the sort order, an obvious enhancement is to just drop the else clause and re-arrange:
def sort_key(s: str) -> list:
    return [int(p)for p in re.findall(r'\D+|\d+', s) if p.isdigit()]
This returns [10, 1] in the above example and sorts well.

But do we need to use isdigit() if we are already guaranteed we have a run of only digits involved? This would normally check if all the characters in the string are digits. It is a built-in and might be fast but in theory you might find it faster to just check if the first character is a digit either using p[0].isdigit() or perhaps comparing it directly to characters for zero and nine.
def sort_key(s: str) -> list:
    return [int(p)for p in re.findall(r'\D+|\d+', s) if p[0].isdigit()]
or
def sort_key(s: str) -> list:
    return [int(p)for p in re.findall(r'\D+|\d+', s) if '0' <= p[0] <= '9']
I have no idea without doing benchmarking if such changes make any difference.

The simpler way to handle this (just using the numerals as numbers to sort) is:

def sort_key(s: str) -> list:
    return [int(p) for p in re.findall(r'\d+', s)]

This will only extract the numbers and not require the isdigit check in any manner. Again, this is useful in the case where we know everything other than the numerals is equal and consistent.

avi.gross · September 20, 2024, 6:02pm

Dan, your suggestion is quite brief and readable for humans and I like it. Again, this function would work if all parts of the string/filename were identical except for the numeric parts.

I was also looking at efficiency considerations, just as an exercise but not to the point of doing benchmarks, LOL!

So I decided to look at how re.findall() is implemented.

There are endless possible ways including fairly expensive ones that keep calling functions, sometimes recursively. Here is what I found in my Python version today:

import re
import inspect
print(inspect.getsource(re.findall))

The ouput was:

def findall(pattern, string, flags=0):
    """Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result."""
    return _compile(pattern, flags).findall(string)

It looks like this does much more than needed, based on the docstring but otherwise, it looks like the pattern is used to create an object which contains some compiled encapsulation of how to do a search and that has a findall() method it invokes.

So far, not all that informative, LOL!

The same technique is not showing me the embedded method but I suspect it has some looping construct. Someone can educate me on how to locate the source code or inspect the method from within by looking in the dictionary or something.

But my overall guess is that for this limited situation, the findall is possibly less efficient even as at the same time, it chunks the ideas better from a human programmer perspective.

I have long mocked the ancient wisdom imparted by early python pioneers that there is only one obvious way to do things. This thread keeps coming up with variations.

Next, someone will suggest we replace all non-digits with spaces, then split on space and …

brass75 · September 20, 2024, 6:31pm

There is often a tradeoff between computer efficiency and human efficiency. regex is inefficient from a computer standpoint but it is efficient from a human one. If performance is not an issue - or it’s such a minor issue that we can afford to ignore small performance hits like this - for the sake of readability we should always opt for the latter.

Paddy3118 · September 21, 2024, 8:30am

You could use that assumption, but by not making that assumption an algorithm might highlight a small portion of files that look like they follow the assumed format but don’t. And if they are mixed in with loads of other files then their difference is masked. (Been there, found that; don’t you just love people hand-editing).

barry-scott · September 21, 2024, 9:12am

My version scans the string char by char and return a key.

def key_natural_string(s):
    key = []
    number = []
    other = []
    for ch in s:
        if ch in '0123456789':
            if other:
                key.append(''.join(other))
                other = []
            number.append(ch)
        else:
            if number:
                key.append(int(''.join(number)))
                number = []
            other.append(ch)

    if other:
        key.append(''.join(other))
    if number:
        key.append(int(''.join(number)))

    return key

print( key_natural_string('this-12-one') )
print( key_natural_string('this-12-one-34') )

Which prints:

['this-', 12, '-one']
['this-', 12, '-one-', 34]