Why is this regex not matching the filename?

I’m still a bit new to Python but not new to regex. I’m using Python 3.11 on Windows 10.

I want to check if a filename does not end with .xlsx or .csv (does not match '(xlsx|csv)$') then print an error. But regex is not working as expected. I must be missing something specific to Python.

>>> import re
>>> print(re.match(r'(xlsx|csv)$', 'fedex-20241022.xlsx', re.IGNORECASE)
... )
None

What am I doing wrong here?

EDIT: I’m getting a (correct) match on Jupyter Notebook when I use this code:

import re
fn = 'fedex-20241022.xlsx'
print(re.search(r'\.(xlsx|csv)$', fn, re.IGNORECASE))

re.match tries to match the whole string, not just a piece of it. You could use .*(xlsx|csv)$ to match the prefix or you could use re.search to search the string.

edit: As you found, match and search are different.

Thank you. Does re.search look for a match starting from the first character, or does it only return the first matching string?

I actually made a mistake. match tries to match the beginning of the string, but doesn’t need to match the whole thing. that’s re.fullmatch. The documentation covers all of this.

I’m not sure what you’re asking here. It searches from the beginning of the string and returns the first match. There’s also re.findall to find multiple non-overlapping matches.

Reference:

You are not including the “.” In the pattern so it will match foo.notcsv as well.

Personally I would use filename.lower().endswith((“.csv”, “.xlsx”)) in the case.

1 Like

It’s often helpful to handle file paths as path objects instead of working with raw strings. That can help save you from re-inventing the wheel every time you need to do a basic path parsing or manipulation, and can help you avoid subtle bugs like the one Barry mentioned.

If you convert the file name to a Path object, you can write this simply as path.suffix not in (".csv", ".xlsx"), which I think you’ll agree is much more readable and fool proof than anything involving a regex :slight_smile:.

P.S. if you goal is to validate that a file is a CSV/Excel file, I’d recommend waiting until you actually read the file instead of trying to make that determination preemptively from the file name alone. There’s plenty of reasons why (for example) a valid CSV file might not have a .csv extension (both technical and not, think process substitutions and email filters), and it can be quite annoying to users when a program is picky about the name of a file when all that really matters is the contents.

1 Like

Thanks, I’m still learning and memorizing many details about Python. I will take notes on this one.

If .endswith() takes a tuple as a parameter, should the tuple end with a comma like this? .endswith((".csv", ".xlsx",))

One of my Python tutorials said all tuples must end with a comma after the last item.

No, tuples are not required to end with a trailing comma. Trailing commas are only strictly necessary when creating a single-element tuples like (1,) (without the comma, (1) is just a expression wrapped with parentheses):

>>> 1, 2
(1, 2)
>>> 1,
(1,)
>>> (1)
1

Can I ask what tutorial that is? Frankly, if it says that, I’d be skeptical of its quality. That’s a major misunderstanding of Python’s syntax that’s also completely trivial to check. If the tutorial gets a basic fact like that wrong, I would be concerned about the level of care it has about getting the subtler points right. One of the most important things when learning a new programming language is to develop a good mental model for how it actually works. Conversely, one of the most harmful things is to start out with a bad mental model, which doesn’t adapt well to new information, suggests incorrect conclusions, and just overall creates new sources confusion where there could have been none. I’d recommend avoiding falling prey to the latter by sticking to reputable, vetted sources. The official Python tutorial is one such place. For comparison, here’s its coverage of the quirks of tuple syntax: 5. Data Structures — Python 3.12.3 documentation

1 Like

Danger, Will Robinson.

Automatic file format detection can be a security hazard.

Suppose you’ve been informed that there is a critical Excel bug, and for the time being you should avoid processing Excel files from external sources. Then you receive a file monthly_report.csv, and you think that it should be OK, because it’s CSV, not Excel. You feed it to an importer which inspects the file, discovers that the content is actually Excel, puts it through Excel processing, and you’re pwned.

If the automatically detected file format disagrees with the file extension, that should be a hard error. Or at the very least, there should be a warning and a confirmation prompt.