Why is this regex not matching the filename?

c-rob · May 16, 2024, 3:34pm

I’m still a bit new to Python but not new to regex. I’m using Python 3.11 on Windows 10.

I want to check if a filename does not end with .xlsx or .csv (does not match '(xlsx|csv)$') then print an error. But regex is not working as expected. I must be missing something specific to Python.

>>> import re
>>> print(re.match(r'(xlsx|csv)$', 'fedex-20241022.xlsx', re.IGNORECASE)
... )
None

What am I doing wrong here?

EDIT: I’m getting a (correct) match on Jupyter Notebook when I use this code:

import re
fn = 'fedex-20241022.xlsx'
print(re.search(r'\.(xlsx|csv)$', fn, re.IGNORECASE))

jamestwebber · May 16, 2024, 3:36pm

re.match tries to match the whole string, not just a piece of it. You could use .*(xlsx|csv)$ to match the prefix or you could use re.search to search the string.

edit: As you found, match and search are different.

c-rob · May 16, 2024, 3:42pm

Thank you. Does re.search look for a match starting from the first character, or does it only return the first matching string?

jamestwebber · May 16, 2024, 3:46pm

I actually made a mistake. match tries to match the beginning of the string, but doesn’t need to match the whole thing. that’s re.fullmatch. The documentation covers all of this.

I’m not sure what you’re asking here. It searches from the beginning of the string and returns the first match. There’s also re.findall to find multiple non-overlapping matches.

bschubert · May 16, 2024, 4:05pm

Reference:

barry-scott · May 16, 2024, 4:15pm

You are not including the “.” In the pattern so it will match foo.notcsv as well.

Personally I would use filename.lower().endswith((“.csv”, “.xlsx”)) in the case.

bschubert · May 16, 2024, 5:04pm

It’s often helpful to handle file paths as path objects instead of working with raw strings. That can help save you from re-inventing the wheel every time you need to do a basic path parsing or manipulation, and can help you avoid subtle bugs like the one Barry mentioned.

If you convert the file name to a Path object, you can write this simply as path.suffix not in (".csv", ".xlsx"), which I think you’ll agree is much more readable and fool proof than anything involving a regex .

P.S. if you goal is to validate that a file is a CSV/Excel file, I’d recommend waiting until you actually read the file instead of trying to make that determination preemptively from the file name alone. There’s plenty of reasons why (for example) a valid CSV file might not have a .csv extension (both technical and not, think process substitutions and email filters), and it can be quite annoying to users when a program is picky about the name of a file when all that really matters is the contents.

c-rob · May 17, 2024, 9:56am

Thanks, I’m still learning and memorizing many details about Python. I will take notes on this one.

c-rob · May 17, 2024, 10:49am

If .endswith() takes a tuple as a parameter, should the tuple end with a comma like this? .endswith((".csv", ".xlsx",))

One of my Python tutorials said all tuples must end with a comma after the last item.

bschubert · May 17, 2024, 11:49am

No, tuples are not required to end with a trailing comma. Trailing commas are only strictly necessary when creating a single-element tuples like (1,) (without the comma, (1) is just a expression wrapped with parentheses):

>>> 1, 2
(1, 2)
>>> 1,
(1,)
>>> (1)
1

Can I ask what tutorial that is? Frankly, if it says that, I’d be skeptical of its quality. That’s a major misunderstanding of Python’s syntax that’s also completely trivial to check. If the tutorial gets a basic fact like that wrong, I would be concerned about the level of care it has about getting the subtler points right. One of the most important things when learning a new programming language is to develop a good mental model for how it actually works. Conversely, one of the most harmful things is to start out with a bad mental model, which doesn’t adapt well to new information, suggests incorrect conclusions, and just overall creates new sources confusion where there could have been none. I’d recommend avoiding falling prey to the latter by sticking to reputable, vetted sources. The official Python tutorial is one such place. For comparison, here’s its coverage of the quirks of tuple syntax: 5. Data Structures — Python 3.12.3 documentation

AndersMunch · May 17, 2024, 12:14pm

Danger, Will Robinson.

Automatic file format detection can be a security hazard.

Suppose you’ve been informed that there is a critical Excel bug, and for the time being you should avoid processing Excel files from external sources. Then you receive a file monthly_report.csv, and you think that it should be OK, because it’s CSV, not Excel. You feed it to an importer which inspects the file, discovers that the content is actually Excel, puts it through Excel processing, and you’re pwned.

If the automatically detected file format disagrees with the file extension, that should be a hard error. Or at the very least, there should be a warning and a confirmation prompt.

c-rob · May 21, 2024, 9:29am

I’m still new to Python. I’m aware that VBA code can be executed within an Excel file. I assume using the Excel app to open an Excel file would execute any VBA code in it.

But does the Python module that reads an Excel file actually execute VBA code?
Does Outlook execute VBA code in an Excel file if Outlook does a preview of the file in the Outlook pane? (We use Outlook email.)

Is there another way of Python reading an Excel file could be dangerous I’m not thinking of?

Anyway, we do have security software on every computer and something at the network level I assume. That will control known threats, but not necessarily new threats.

AndersMunch · May 21, 2024, 11:12am

The point isn’t that Excel in particular is dangerous, the point is that if you can masquerade a file as file format A when it’s actually in file format B, then you may be able to sneak past security measures designed to guard against malicious format B files.

Excel is a complex file format read by a complex piece of software. As such, it is at risk of having security vulnerabilities, but not necessarily more so than any other complex piece of software.

If you’re reading the file as bytes, as a zip file, or with openpyxl (which AFAIK doesn’t support VBA at all), then no.
If using COM to remote control Excel, then who knows (I don’t).

No. If it did, that would be an epic security blunder.

If you have more questions, you should find a more appropriate forum.