Command line interface for the `re` module

storchaka · August 17, 2023, 2:20pm

Some modules in the stdlib have also a simple CLI, for example the gzip module allows to compress and decompress files, the uuid module generates the UUID, and the sqlite3 module provides the SQL REPL. And a number of modules (tokenize, ast, symtable, dis) allow to inspect Python sources at different levels.

Using such commands as grep and sed every day I often regret their limited regular expression syntax. I propose to add the CLI for the re module which would allow to search and replace in files using the syntax of Python regular expressions. For searching the obvious candidate is implementing the grep command. For replacing, I’m still thinking about the interface. It could be possible to implement the sed command, but I think that it may be too complex for this.

The benefits:

Windows users will have access to powerful tools from the box.
All users will be able to use more powerful regular expression syntax in these tools.

So I have few questions to community:

Is it a good idea at all? I already implemented grep, with almost full set of the GNU grep options, and it all in just 250 lines.
How to invoke it?
- python3 -m re.grep – a submodule of re, separate submodules for different commands. It was one of my internal reasons for reorganizing re into a package.
- python3 -m re.tool grep – a submodule of re for the CLI, similar to json.tool. Different commands are subcommands of this module.
- A separate script in the Tools/scripts directory. But seems the recent tendency is to wipe out this directory.
Is there existing CLI tool similar to grep for replacing text, so we can borrow its interface instead of designing something new?

erlendaasland · August 17, 2023, 2:32pm

It is a nice idea.

How about simply python -m grep? If not, I would prefer python -m re.grep over the tool variant.

(My favourite stdlib script is python -m calendar)

jeanas · August 17, 2023, 2:53pm

ripgrep (rg command) is quite popular these days. There are also pcregrep, ack and ag, off the top. And, of course, git grep.

Note that rg --engine=pcre2 or git grep -P or ag or ack gives you Perl-compatible regular expressions, which is about what Python supports, so I’m not sure this is worth doing. All of these support Windows.

guido · August 17, 2023, 3:36pm

Since the trend for this kind of tool seems to be towards short command names, can’t we just use

python -m re PATTERN FILE ...

as the basic incantation? It could import re.__main__ so it oughtn’t weigh down programmatic imports of the re module.

I don’t think we should put any effort into perf or functionality to match tools like ripgrep or my personal standby, good old ag (Silver Searcher). It would be nice if it could search directories though.

Maybe once we have no-GIL implemented we can add a -j flag.

khs · August 17, 2023, 6:09pm

To iterate earlier response, what are the benefits compared with rg and other tools that are used everywhere today? ripgrep is one of the first tools I install if I’m getting a new dev system.

stoneleaf · August 17, 2023, 6:32pm

They are not used everywhere today, as not everyone is a professional developer; but even casual Python programmers can benefit from an easy-to-use and built-in grep tool.

jamestwebber · August 17, 2023, 6:39pm

This seems like it could be a package?

Adding a slow version of grep to the stdlib seems like a lot of added maintenance, documentation, and support for a small convenience ^[1]

Beginners who don’t know to install a better tool would be better served with a tutorial on how to write the appropriate python script for their task.

honestly this doesn’t even seem that convenient to me? ↩︎

AA-Turner · August 17, 2023, 7:38pm

Serhiy noted that this was implemented in 250 lines, which doesn’t seem a massive burden, and often you do just have standard-library Python on a computer, so I can see the benefits. (c.f. batteries-included).

A

storchaka · August 17, 2023, 7:45pm

Thank you, it is interesting. But they only support search (ripgrep also allows replacement in the output), not search & replace.

Yes, it was also my initial idea. But I want to support two different operations: “search” and “search & replace”. They need two different commands.

Better discoverability and accessibility for Python users.
More familiar regex syntax for Python users.
And the main reason – a CLI is a living test bench. Even for core devs it may be easier to use CLI for simple test than use REPL or write a script. It helps to support both the module to which the CLI belongs and argparse in healthy conditions.

I have created an issue and a draft PR (only code, without docs and tests): Command line interface for the `re` module · Issue #108095 · python/cpython · GitHub

jamestwebber · August 17, 2023, 8:08pm

250 lines might be how it starts, but the scope creep has already begun.

Even more than maintaining the code, I was thinking that “how do I use grep” is suddenly a relevant topic for Python Help and similar places. This requires documentation and continuous support to be a useful feature for beginners, and it’s a redundant feature for non-beginners.

Is there an OS that comes with python pre-installed, but doesn’t give you grep ^[1]? In any case, I thought the idea was to discourage using system-managed python installations. I use a bunch of different environments in my work, and I really don’t want to worry that the activated environment will change how my file-searching tool works.

I feel like the intersection of people who a) strongly prefer python regular expressions to other tools and b) can’t install a PyPI package is roughly no one. This feels like a perfect candidate for a package that people can install in their path. There’s no need to add the documentation and support burden to the stdlib, and it would delay availability by many years (what’s the version on that “os-provided python” you’ve got, anyway?).

Personally I lean more on the side PEP 594, and I don’t think python needs any more miscellaneous batteries than it has.

i.e. does Windows? ↩︎

smontanaro · August 17, 2023, 8:09pm

Isn’t a regular expression search command already widely available? grep (or egrep or grep -E, depending upon when you started with a Unix CLI) in Unix-like environments is widely available, even on Windows (WSL), right?

I use sed, though only in its most elementary form (e.g., sed -e s/pattern/replacement/[g], sometimes sed -e '/pattern/d' or its complement, sed -n '/pattern/p'). I never understood all the complexity of the rest of it (hold spaces and such). I also use awk where it’s simpler (input lines conveniently broken up into words, making {print $3, $5, $7, ...} trivial).

In short, while providing a grep-like CLI for re might be an interesting programming exercise, I suspect other than for sussing out the preferred general approach to such interfaces for Python-based command line tools, I imagine you might find CLI-ifying other modules/packages would fill more of a niche. For example, years ago I did a lot of work with CSV files. It made sense (to me) to use Unix pipelines for quick-n-dirty transmogrification of such data (moving averages, Sharpe ratios, simple plotting, etc). I found it useful enough that I coaxed my employer at the time to let me take this little toolkit with me when I moved on to my next job. I don’t claim that it’s a world-beating CLI (it could hardly be called “properly designed”), but it worked for me, and I’ve enhanced it in one way or another over the years. Providing command line access to this sort of functionality (or at least helper packages to create command pipelines) might be more useful.

pf_moore · August 17, 2023, 9:01pm

Not on Windows. Yes, tools like ripgrep exist, and there are various ways of getting Unix-like commands. But first of all, we can’t assume these tools are available - in a locked down corporate environment, it’s not impossible that Python is the only tool allowed on servers, for example. And secondly, there’s no way of writing instructions that will work for everyone - “run grep ..., or if you have ripgrep, rg ..., or if you have neither of those but you have git, find out where git is installed and run C:\Path\To\Git\bin\grep.exe ..., or…” Whereas being able to say “run python -m re ...”^[1] works for everyone.

Maybe that’s not important. But “everyone has tools like grep these days” just isn’t true, in my experience. It’s a lot closer than it used to be, and there’s usually something that’s no more than a download away, but having something that will always work is a big advantage.

For comparison, I’ve never actually used py -m zipfile or py -m tarfile, but knowing they are always available is an extremely valuable safety net when working in unknown environments.

(Actually, even more useful would be if difflib had a CLI that handled the basic features of diff, because cross-platform implementations of that are still uncommon).

Yes, “how do I run Python” is still an issue, there’s the launcher, or store Python, etc. But that’s an issue that’s in our control to solve, if we want to ↩︎

guido · August 17, 2023, 9:45pm

I would use a command-line flag for replace. In my experience it is much less common (maybe 1% of all uses).

smontanaro · August 17, 2023, 11:23pm

I’m not a Windows person, but at my last job (3+ years ago) our Windows systems had WSL. Isn’t that widely available?

Rosuav · August 17, 2023, 11:35pm

When assisting people on Mac OS, I’ve often run into the problem of “the tool exists, but THIS flag doesn’t” (because it’s not the GNU utility, it’s another with the same name and somewhat similar features). Having something I could depend on would be good, as long as I can dictate a Python command to someone.

tjreedy · August 18, 2023, 2:05am

IDLE has a grep facility called ‘find in files’ that uses os.walk to fine directories, fnmatch.fnmatch to filter files, and str or re functions to search lines. I cannot remember any questions about using this feature and found essentially nothing on Stackoverflow searching [python-idle] for ‘grep’ or ‘“find in files”’. If the grep options were sufficiently well documented, I would not expect a support burden.

Searching installed 3.12 /Lib/*.py for a no-match string takes over 10 seconds on my machine. A repeat with caches full takes under 2 seconds. (I am thinking of having IDLE report the search time in the future.) Would a C-coded grep really be much faster? Finding that a no non-idlelib stdlib file does ‘import idlelib’ but one 3rd-parth project I have loaded does (both expected) barely took longer.

Edit: I work on Windows so this is my working grep.

flyinghyrax · August 18, 2023, 2:06am

Oh fun times! My previous work machine was macOS and I found this out the hard way.

Turns out that grep on macOS is BSD grep, and a very old version of BSD grep. Which is manageable, if it weren’t also multiple orders of magnitude slower than GNU grep. Literally profiled it once. We had an old automation bash script that used grep a lot internally that I ended up writing a Dockerfile for^[1] because running the script in a container was faster than natively with the macOS version of grep.

Anyway. If there’s a point there I guess it’s that it can be nice to have options.

not solely because of grep, but it was definitely a factor… ↩︎

pf_moore · August 18, 2023, 7:27am

It’s sometimes available, but using the tools from WSL in a native shell isn’t easy. They aren’t on PATH by default, and they almost certainly handle filenames in a non-native manner.

ncoghlan · August 21, 2023, 1:45am

An option like “python -m re --sub=‘replacement text’ ‘match pattern’ files…” would also be substantially less cryptic than sed’s “pattern separators with trailing command” approach.

Other ideas seemed more ambiguous to me:

allowing two args to imply replacement would conflict with the convention of grep accepting the files to search at the end of the command
same rationale for associating the replacement text directly with the CLI option rather than changing the meaning of the second positional command line argument
“–sub” (rather than “–replace”) matches the name of the module function and re pattern method
a submodule would have to be called “re.replace” to avoid conflicting with the “re.sub” function

gwerbin · August 21, 2023, 4:21am

I really like this proposal. It would also be nice if it had some flags to control output format like grep. An optional JSON output mode would be especially interesting for downstream processing.

Topic		Replies	Views
Script Search And Replace for multi lines Python Help	5	843	November 28, 2023
Regular expression pattern to handle the tail end of a command line Python Help help	9	1155	November 3, 2022
Compressed re.sub Example Python Help	1	419	February 23, 2021
Help (beginners stuff) - Replit - bash: [name of the variable]: comand not found Python Help help	8	647	September 25, 2023
Make `wsgiref.simple_server` usable as a CLI tool (like `http.server`) Ideas	0	177	September 7, 2023

Command line interface for the `re` module

Related Topics