Command line interface for the `re` module

smontanaro · August 21, 2023, 5:26pm

Apologies if this has already been discussed. A quick skim of the entire thread (but not @storchaka’s code) suggests not. I think it would be worthwhile to discuss the standard I/O parts of such an interface. Again, putting on my old man Unix hat, I think the default input and output should be sys.stdin and sys.stdout. Explicitly specifying input and output files should not be required. If you want to override that, a standard way to specify input and output files would establish a best practice for future modules-as-commands. Examining common input/output file specs (other than stdin/stdout) in Unix pipelines would be worthwhile. Maybe:

[[-i/--input] infile]
[[-o/--ouput] outfile]

or

[infile [outfile]]

make sense to me.

guido · August 21, 2023, 5:45pm

From one old UNIX hat to another, I don’t think we need a -i flag, the default is sys.stdin or any number of files you specify after the pattern on the command line.

But I like the idea of a -o flag, defaulting to sys.stdout, with the special meaning that (if this flag is given) the output file is opened after all the inputs have been fully read, so you can write e.g.

python -m re foo --sub bar file.py -o file.py

to update a file in place.

I would use that. I would probably first test it with output to stdout and eyeball whether it did the right thing before using -o, but if I still got it wrong, git checkout file.py would restore the original for me.

sirosen · August 22, 2023, 4:04am

Before jq became popular, I frequently saw people use json.tool as part of their workflows, to pretty print JSON data.
I could imagine re usage becoming similarly popular as a highly portable find-and-replace. Strong +1 for this idea.

-o would work in lieu of -i, but wouldn’t users avoid it on the assumption that it has the same ordering issue as shell output redirection? Regardless of how well it’s documented, I think it may save effort over the long run, mostly spent explaining the feature, if -i were used instead.

guido · August 22, 2023, 5:05am

I’m actually not sure what you’re saying here. It sounds like you’re arguing that -i input_file is more intuitive than -o output_file? But that makes no sense (to me, anyway).

What does -i do in your mind? To me it looks like a redundant way of specifying “here come input files” – redundant because the UNIX convention is that all remaining non-flag arguments are usually input files. This convention doesn’t leave room for an output file (cp notwithstanding), hence -o if there’s a reason the tool needs to know the name of the output file.

I’m sure I learned about the idea of -o (holding the output until after all the inputs have been read) long ago from some UNIX command, and I’ve always assumed that was sed – but sed doesn’t seem to have a -o flag! I recall using temp files for years when doing global substitutions (before this was something text editors could do more easily) and then one day stumbling over some utility’s -o flag in the manual and going “Aha! Just what I always wanted.” Was it an older, non-POSIX version of sed? Or some other utility?

sirosen · August 22, 2023, 5:16am

To me, -i carries it’s meaning from GNU sed, where I think it stands for “in place”.
Some other commands have picked this up as well, but I can’t recall any good examples offhand (gawk, maybe?).

-o works if you wait to open the output file, but not all commands do this. So my thought was basically that -o might have an ambiguity which -i doesn’t.

If it can do in-place replacements, regardless of the exact interface, an re CLI will be valuable. IIRC, BSD sed doesn’t support -i.

EDIT: Sorry for the confusion – I just reread and realized that -i was being proposed as -i/--infile, which is very different from what I was suggesting. I was thinking of it in the GNU sed sense of that flag and didn’t even realize I had changed the meaning of the flag.

AndersMunch · August 22, 2023, 12:24pm

Sounds very ambitious. Eventually you are going to want to have every feature that ripgrep has, and then that one extra feature, search-replace. I think PyPI, not standard library.

Are you aware of grin? It doesn’t have search-replace either, but it’s written in Python, so perhaps you could contribute something on that front.

laclouis5 · August 22, 2023, 4:58pm

WSL is unfortunately not widely available. In my experience it can be difficult to install and there is a lot of troubleshooting on some less supported platforms (including activating virtualization support in the UEFI and debugging obscure error messages).

This can be non trivial to do and I would definitely rely on Python CLI for grep and other tools on such systems.

gwerbin · August 22, 2023, 6:54pm

An --inplace flag could be interesting for the same reason Sed has it, because foo < data.txt > data.txt doesn’t work right. Although it puts more responsibility on the internals of re.__main__ to work correctly, because now you’re making destructive changes to people’s files. Maybe that shouldn’t go in with v1.0.

I also don’t think anyone should expect this to emulate Grep/Ag/Ack/Ripgrep functionality like excluding Git-ignored files and precisely controlling the output format. For input, I think we can adhere to the Unix philosophy and rely on other tools for that, e.g. this contrived alternative to git grep:

git ls-files | xargs python -m re 'pattern'

For other input options like encoding, we can maybe rely on the PYTHON* env vars.

As for control of output, I think we can keep it simple:

By default, emit filenames and matched lines/text on the same line together
A “grouped” format where each filename is a heading and matched lines/text follow beneath
An option to list filenames only without matched lines
An option to omit filenames, emitting only matched lines/text
An option to list matched text only instead of the entire line (with/without filenames)
A JSON or JSON Lines format for precise unambiguous downstream processing

The JSON Lines format could look like a stream of these things, delimited by ASCII \n:

{
  "file": "...",
  "lineno": 0,
  "line": "...",
  "match": {"start": 0, "end": 2, "text": "..."},
  "groups": {
    "1": {"start": 0, "end": 1, "text": "..."},
    "foo": {"start": 1, "end": 2, "text": "..."},
  }
}

Using JSON Lines format instead of plain JSON provides much better support for streaming output. Wouldn’t hurt to add it to the stdlib json module either.

kknechtel · August 22, 2023, 7:27pm

JSON can only use strings as dict (what it calls “object”) keys, so that groups specification is impossible (and text needs to be quoted). On the other hand, group names apparently may not start with a digit, so it should be fine to just stringify the numeric indices for unnamed groups. (They could be included redundantly for named groups, too, but there isn’t a way in JSON to avoid actually duplicating the data in that case.)

gwerbin · August 22, 2023, 7:54pm

Thanks for the correction, I can never remember what is and isn’t allowed in various “dict-like” representations.

PeterL · August 24, 2023, 11:05pm

Throwing my 2c in here. Many of my clients are on Windows, and I can’t install any other software. They have python, but no WSL and can’t pip install. So a CLI to re would be a boon for me.

vstinner · September 15, 2023, 10:55am

See also Add a page to the documentation listing stdlib modules with a command-line interface (CLI): Python stdlib had around 49 command-line interfaces.

pylang · September 16, 2023, 3:12am

I remember looking into this a while ago. Glad to see them listed in one place. I think the following may be missing from your list, which all have interfaces via python -m <tool> --help:

dis 
doctest 
http.server
pickle 
pickletools 
pydoc 
uu

There is also diff.py and pstats, however they fit among them.

hugovk · September 16, 2023, 7:33am

Thanks, they’re all included in the PR, except for uu which is a dead battery being removed as part of PEP 594.

Perhaps diff.py could be integrated into python -m difflib instead of _test().

orent · September 17, 2023, 5:34pm

I would prefer a tool that helps debugging and verifying python regular expressions over yet another grep.

How about rewriting the input regexp as a spaced and indented expression in re.X mode? And show exactly how an input string was matched?

vstinner · September 18, 2023, 4:03pm

As I wrote, if we add a grep-like CLI, it should be called re.grep, since there are other potential use cases with different CLI I proposed a sed-like CLI