Splitting a string dynamically

mlgtechuser · June 8, 2022, 3:10pm

May I have some more long sample strings, Ross?

Is the = always on the right as shown in this example from a PREVIOUS POST?

string1 = '>= 15 FS99 <= 46 SS99 >= 0.0 SS88 >= 90 SS77 == 90 SS77 < 90'
**Is this accurate, by the way? It has no parentheses and doesn’t resolve as a oolean expression like all of the others do.

cheesebird · June 8, 2022, 3:47pm

@mlgtechuser

No the range pattern could be anywhere. I’ll dig some out tomorrow.

BTW did you see my regex ? It works 100% if don’t remove &&.

So a stripped string looks like this now…

18 < && FS88 < 19 HH88 > 12.0 HH55 < 11 10.5 =< && GG13 <= 21.3

Output…

18 < FS88 < 19 10.5 <= GG13 <= 21.3

I modified the posted regex to add [&]+ and works well integrated into the full script

mlgtechuser · June 9, 2022, 10:18am

did you see my regex ? It works 100% if don’t remove &&.

I saw that you were making progress. Well done!

18 < FS88 < 19 10.5 <= GG13 <= 21.3

Are you now achieving 100% of the output you need or are there still snags?

cheesebird · June 9, 2022, 11:26am

@mlgtechuser

Its as good as done.

Thanks for help

kenahoo · June 12, 2022, 2:12am

I’m coming late to this thread, but from my perspective this does not seem to be a good application for regexes in the traditional “find a single pattern in a string” sense.

As was mentioned earlier, this is a “parse and transform” problem, where there are two traditional options:

Write an ad-hoc Python-based parser that uses regexes and a bunch of custom machinery.
Use a parser library where you can explicitly and clearly define the structure and syntax of the mini-language you’re trying to parse.

Clearly I favor 2. =)

I’m not very familiar with the available options available and popular for Python, most of my parser-generator work was done using ANTLR or similar tools. What would people recommend for things like these in modern times? Still ANTLR?

steven.daprano · June 12, 2022, 3:01am

There are tons of options for writing parsers in Python.

ANTLR
Arpeggio
Canopy
Construct
Lark
Lrparsing
Parsec
Parsey
Parsimonious
PLY
PlyPlus
Pyleri
Pyparsing
PyPEG
Reparse
TatSu
WaxEye

I may have missed some. I think that Pyparsing may be the most popular among people with little or no previous parsing experience.

Read the link for more details on the libraries.

ptmcg · April 9, 2023, 10:31am

Thanks for the link to pyparsing. This is actually a pretty involved expression to process using regexen. Not only are there groups in parentheses with comparison operators, but logical operators as well.

Here is an initial pyparsing parser, using its infix_notation helper function. infix_notation takes an expression for the base operand (in this case an identifier or a real or integer number) and a list of tuples that specify the operators, their arity, and left- or right-associativity. The order of the tuples in the list indicates their precedence of operations.

import pyparsing as pp

# define expressions that will be parsed as expression operands
ident = pp.Word(pp.alphas, pp.alphanums)
integer = pp.Word(pp.nums)
real = pp.Regex("\d+\.\d+")

comparison_op = pp.one_of("> < >= <= != ==")
expr = pp.infix_notation(
    ident | real | integer,
    [
        (comparison_op, 2, pp.OpAssoc.LEFT),
        ("&&", 2, pp.OpAssoc.LEFT),
        ("||", 2, pp.OpAssoc.LEFT),
    ]
)

Pyparsing expressions have a run_tests method that makes it easy to process a number of sample strings:

expr.run_tests("""\
    (FS22 > 15) && (FS22 < 46) || (FS33 > 0.0)
    (FS33 > 0.0) || (FS34> 15) && (FS22 < 46) || (FS33 > 0.0)
    (FS33 > 0.0) || (FS34> 15) && (FS22 < 46) || (FS33 > 0.0)  && (FS39> 15)
""")

The output can be fairly verbose, especially with deeply nested data, but an abbreviated form is:

(FS22 > 15) && (FS22 < 46) || (FS33 > 0.0)
[[[['FS22', '>', '15'], '&&', ['FS22', '<', '46']], '||', ['FS33', '>', '0.0']]]

(FS33 > 0.0) || (FS34> 15) && (FS22 < 46) || (FS33 > 0.0)
[[['FS33', '>', '0.0'], '||', [['FS34', '>', '15'], '&&', ['FS22', '<', '46']], '||', ['FS33', '>', '0.0']]]

(FS33 > 0.0) || (FS34> 15) && (FS22 < 46) || (FS33 > 0.0)  && (FS39> 15)
[[['FS33', '>', '0.0'], '||', [['FS34', '>', '15'], '&&', ['FS22', '<', '46']], '||', [['FS33', '>', '0.0'], '&&', ['FS39', '>', '15']]]]

If you look closely, you’ll see that the “&&” operations are grouped, indicating their higher precedence. Also the parenthesized parts are grouped as well, and if parentheses were added to override the operator precedence, those would also be grouped.

Actually evalutating this expression can be done, but is a much longer story than covered in this post. The pyparsing repo includes several examples using infix_notation, eval_arith.py is probably closest to this parser, and simpleBool.py includes evaluation of logical operations.

ptmcg · April 10, 2023, 3:06pm

Gah! Should be real = pp.Regex(r"\d+\.\d+") Needs to be a raw string literal (string prefixed with an ‘r’).