Splitting a string dynamically

cheesebird · June 6, 2022, 5:59am

Here is my ‘headache’ for today.
I could have a number of strings like this…

string1 = "(FS22 > 15) && (FS22 < 46) || (FS33 > 0.0)"
string2 =  "(FS33 > 0.0) || (FS34> 15) && (FS22 < 46) || (FS33 > 0.0)"
string3  =   "(FS33 > 0.0) || (FS34> 15) && (FS22 < 46) || (FS33 > 0.0)  && (FS39> 15)"

This is the output…

print( re.sub(r'\b[^\.|^\d|^\d\.](\w+\d+\w*)\b(?=.*\b([\S+]\1)\b)', r'', string1))
>>> "(> 15) && (FS22 < 46) || (FS33 > 0.0)"
print( re.sub(r'\b[^\.|^\d|^\d\.](\w+\d+\w*)\b(?=.*\b([\S+]\1)\b)', r'', string2))
>>> "(FS33 > 0.0) || (> 15) && (FS22 < 46) || (FS33 > 0.0)"
print( re.sub(r'\b[^\.|^\d|^\d\.](\w+\d+\w*)\b(?=.*\b([\S+]\1)\b)', r'', string3))
>>> "(FS33 > 0.0) || (> 15) && (FS22 < 46) || (FS33 > 0.0)  && (FS39> 15)""

This works for the matching and removing of the duplicate substrings, but I now want to split the string further by removing (FS39> 15) or (FS39 < 15). These should not be removed completely but fed into another another routine.

So how would break up the string in these cases?

cheesebird · June 6, 2022, 7:27am

So i managed this via …

helper = re.findall(r'(\b\w\d\w+\b\s.*\d+)',string1)
for help in helper:
                 string1= string1.replace(help,''")
                 print(string1,'replace')

output > string1 15 && FS22 < 46

output > helper FS33 > 0.0

vbrozik · June 6, 2022, 7:49am

Trying to use regexes for everything can become overly complex. If you need to process the individual terms separated by operators && and || why not to split the string by these operators?

>>> re.split(r'\s*(&&|\|\|)\s*', string1)
['(FS22 > 15)', '&&', '(FS22 < 46)', '||', '(FS33 > 0.0)']

Notice that the operators (in regex in a capture group) are kept in the result.

Then process the terms as needed:

normalization (can happen before splitting) - whitespaces, brackets, upper/lower case, number formats
duplicate searching - using dictionaries and or sets

At the end compose the expression back from the processed parts.

For more complex expressions you can use full parsing - using for example GitHub - pyparsing/pyparsing: Python library for creating PEG parsers

cheesebird · June 6, 2022, 8:12am

Splittig on && or || is not an option unfortunately.
For example after removing duplicates i get this
(> 15) && (FS22 < 46) || (FS33 > 0.0)
I need to keep the (> 15) && (FS22 < 46) intact as it is later processed into this FS22[15:46]. If I split with your method i lose this.

Unfortunately my regex (\b\w\d\w+\b\s.*\d+) on previous reply
captures this FS22 < 46 and this FS33 > 0.0 and drops this (> 15)

So i need to rework it.

vbrozik · June 6, 2022, 8:33am

I do not understand at all what are the conditions for your processing but you are not limited to process the list (result of re.split()) just by individual items you can group them in a next step (to a nested list or tuple to be hashable) etc.

cheesebird · June 6, 2022, 8:39am

Ok , any examples how I could group.

(> 15) && (FS22 < 46)

and (FS33 > 0.0) induvidually which are in the same string.

Filtering out via regex is all I can think of.

vbrozik · June 6, 2022, 9:16am

I can create an example for you but I still do not know:

what is the condition for grouping
which items to group (only pairs or more?)

cheesebird · June 6, 2022, 9:32am

Ok thanks. So i’ve tried splitting the string by /s+

string1 = '>= 15 FS99 <= 46 SS99 >= 0.0   SS88 >= 90 SS77 == 90 SS77 < 90'
list_splilt = re.split('\s+', string1 )
list_splilt ['>=' ,'15','FS99'. '<=' ,'46' ,'SS99', '>=', '0.0',  'SS88', '>=', '90', 'SS77', '==', '90', 'SS77', '<', '90']

What I need to do is then group the substring that is surrounded by > 15 xxx < 16 (5 elements) and then a second group that will contain 3 elements ‘SS77’, ‘<’, ‘90’

I’ve no idea how to do this dynamically. The substring SS88 are variables so will change with each run.

vbrozik · June 6, 2022, 10:20am

What are the strings you are processing??? First I though they are expressions. Now I see very strange sequences in them: >= with nothing on the left, 15 FS99 number followed by an identifier? What does the comma in <, mean?

Are not there some more fundamental problems which should be solved first? How are the strings created? What do they mean?

Why do you edit the result list_splilt? It should be different. list_splilt[-2] should be '<,' not '<'.

Regarding the grouping: I do not see any sequence > 15 or < 16 in string1. I am guessing: did you mean matching this regex r'>=?\s*15' instead of matching exactly '> 15'? What do the numbers 15 and 16 mean? Will they always be the same?

You have to exactly define (for yourself) what you need.

cheesebird · June 6, 2022, 11:04am

Hi so the original string can look something like this…

org_string1 = ‘(>= 15 FS99 && FS99 <= 46) && ( SS99 >= 0.0) || (SS88 >= 90 && SS77 == 90 && SS67 < 90)’

When I have string with this only this sequence in…
(>= 15 FS99 && FS99 <= 46)
It’s no problem use re.sub to replace the duplicates and get…
(>= 15 FS99 <= 46) giving me a between range

When I have string with…

SS88 >= 90 && SS77 == 90 && SS67 < 90

I simply split on && and all is well.

When I have combination of both i.e org_string1 then my code can’t handle it.

So I’m looking for a way to discriminate between the two types of equation when in the same string.

Note in the org_string1 the separator between the two types of equation is || in this case but it could also be &&.

As per my previous reply I tried a list by searching for < , > ,<= , >= and checking what’s on either side of it by index but this falls apart.

All these values can vary on each parse so it has to be dynamic

vbrozik · June 6, 2022, 12:12pm

So the input string should be a valid expression? With a subset of Operators in C?

Numbers are like in C too? Identifiers too? (FS99, SS99 etc.)

So expression is similar to the C language expressions?

If it is like that then the beginning of the org_string1: (>= is invalid. Are you mixing prefix and infix operator syntax? I.e. prefix >= 15 FS99 would be 15 >= FS99 in the infix notation? I am afraid it is still too wild and unclear.

I understand that you want to shorten the comparison expressions to the shortened Python notation x > 15 and x < 45 → 15 < x < 45. Am I right?

cheesebird · June 6, 2022, 12:50pm

No not quite , I’m simply taking them from an XML and reformatting and outputting to another XML to be read on another system. I don’t want to evaluate to formulas in python but being able to discriminate between a range and other type is needed or I will be dropping parts.

vbrozik · June 6, 2022, 1:20pm

I am sorry. I am not able to continue to help you without having the information and I cannot play the guessing game. The input strings are clearly not in the XML format.

mlgtechuser · June 6, 2022, 1:26pm

Welcome to Astronomy Club, Václav!

@cheesebird, do you have some sort of specification for the source data and the destination data? It’s extremely difficult to think with the separations and consolidations you’re needing. As I said over here, they just look like random textbook exercises.

For example, these aren’t actually equations. They’re parameters of some type, right?.

If you present your scenario like…

Here’s what I have:

string1 = "(FS22 > 15) && (FS22 < 46) || (FS33 > 0.0)"
string2 = "(FS33 > 0.0) || (FS34> 15) && (FS22 < 46) || (FS33 > 0.0)"
string3  = "(FS33 > 0.0) || (FS34> 15) && (FS22 < 46) || (FS33 > 0.0)  && (FS39> 15)"

Here’s what I need to turn it into:

FS22[15:46] ...

Because… <description of before and after data structure and some explanation of the elements in the strings>

Presenting the situation and its requirements in bits and pieces means there’s no endgame for anyone trying to help you. It just seems like random actions and shots in the dark toward some unknown end result.

To successfully automate a process, one must FIRST understand how to do the process manually.

rob42 · June 6, 2022, 1:58pm

Kinda brings to mind the OP with the ‘Excel’ conundrum. He had ‘CS’ in his stuff; this has ‘FS’: coincidence?

mlgtechuser · June 6, 2022, 2:10pm

There are some parallels. I think the biggest difference is that @cheesebird has a better handle on the data, the process, and a good bit of the coding.

Let’s see of we can get some visibility on the encoding of the source and destination data.

cheesebird · June 6, 2022, 2:20pm

@mlgtechuser

I’ll try to explain again.

If I have only an equation like this

(10 < DD99) && (DD99 < 15) or even ( > 15 DD67 && DD67 18 <)

The string is reformatted to this by moving the string by index and removing the duplicated substring.

DD99 10:15
DD67 15:18

This gives me a range output, so I’ll refer to this a range from now on. This is fine if the string only has range values/ equations.
If I have only this in the string.

DD88 > 10 && DD11 >= 12

This can also be dealt with in my code easily and I’ll call this type nonrange as I can’t think of a better name. The non range doesnt require any special formatting and is fine as is.

DD88 > 10 
DD11 >= 12

When I have a string combined with both types then my code explodes.

(10 < DD99 && DD99 < 15) && (DD88 > 20) &&
 ( > 15 DD67 && DD67 18 <) || (DD11 <= 15)

So I am hoping for some way to discriminate between these two types and and split the string accordingly.

As with a previous reply I had it partially working but my regex wasn’t robust enough and picked out nonrange with range values and vice versa.
Note that matching by brackets also didn’t work as some equations use brackets and others don’t.

Btw how do you format code in the post on your phone , there is no option.

cheesebird · June 6, 2022, 2:49pm

They are parsed from a tag within the XML and output to list within my script. This is not an issue, in my posts the string values I am giving have already been extracted from the XML. The XML is not needed anymore we have the values to process

mlgtechuser · June 6, 2022, 3:17pm

One of the hardest things to do when communicating is to recognize what your listener doesn’t know–but needs to–out of all the things that you already take for granted about the topic.

These strings you’re processing are pretty esoteric but there is a nonobvious system in its syntax, operators, sequence, etc. What can you tell us about this system?

Try this: Step back and provide the overview of what the source data is and how it’s structured. You (and we) need to find the ‘Zen’ of the data encoding (&&, ||, etc.) and syntax in order to parse and consolidate the data elements. In other words, start by explaining the data structure and encoding instead of addressing any process for now. For example, you refer to these as “equations” but they don’t appear to be. This is confusing. That confusion carries forward into any further explanation, so clear it up from the outset.

The source data is some sort of encoding. If it really is an equation, then how is the expression calculated? (It looks more like an encoding of parameter constraints than an equation expression, though).

Please re-read my “Welcome to Astronomy Club” post above and then take another pass at presenting the data structure and encoding. The next step after that is to explain what you need to convert them to (but not how, yet) and a bit about that data structure and its elements. Working out how to get from source to destination comes a VERY distant third.

It’s always tempting to just jump in and start coding but on anything other than the simplest of processes, that leads to problems because the design step was skipped. Design-on-the-fly (sometimes known as “rectal extrapolation”) doesn’t often work well–if at all. If it were possible to code the required conversion process by discussing the detailed mechanics of the conversion process, we would have made much more progress than we have at this point.

You may have heard of a technique called “rubber ducking”. Explaining the situation clearly and simply to someone clarifies one’s own understanding of the situation.

Also- if one defines the problem sufficiently, then the solution becomes obvious. The problem is “We have <this source data> and we need to convert it to <this form>.” That full discussion and the complete definition of <this data> and sets the stage for HOW.

vbrozik · June 6, 2022, 3:25pm

I will try to summarize the useful information and my opinion. We would need to confirm it and add possible missing details.

The expressions and the required transformations are too complex to be done primarily by using regexes. IMHO the expression should be parsed, processed and then reconstructed.
What are you going to process - expressions in text strings. Their properities:
- have boolean result (True | False)
- contain variables with alphanumeric identifiers, probably only in the form LLdd (two letters and two digits)
- contain integer and floating point numeric values
- Use C-like boolean operators && || - with normal infix notation
- Use C-like comparison operators: > => < <=
- Use special comparison operator : defining a non-inclusive numeric range in the form start_value:end_value
  - ! Not sure how to describe an inclusive range.
- The comparison operators use a very special notation (my wild guess)
  - There is a term defining a limit value consisting of an operator O and a number(s) n:
    - operators > => < <= a) O n b) n O
    - operator : c) nOn
  - This term can be placed in front of or after a variable
    - Examples with the same meaning: 5 > AB12, AB12 5 >, < 5 AB12, AB12 < 5
    - Examples of the numeric range operator: 0:5 AB12, AB12 0:5
    - Equivalent expression for the range above: 5 > AB12 && 0 < AB12
- The boolean operators have lower precedence than the comparison operators
- use () brackets to override the operator precedence
What you want to do.
- Simplify the expression to get the range operators when possible.