Splitting a string dynamically

mlgtechuser · June 6, 2022, 5:51pm

Not necessarily. Just thinking through how to pick up the comparison symbols and their operands.

We might be able to strip out all spaces and then process the rest. Would you like to see how the strings look without spaces while I get some sleep, @cheesebird ?

P.S. This is quite an intriguing puzzle.

mlgtechuser · June 6, 2022, 5:55pm

I’m thinking: a find of each type of comparison symbol followed by a check for a space to the left) insert one if no space) and the same check to the right.

Notice that I didn’t throw out any code as a reply. I’m just designing an algorithm at this point.

It’s almost 3am here, though, so I’m going to need to sign off for some sleep.

This may be similar to a data science process where you spend 60% or more of your time preparing the data for processing.

All of these separate rules are starting to look like they want a bunch of match:case branches.

mlgtechuser · June 6, 2022, 6:03pm

QUESTION: Is this a one-time process where you have a bunch of archived files that you just need to process once, no matter how ugly that process might be, or will you have more source XML files to process in the future?

cheesebird · June 6, 2022, 6:14pm

The idea was to get to 100% functional and then anyone can use.

It certainly will be used again in the future when we get more filesm

cheesebird · June 7, 2022, 6:48am

Here is one possible solution but i’m lacking the know how to complete…

string1 = "(FS22 > 15) && (FS22 < 46) || (FS33 > 0.0)"
s1 = (re.sub ('\b[^\.|^\d|^\d\.](\w+\d+\w*)\b(?=.*\b([\S+]\1)\b)') , r'', string1)
s2 = (re.findall('\((.*?)\)',s1)

[’( >= 15’, ‘FS22< 46’, 'FS33> 0.0 '] s2

fv = list(filter(lambda v: match('^[0-9<>\(\)]', v),s2))
# print (s2.index(fv))
print(fv,'fv')

[’( >= 15’] fv

‘’’’

I would then need of find the index of where the match was found in list2 and join it with the following index but I get this error and can’t see a way around it currently.

print (s2.index(fv))
ValueError: ['0.0 > '] is not in list

vbrozik · June 7, 2022, 7:22am

It looks like you are creating s1 as a tuple of strings, s2 as a tuple of list and string. I strongly suggest you to start using meaningful explanatory names of the variables. No-one (including you few weeks later) will understand your code. The effort to analyze the code could become greater than the effort to create it.

For developing similar code in such a highly experimental way I recommend using notebooks like JupyterLab. You can run and modify pieces of code there repeatedly. It can even be used from within editors like VS Code.

Your code above cannot be executed. Please post code which executes without syntactic errors.

>>> s1 = (re.sub('\b[^\.|^\d|^\d\.](\w+\d+\w*)\b(?=.*\b([\S+]\1)\b)') , r'', string1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: sub() missing 2 required positional arguments: 'repl' and 'string'

cheesebird · June 7, 2022, 7:38am

This should be

s1 = (re.sub(r'\b[^\.|^\d|^\d\.](\w+\d+\w*)\b(?=.*\b([\S+]\1)\b)') , r'', string1)

i missed the r

mlgtechuser · June 7, 2022, 9:18am

Ross, will you please post several “worst case” examples like this one?

(ES33 > 0.0) || (ES34 > 5) && (ES34 < 16) || (EZ99 > 0.0)  && (ES39 > 15)

And to clarify, you said the expressions sometimes use inconsistent spaces like this, yes?

(ES34> 5)

It’s possible that the AND and OR operators can both appear both within and also between parenthetical clauses. However, I haven’t seen any || inside of any parentheticals. I’m operating on that assumption, at least, and have updated the reverse-engineered “spec” in my post above. Will you please (re)read the structural rules there and confirm or refute their accuracy? (And please quote the rule you’re clarifying when you reply.)

Lastly, YOUR POST HERE still needs proper backtick fencing around the code blocks.

mlgtechuser · June 7, 2022, 9:42am

ALL: The admin team has enabled a CodeCOPY feature in Discord. You can now click on the ‘copy’ icon that appears when you hover over a code block. (Code blocks only, not in-line code.)

cheesebird · June 7, 2022, 9:48am

This is a good worse can scenario

(ES33 > 0.0) || (ES34 > 5) && (ES34 < 16) || (EZ99 > 0.0) && (ES39 > 15)

or
(ES33 > 0.0) || (ES34 > 5) && (ES34 < 16) || (EZ99 > 0.0) && (ES39 > 15) || (ES24 > 5) && (ES24 < 26)

When I have a standalone range case
(ES34 > 5) && (ES34 < 16) && (ES24 > 5) && (ES24 < 26)
My existing code works fine. It breaks it down like this…

( > 5) && (ES34 < 16)
5 ES34 16
ES34 <> 5:16
(> 5) && (ES24 < 26)`
5 ES24 26
ES24 <> 5:26

For any cases where we see the <|> signs joined to the alphanumeric substring or the digit
FF99> 10 or FF99 >10 etc
I sort this out before I get to the splitting up routine by .replace("<" , " < “).replace(”>" , " > ")

So this might add some extra whitespace but I split the string on whitespace once it gets to here…
5 ES24 26
and then format it to this…
ES24 <> 5:26

mlgtechuser · June 7, 2022, 10:00am

Thank you. Now we can get some view of the full challenge.

[EDIT: ] These also don’t contain and GEQ or LEQ operators.

There was also this bugger that you sent me with the GT/LT on the outsides… It doesn’t “read” logically and isn’t even internally consistent (using Polish Notation for the last triplet and Reverse Polish Notation for the first.

(> 15 DD67 && DD67 18 <)

Is this syntax really in your source file? If so, then collapsing the strings by removing spaces is a no-go since the 67 and the 18 will merge and become uparseable as two numbers.

I was thinking that we might be able to use all non-alphanumeric characters as delimiters for parsing, but the number merging kills this option.

mlgtechuser · June 7, 2022, 10:10am

We need more examples to see the variations (since the spec hasn’t been forthcoming–what’s the status on that?).

To see, for example, if the ‘ X > ’ and ‘ X < ’ are always in the same order or if they also appear in reversed order.

It also looks like the OR operator is the a top-level delimiter (so should be the first place to split up the string). Is this true?

rob42 · June 7, 2022, 10:10am

@cheesebird
Could you clearify the objective please?

Are we simply (yeah, right) trying to get to the ‘output’ in any way possible, or are we restricted to using Regex here?

Thanks.

cheesebird · June 7, 2022, 10:59am

There can be cases where we see but they are rare…

((< 10 FF99) && (FF99 11 >))

Another syntax maybe :-

((10 < FF99) && (FF99 < 11)
((FF99 > 10) && (FF99 > 11))

Did you see my previous post? Did you try my previous code , posted this morning. I had it partially working ?

cheesebird · June 7, 2022, 11:04am

Hi @rob42

No restrictions on anything. I have used regex because it was the only way I could match this syntax…

((FF99 <10) && (FF99 > 20))

i.e. FF99
Then the duplicate is removed.

( <10) && (FF99 > 20))

And stripped down to this…

10 FF99 20

And then to the final desired output…

FF99 <> 10:20

rob42 · June 7, 2022, 11:09am

Thanks.

Okay. well this is a little on the rough side, but it cracks ‘string1’

Not sure if I’ll have any time today to move this forward, but here’s what I have so far; like I say, a little rough, but maybe it could be polished up:

string1 = "(FS22 > 15) && (FS22 < 46) || (FS33 > 0.0)"
string2 = "(FS33 > 0.0) || (FS99 > 15) && (FS99 < 46) || (FS38 > 0.0)"

sample = "((FF99 <10) && (FF99 > 20)) || (FS33 > 0.0)"

inProcess = sample
output = ''

getIndex = inProcess.find('(')+1
test = inProcess[getIndex]
if test == '(':
    index = getIndex+1
else:
    index = getIndex

offset = 4

andProcess = inProcess.index('&&')
getNum = andProcess-offset

orProcess  = inProcess.index('||')

find = inProcess[index:index+offset] # a str that holds chrs 1 to 4
output += find
index = getNum
output += ' <> ' + inProcess[getNum:getNum+2]
getNum = andProcess+11
output += ':' + inProcess[getNum:getNum+2]

output += '\n'+inProcess[orProcess+offset:orProcess+14]
print(output)

{code edit to take care of the sample ((FF99 <10) && (FF99 > 20))}

This could be a blind alley, but take it for what it’s worth

mlgtechuser · June 7, 2022, 11:14am

I did. I’m not familiar enough with regex to do any better than you have and regex may not offer the full functionality needed.

Also- I’m still on STEP 1 of finding what “doesn’t change” in terms of defining the structures and variations. If you’ll give me more sample data strings, I can get into a deeper Zen state of consulting the character of the strings and their various forms and facets.

Regex may well be the proper tool, but it’s impossible to tell before fully consulting the Oracle: the data itself. Design comes first, building comes second.

cheesebird · June 7, 2022, 3:30pm

That looks promising, haven’t a chance to run it fully but it something to play with.

Thanks

rob42 · June 7, 2022, 3:44pm

No worries. My thoughts are that we’re not trying to ‘pattern match’ per se, that is to say we’re not looking for some random alphanumeric characters; we know what we’re looking for, so why use Regex when we can string.find() for what we know is going to be there.

I think the way forward with my approach, is to do some string splitting and then work on the sub-string to extract what’s needed, building an output string as we go.

cheesebird · June 7, 2022, 7:23pm

Here’s a regex I made to match the range pattern …
I would first run the original code and delete all brackets , ands + ors

And this will exclude other patterns.

I would have to do a re.findall to write to a new list and then delete the matches from the original string removing the range pattern and leaving only the non range patterns.

Or is there another method to the re.findall re sub ?