Not necessarily. Just thinking through how to pick up the comparison symbols and their operands.
We might be able to strip out all spaces and then process the rest. Would you like to see how the strings look without spaces while I get some sleep, @cheesebird ?
I’m thinking: a find of each type of comparison symbol followed by a check for a space to the left) insert one if no space) and the same check to the right.
Notice that I didn’t throw out any code as a reply. I’m just designing an algorithm at this point.
It’s almost 3am here, though, so I’m going to need to sign off for some sleep.
This may be similar to a data science process where you spend 60% or more of your time preparing the data for processing.
All of these separate rules are starting to look like they want a bunch of match:case branches.
QUESTION: Is this a one-time process where you have a bunch of archived files that you just need to process once, no matter how ugly that process might be, or will you have more source XML files to process in the future?
I would then need of find the index of where the match was found in list2 and join it with the following index but I get this error and can’t see a way around it currently.
print (s2.index(fv))
ValueError: ['0.0 > '] is not in list
It looks like you are creating s1 as a tuple of strings, s2 as a tuple of list and string. I strongly suggest you to start using meaningful explanatory names of the variables. No-one (including you few weeks later) will understand your code. The effort to analyze the code could become greater than the effort to create it.
For developing similar code in such a highly experimental way I recommend using notebooks like JupyterLab. You can run and modify pieces of code there repeatedly. It can even be used from within editors like VS Code.
Your code above cannot be executed. Please post code which executes without syntactic errors.
>>> s1 = (re.sub('\b[^\.|^\d|^\d\.](\w+\d+\w*)\b(?=.*\b([\S+]\1)\b)') , r'', string1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: sub() missing 2 required positional arguments: 'repl' and 'string'
And to clarify, you said the expressions sometimes use inconsistent spaces like this, yes?
(ES34> 5)
It’s possible that the AND and OR operators can both appear both within and also between parenthetical clauses. However, I haven’t seen any ||inside of any parentheticals. I’m operating on that assumption, at least, and have updated the reverse-engineered “spec” in my post above. Will you please (re)read the structural rules there and confirm or refute their accuracy? (And please quote the rule you’re clarifying when you reply.)
Lastly, YOUR POST HERE still needs proper backtick fencing around the code blocks.
ALL: The admin team has enabled a CodeCOPY feature in Discord. You can now click on the ‘copy’ icon that appears when you hover over a code block. (Code blocks only, not in-line code.)
For any cases where we see the <|> signs joined to the alphanumeric substring or the digit FF99> 10 or FF99 >10 etc
I sort this out before I get to the splitting up routine by .replace("<" , " < “).replace(”>" , " > ")
So this might add some extra whitespace but I split the string on whitespace once it gets to here… 5 ES24 26
and then format it to this… ES24 <> 5:26
Thank you. Now we can get some view of the full challenge.
[EDIT: ] These also don’t contain and GEQ or LEQ operators.
There was also this bugger that you sent me with the GT/LT on the outsides… It doesn’t “read” logically and isn’t even internally consistent (using Polish Notation for the last triplet and Reverse Polish Notation for the first.
(> 15 DD67 && DD67 18 <)
Is this syntax really in your source file? If so, then collapsing the strings by removing spaces is a no-go since the 67 and the 18 will merge and become uparseable as two numbers.
I was thinking that we might be able to use all non-alphanumeric characters as delimiters for parsing, but the number merging kills this option.
Okay. well this is a little on the rough side, but it cracks ‘string1’
Not sure if I’ll have any time today to move this forward, but here’s what I have so far; like I say, a little rough, but maybe it could be polished up:
I did. I’m not familiar enough with regex to do any better than you have and regex may not offer the full functionality needed.
Also- I’m still on STEP 1 of finding what “doesn’t change” in terms of defining the structures and variations. If you’ll give me more sample data strings, I can get into a deeper Zen state of consulting the character of the strings and their various forms and facets.
Regex may well be the proper tool, but it’s impossible to tell before fully consulting the Oracle: the data itself. Design comes first, building comes second.
No worries. My thoughts are that we’re not trying to ‘pattern match’ per se, that is to say we’re not looking for some random alphanumeric characters; we know what we’re looking for, so why use Regex when we can string.find() for what we know is going to be there.
I think the way forward with my approach, is to do some string splitting and then work on the sub-string to extract what’s needed, building an output string as we go.
Here’s a regex I made to match the range pattern …
I would first run the original code and delete all brackets , ands + ors
And this will exclude other patterns.
I would have to do a re.findall to write to a new list and then delete the matches from the original string removing the range pattern and leaving only the non range patterns.
Or is there another method to the re.findall re sub ?