Alternative for variable length lookbehind

Deepak275 · February 28, 2023, 12:58pm

Hi Team

I have list of files, each file have this format:


<tag1 ip1>
  <tag2 *>
      sometext 
  </tag2>
  sometext with multiple lines goes here
</tag1>

<tag1 ip2>
  sometext with multiple lines goes here
  <tag2 *>
      sometext
  </tag2>
</tag1>

<tag1 ip3>
  sometext with multiple lines goes here
  <tag2 *>
      sometext
  </tag2>
</tag1>

What i am trying to achieve is to replace the text inside tag2 with specific ip of tag1.
Have tried lookbehind, but it is only able to cover the tag1 with ip:

(?<=(<tag1 ip1>)(.|\n)*)(<tag2 \*>)(.|\n)*.*</tag2>

but it is also matching the text inside tag1 with the tag2 and while replacment, it is removing those text. Since the text inside tag1 is of variable length so not able to use the lookbehind as it doesnt support variable length lookbehind.

Please suggest some alternative for this or if i am missing somthing. Thanks

rob42 · February 28, 2023, 1:40pm

One way to do this would be to read a tags file into a list object, then manipulate said object in anyway you choose.

As a POC:

The input file

<tag1 ip1>
  <tag2 *>
      sometext 
  </tag2>
  sometext with multiple lines goes here
</tag1>

with open("tags", mode='r', encoding='UTF-8') as file:
    tags = file.readlines()

tags = [tag.strip() for tag in tags]  # clean it

ip = tags[0].rstrip(">").split()[1]

tags[2] = ip

# just see the results
for tag in tags:
    print(tag)

Once you have that, the list object can be used to produce an output file, or whatever.

PythonCHB · February 28, 2023, 5:51pm

This is XML, yes?

It may be possible with regex (well, probably is, they are powerful), but why not read it in as XML, make the change, write it back out as XML?

That would be more robust – a lot less likely to break the XML.

Though if you want something quick and dirty for just this one file – Rob’s idea is a good one – you can spend a lot of time getting the regex right!

Just the first hit on Google:

MRAB · February 28, 2023, 6:55pm

Does this give you the result you want?

Search: (<tag1 (\w+)>[^<]*<tag2 \*>)[^<]*(</tag2>)
Replace: \1\2\3