XML.Etree to replace a string in an .odt document - other ways welcome

Michael-Gauthier · February 27, 2023, 7:24am

I have looked for help on the web, on Reddit, on Stack, but I still haven’t been able to find an answer to this and am a bit at-a-loss…

I have an .odt file called “myfile.odt” in a specific directory (that’s not the current working directory) and I want to replace “astring” by “anewstring”, then export the modified file into a new .odt file in the current directory. How can I do that using XML.Etree, which, from what I understand, is the best method for doing it?

I’ve spent the last couple of days trying to work my head around that, but I’m not used to working with XMLs… I’ve tried to adapt countless bits of code I found around the web, but I haven’t been successful so far… I’ve also tried to use odfpy, but it does not seem to work for me due to a know problem explained on their Git. Very briefly, the new file is created but is identical to the input file… So I feel a bit lost and ask for someone familiar with XML.Etree to kindly help me find solution. I’d perfectly accept any other solution that works. I also tried simply considering the .odt file as a text file, but it didn’t work.

Thanks a lot in advance. : )

tjreedy · February 27, 2023, 4:51pm

If this is a one-off job, rather than a recurring job you need to automate, open in LibreOffice, search/replace, and save.

Michael-Gauthier · February 27, 2023, 8:59pm

Thanks, but I did know about this command, and no, it’s not a one-time job, I’d like to automate something that I frequently have to do, and I figured that I’d take this opportunity to learn some more Python in the mean time, but I didn’t think this would be so difficult… : /

EpicWink · February 27, 2023, 10:31pm

~~I’ll update this with examples when I’m at my PC (if I remember).~~ I’ve added a new post below with examples instead.

You can traverse through the element-tree using XPath or recursive and filtered child iteration. Once you’ve got the element you want, you can update it in-place (including it’s children). After modifying, save the etree to a new file.

You’ll have to look up the ODT format’s specification for the feature you want to modify.

Is there a reason you can’t use existing OpenOffice manipulation libraries such as OOoPy and odfpy?

Michael-Gauthier · February 28, 2023, 7:09am

Oh, that would be so lovely if you could indeed provide some example script that would do that! : )))

Well, the odfpy library does not seem to work due to a problem explained on their Git page.

About OOoPy now, I admit that it’s not a library I came across during my research, but I’ll look into it. There’s not much documentation, which is a bit worrying since I’m really a novice programmer, but I’ll definitely take a closer look at it this afternoon.

Thanks for the suggestion, and for any additional help you may provide! : )

EpicWink · February 28, 2023, 7:20am

import xml.etree.ElementTree as ET

with open(in_xml_path, "rb") as f:
    etree = ET.parse(f)
root = etree.getroot()

for child in root:
    if child.tag == "foo":
        for child_child in list(child):
            if child_child.tag == "bar" and not child_child.attrib:
                child.remove(child_child)
        child.attrib["baz"] = "42"

for child in root.findall(".//*[@name='Singapore']/year"):
    child.append(ET.Element("foo"))

with open(out_xml_path, "rb") as f:
    etree.write(f)

See the docs (including XPath support).

Michael-Gauthier · February 28, 2023, 7:46am

Thanks a lot, I’ll take a closer look at that this afternoon, and report on my progress! : )))

rob42 · February 28, 2023, 12:07pm

@Michael-Gauthier

You have some very good info here maybe enough to move this forward. I have no reason (to date) to work with XML files, but I do have some notes that suggest Beautiful Soup could very well be a useful resource for such.

Michael-Gauthier · February 28, 2023, 1:32pm

Thanks a lot for your suggestion of using BeautifulSoup! I have tried it, but despite my attempts, I cannot manage to get over a UnicodeDecodeError… I’ll have to look at it more closely later. : )

rob42 · February 28, 2023, 2:06pm

You’re welcome. I’d be surprised if the error that you’re seeing can’t be fixed, although I can’t say for sure, as I’ve never used BS.

Michael-Gauthier · March 1, 2023, 1:29pm

Hmmm, ok, so I have tried both methods, the XML.Etree one, and the BeautifulSoup one, and I haven’t managed to have success with any so far… : /

Regarding your example script @EpicWink, this is a bit embarrassing, but I couldn’t manage to adapt it for my needs… I have tried to make sense of what each line does, and remove/adapt only what I need by also using the documentation, but I’m a bit stuck… Here is my script so far, which is probably super far from a proper what, so apologies in advance:

import xml.etree.ElementTree as ET

with open("test.odt", "rb") as f:
    etree = ET.parse(f)
root = etree.getroot()
for element in root.iter(root):
    if element.text == 'foo':
        element.text == 'bar'
etree.write('output.odt')

I have tried to go from the root and down throughout the children in order to go over each element (i.e. word) of the .odt file in order to look for all the instances of ‘foo’, but I always get this error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2

For now, the file I’m using is a test file, which contains the following:
This is a word.
This is foo .
This is foo as well.

Any further help would be appreciated, and again, sorry for not getting what’s probably a trivial thing… : /

avisser · March 1, 2023, 4:05pm

It looks like Etree complains about the .original odt not being valid XML. If I save your text in LibreOffice and look at it, it appears to be an archive containing (among others) the content in XML form. That implies you would need to extract this document first (which is what odfpy seems to do).

Michael-Gauthier · March 1, 2023, 4:20pm

Thanks for your answer! Sorry, but what do you mean by “you would need to extract this document first”? Would you have an example script that would do that? I know that odfpy seems to be more adapted to that, but again, I think it does not work for me… I may very well have missed something though… : /

avisser · March 1, 2023, 8:24pm

perhaps the term “unzip” or “untar” is more familiar? Anyway, there are filemanagers and archive managers that can do that for you, possibly one of Python’s standard library “Data Compression and Archiving” packages might help you there

dstromberg · March 1, 2023, 8:37pm

I think xmltodict is generally the easiest way to do XML in Python.

It does have the drawback of reading an entire document into virtual memory though.