How to iterate through unbalanced XML hierarchy?

Greetings,

I have an XML file which has an unbalanced hierarchy of nodes … some of the nodes have a sub-node of but many do not.

Do you know of an easy way to read this xml into an array? I’ve tried beautifulsoup but it seems to be more oriented to strictly formed xml (at least for it’s lxml library). Example below.

In the code below if I uncomment the find_all(‘note’) line I get an immediate error.

Thoughts?

XML CODE

<book>
  <title>The Byzantine Empire (Serapis Classics)</title>
  <authors>Charles Oman</authors>
  <highlights>
    <highlight isNoteOnly="false">
      <text>Byzantium.</text>
      <location url="kindle://book?action=open&amp;asin=B0779GVQNZ&amp;location=53">53</location>
      <note>TEST NOTES ONLY -- ALL NEED TO BE DELETED</note>
    </highlight>
    <highlight isNoteOnly="false">
      <text>During the fifth century Byzantium twice declared war on Athens, now the mistress of the seas, and on each occasion fell into the hands of the enemy—once by voluntary surrender in 439 b.c., once by treachery from within, in 408 b.c.But the Athenians, except in one or two disgraceful cases, did not deal hardly with their conquered enemies, and the Byzantines escaped anything harder than the payment of a heavy war indemnity. In a few years their commercial gains repaired all the losses of war, and the state was itself again.</text>
      <location url="kindle://book?action=open&amp;asin=B0779GVQNZ&amp;location=98">98</location>
    </highlight>
    <highlight isNoteOnly="false">
      <text>Though deprived of a liberty which had for long years been almost nominal, Byzantium could not be deprived of its unrivalled position for commerce.</text>
      <location url="kindle://book?action=open&amp;asin=B0779GVQNZ&amp;location=127">127</location>
      <note>TRIED TO MAKE A NOTE WITHOUT A HIGHLIGHT BUT THE SELECTION MADE IN THE NOTE TURNS INTO A HIGHLIGHT WITH A NOTE ATTACHED TO IT -- THIS IS THAT NOTE...</note>
    </highlight>

PYTHON CODE

# Initializing soup variable
from bs4 import BeautifulSoup

# reading content
file = open("mounce.xml", "r")
contents = file.read()
soup = BeautifulSoup(contents, 'xml')
 
  # Iterating through item tag and extracting elements
all_items = soup.find_all('highlight')
items_length = len(all_items)
    
for index, item in enumerate(all_items):
 hltxt = item.find('text').text
 hlloc = item.find('location').text
 # hlnote = item.find('note').text
 print (hltxt)
 print (hlloc)
 # print (hlnote)

implemented a very in-elegant workaround: do a find for the ‘note’ tag on the current element. If it’s NULL (None) then don’t request the text of the element.

BS4 barfs when you ask it for an element which hasn’t been instantiated:
item.find(‘note’).text throws an error on every item which has no tag

this if worked around the error … (there have to be better ways to do this?)

for index, item in enumerate(all_items):
 hltxt = item.find('text').text
 hlloc = item.find('location').text
 hlnote = item.find('note') 
 if (hlnote is not None):
    hlnote = item.find('note').txt

Maybe something like this?

root = ET.parse(XML_FILE).getroot()
highlights = root.findall('.//highlight')

for highlight in highlights:
    note = highlight.find('note')
    if note is not None:
        print(note.text)

    text = highlight.find('text')
    if text is not None:
        print(text.text)

    location = highlight.find('location')
    if location is not None:
        print(location.text)

    print('-' * 20)

That leads to this print out:

TEST NOTES ONLY -- ALL NEED TO BE DELETED
Byzantium.
53
--------------------
During the fifth century Byzantium twice declared war on Athens, now the mistress of the seas, and on each occasion fell into the hands of the enemy—once by voluntary surrender in 439 b.c., once by treachery from within, in 408 b.c.But the Athenians, except in one or two disgraceful cases, did not deal hardly with their conquered enemies, and the Byzantines escaped anything harder than the payment of a heavy war indemnity. In a few years their commercial gains repaired all the losses of war, and the state was itself again.
98
--------------------
TRIED TO MAKE A NOTE WITHOUT A HIGHLIGHT BUT THE SELECTION MADE IN THE NOTE TURNS INTO A HIGHLIGHT WITH A NOTE ATTACHED TO IT -- THIS IS THAT NOTE...
Though deprived of a liberty which had for long years been almost nominal, Byzantium could not be deprived of its unrivalled position for commerce.
127
--------------------

Edit: Thought this is similar to what you had. What was the error/issue you were hitting?

I think with lxml you can use an XPath query to iterate over these things directly, something like:

from lxml import etree

et  = etree.parse(your_file)

# including trailing whitespace and newlines...
all_the_text = et.xpath("//text()")

You can use a more specific XPath query for better results. But XPath makes this pretty easy, once you learn it.

Edit: Thought this is similar to what you had. What was the error/issue you were hitting?

The error I was hitting is that:

  1. not every highlight entry has a note child node. (I called this an “unbalanced hierarchy” – is there a better term in XML parlance?)
  2. for missing note entries, if you try to assign hlnote = item.find(‘note’).text BS4 throws an error that there is no such element I was rather hoping it just assigned NULL / None and moved on.

I’ll have to look at XML handling outside of BS4.