How to parse and group hierarchical list items from an unindented string in Python?

anni.eabe · March 28, 2024, 11:47am

@c-rob Thank you for taking the time to explore potential solutions to my problem using AI. I sincerely appreciate your efforts and the insights you’ve shared from your own experience working with PDF documents and converting them to various formats.

I have indeed attempted to leverage AI and language models to find a solution to this problem in the past. However, the code suggestions generated by these models often rely on indentation to determine the hierarchy of the lists, which is not a reliable approach for my specific use case. Additionally, many of the AI-generated solutions fail to pass the majority of the test cases I’ve provided, indicating that they are not robust enough to handle the diverse range of inputs and edge cases.

Your firsthand experience with processing OCR material and the challenges you encountered, such as inconsistencies in bullet markings, indentation, and whitespace, resonates with the difficulties I’m facing. Ensuring consistency in the input is indeed crucial for developing a reliable solution. While manual processing of bullets into Markdown worked for your application, I’m hoping to find an automated approach that can handle the variations present in my input data. Nevertheless, your insights have reinforced the importance of considering these factors when developing a solution.

jeff5 · March 28, 2024, 5:30pm

It was a bit thin, but I only had so much time before work. A lot of people have offered help. I’ll read that and see if I still have useful things to say. I think there is a simple grammar here.

jeff5 · March 28, 2024, 6:06pm

I think @FelixLeg has got you most of the way with this. I don’t understand why the program you exhibit to apply it is so complicated (and I don’t think I need to).

There is one fundamental problem, though. Your stated way of identifying top-level and lower-level items by their start is ambiguous. Bullets can start either type, so when any item after the first, the parser cannot tell whether a bulleted paragraph is top-level or begins/continues a sequence of low-level paragraphs.

This may be (one reason) why it doesn’t quite work yet.

Now, if you mean in your specification that all the top-level paragraphs in a given document are of the same style (e.g. all numeric, or all the same bullet) while lower-level paragraphs are always recognisably different, which must be the case, I think, then it is only necessary to remember the top-level style to disambiguate the grammar.