How to parse and group hierarchical list items from an unindented string in Python?

anni.eabe · March 27, 2024, 3:01am

Problem Statement

Given an unindented string as input, perform these steps:

Identify the list items at the highest level of the hierarchy within the string. These top-level items can be identified by the following criteria:
- Numbering systems (e.g., 1., 2., 3.)
- Lettering systems (e.g., A., B., C.)
- Bullets (e.g., -, *, •)
- Symbols (e.g., >, #, §)
For each top-level item identified in step 1:

a. Group it with all subsequent lower-level items until the next top-level item is encountered. Lower-level items can be identified by the following criteria:
- Prefixes (e.g., 1.1, 1.2, 1.3)
- Bullets (e.g., -, *, •)
- Alphanumeric sequences (e.g., a., b., c.)
- Roman numerals (e.g., i., ii., iii.)
b. Concatenate the top-level item with its associated lower-level items into a single string, maintaining the original formatting and delimiters. The formatting and delimiters should be preserved as they appear in the input string.
Return the resulting grouped list items as a Python list where each element represents a top-level item and its associated lower-level items. Each element in the list should be a string containing the concatenated top-level item and its lower-level items.
Exclude any text that appears before the first top-level item and after the last top-level item from the output. Only the content between the first and last top-level items should be included in the output list.

Goal

The goal is to create a Python method that takes an unindented string as input, identifies the top-level items and their associated lower-level items based on the specified criteria, concatenates them into a single string for each top-level item while maintaining the original formatting and delimiters, and returns the resulting grouped list items as a Python list. The output list should match the desired format, with each element representing a top-level item and its associated lower-level items.

Request

Please provide an explanation and guidance on how to create a Python method that can successfully achieve the goal outlined above. The explanation should include the steps involved, any necessary data structures or algorithms, and considerations for handling different scenarios and edge cases.

Additional Details

I have attempted to create a Python method to achieve the tasks outlined above, but my attempts have been unsuccessful. The methods I have tried do not produce the expected outputs for the given inputs.
To aid in testing and validating the solution, I have created and included numerous sample inputs and their corresponding expected outputs below. These test cases cover various scenarios and edge cases to ensure the robustness of the method.

Code Attempts:

Attempt 1:

def process_list_hierarchy(text):
    # Helper function to determine the indentation level
    def get_indentation_level(line):
        return len(line) - len(line.lstrip())

    # Helper function to parse the input text into a list of lines with their hierarchy levels
    def parse_hierarchy(text):
        lines = text.split('\n')
        hierarchy = []
        for line in lines:
            if line.strip():  # Ignore empty lines
                level = get_indentation_level(line)
                hierarchy.append((level, line.strip()))
        return hierarchy

    # Helper function to build a tree structure from the hierarchy levels
    def build_tree(hierarchy):
        tree = []
        stack = [(-1, tree)]  # Start with a dummy root level
        for level, content in hierarchy:
            # Find the correct parent level
            while stack and stack[-1][0] >= level:
                stack.pop()
            # Create a new node and add it to its parent's children
            node = {'content': content, 'children': []}
            stack[-1][1].append(node)
            stack.append((level, node['children']))
        return tree

    # Helper function to combine the tree into a single list
    def combine_tree(tree, combined_list=[], level=0):
        for node in tree:
            combined_list.append(('  ' * level) + node['content'])
            if node['children']:
                combine_tree(node['children'], combined_list, level + 1)
        return combined_list

    # Parse the input text into a hierarchy
    hierarchy = parse_hierarchy(text)
    # Build a tree structure from the hierarchy
    tree = build_tree(hierarchy)
    # Combine the tree into a single list while maintaining the hierarchy
    combined_list = combine_tree(tree)
    # Return the combined list as a string
    return '\n'.join(combined_list)

Attempt 2:

def organize_hierarchically(items):
    def get_level(item):
        match = re.match(r'^(\d+\.?|\-|\*)', item)
        return len(match.group()) if match else 0

    grouped_items = []
    for level, group in groupby(items, key=get_level):
        if level == 1:
            grouped_items.append('\n'.join(group))
        else:
            grouped_items[-1] += '\n' + '\n'.join(group)

    return grouped_items

Attempt 3:

from bs4 import BeautifulSoup
import nltk

def extract_sub_objectives(input_text):
    soup = BeautifulSoup(input_text, 'html.parser')
    text_content = soup.get_text()

    # Tokenize the text into sentences
    sentences = nltk.sent_tokenize(text_content)

    # Initialize an empty list to store the sub-objectives
    sub_objectives = []

    # Iterate through the sentences and extract sub-objectives
    current_sub_objective = ""
    for sentence in sentences:
        if sentence.startswith(("1.", "2.", "3.", "4.")):
            if current_sub_objective:
                sub_objectives.append(current_sub_objective)
                current_sub_objective = ""
            current_sub_objective += sentence + "\n"
        elif current_sub_objective:
            current_sub_objective += sentence + "\n"

    # Append the last sub-objective, if any
    if current_sub_objective:
        sub_objectives.append(current_sub_objective)

    return sub_objectives

Attempt 4:

def extract_sub_objectives(input_text, preserve_formatting=False):
    # Modified to strip both single and double quotes
    input_text = input_text.strip('\'"')
    messages = []
    messages.append("Debug: Starting to process the input text.")
    # Debug message to show the input text after stripping quotes
    messages.append(f"Debug: Input text after stripping quotes: '{input_text}'")

    # Define possible starting characters for new sub-objectives
    start_chars = [str(i) + '.' for i in range(1, 100)]  # Now includes up to two-digit numbering
    messages.append(f"Debug: Start characters defined: {start_chars}")

    # Define a broader range of continuation characters
    continuation_chars = ['-', '*', '+', '•', '>', '→', '—']  # Expanded list
    messages.append(f"Debug: Continuation characters defined: {continuation_chars}")

    # Replace escaped newline characters with actual newline characters
    input_text = input_text.replace('\\n', '\n')
    # Split the input text into lines
    lines = input_text.split('\n')
    messages.append(f"Debug: Input text split into lines: {lines}")

    # Initialize an empty list to store the sub-objectives
    sub_objectives = []
    # Initialize an empty string to store the current sub-objective
    current_sub_objective = ''
    # Initialize a counter for the number of continuations in the current sub-objective
    continuation_count = 0

    # Function to determine if a line is a new sub-objective
    def is_new_sub_objective(line):
        # Strip away leading quotation marks and whitespace
        line = line.strip('\'"').strip()
        return any(line.startswith(start_char) for start_char in start_chars)

    # Function to determine if a line is a continuation
    def is_continuation(line, prev_line):
        if not prev_line:
            return False
        # Check if the line starts with an alphanumeric followed by a period or parenthesis
        if len(line) > 1 and line[0].isalnum() and (line[1] == '.' or line[1] == ')'):
            # Check if it follows the sequence of the previous line
            if line[0].isdigit() and prev_line[0].isdigit() and int(line[0]) == int(prev_line[0]) + 1:
                return False
            elif line[0].isalpha() and prev_line[0].isalpha() and ord(line[0].lower()) == ord(prev_line[0].lower()) + 1:
                return False
            else:
                return True
        # Add a condition to check for lower-case letters followed by a full stop
        if line[0].islower() and line[1] == '.':
            return True
        return any(line.startswith(continuation_char) for continuation_char in continuation_chars)

    # Iterate over each line
    for i, line in enumerate(lines):
        prev_line = lines[i - 1] if i > 0 else ''
        # Check if the line is a new sub-objective
        if is_new_sub_objective(line):
            messages.append(f"Debug: Found a new sub-objective at line {i + 1}: '{line}'")
            # If we have a current sub-objective, check the continuation count
            if current_sub_objective:
                if continuation_count < 2:
                    messages.append(f"Debug: Sub-objective does not meet the continuation criterion: '{current_sub_objective}'")
                    for message in messages:
                        print(message)
                    return None
                # Check the preserve_formatting parameter before adding
                sub_objectives.append(
                    current_sub_objective.strip() if not preserve_formatting else current_sub_objective)
                messages.append(f"Debug: Added a sub-objective to the list. Current count: {len(sub_objectives)}.")
            # Reset the current sub-objective to the new one and reset the continuation count
            current_sub_objective = line
            continuation_count = 0
        # Check if the line is a continuation
        elif is_continuation(line, prev_line):
            messages.append(f"Debug: Line {i + 1} is a continuation of the previous line: '{line}'")
            # Add the line to the current sub-objective, checking preserve_formatting
            current_sub_objective += '\n' + line if preserve_formatting else ' ' + line.strip()
            # Increment the continuation count
            continuation_count += 1
        # Handle lines that are part of the current sub-objective but don't start with a continuation character
        elif current_sub_objective:
            messages.append(f"Debug: Line {i + 1} is part of the current sub-objective: '{line}'")
            # Add the line to the current sub-objective, checking preserve_formatting
            current_sub_objective += '\n' + line if preserve_formatting else ' ' + line.strip()

    # If we have a current sub-objective, check the continuation count before adding it to the list
    if current_sub_objective:
        if continuation_count < 2:
            messages.append(f"Debug: Sub-objective does not meet the continuation criterion: '{current_sub_objective}'")
            for message in messages:
                print(message)
            return None
        # Check the preserve_formatting parameter before adding
        sub_objectives.append(current_sub_objective.strip() if not preserve_formatting else current_sub_objective)
        messages.append(f"Debug: Added the final sub-objective to the list. Final count: {len(sub_objectives)}.")

    # Print the debug messages if no sub-objectives are found
    if not sub_objectives:
        for message in messages:
            print(message)

    return sub_objectives

Sample Data (Inputs and associated Outputs):

onePythonUser · March 27, 2024, 3:05am

I don’t believe this is a forum where you post your homework and the volunteers will do the homework for you. If you have a specific question regarding the Python language that you are not understanding, and you need clarification, the volunteers on this forum can help you with understanding the stated issue.

jeff5 · March 27, 2024, 8:41am

To be fair, @anni.eabe has posted some attempts at a solution (and not screenshots :)). And the code displays some understanding of data structures and parsing by lines, if it’s not just from a GPT.

However, this is quite a large question, and a complete answer would really mean doing the homework, which isn’t helpful in the long run. And I certainly don’t have an obviously right answer up my sleeve. Or the time.

Can we give any pointers about which of these attempts is a dead end or misunderstanding? Or some other way to get started?

I didn’t understand the inputs and outputs. They would be easier to read if formatted as Python. And short examples testing one thing, with minimal text “xxx”, would be easier to work with.

A few uncoordinated thoughts of my own follow.

Attempt 1

It is right to split the text into lines. I think a line (or paragraph?) is the unit to work with here. Indenting is not one of the criteria, so I think relying on it may not satisfy the brief. The markers at the beginning of a line (bullets, list numbers) should steer the parsing.

You’re using an explicit stack, to deal with the recursion, but often actual recursion in the code your write is a better solution. Do you understand recursive descent parsing?

Attempt 2

Regular expressions are a good way to identify “what kind of line am I looking at currently”. It is not powerful enough on its own to solve problems involving recursion. But are we really asked to deal with recursive structure. Looking at the brief, I think only to one level, and after that it is flat. So I think you may be on the right path.

I think itertools.groupby may be almost but not quite what you want here. Try doing it with your own code.

Attempt 3

I don’t know BeautifulSoup, but isn’t it for processing HTML? And it isn’t HTML we’re processing. Looking at how it does it may be a good exercise.

Attempt 4

Hard to follow this, but a lot of it is to do with matching characters, and I think regexes are a better bet.

c-rob · March 27, 2024, 9:31am

Programmer here. While I’m still new to Python I’m learning as I go.

Your sample input data is not conducive to being cut and paste into a plain text file so you are unlikely to get much help here. If the input is an HTML file post the HTML file in full with HTML codes, preformatted as code/plain text that we can copy and paste.
Reread your class notes to see what you are not understanding. Research parts you do not understand by searching for web pages that discuss similar problems and look at how they solved that.
Break the problem down into smaller steps. Start with the first step where you get stuck. Be calm and approach that step in different ways. Brainstorm, and try different things. If you can’t figure it out ask classmates or on this form about a specific step.

Someone mentioned regular expressions, aka “regex”. Use online tools to test regex on your HTML file to see if the regex works for a variety of cases. Here are some examples.

Debuggex. Test different types of regex like Javascript, Python, PCRE (Perl). https://www.debuggex.com/
Regex101. (My favorite.) Build and test regex with flavors like PCRE, Python, Golang, Java 8, .NET, etc. Find mode only, no replace mode. https://regex101.com
Regexr. https://regexr.com. Supports Javascript and PCRE. But has a replace mode. Have an account and save your tests.
Regex Kit. Test various types of regex like PCRE, Javascript, Pythan, etc. Regex tester - Best online regular expressions testing tool
Regexpal. Has login. https://www.regexpal.com/
Regextester. Test different flavors of regular expressions like PCRE. Has login account. Text won’t fit to browser window. https://www.regextester.com/ This might be the same: https://regextester.github.io/

anni.eabe · March 27, 2024, 11:10am

Paul, I appreciate your concern, but I want to clarify that this is not a homework assignment. This is a real-world problem I’m working on for a personal project. As you can see from my multiple code attempts, I have invested significant time and effort into trying to solve this issue on my own.

I’ve explored various approaches, including parsing the input text into a hierarchy, building tree structures, and using regular expressions to identify patterns. However, despite my best efforts, none of these methods have been able to successfully handle the complex input format and produce the desired output.

To ensure the robustness of my solution, I’ve also created an extensive set of test cases covering different scenarios and edge cases. Unfortunately, my current implementations have failed to pass these test cases. That’s why I’m reaching out to the community for guidance and insights on how to approach this problem effectively. I’m open to any suggestions or alternative perspectives that could help me overcome the challenges I’m facing.

FelixLeg · March 27, 2024, 11:16am

If I may give some advice to You @anni.eabe, could you post your test data in an web service like pastebin? The test data is quite long and complicated, and if one wants to help you, he/she has to be able to do the same tests like you are doing.

anni.eabe · March 27, 2024, 11:33am

Thank you for taking the time to review my code attempts and provide your thoughts. I really appreciate your insights and suggestions.

To clarify, this is not a homework assignment, but rather a personal project that I have been working on extensively. I apologize if my initial post was unclear in that regard. I have invested a significant amount of time trying to solve this problem on my own, as evidenced by the multiple code attempts I shared. However, despite my efforts, I have been unable to arrive at a solution that successfully passes all the test cases I have created.

While I value your general pointers and observations about my code, what I would find most helpful at this stage is a concrete suggestions that effectively addresses the issues I’m facing. The test cases I provided cover a range of scenarios and edge cases, and I’m hoping to find an approach that can handle them all robustly. If you or anyone else in the community could share a solution that tackles this problem head-on, I would be immensely grateful.

I understand this is a complex problem and a complete solution may require some effort. However, given the amount of time and energy I have already put into this, I would greatly appreciate any direct assistance in resolving the programmatic challenges I’m encountering.

anni.eabe · March 27, 2024, 11:37am

Thank you for your thoughtful response and suggestions, Chuck. I appreciate you taking the time to review my code attempts and provide guidance. Your point about the sample input data not being easily copy-pasteable is well taken. In the future, I will make sure to provide the input data in a more accessible format to facilitate better collaboration and assistance from the community.

While I value your advice to reread my class notes, research similar problems, and break down the task into smaller steps, I must admit that I have already exhausted those avenues. I have spent considerable time trying to understand the concepts and apply them to my specific problem, but I am still struggling to arrive at a working solution. This is why I have turned to this forum for more direct assistance.

I would be immensely grateful if you, or anyone else in the community, could propose a solution that effectively addresses the programmatic issues I am facing. As mentioned, I have created numerous test cases to validate the correctness of the solution, but my own attempts have consistently failed to pass them. A working code example that handles the hierarchical list parsing as described in the problem statement would be incredibly helpful in furthering my understanding and overcoming this challenge. Thank you in advance for any additional support you can provide.

anni.eabe · March 27, 2024, 11:38am

Sure happy too.

FelixLeg · March 27, 2024, 12:20pm

OK, my first question is: Does " or ' in your Inputs are a part of a source data? I mean, whatever real source data for your program will be given to, does it also contains quotes?

Besides I have got a kind of bad news: parsing a text with hierarchy is a damn complicated task, and probably the source code of your program will be like 1/6 of the test data length O_O. However, I once made a Markdown parser so I’m at least familiar with what things may be necessary to solve your problem

anni.eabe · March 27, 2024, 12:48pm

If you are referring to the symbols surrounding each input overall, such as an apostrophe followed by a speech-mark, then that is how the inputs can be structured. Alternatively, the inputs can simply be contained within just speech-marks.

If you are instead referring to how elements are separated in the expected output, then elements are separated by a closing apostrophe (or speech-mark) followed by a comma, and the next element in that list begins with an apostrophe (or speech-mark) followed by the content. This problem is indeed very difficult to solve and has taken many hours just to get to this stage. Any help or guidance you can offer would be greatly appreciated.

hansgeunsmeyer · March 27, 2024, 12:51pm

I took a quick look at your paste bin and believe that with your current approach you will never get satisfactory outputs for all possible inputs. Unstructured input text is just too fluid and slippery. When dealing with written natural language, even if you’re looking for bits and pieces that may seem a bit structured, it’s practically impossible for a purely rule-based system to generate good-enough output.

Have you looked into using an LLM to first identify the structured elements in the text?

Other than that, given that you already put quite a bit of effort into this, how about first trying to solve a simpler problem – finding a reasonable working solution for text that is already pretty structured - for instance markdown text, as @FelixLeg suggested?

If you haven’t worked with LLMs before than it might be worthwhile to have a look as SpaCy parsers - Spacy comes with a ton of good tools to help parsing natural language. One high-level approach might be to break your original problem into two stages: natural language text —> intermediate semi-structured representation → final output. Spacy could help in the first stage. In the second stage you could perhaps use more strictly rule-based methods or simpler parsers (have a look as David Beazley’s Sly which is a parser-generator tool, probably more suitable for this problem than trying to use regexes).

FelixLeg · March 27, 2024, 1:23pm

I don’t have got such a bad feeling as Hans. If I “push” your test data through Pandoc and tell it it is a Markdown source (which is not, but close enough) it produces a quite satisfactory output (sorry for a screenshot people ):

Of course, I think we have to have got some presumptions about the source data. If we can tell that:

every item in a list begins a line of text (e.g. there are not list items “hidden” in the middle of a line)
a list doesn’t contain mixed type items (like, “2. xxx…” after “a. xxx…”)
then I can write one program that does the thing right now (just give me some time )

franklinvp · March 27, 2024, 1:27pm

Speak for yourself. Some might be willing to solve homework, specially because some of us know that seeing solutions by others is part of (actually the first stage of) learning how to do it yourself.

anni.eabe · March 27, 2024, 2:20pm

Hans,

I appreciate your input, but I respectfully disagree with your suggestion to use an LLM for this problem. While an LLM might be able to process the input and generate the desired output in some cases, it’s not a reliable or practical solution for several reasons:

Inconsistency: LLMs can produce varying outputs for the same input, so there’s no guarantee that the desired output will be generated consistently. This makes them unsuitable for a real-world solution where reliability is crucial.
Extraneous content: Even when an LLM does generate the desired output, it often includes additional alphanumeric text and symbols that are not part of the expected result. Dealing with this extra content would require additional processing and filtering, adding unnecessary complexity to the solution.
Verification overhead: Using an LLM would still require the development of the Python techniques I’m working on to verify the correctness of the LLM’s output. This defeats the purpose of using an LLM in the first place, as it would increase the workload rather than simplify the problem.
Cost: Employing an LLM for this task would incur additional costs associated with LLM usage, which is not justified given the drawbacks mentioned above.

While I understand your suggestion to tackle a simpler problem first, the issue I’ve described is the one that requires my immediate attention. Parsing structured data is a task I’m already capable of handling, so the focus of this thread is specifically on processing unindented string input, as evidenced by the numerous examples of inputs and their corresponding outputs that I’ve provided.

Regarding your recommendation to use a natural language approach like Spacy in conjunction with rule-based methods or simpler parsers, I find the idea intriguing. However, as you can see from my code attempts, I have not been successful in achieving the desired results. If you could provide a pragmatic example demonstrating how inputs like the ones I’ve shared can be processed to produce the expected output as described in the problem statement, I would greatly appreciate it.

hansgeunsmeyer · March 27, 2024, 2:33pm

I agree - the points you make about LLMs are all valid - but that’s exactly why I thought a two-stage approach might work. Like first doing a greedy, rough parse to map the input to a simpler, more structured input (or to an input annotated with certain features) and then applying the actual parse to generate the final output.

anni.eabe · March 27, 2024, 2:47pm

Hi FelixLeg,

Thank you for your response and the effort you’ve put into finding a solution. I appreciate your optimism and the approach you’ve taken with Pandoc. However, I’d like to address a couple of points in your post.

Firstly, regarding the presumption that “a list doesn’t contain mixed type items (like, ‘2. xxx…’ after ‘a. xxx…’)”, I’ve reviewed the test data provided in the original post and found that this assumption doesn’t hold true. There are instances where a numbered list item is followed by a lettered sub-list. Here are a couple of examples:

1. Enhance the AIs decision-making capabilities:
a. Develop a modular framework that integrates techniques such as alpha-beta pruning and Monte Carlo Tree Search (MCTS) to improve decision-making.
b. Manage resources and prioritize techniques based on context, problem size, and complexity, using a ranking system to evaluate algorithms, assigning weights, and calculating scores.

In this case, the numbered item “1.” is followed by lettered sub-list items “a.” and “b.”.

1. Develop a comprehensive system to model player-game interactions by defining their roles, goals, actions, and interactions within the board game domain.
2. Represent the state space with all possible game states, including setup and gameplay, while utilizing dynamic, adaptive, and constrained scenarios for assessment.

Here, the list starts with a numbered item “1.” and is followed by another numbered item “2.” without any lettered sub-list in between.

These examples demonstrate that the lists in the provided test data do contain mixed type items, where a numbered list item can be followed by a lettered sub-list or another numbered item.

Secondly, I noticed that the output generated by pushing the test data through Pandoc includes some additional content that doesn’t align with the expected output format. For instance, the top paragraph of the Pandoc output contains text like:

Opening: To achieve our vision of creating a diverse and engaging collection of board games, we will focus on the following strategic components:
Structure: Each component is designed to meet specific gameplay requirements:
Terminology: Components, strategy, gameplay requirements, design. Each sub-objective should be a more detailed and focused component of its parent objective, contributing to achieving the parent goal.

This content appears before the start of the list items and should not be present in the final output based on the examples provided.

FelixLeg · March 27, 2024, 2:58pm

OK, I think I got something:

import re

NUMBERING_M = re.compile(r"^(\d+)(\.|\))\s*")
LETTERING_M = re.compile(r"^[A-Z](\.|\))\s*")
BULLETS_M = re.compile(r"^[-*]\s*")
SYMBOLS_M = re.compile(r"^[#>]\s*")
ALPHA_M = re.compile(r"^[a-z](\.|\))\s*")
SUB_NUM_M = re.compile(r"^\d+(\.|\))(\d+(\.|\))(\d+(\.|\))(\d+(\.|\)))?)?)?\s*") #four levels deep max
ROMAN_M = re.compile(r"^m?m?m?(cm|cd|d?c?c?c?)(xc|xl|l?x?x?x?)(ix|iv|v?i?i?i?)(\.|\))\s*")

TopLevelMatchers = [NUMBERING_M, SUB_NUM_M, LETTERING_M, BULLETS_M, SYMBOLS_M]
IntraLevelMatchers = [BULLETS_M, ALPHA_M, ROMAN_M]

def parse_text(input_txt):
	#do some cleanups
	if input_txt.startswith('"') or input_txt.startswith("'"):
		input_txt = input_txt[1:]
	if input_txt.endswith('"') or input_txt.endswith("'"):
		input_txt = input_txt[:-1]
	
	itxt = input_txt.splitlines()
	result = list()
	current_top_level = None
	custom_intra_level = None
	
	for i in itxt:
		if current_top_level is None:
			for matcher in TopLevelMatchers:
				if matcher.match(i):
					current_top_level = matcher
					result.append(i)
					break
		else:
			if current_top_level.match(i):
				result.append(i) #new item at the top level
			else:
				match_found = False
				for matcher in IntraLevelMatchers:
					if matcher.match(i):
						result[-1] += "\n" + i
						match_found = True
						break
				if not match_found:
					#this is the end of itemized text, break
					break
	
	return result

My program passes all 37 test cases from Your list Annie except these special cases:

Input #2 - There are lone lines which doesn’t start with an delimiter, and I don’t know what to do with these
Input #10 - My program treads all “X.Y” items to belonging with the previous “X.” item, while in your test data these are treated like separate top-level items. I need a clarification for this case
Input #13 - There is a list at the beginning that my algorithm catches as a top-level list. To solve this case we should agree to if there are some hierarchy about which types of lists are more “equal” than the others
Input #21 - this one is too ambiguous for my algorithm as there are lone lines between items, and I’m afraid that this one is a work to an AI of sort (because my algorithm need to know which words means what)

4 out of 37 cases is not that bad I think I hope that my program will still be helpful for those rest 29 cases. Tell me what you think of it

anni.eabe · March 27, 2024, 10:53pm

Thank you for your efforts!

When I test your method against each test case, it shows that 24 out of 37 are successful (or 13 out of 37 are unsuccessful). I put together a small program (shown below) which uses your method and compares the output against each expected output. If you could let me know if I’m missing something, that would be greatly appreciated. As you correctly pointed out, missing only 4 out of 37 cases is pretty great.

Code without inputs and outputs:

from itertools import groupby, zip_longest
 
import nltk
from bs4 import BeautifulSoup
 
 
def process_list_hierarchy(text):
    # Helper function to determine the indentation level
    def get_indentation_level(line):
        return len(line) - len(line.lstrip())
 
    # Helper function to parse the input text into a list of lines with their hierarchy levels
    def parse_hierarchy(text):
        lines = text.split('\n')
        hierarchy = []
        for line in lines:
            if line.strip():  # Ignore empty lines
                level = get_indentation_level(line)
                hierarchy.append((level, line.strip()))
        return hierarchy
 
    # Helper function to build a tree structure from the hierarchy levels
    def build_tree(hierarchy):
        tree = []
        stack = [(-1, tree)]  # Start with a dummy root level
        for level, content in hierarchy:
            # Find the correct parent level
            while stack and stack[-1][0] >= level:
                stack.pop()
            # Create a new node and add it to its parent's children
            node = {'content': content, 'children': []}
            stack[-1][1].append(node)
            stack.append((level, node['children']))
        return tree
 
    # Helper function to combine the tree into a single list
    def combine_tree(tree, combined_list=[], level=0):
        for node in tree:
            combined_list.append(('  ' * level) + node['content'])
            if node['children']:
                combine_tree(node['children'], combined_list, level + 1)
        return combined_list
 
    # Parse the input text into a hierarchy
    hierarchy = parse_hierarchy(text)
    # Build a tree structure from the hierarchy
    tree = build_tree(hierarchy)
    # Combine the tree into a single list while maintaining the hierarchy
    combined_list = combine_tree(tree)
    # Return the combined list as a string
    return '\n'.join(combined_list)
 
def organize_hierarchically(items):
    def get_level(item):
        match = re.match(r'^(\d+\.?|\-|\*)', item)
        return len(match.group()) if match else 0
 
    grouped_items = []
    for level, group in groupby(items, key=get_level):
        if level == 1:
            grouped_items.append('\n'.join(group))
        else:
            grouped_items[-1] += '\n' + '\n'.join(group)
 
    return grouped_items
 
def extract_sub_objectives(input_text):
    soup = BeautifulSoup(input_text, 'html.parser')
    text_content = soup.get_text()
 
    # Tokenize the text into sentences
    sentences = nltk.sent_tokenize(text_content)
 
    # Initialize an empty list to store the sub-objectives
    sub_objectives = []
 
    # Iterate through the sentences and extract sub-objectives
    current_sub_objective = ""
    for sentence in sentences:
        if sentence.startswith(("1.", "2.", "3.", "4.")):
            if current_sub_objective:
                sub_objectives.append(current_sub_objective)
                current_sub_objective = ""
            current_sub_objective += sentence + "\n"
        elif current_sub_objective:
            current_sub_objective += sentence + "\n"
 
    # Append the last sub-objective, if any
    if current_sub_objective:
        sub_objectives.append(current_sub_objective)
 
    return sub_objectives
 
def extract_sub_objectivesV2(input_text, preserve_formatting=False):
    # Modified to strip both single and double quotes
    input_text = input_text.strip('\'"')
    messages = []
    messages.append("Debug: Starting to process the input text.")
    # Debug message to show the input text after stripping quotes
    messages.append(f"Debug: Input text after stripping quotes: '{input_text}'")
 
    # Define possible starting characters for new sub-objectives
    start_chars = [str(i) + '.' for i in range(1, 100)]  # Now includes up to two-digit numbering
    messages.append(f"Debug: Start characters defined: {start_chars}")
 
    # Define a broader range of continuation characters
    continuation_chars = ['-', '*', '+', '•', '>', '→', '—']  # Expanded list
    messages.append(f"Debug: Continuation characters defined: {continuation_chars}")
 
    # Replace escaped newline characters with actual newline characters
    input_text = input_text.replace('\\n', '\n')
    # Split the input text into lines
    lines = input_text.split('\n')
    messages.append(f"Debug: Input text split into lines: {lines}")
 
    # Initialize an empty list to store the sub-objectives
    sub_objectives = []
    # Initialize an empty string to store the current sub-objective
    current_sub_objective = ''
    # Initialize a counter for the number of continuations in the current sub-objective
    continuation_count = 0
 
    # Function to determine if a line is a new sub-objective
    def is_new_sub_objective(line):
        # Strip away leading quotation marks and whitespace
        line = line.strip('\'"').strip()
        return any(line.startswith(start_char) for start_char in start_chars)
 
    # Function to determine if a line is a continuation
    def is_continuation(line, prev_line):
        if not prev_line:
            return False
        # Check if the line starts with an alphanumeric followed by a period or parenthesis
        if len(line) > 1 and line[0].isalnum() and (line[1] == '.' or line[1] == ')'):
            # Check if it follows the sequence of the previous line
            if line[0].isdigit() and prev_line[0].isdigit() and int(line[0]) == int(prev_line[0]) + 1:
                return False
            elif line[0].isalpha() and prev_line[0].isalpha() and ord(line[0].lower()) == ord(prev_line[0].lower()) + 1:
                return False
            else:
                return True
        # Add a condition to check for lower-case letters followed by a full stop
        if line[0].islower() and line[1] == '.':
            return True
        return any(line.startswith(continuation_char) for continuation_char in continuation_chars)
 
    # Iterate over each line
    for i, line in enumerate(lines):
        prev_line = lines[i - 1] if i > 0 else ''
        # Check if the line is a new sub-objective
        if is_new_sub_objective(line):
            messages.append(f"Debug: Found a new sub-objective at line {i + 1}: '{line}'")
            # If we have a current sub-objective, check the continuation count
            if current_sub_objective:
                if continuation_count < 2:
                    messages.append(f"Debug: Sub-objective does not meet the continuation criterion: '{current_sub_objective}'")
                    for message in messages:
                        print(message)
                    return None
                # Check the preserve_formatting parameter before adding
                sub_objectives.append(
                    current_sub_objective.strip() if not preserve_formatting else current_sub_objective)
                messages.append(f"Debug: Added a sub-objective to the list. Current count: {len(sub_objectives)}.")
            # Reset the current sub-objective to the new one and reset the continuation count
            current_sub_objective = line
            continuation_count = 0
        # Check if the line is a continuation
        elif is_continuation(line, prev_line):
            messages.append(f"Debug: Line {i + 1} is a continuation of the previous line: '{line}'")
            # Add the line to the current sub-objective, checking preserve_formatting
            current_sub_objective += '\n' + line if preserve_formatting else ' ' + line.strip()
            # Increment the continuation count
            continuation_count += 1
        # Handle lines that are part of the current sub-objective but don't start with a continuation character
        elif current_sub_objective:
            messages.append(f"Debug: Line {i + 1} is part of the current sub-objective: '{line}'")
            # Add the line to the current sub-objective, checking preserve_formatting
            current_sub_objective += '\n' + line if preserve_formatting else ' ' + line.strip()
 
    # If we have a current sub-objective, check the continuation count before adding it to the list
    if current_sub_objective:
        if continuation_count < 2:
            messages.append(f"Debug: Sub-objective does not meet the continuation criterion: '{current_sub_objective}'")
            for message in messages:
                print(message)
            return None
        # Check the preserve_formatting parameter before adding
        sub_objectives.append(current_sub_objective.strip() if not preserve_formatting else current_sub_objective)
        messages.append(f"Debug: Added the final sub-objective to the list. Final count: {len(sub_objectives)}.")
 
    # Print the debug messages if no sub-objectives are found
    if not sub_objectives:
        for message in messages:
            print(message)
 
    return sub_objectives
 
 
 
import re
 
NUMBERING_M = re.compile(r"^(\d+)(\.|\))\s*")
LETTERING_M = re.compile(r"^[A-Z](\.|\))\s*")
BULLETS_M = re.compile(r"^[-*]\s*")
SYMBOLS_M = re.compile(r"^[#>]\s*")
ALPHA_M = re.compile(r"^[a-z](\.|\))\s*")
SUB_NUM_M = re.compile(r"^\d+(\.|\))(\d+(\.|\))(\d+(\.|\))(\d+(\.|\)))?)?)?\s*")  # four levels deep max
ROMAN_M = re.compile(r"^m?m?m?(cm|cd|d?c?c?c?)(xc|xl|l?x?x?x?)(ix|iv|v?i?i?i?)(\.|\))\s*")
 
TopLevelMatchers = [NUMBERING_M, SUB_NUM_M, LETTERING_M, BULLETS_M, SYMBOLS_M]
IntraLevelMatchers = [BULLETS_M, ALPHA_M, ROMAN_M]
 
 
def parse_text(input_txt):
    # do some cleanups
    if input_txt.startswith('"') or input_txt.startswith("'"):
        input_txt = input_txt[1:]
    if input_txt.endswith('"') or input_txt.endswith("'"):
        input_txt = input_txt[:-1]
 
    itxt = input_txt.splitlines()
    result = list()
    current_top_level = None
    custom_intra_level = None
 
    for i in itxt:
        if current_top_level is None:
            for matcher in TopLevelMatchers:
                if matcher.match(i):
                    current_top_level = matcher
                    result.append(i)
                    break
        else:
            if current_top_level.match(i):
                result.append(i)  # new item at the top level
            else:
                match_found = False
                for matcher in IntraLevelMatchers:
                    if matcher.match(i):
                        result[-1] += "\n" + i
                        match_found = True
                        break
                if not match_found:
                    # this is the end of itemized text, break
                    break
 
    return result
 
 
class ListHierarchyProcessor:
 
    def __init__(self, method='process_list_hierarchy'):
        self.method = method
        self.actual_outputs = []
        self.perfect_match_count = 0  # Instance variable to track the number of perfect matches
        self.inputs = [
            ...
        ]
 
        self.expected_outputs = [
            ...
           ]
 
    def process_inputs(self):
        for i, input_text in enumerate(self.inputs):
            print(f"========== Processing Input {i + 1} using method '{self.method}' ==========")
            print()
 
            # Existing code to process the input based on the method
            if self.method == 'process_list_hierarchy':
                output = process_list_hierarchy(input_text).split('\n')
            elif self.method == 'organize_hierarchically':
                output = organize_hierarchically(input_text.split('\n'))
            elif self.method == 'extract_sub_objectives':
                output = extract_sub_objectives(input_text)
            elif self.method == 'extract_sub_objectivesV2':
                output = extract_sub_objectivesV2(input_text)
            elif self.method == 'parse_text':
                output = parse_text(input_text)
            else:
                raise ValueError(f"Invalid method: {self.method}")
 
            self.actual_outputs.append(output)
 
            print("Output:")
            print("-------")
            print(output)
            print()
 
            print("Expected Output:")
            print("----------------")
            print(self.expected_outputs[i])
            print()
 
            # Calculate similarity percentage
            similarity_percentage = self.calculate_similarity_percentage(output, self.expected_outputs[i])
            print(f"Similarity Percentage: {similarity_percentage:.2f}%")
            print()
 
            # Increment perfect_match_count if the output matches the expected output perfectly
            if similarity_percentage == 100:
                self.perfect_match_count += 1
 
            # Check if the output matches the expected output
            if len(output) != len(self.expected_outputs[i]):
                print("Result: Output does not match the expected output.")
                print("Differences:")
                for j, (output_line, expected_line) in enumerate(zip_longest(output, self.expected_outputs[i], fillvalue="")):
                    if output_line != expected_line:
                        print(f"Line {j + 1}:")
                        print(f"Output:   {output_line or 'None'}")
                        print(f"Expected: {expected_line or 'None'}")
            else:
                print("Result: Output matches the expected output.")
 
            print("=" * 70)
            print()
 
        # Print the total count of perfect matches after processing all inputs
        print(f"Total Perfect Matches: {self.perfect_match_count}")
 
    def calculate_similarity_percentage(self, actual_output, expected_output):
        total_lines = max(len(actual_output), len(expected_output))
        if total_lines == 0:
            return 100
        matching_lines = sum(1 for actual_line, expected_line in zip_longest(actual_output, expected_output, fillvalue="") if actual_line == expected_line)
        similarity_percentage = (matching_lines / total_lines) * 100
        return similarity_percentage
 
 
processor = ListHierarchyProcessor(method='parse_text')
#processor = ListHierarchyProcessor(method='organize_hierarchically')
#processor = ListHierarchyProcessor(method='extract_sub_objectives')
#processor = ListHierarchyProcessor(method='extract_sub_objectivesV2')
processor.process_inputs()

Full Code including inputs and outputs:

Output from Testing (running method in program with all inputs and outputs):

c-rob · March 28, 2024, 10:45am

Have you tried various prompts for an AI to give you ideas?

I tried one for you. https://g.co/gemini/share/af30653df018

I used to process PDF documents, some were images that were OCRd, some were PDF documents that had actual text. I would extract the text, convert it to Markdown, then make an EPUB from Markdown. But processing bullets was really difficult due to:

Inconsistency of the OCR material that I got.
Bullet markings (like solid circle or dash) were sometimes not converted to an actual circle or dash.
Indentation of the bulleted material was random at times.
Any whitespace between the bullet mark and the text was random.

So consistency of the input would be important for your application. I ended up processing bullets into Markdown by hand.