Extract text from html tags, change some text, and set new text with old html tags

Hello Pythoners, i have been trying to correct grammar errors from HTML string. But the grammar corrector i use, removes the html tags and i lose all the format.

I want to extract text, correct the grammar and finally add back the HTML tags in the same postion they were.

What i have tried so far (posting full code in case you want to run it):

from bs4 import BeautifulSoup
from happytransformer import HappyTextToText, TTSettings


happy_tt = HappyTextToText("T5", "vennify/t5-base-grammar-correction")
settings = TTSettings(num_beams=5, min_length=1)


def fix_html_tags(text):
    soup = BeautifulSoup(text, "lxml")
    ignore_tags = ['html', 'body']
    result = ''
    for tag in soup.find_all():
        grammar = happy_tt.generate_text("grammar: " + tag.text, args=settings)
        fixed_grammar = grammar.text

        already_added = fixed_grammar in result
        if tag.name not in ignore_tags and not already_added:
            fixed_grammar = f'<{tag.name}>' + fixed_grammar + f'</{tag.name}>'

        if not already_added:
            result += fixed_grammar
    return result


text = """Gmail is a free email service provided by Google. ... Gmail.<table><tr><th>A screenshot of a Gmail inbox and compose box</th></tr><tr><td>Content license</td><td>Proprietary</td></tr><tr><td>Written in</td><td>Java, C++ (back-end), JavaScript (UI)</td></tr></table>"""
print('original text: ', text)
print('*'*50)
print('grammar corrected text: ', fix_html_tags(text))

Original HTML

Gmail is a free email service provided by Google. ... Gmail.<table><tr><th>A screenshot of a Gmail inbox and compose box</th></tr><tr><td>Content license</td><td>Proprietary</td></tr><tr><td>Written in</td><td>Java, C++ (back-end), JavaScript (UI)</td></tr></table>

Result HTML (after fixing grammar using the posted function)

Gmail is a free email service provided by Google. ... Gmail.A screenshot of a Gmail inbox and compose boxContent licenseProprietaryWritten in Java, C++ (back-end),<table>A screenshot of a Gmail inbox and compose boxContent licenseProprietaryWritten in Java, C++ (back-end), JavaScript (UI).</table><tr>A screenshot of a Gmail inbox and compose box.</tr><tr>Content licenseProprietary.</tr><td>Content license is granted.</td><td>Proprietary property.</td><td>Written in English.</td>

It is not working as expected, some tags are getting lost (also some text is duplicated not sure why), any suggestions and help is appreciated :grinning: