Hello Pythoners, i have been trying to correct grammar errors from HTML string. But the grammar corrector i use, removes the html tags and i lose all the format.
I want to extract text, correct the grammar and finally add back the HTML tags in the same postion they were.
What i have tried so far (posting full code in case you want to run it):
from bs4 import BeautifulSoup
from happytransformer import HappyTextToText, TTSettings
happy_tt = HappyTextToText("T5", "vennify/t5-base-grammar-correction")
settings = TTSettings(num_beams=5, min_length=1)
def fix_html_tags(text):
soup = BeautifulSoup(text, "lxml")
ignore_tags = ['html', 'body']
result = ''
for tag in soup.find_all():
grammar = happy_tt.generate_text("grammar: " + tag.text, args=settings)
fixed_grammar = grammar.text
already_added = fixed_grammar in result
if tag.name not in ignore_tags and not already_added:
fixed_grammar = f'<{tag.name}>' + fixed_grammar + f'</{tag.name}>'
if not already_added:
result += fixed_grammar
return result
text = """Gmail is a free email service provided by Google. ... Gmail.<table><tr><th>A screenshot of a Gmail inbox and compose box</th></tr><tr><td>Content license</td><td>Proprietary</td></tr><tr><td>Written in</td><td>Java, C++ (back-end), JavaScript (UI)</td></tr></table>"""
print('original text: ', text)
print('*'*50)
print('grammar corrected text: ', fix_html_tags(text))
Original HTML
Gmail is a free email service provided by Google. ... Gmail.<table><tr><th>A screenshot of a Gmail inbox and compose box</th></tr><tr><td>Content license</td><td>Proprietary</td></tr><tr><td>Written in</td><td>Java, C++ (back-end), JavaScript (UI)</td></tr></table>
Result HTML (after fixing grammar using the posted function)
Gmail is a free email service provided by Google. ... Gmail.A screenshot of a Gmail inbox and compose boxContent licenseProprietaryWritten in Java, C++ (back-end),<table>A screenshot of a Gmail inbox and compose boxContent licenseProprietaryWritten in Java, C++ (back-end), JavaScript (UI).</table><tr>A screenshot of a Gmail inbox and compose box.</tr><tr>Content licenseProprietary.</tr><td>Content license is granted.</td><td>Proprietary property.</td><td>Written in English.</td>
It is not working as expected, some tags are getting lost (also some text is duplicated not sure why), any suggestions and help is appreciated