Seeking Python Libraries for Removing Extraneous Characters and Spaces in Text

I am developing a project that involves processing text data. My goal is to correct errors specifically related to unnecessary characters and spaces in texts. I’m looking for recommendations on suitable Python libraries and tools that could help address these issues.

Extraneous spaces:

  • Correct: “We boug ht a new car yesterday.” to “We bought a new car yesterday.”
  • Correct: “Today was a ve ry goo d da y.” to “Today was a very good day.”
  • Correct: “Hel lo! Ho w are you do ing?” to “Hello! How are you doing?”

I have explored several existing solutions, but most of them were either too basic for our needs or demanded significant computational resources. Additionally, it’s crucial for my project to handle data processing internally to ensure data privacy and security. Therefore, I need a tool that allows for easy customization, can be integrated into an existing project without substantial additional hardware investments, and operates without relying on external API calls.

What I expect from the solution:

  • Easy customization and integration capabilities.
  • Should not require significant computational resources.
  • Must operate locally and not rely on external API calls for data processing.

I would appreciate any suggestions on suitable Python libraries, tools, or open-source projects that can help solve the mentioned issues with extraneous characters and spaces, in line with these requirements.

You’re basically looking for a spellchecker. I’ve been using a variant of SymSpell for a somewhat similar problem, but the original implementation is in C#. There’s a version in Python but it’s not optimized for speed or memory usage.

I ended up adapting a version written in Rust and added Python bindings with PyO3 but in the process I optimized away from the general spellchecking problem. My library isn’t currently public, but I hope to release it in the future.

1 Like