Text cleaning for data analysis

Siddhu513 · May 4, 2023, 6:53am

Hi, im working on data science project for factory quality notes.
I want to create a function which takes string input and return a cleaned string.

Example : Q2-1 Brake return line rear joint loose LHS V-03

Expected output: Brake return line rear joint loose LHS

The function should be able to remove the trailing spaces after removing Q2-1 and V-03. That is ,the function should be able to remove any text of the pattern [alphabet][alphabet] [0-9] or [alphabet][alphabet][-] [0-9] or or [alphabet][alphabet][space] [0-9] format .

This function should be applied on an Excel file. so that it applies to all 15000 rows.

Thanks in advance.

franklinvp · May 4, 2023, 11:49am

The substrings that you want to remove constitute a regular language. So, you can define a regular expression that matches those and only those strings. The module re, in particular the function re.sub can then go over your text replacing everything that matches your regular expression with the empty string ''.

The removal of trailing spaces can be done with str.strip.

smontanaro · May 4, 2023, 11:57am

Came here to say almost this. With a quick read of the docs, OP can easily convert their not-quite-regular expression into a full-fledged one.

As for manipulating Excel spreadsheets, there are a couple packages from the same author which have changed name and organization over the years. I have xlrd and xlwt installed on my daily Python interpreter. I thought there was an xlutils package as well, but may be misremembering.