I have some .txt file into tabular format and I want to convert it into .csv. I am attaching the screenshot of the data.
Please assist me with this how it is possible.
Thank you in advance.
I have some .txt file into tabular format and I want to convert it into .csv. I am attaching the screenshot of the data.
Please assist me with this how it is possible.
Thank you in advance.
Reading the data is easy, just use pandas
with any whitespace as delimiter:
import pandas as pd
table = pd.read_table("table.txt", sep="\s+", skiprows=2)
The column names are annoying to read because they both contain whitespace and are delimited by whitespace. Assuming that no column name contains more than a single sequential space character and all column names are delimited by at least two space characters, you can do this:
with open("table.txt" ,"r") as tablefile:
column_names = [n.strip() for n in table.readline().strip().split(" ")] # That's two spaces.
readline()
table = pd.read_table(tablefile, sep="\s+", names=column_names)
Surely, if the separator is a regex, you can use sep=r'\s\s+'
or sep=r'\s{2,}'
to specify 2 or more whitespace characters?
True. But pandas
will give you a warning if you do, because its C engine only supports the literal \s+
; other regexes use the slower Python engine.
Getting error
Error tokenizing data. C error: Expected 3 fields in line 6, saw 25
Gonna need to see the code you ran, along with a full traceback.
I just wrote this only:
import pandas as pd
table = pd.read_table(“table.txt”, sep=“\s+”)
Still need the traceback, and ideally a sample of the input data in text format.
S.No. Txn. Date Txn. Part. Txn. ID Inst. No. Txn. Type Txn. Amt. Value Date Curr. Cls. Bal. Cls. Sign Init Sol Id Txn. Rmks.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
337 18-05-2017 NEFT-AJAY AHUJA O40712866 C 325000 18-05-2017 INR -3107722.26 DR -N138170296320913
338 18-05-2017 NEFT-AJAY AHUJA O40712866 C 4280 18-05-2017 INR -3103442.26 DR -N138170296331445
339 18-05-2017 NEFT-KAMLESH O40799885 C 100000 18-05-2017 INR -3003442.26 DR -SBIN417138363649
340 23-05-2017 Loan Account Payments For : 00077025004805 O46009515 D 800000 23-05-2017 INR -3803442.26 DR
341 26-05-2017 NEFT-SANGEETA POPLI O50250633 C 104631 26-05-2017 INR -3698811.26 DR -SBIN217146595204
342 26-05-2017 RTGSSIDHARTH CHAUHANKARBH17146 O50309553 C 5300000 26-05-2017 INR 1601188.74 CR RTGSKARBH17146322212
343 26-05-2017 Loan Account Payments For : 00077025004805 O50433969 D 5300000 26-05-2017 INR -3698811.26 DR
344 15-06-2017 NEFT-SURYA KANT O64797056 C 67565 15-06-2017 INR -3631246.26 DR -N166170313419277
345 17-06-2017 NEFT-RAVI SHARMA O66494750 C 301526 17-06-2017 INR -3329720.26 DR -1232885648
I see, there’s also whitespace in the data. And the individual data elements are sometimes delimited by just a single space. That’s pretty annoying.
This must be preprocessed before being fed to pandas. In particular, the Txn. Date
and Txn. Part
columns must be separated by more whitespace. The Value Date
and Curr.
columns, too.
This should do the trick:
import re
from io import StringIO
import pandas as pd
pattern = r"""
(^.*\d{2}-\d{2}-\d{4}) # Capture everything until and including the first occurrence of DD-MM-YYYY.
(.*\d{2}-\d{2}-\d{4}) # Capture everything until and including the second occurrence of DD-MM-YYYY.
(.*$) # Capture anything that remains.
"""
repl = r"\1 \2 \3" # Separate the three capture groups by a single space.
with open("sample.txt", "r") as tablefile:
raw = tablefile.read()
preprocessed = re.sub(pattern, repl, raw, flags=re.M | re.X)
buffer = StringIO(preprocessed)
table = pd.read_table(buffer, sep="\s\s+", skiprows=2)
Edit: Made the regex verbose.
ParserError: Expected 1 fields in line 6, saw 10. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Can’t reproduce that error with the data sample you provided.
Sometimes it’s easier to just do it by hand, rather than trying to teach Pandas [*] to do it for you
This is a fixed-width formatted table. Note that means there’s no guarantee that there will be two or more spaces between fields. In fact, there could be no spaces between fields, depending on how it’s written.
If you don’t need this to be universal code, you can specify the widths by hand:
Sno = line[:13]
txdate = line[14:25]
...
or be a bit more automatic:
breaks = [10, 3, 25, 88, ...]
data = [line[start, end for start, end in zip(breaks[:-1]), breaks[1:]]]
Alternatively, if the Txn. Part.
field is the only one with spaces:
parts = line.split()
first_two = parts[:2]
TxnPart = ’ '.join(parts[2:-9])
rest = parts[-9:]
Note: that may break if there are empty fields, which it looks like there are
All code untested
[*] You’d think that Pandas might have a fixed-width field reader – has anyone looked – good feature request if not.
The header reoccurs throughout the file. You need to remove it in order to parse the data.
I would use three regular expressions in Notepad++ to convent this into a proper CSV file instead of using pandas:
Inst. No.
or Type