Tabular data in .txt into .csv/xls

I have some .txt file into tabular format and I want to convert it into .csv. I am attaching the screenshot of the data.

Please assist me with this how it is possible.

Thank you in advance.

Reading the data is easy, just use pandas with any whitespace as delimiter:

import pandas as pd
table = pd.read_table("table.txt", sep="\s+", skiprows=2)

The column names are annoying to read because they both contain whitespace and are delimited by whitespace. Assuming that no column name contains more than a single sequential space character and all column names are delimited by at least two space characters, you can do this:

with open("table.txt" ,"r") as tablefile:
    column_names = [n.strip() for n in table.readline().strip().split("  ")]  # That's two spaces.
    readline()
    table = pd.read_table(tablefile, sep="\s+", names=column_names)
2 Likes

Surely, if the separator is a regex, you can use sep=r'\s\s+' or sep=r'\s{2,}' to specify 2 or more whitespace characters?

True. But pandas will give you a warning if you do, because its C engine only supports the literal \s+; other regexes use the slower Python engine.

Getting error

Error tokenizing data. C error: Expected 3 fields in line 6, saw 25

Gonna need to see the code you ran, along with a full traceback.

I just wrote this only:

import pandas as pd
table = pd.read_table(“table.txt”, sep=“\s+”)

Still need the traceback, and ideally a sample of the input data in text format.

S.No. Txn. Date  Txn. Part.                                                       Txn. ID    Inst. No.    Txn. Type  Txn. Amt.   Value Date Curr.  Cls. Bal.      Cls. Sign Init Sol Id                             Txn. Rmks.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
       337   18-05-2017 NEFT-AJAY AHUJA                                                  O40712866               C          325000      18-05-2017 INR    -3107722.26    DR                                                -N138170296320913
       338   18-05-2017 NEFT-AJAY AHUJA                                                  O40712866               C          4280        18-05-2017 INR    -3103442.26    DR                                                -N138170296331445
       339   18-05-2017 NEFT-KAMLESH                                                     O40799885               C          100000      18-05-2017 INR    -3003442.26    DR                                                -SBIN417138363649
       340   23-05-2017 Loan Account Payments For : 00077025004805                       O46009515               D          800000      23-05-2017 INR    -3803442.26    DR
       341   26-05-2017 NEFT-SANGEETA POPLI                                              O50250633               C          104631      26-05-2017 INR    -3698811.26    DR                                                -SBIN217146595204
       342   26-05-2017 RTGSSIDHARTH CHAUHANKARBH17146                                   O50309553               C          5300000     26-05-2017 INR    1601188.74     CR                                                RTGSKARBH17146322212
       343   26-05-2017 Loan Account Payments For : 00077025004805                       O50433969               D          5300000     26-05-2017 INR    -3698811.26    DR
       344   15-06-2017 NEFT-SURYA KANT                                                  O64797056               C          67565       15-06-2017 INR    -3631246.26    DR                                                -N166170313419277
       345   17-06-2017 NEFT-RAVI SHARMA                                                 O66494750               C          301526      17-06-2017 INR    -3329720.26    DR                                                -1232885648

I see, there’s also whitespace in the data. And the individual data elements are sometimes delimited by just a single space. That’s pretty annoying.

This must be preprocessed before being fed to pandas. In particular, the Txn. Date and Txn. Part columns must be separated by more whitespace. The Value Date and Curr. columns, too.

This should do the trick:

import re
from io import StringIO
import pandas as pd

pattern = r"""
            (^.*\d{2}-\d{2}-\d{4})  # Capture everything until and including the first occurrence of DD-MM-YYYY.
            (.*\d{2}-\d{2}-\d{4})  # Capture everything until and including the second occurrence of DD-MM-YYYY.
            (.*$)  # Capture anything that remains.
          """
repl = r"\1 \2 \3"  # Separate the three capture groups by a single space.

with open("sample.txt", "r") as tablefile:
    raw = tablefile.read()

preprocessed = re.sub(pattern, repl, raw, flags=re.M | re.X)
buffer = StringIO(preprocessed)
table = pd.read_table(buffer, sep="\s\s+", skiprows=2)

Edit: Made the regex verbose.

1 Like

ParserError: Expected 1 fields in line 6, saw 10. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

Can’t reproduce that error with the data sample you provided.

File link - https://filetransfer.io/data-package/UhrX4mjP#link

Sometimes it’s easier to just do it by hand, rather than trying to teach Pandas [*] to do it for you :slight_smile:

This is a fixed-width formatted table. Note that means there’s no guarantee that there will be two or more spaces between fields. In fact, there could be no spaces between fields, depending on how it’s written.

If you don’t need this to be universal code, you can specify the widths by hand:

Sno = line[:13]
txdate = line[14:25]
...

or be a bit more automatic:

breaks = [10, 3, 25, 88, ...]
data = [line[start, end for start, end in zip(breaks[:-1]), breaks[1:]]]

Alternatively, if the Txn. Part. field is the only one with spaces:

parts = line.split()
first_two = parts[:2]
TxnPart = ’ '.join(parts[2:-9])
rest = parts[-9:]

Note: that may break if there are empty fields, which it looks like there are :frowning:

All code untested

[*] You’d think that Pandas might have a fixed-width field reader – has anyone looked – good feature request if not.

The header reoccurs throughout the file. You need to remove it in order to parse the data.

I would use three regular expressions in Notepad++ to convent this into a proper CSV file instead of using pandas:

  • " {3,}" to “,”
  • "(\d{4}) " to “$1,”
  • “^,” to nothing.
    Then you can deduplicate headers by hand, and remove unused columns like Inst. No. or Type