Concatenning dataframes is very slow, how to make it more efficient?

DeLaRiva · June 7, 2023, 1:19pm

I’m reading in a lot of excel files and based on a part of their filename they are being merged. This itself works quite well but I noticed that with a lot of files this takes ages and this seems to be caused by the concatenation:

filenames = glob.glob(os.path.join(path, "*.xls"))

dd = defaultdict(list)
for fn in filenames:
    dd[fn.split('_202')[0]].append(fn)
    
dict_df = {}
for k, v in dd.items():
    dict_df[k] = pd.concat([pd.read_csv(fn
                                        , parse_dates = ['time']
                                        , dayfirst=True
                                        , skiprows=4
                                        , sep="\t"
                                        , decimal=','
                                        ).assign(Path = fn) for fn in v
                            ]
                           , ignore_index=True
                           )

Re-writing this with the help of appending didn’t work, or at least I do not understand how to do so:

dict_df2 = []
for k, v in dd.items():
    dict_df2.append(pd.read_csv(fn
                               , parse_dates = ['time']
                               , dayfirst=True
                               , skiprows=4
                               , sep="\t"
                               , decimal=','
                    ).assign(Path = fn) for fn in v)

Could you please help me out here?

vovavili · June 8, 2023, 2:57pm

Use dask to read CSV files, and .compute method to get back a pandas dataframe.

import dask.dataframe as dd

df = dd.read_csv('*.csv').compute()