How to import all files in a folder as pandas dataframe having same name as the file name?

dub2s · August 15, 2023, 8:44pm

Basically, I have a folder that have files like :
“fileA.txt”
“fileB.txt”
“fileC.txt”

These are csv files that I am importing as pandas dataframe using pd.read_csv(). I am doing :

fileA = pd.read_csv("fileA.txt")
fileB = pd.read_csv("fileB.txt")
fileC = pd.read_csv("fileC.txt")

I have more than 10 files which I want to import like this. Currently, I am writing this manually. Is there a way to automate this process?

rob42 · August 15, 2023, 9:21pm

One idea that comes to my mind: you could use a directory tree generator so that you get a list object that contains all of the file names that a source directory contains, then use that in a loop to do the pd.read_csv().

How that would be constructed, would depend on how each read operation is to be processed; that is to say: do you need every file to be held in memory, before any processing can be done, which is possible, if (potentially) resource heavy. Or can each file be read, then processed, before the next file is to be read?

cameron · August 16, 2023, 12:35am

One idea that comes to my mind: you could use a directory tree
generator so that
you get a list object that contains all of the file names that a source
directory contains, then use that in a loop to do the pd.read_csv().

If it’s a flat directory, os.listdir is simpler than os.walk. If you
import the glob module, it has a glob function, eg:

 >>> import glob
 >>> glob.glob("*.md")
 ['README.md']

You might want to use the pattern *.txt.

Cheers,
Cameron Simpson cs@cskk.id.au

dub2s · August 16, 2023, 3:22am

Hi @rob42 and @cameron

Thank you for your responses. I get that I can use, e.g., os.listdir or glob for instance to get list of file names.

From there, how do I create these pandas dataframe for each of these file where the dataframe has the same name as file (without extension).

From string element “filenameA.txt” of that list, I could get “filenameA” using .split() or regex, but how do I define filenameA = pd.read_csv(“filenameA.txt”) ?

Is it that I need to find how to use a string to define a variable name ? Sorry, if I might have worded my question wrongly.

kknechtel · August 16, 2023, 4:02am

This is possible, but you don’t actually want to do it.

Instead, make a dictionary that uses the string 'filenameA' as a key, and let the corresponding DataFrame be its value: dfs["filenameA"] = pd.read_csv(“filenameA.txt”).

Background reading:

rob42 · August 16, 2023, 7:10am

You’re welcome.

In addition to my other (still unanswered) question: why are placing such a strict naming convention on the internal objects? Unless I’m missing something, this seems to be a needless obstacles and serves only to complicate the process.

dub2s · August 16, 2023, 11:59am

Hii @kknechtel and @rob42

@kknechtel Thank you for your answer. I was so close and yet so far. After importing the way I mentioned, I was eventually making a dictionary that has the “filename” as the key and the imported dataframe as the value. Damn! I was so close. But this makes sense to directly import into a dictionary with the “filename” as the key! Thank you very much…

@rob42 I am not sure if I am getting your question right but I can explain the nature of these files. Basically each file correspond to a unique biological sample (sample_name contained in the filename). It’s a very simple data having 3 columns. When I import them, one of the column becomes the index. The other relevant column consists of numbers essentially. I apply some mathematical operations on them before combining all the dataframes into one using pd.concat and axis=1 i.e. merging based on index. (Before merging, I make the column name for each dataframe same as the filename so that after merging the sample information is contained in the column names).

rob42 · August 16, 2023, 12:52pm

Thank you for that detail. From what I think I understand, you wanted to have the internal objects named, so that you could use those object names as a way of tracking which object holds the data that has been read from which csv file. As you can now see, you don’t need to do that, because you can use a dictionary object, as suggested by @kknechtel , to construct a kind of ‘database’ for that very purpose.

Given that, you now need only have one internal object, which can be reused in a loop, to read every csv file; although I see that you have in fact used a .txt file extension rather than a .csv, but, no matter.

You can also used the glob module, as suggested by @cameron , if not os.walk that I suggested, to construct the file names, for the data reading process. Doing things like that, will make your application much more flexible than having ‘hard coded’ file names and indices. For example, if you decide to rename your files to use the correct file extension, your application will not have to be re-coded; it will simple read the files, just the same.

The detail of the data processing, are somewhat more complex, but you seem to have a handle on that, if not, then possibly, that also needs to be broken down into smaller steps.