How much memory does using "from xxx import *" use

c-rob · January 30, 2024, 1:28pm

My eventual config will use Microsoft Azure to make a web app. The virtual machine that will run this will likely have limited memory. We are charged for both storage and something I can’t remember but which is like runtime. And we are charged more for VMs that have a higher memory/RAM config.
I do not control the budget for this project.

I’m new to Python and have not completed a tutorial on Udemy yet. I want to address a possible memory issue as I will be dealing with processing large XLSX or CSV files with 500,000 lines sometimes.

My question: Which uses less memory? When I import only some parts of a module or when I do from modulename import *?

I expect to use many modules in my Python code for reading/writing Excel spreadsheets, CSV files, tab-delimited files, and other things in the future, on a VM with only 1GB of memory. So I hope to find out how this works with modules so I can write my programs smaller.

Thank you!

kpfleming · January 30, 2024, 2:15pm

There is no effective difference. All of the code in the module’s __init__.py will be executed either way. The only difference is how you access the instantiated objects after that code has been executed.

barry-scott · January 30, 2024, 2:23pm

First write the app in an obvious and maintainable way.
Then benchmark to see what its memory use is.
If its acceptable you are done.

If not them you can worry about shrinking the memory use.
You may find that its the storing of the data that is the thing you need to optimise.

For example with a CSV file can you process it row-by-row and avoid loading it all into memory?

For XLSX I think you will have to load all of it into memory that means that bif XLSX will be a limiting factor.

c-rob · January 30, 2024, 2:40pm

Correct, that’s my primary concern, is making it maintainable. My secondary concern is memory usage for very large spreadsheets which are about 5-10GB in the compressed XLSX form.

Yes that’s how Perl does it, because an XLSX file is a zip file with several XML files and folders in it. So in Perl the whole file has to be loaded into memory, where it is unzipped, then processed. And these large XLSX files with 500,000 rows were what caused an out-of-memory in Perl on a machine with 16GB of memory.

You can copy the XLSX file and replace the .XLSX extension with .zip and poke around in there.

For some reason, the module I used to read the XLSX file in Perl caused it to use 10x more memory than the file size itself, and I couldn’t get any other modules to work to read it.

jamestwebber · January 30, 2024, 3:48pm

Depending on what kind of processing you need to do, a CSV file can be far more performant, basically using negligible memory.

This is because you can process the file line by line:

import csv

with open(csv_file) as fh:
    for row in csv.reader(fh):  # csv.reader(fh, delimiter="\t") for a TSV
        # do something with each row

This will read through the file one line at a time, and never store the whole thing in memory^[1].

If you need all of the data in memory for some big computation, this might not work. But many operations can be written in this iterative way, even things like computing means and standard deviation over the whole file.

there will be some file buffering for speed but this is small ↩︎

barry-scott · January 30, 2024, 4:35pm

You may be able to stream decode the file and avoid loading into memory.
I think that the zip module will allow you to read any archieve member piece by piece.
If so then you could see if you can pull out the data you need without fully loading the member.

This is not uncommon. If the CSV file contains the string “1” that 1 bytes, but as a float its 8 bytes in C and more then that in python.

c-rob · January 30, 2024, 5:12pm

Thanks for this! I read there are parts of Perl that are optimized in C. Which ones, I really don’t know.

barry-scott · January 30, 2024, 6:16pm

I suspect that python/perl/ruby etc will all see simular memory foot print once you convert from text to lists of int and strings.

steven.rumbalski · January 30, 2024, 6:19pm

Not necessarily. openpyxl has a read-only mode

Sometimes, you will need to open or write extremely large XLSX files, and the common routines in openpyxl won’t be able to handle that load. Fortunately, there are two modes that enable you to read and write unlimited amounts of data with (near) constant memory consumption.
…
Unlike a normal workbook, a read-only workbook will use lazy loading.

The docs have an example. I haven’t verified how well it works.

kknechtel · January 31, 2024, 2:11am

There is no such thing as “importing only some parts of a module” - except insofar as packages are modules (that could have their “contained” modules attached as attributes). The module is an indivisible object that is created when the source code file is loaded. The from ... import ... form - whether you import specific names, or all of them with * - just does some simple variable assignment after that, so that global variables in the current code are bound to attributes o the module object.

Module loading is cached; you only get one module object per source code file regardless of how many import statements run, unless you deliberately work around that system.

If you need to make it possible to “load only part”, then you should have a package instead, and load sub-modules as appropriate. If you have a package folder pkg that includes a module mod, then import pkg will not load the mod.py code or create an object. But if for example you have __init__.py and mod.py, then import pkg.mod will execute the __init__.py code (if the result wasn’t already cached) and create the corresponding module object to represent the pkg package.