Read Large Excel

I have large excel file having 5-6 worksheets.
I am trying to read one sheet who has 0.7 Millions (7 lakhs+) rows.
The process took hours to read such huge file using pd.read_excel().

Needed help for the optimize/quick responsive of code.

  1. What’s the total RAM your computer has that is running this Python program?
  2. What is the size of the spreadsheet file in kilobytes?
  3. Do you have the same size spreadsheet with fake data somewhere so I can test reading it with Pandas?

This should not be a problem for Python or Pandas. It might take 1-2 minutes to read a spreadsheet like that but if you have 64,000 columns for every row, it will take a while.

  1. System RAM size is 16 GB.
  2. Spreadsheet file size is around 28MB.
  3. The file contains only 10 columns. I tried by specifying the columns range using ‘usecols’

When I take out the the required worksheet to another file, the process took around 45-1 hour to read the content.

My Sample Code :

file_password = ''
with open('Downloads/file.xlsb', 'rb') as file:
     office_file = msoffcrypto.OfficeFile(file)
     office_file.load_key(password=file_password)
     office_file.decrypt(decrypted_workbook)

df = pd.read_excel(decrypted_workbook, sheet_name =worksheet_name, usecols = "A:N")

Unfortunately reading xlsb files with pandas is generally very slow. Are you able to save the file as xlsx or even better as csv? Then it would read very fast.

Sarang,

As someone else pointed out, a properly saved CSV file may be a way to go.

You may also want to guarantee your EXCEL worksheet is really the size you mention. All it takes is some cell containing an invisible space far from anything and …

But a question I have is about your attempt to tell pandas you have ten columns. If your data starts in column A then you want to ask for "A:J" rather than "A:N"

I also am just asking if the workbook being encrypted is a factor. Are you sure your code above has made a decrypted version somewhere and you are working with that? Is all that in memory including all the other sheets?

There are many other situations where reading in a dataframe can be slow such as what happens if your data has flaws so that in row 8000, you have a character string or period instead of a number. Often there is a look-ahead phase to guess properly what type a column is. You can have it skip some of that.

So I suggest looking at the output you get now that takes such a long time and see if it is even correct. If a cell is of type character when it should be integer, for example, then something else needs adjustment.

You may also look into other pd.read_excel options such as telling it what dtype to expect or whether there is an appropriate header line or how many rows to skip or whatever may apply.

Do note that there are other ways to read in data and if you saved a sheet as a CSV, you can even read in a line at a time and split at the commas and put together your own data structure, if all else fails.

And you may want to look at the memory management in your program if it turns out you have lots of big items you do not need but have not removed. Slowness can be caused by memory thrashing as garbage collection tries to keep up.

Avi,
I tried all your suggested ways, but it still taking up the time.

EXCEL Upload is not possible here.

XLSB is a binary file format used to make files smaller, and faster to read for Excel. But not necessarily for Python as others mentioned above.

Can you convert this to XLSX and then read the XLSX file? It will be bigger, how much bigger I don’t know. I have no experience using XLSB files.

But XLSX files are a zipped file with many other XML and other files and subfolders. Copy the XLSX file and rename the file extension to .zip and you can poke around inside.

Caveats

Converting the file to a tab-delimited or .csv format can have problems as the customer sometimes sends us files with CRLF embedded in random cells. This means I have to convert the file to tab-delimited, then write an error routine in my program looking for odd data where a specific column clearly has the wrong data. Manually scanning through 500,000 lines with my eyes is not an option with my 500,000 line Excel file converted to tab-delimited.

In Excel we have even found embedded tabs inside cells which also messes up the tab-delimited file, which is why my first preference is to use CSV. How the tabs get in there I have no idea but we receive Excel files that were made on Macs and Windows alike.

Yes I know. The files is large. Is there an URL to a large file like that which I can test?

EDIT: I found this 304MB CSV file here people can test with. 🍄Mushroom Overload| 6.7M Rows | Kaggle
You will probably have to log in using Google or another method to get the data.

GitHub - Ssj11/Sample_File - You can download the sample file from here.

Ok I got your file. It’s only 7.5MB but has 750,000 rows and the spreadsheet also goes through column XFB which is part of the reason it’s getting slowed down. It has LOTS of columns according to Excel. Excel things there is data in those extra columns even if it’s the null character.

I will take a look at it but this may not get done until next week.

EDIT: To find the end of the spreadsheet I opened the spreadsheet in Excel, then on my laptop hit CTRL-FN-RIGHTCURSOR. This takes me to the maximum row and maximum column.

The XLSX file is 41MB. And I’m timing how long it takes to read it. It’s reading the file and my CPU usage is hovering around 20-30%.

The Original File has 8 worksheets in total.
The file I uploaded on link there I removed the rest of the sheets and kept only the required sheet which I am using, so file size has been reduced to 7.5 MB.

About converting to CSV, I could download and save it as CSV, but as it has multiple sheets, can’t read as tab-delimited.

  1. I ran a test using Python 3.11.9 on an SSD on Windows 10 and 32GB RAM. I read the XLSX file 10 times before I exited the Python program. Only the first tab was read.
  2. I only read the file I did not process or print any data from the spreadsheet as your concern was about the read phase.
  3. Pandas seems to read the whole spreadsheet, or at least one default tab, before we can do anything with the data.
  4. Here’s my list of read times in seconds. The 261 is from a separate run before I made the read loop in my program. [261, 45.5, 45.0, 44.4, 44.9, 47.3, 44.9, 44.0, 44.8, 44.2]
  5. Your read times may be slower on an HDD or on a free cloud service like Google Colab. I know our Azure Data Factory is REALLY slow to process just 1KB of data.

Here’s my program.

r'''
Read large 40MB XLSX file. 

Install these: pip install pandas openpyxl

import inspect
    procname = str(inspect.stack()[0][3]) + ":"
    
Read times in seconds: 261
    
'''
import inspect
import os
import sys
import pandas
import time

#######################################################
#######################################################
#######################################################
#######################################################
def readxla(xlfn):
    r'''Read a rather large 40MB XLSX file.
    Out: Seconds it took to read the file.'''
    procname = str(inspect.stack()[0][3]) + ":"

    starttime = time.time()
    print(f"\n{procname} Reading {xlfn}")
    try:
#        df = pandas.read_excel(xlfn, sheet_name=optexceltab, usecols='A:N', skiprows=[1]) # To skip headers.
        df = pandas.read_excel(xlfn, usecols='A:N', skiprows=[1]) # To skip headers.
    except PermissionError:
        print(f"{procname} ERROR: you should close any spreadsheet you want to read first.")
        sys.exit()
    except Exception as e:
        print(f"{procname} ERROR-read_excel: {e}")
        sys.exit()
    numrows = len(df) # Number of rows we found.
    mytime = round(time.time() - starttime,1)
    print(f"{procname} Time elapsed {mytime} seconds")
    zz=0 # Stop here.
    # for index, rowdata in resultdf.iterrows():
        # rowlen = len(rowdata) # stuff = rowdata['Colname']
    return mytime
    
#######################################################
def read_csv1(options):
    procname = str(inspect.stack()[0][3]) + ":"

#######################################################
# Main program

fn = 'samplefile.xlsx'
timelist = [261]
if not os.path.exists(fn):
    print(f"ERROR: file does not exist. {fn}")
    sys.exit()

for i in range(9):
    print(f"Pass number {i}")
    sec = readxla(fn)
    timelist.append(sec)
    
print(f"List of read times in seconds: {timelist}")

If it’s necessary for you to process all tabs in the spreadsheet I don’t know how to loop through the tabs. I have not done that before.

Tips to speed things up

  1. Convert file to XLSX file and read that.
  2. Stop or pause programs on your PC if they are using the disk a lot or use a lot of memory. Especially if you use a HDD.

Thanks for the snippet!
However, when I saves the XLSB file as XLSX manually and ran the test, process took almost same time list like yours.

Here, I am failing to convert this XLSB to XLSX using python.
This is what I am trying for conversion:

# Read the XLSB file and convert to XLSX
with open_workbook(file_path) as wb:
    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
        for sheetname in wb.sheets:
            with wb.get_sheet(sheetname) as sheet:
                rows = []
                for row in sheet.rows():
                    rows.append([item.v for item in row])
                df = pd.DataFrame(rows[1:], columns=rows[0])  # Assuming the first row is the header
                df.to_excel(writer, sheet_name=sheetname, index=False)

Don’t you have Excel to convert an XLSB to an XLSX file?

If not, can you upload your file to this website to convert it to XLSX? https://convertio.co/

Maybe you can try to use polars instead of pandas, it’s supposed to be faster: polars.read_excel — Polars documentation

This is through the third-party conversion.
I have to convert the XLSB to XLSX using python.

In your example file, all (most?) of the rows have a length of 16381 cells according to pyxlsb (which Pandas uses to read XLSB files) even though they mostly contain only 13–14 values, so pyxlsb wastes most of its time reading and converting empty cells.

Pandas also supports calamine as a backend, which is much faster than pyxlsb in this case. You can enable it by installing python-calamine and passing engine="calamine" to read_excel.

This finishes in 7 seconds on my machine when using the calamine backend (reading 746408 rows x 14 columns):

import pandas as pd

filename = "SampleFile.xlsb"
worksheet_name = "Basesheet"

with open(filename, 'rb') as f:
    df = pd.read_excel(f, sheet_name=worksheet_name, usecols="A:N", engine="calamine")
    print(df)
2 Likes

Since you have to use Python to convert an XLSB to XLSX file, use the usecols parameter in .read_excel(). I don’t know the internals of how .read_excel works but it might work faster. Otherwise just be patient and convert the file once. These XLSB files take more time to read.

df = pandas.read_excel(xlfn, usecols='A:N', skiprows=[1])

Or use an online service to convert to XLSX file. Or get Libre Office (free) or Open Office, they both read/write Excel files. Hopefully they will read XLSB file and you can save it as an XLSX file.

The usecolums parameter isn’t passed through to the backend, so unfortunately it doesn’t make pyxslb any faster.

If pyxlsb is used directly (not via Pandas), you can set the dimension attribute on the sheet object to specify a smaller number of columns, which does speed it up substantially, but it raises an IndexError error when it encounters row 475202 in sample file, so it doesn’t quite solve the problem. Code snippet below.

import pyxlsb

wb = pyxlsb.open_workbook("SampleFile.xlsb")
sheet = wb.get_sheet("Basesheet")
sheet.dimension = sheet.dimension._replace(w=18)
for i, row in enumerate(sheet.rows(sparse=True)):
    print(i, len(row), [c.v for c in row])

But as I said earlier, if you switch the backend from pyxlsb to calamine, Pandas does the job in a few seconds.

1 Like