Loading Large Datasets from Snowflake to Python

Nishesh · December 2, 2024, 5:35am

Hi everyone,

I’m a recent Python convert from a 10-year SAS background. While I’ve been learning Python through Datacamp, I’m currently facing a roadblock when trying to load large datasets from Snowflake into Python.

In SAS, this is a straightforward process involving libname and data steps. However, I’m struggling to replicate this in Python. I’ve been using the following code, but it’s running into memory issues with large datasets.

I’ve tried various approaches, including chunking the data, but I’m still hitting limitations. I’d appreciate any guidance or suggestions on how to efficiently load and process large datasets from Snowflake in Python.

Thank you in advance for your help!

Regards,
Nishesh

Code below:

import pandas as pd

from sqlalchemy import create_engine, text
from snowflake.sqlalchemy import URL

SNOWFLAKE_USER = 'xxxxxxx'
SNOWFLAKE_ACCOUNT = 'yyyyyy'
SNOWFLAKE_DATABASE = 'aaaaa'
SNOWFLAKE_SCHEMA = 'bbbbb'  # Specify the schema if needed
SNOWFLAKE_ROLE = 'ccccc'  # Optional: specify your Snowflake role if required
SNOWFLAKE_WAREHOUSE = 'ddddd'  # Optional: specify your Snowflake warehouse if required

# Establish connection to Snowflake using SSO with MFA

engine_dev = create_engine(URL(
account = SNOWFLAKE_ACCOUNT,
user = SNOWFLAKE_USER ,
password = 'abc',
role=SNOWFLAKE_ROLE,
database = SNOWFLAKE_DATABASE,
warehouse = SNOWFLAKE_WAREHOUSE,
authenticator = 'externalbrowser'
))
connection_dev = engine_dev.connect()

# Define your parameters in a dictionary
params = {'period': '202409', 'business_written_ind': 'Y'}

# Define your SQL query as a TextClause
query = text("SELECT * FROM bbbbb.RISK_DATA_V WHERE PERIOD = :period AND business_written_ind = :business_written_ind")

# Use Pandas to read data in chunks
df_iter = pd.read_sql(query, connection_dev, params=params, chunksize=500000)

# Create an empty list to store the chunks
all_chunks = []

# Process data in chunks
for chunk in df_iter:
    # Process each chunk as needed
    print(chunk.head())
    all_chunks.append(chunk)

# Concatenate all chunks into a single DataFrame
final_df = pd.concat(all_chunks, ignore_index=True)

print(final_df.head())
print(final_df.tail())

connection_dev.close()
engine_dev.dispose()

barry-scott · December 2, 2024, 9:06am

I do not know what snowflake is or how to use pandas.
I am just looking at the code presented.

This means that you need enough memory to hold the total dataset in RAM.

Processing in chunks makes sense only if you do not keep the old chunks after processing.

doomsdaychicken · December 2, 2024, 6:27pm

Use fetch_pandas_all or batches: Using pandas DataFrames with the Python Connector | Snowflake Documentation