Hi everyone,
I’m a recent Python convert from a 10-year SAS background. While I’ve been learning Python through Datacamp, I’m currently facing a roadblock when trying to load large datasets from Snowflake into Python.
In SAS, this is a straightforward process involving libname
and data
steps. However, I’m struggling to replicate this in Python. I’ve been using the following code, but it’s running into memory issues with large datasets.
I’ve tried various approaches, including chunking the data, but I’m still hitting limitations. I’d appreciate any guidance or suggestions on how to efficiently load and process large datasets from Snowflake in Python.
Thank you in advance for your help!
Regards,
Nishesh
Code below:
import pandas as pd
from sqlalchemy import create_engine, text
from snowflake.sqlalchemy import URL
SNOWFLAKE_USER = 'xxxxxxx'
SNOWFLAKE_ACCOUNT = 'yyyyyy'
SNOWFLAKE_DATABASE = 'aaaaa'
SNOWFLAKE_SCHEMA = 'bbbbb' # Specify the schema if needed
SNOWFLAKE_ROLE = 'ccccc' # Optional: specify your Snowflake role if required
SNOWFLAKE_WAREHOUSE = 'ddddd' # Optional: specify your Snowflake warehouse if required
# Establish connection to Snowflake using SSO with MFA
engine_dev = create_engine(URL(
account = SNOWFLAKE_ACCOUNT,
user = SNOWFLAKE_USER ,
password = 'abc',
role=SNOWFLAKE_ROLE,
database = SNOWFLAKE_DATABASE,
warehouse = SNOWFLAKE_WAREHOUSE,
authenticator = 'externalbrowser'
))
connection_dev = engine_dev.connect()
# Define your parameters in a dictionary
params = {'period': '202409', 'business_written_ind': 'Y'}
# Define your SQL query as a TextClause
query = text("SELECT * FROM bbbbb.RISK_DATA_V WHERE PERIOD = :period AND business_written_ind = :business_written_ind")
# Use Pandas to read data in chunks
df_iter = pd.read_sql(query, connection_dev, params=params, chunksize=500000)
# Create an empty list to store the chunks
all_chunks = []
# Process data in chunks
for chunk in df_iter:
# Process each chunk as needed
print(chunk.head())
all_chunks.append(chunk)
# Concatenate all chunks into a single DataFrame
final_df = pd.concat(all_chunks, ignore_index=True)
print(final_df.head())
print(final_df.tail())
connection_dev.close()
engine_dev.dispose()