Joining Data Frame & SQL Server table directly and update table

dgr · March 17, 2024, 9:45pm

Hi All,
I am new to Python, looking for options on direct join (Just like sql join b/w tables) between Data Frame and SQL Server table and Insert/update table(s).Is there way to join and Insert/update without using Temp table or Cursor.

Reason : In actual scenario, i have millions of rows between the tables DF & SQL Server Tables.My understanding is …using temp table or cursor is expensive process.
So looking for options.

This is how the Dataframe looks like:

df
name author count
0 a b 10
1 c d 5
2 e g 2

I need to join it with a SQL Server table that looks as below:
TABLE: Books
title author url Count
a b aaaaa Null
b z bbbbb Null
e g ccccc Null

I want to join df & Books on df.name & Books.title and
insert missing rows from df into Books and
update Count column from df.

Insert into Books (title ,author ,url ,Count)
select name, author, count
from df left join Books on df.name = Books.title
where Books.title is null

update Books
set Count = df.Count
from df Inner join Books on df.name = Books.title

Can anyone guide me how to achieve this.
Thanks!

Result :

title author url Count
a b aaaaa 10
b z bbbbb 5
e g ccccc 2
c d Null 5

cameron · March 17, 2024, 11:54pm

This isn’t necessarily so. it really depends on what you’re doing.

And the stuff below is general advice.

Somehow you need to get the SQL table data and the DF data in the same
place for comparison. You’ve got 2 basic ways to do that:

copy the DF data into the database 9temporary table) and use SQL in
the database very much as you describe above
copy the SQL data into your Python programme (or at least the relevant
bits - the keys you need to look up) and use Python to compute the new
rows, and then send them to SQL to be inserted

What’s efficient depends on how big things are. Is the SQL table large?
(Possibly quite large later if it is to grow indefinitely over time.) is
the DF large?

If the DF is “small” (a vague term, I know) you could query the SQL for
its keys and then locate the DF rows that did not come back in the
answer, and SQL INSERT those rows< something like:

 SELECT title from Books where title not in (the DF.name values go here...)

If the SQL table is small (or, given that you only need the book titles,
moderately large but not insanely large - thinking personal library
versus a large public library here), you could just SELECT all the
book titles and then find your new DF rows in Python. This avoids
sending all the DF.name values to the SELECT.

If the DF is going to be mostly new rows, or the DF is not insanely
large, send it to SQL as a temporary table and let SQL do the work per
your example SQL code.

The main expense (to my naive mind) of a temporary table is that it also
needs an index (to make the JOIN efficient). If you can pull the SQL
keys into Python memory with an SQL select, you can do that all in
memory in Python and use the various DataFrame operations to compare
rows - probably faster than using the database.