Hi,
I initially opened this on Github, which was not a good idea as it had not been discussed.
Here’s a benchmark/proposal:
import sqlite3, json
import time
import subprocess
def dict_factory(cursor, row):
d = {}
for idx, col in enumerate(cursor.description):
# col[0] is the column name
d[col[0]] = row[idx]
return d
def get_data_to_json():
conn = sqlite3.connect("/home/user/database.sqlite")
conn.row_factory = dict_factory
c = conn.cursor()
c.execute("SELECT * FROM server")
rst = c.fetchall() # rst is a list of dict
return rst
start = time.time()
rst = get_data_to_json()
print(f"normal data load: {time.time() - start}")
start = time.time()
out = subprocess.run(["sqlite3", "--readonly", "/home/user/database.sqlite", ".mode json", "select * from server"], check=True, capture_output=True).stdout
print(f"sqlite3 command: {time.time() - start}")
start = time.time()
loadit = json.loads(out)
print(f"load json from command: {time.time() - start}")
start = time.time()
json.dumps(loadit)
print(f"dump loaded json: {time.time() - start}")
Here’s the results on my machine:
$ python benchmark.py
normal data load: 0.7838420867919922
sqlite3 command: 0.19471359252929688
load json from command: 0.21944022178649902
dump loaded json: 0.16810369491577148
I recently found out that sqlite3’s CLI has a JSON mode. This JSON mode is considerably faster than any Python option that I’m aware of.
There’s lots of cases where this might be useful. Let’s say you have an API that needs to dump a decent bit of data from your database. There’s no reason to load the JSON, then dump it to send it. In this case, it’s about 5x faster to let sqlite dump the JSON and pass that along.
What’s crazy is that it’s even 2x faster to call sqlite3
and load JSON, than it is to fetchall()
.
I am not sure how the sqlite interface works and if this would require work on sqlite’s part, but to me this is a substantial improvement. It’s possible this isn’t technically feasible atthe moment.
I imagine this might have an interface like:
session.exec(query, mode=json)
or session.exec(query, mode=dict)
Now while this would be more convenient no matter the implementation, I would hope to have the same speed seen by calling subprocess.run on the CLI.
Curious what you all think.
Thanks!