Hi everyone,
Apache Arrow has become the de-facto standard in-memory format for analytical data. It allows moving data across language and process boundaries without actually copying any data in memory. There are many successful projects built on top of Arrow, such as Polars and DuckDB, and it has become the default array backend for Pandas since version 2.
In modern python data pipelines, the data never has to interact with python directly. The data is usually loaded directly into Arrow, and then python is used only to pass these Arrow arrays to various compiled computation engines, and then dump them directly into some compatible format like Apache Parquet. When data pipelines are written in this style, there is almost zero performance penalty from using python.
The story is different for relational databases however. Python defines the standard DBAPI v2 and most database drivers conform to it. This API is row-based and works with python values.
The stereotypical python data pipeline that works with relational databases works like this
- Fetch data from database (in a compiled database driver module)
- Convert the query result into python lists of python tuples of python values
- Convert the lists into Arrow
- Do computation in Arrow
- Convert Arrow back into python lists of python tuples of python values
- Pass it to database driver that will insert it into the database
It turns out that steps 3, 5 are completely unnecessary and are actually very costly.
It becomes an obvious bottleneck in cloud computing settings, where the database interactions are fast.
Moreover, detailed type information is lost when converting to python types, so the step 3 usually requires some fragile type-inference.
There has been numerous projects that aim to solve this problem
- ADBC
- turbodbc
- arrow-odbc
- connector-x
- oracledb (recently added support to create Arrow arrays directly in the driver)
These implementations usually report manyfold speedups over DBAPI.
But they always come up with a bespoke API for doing so, and some of them don’t even provide the other semantics of DBAPI (connection/cursor/etc.).
I propose that the DBAPI specification be extended (backward-compatible) with optional methods to allow drivers to work with Arrow data instead, something along the lines of
cursor.fetch{one,many,all}_arrow
will return results as tuple of Arrow arrayscursor.execute
should accept parameters as Arrow arrayscursor.description_arrow
will return tuple of Array dtypes that best matches the result set
It will not be needed that the driver libraries depend on pyarrow. Arrow defines a simple “Arrow PyCapsule Interface” that any library can implement on its own. I believe that this is the way oracledb did it.
Implementing this would make way e.g. for the best database toolkit in any language SQLAlchemy to provide efficient analytics data access on top of these drivers.
From what I’ve seen in oracledb and turbodbc, I believe that conforming to some new DBAPI like this could be a minor change entailing mostly renaming some methods.
What do you think about it?