Asynchronous DBAPI

agronholm · August 25, 2019, 4:36pm

I would like to propose a PEP that defines a standard for asynchronous DBAPI interface, using the async/await syntax. This should be almost identical to the existing standard (PEP 249) except for the following differences:

threadsafety would be replaced with tasksafety (updated semantics to follow)
functions that might cause file or network activity (executing queries, stored procedures etc.) would be defined as coroutine functions

I am proposing this because there are existing async DBAPI-like drivers (aiopg, aiomysql, aioodbc) and I would like to leverage these in my effort to create an async fork of SQLAlchemy.

asvetlov · August 28, 2019, 11:21am

How is the async DB API standard important?
I’ve borrowed the idea for aiopg drivers because I personally had no idea how async API should look like. Nikolay extended aiopg design to aiomysql and aioodbc.

But if you look on sqlalchemy code you’ll find that the alchemy doesn’t use DBAPI directly but builds dialect mediator classes. A dialect can be designed on top of any DB interface, doesn’t matter if it supports DBAPI or not.
The same is true for any ORM wrapper in Python, e.g. Django etc.
I pretty sure that making a dialect for asyncpg driver doesn’t take longer than aiopg version if hypothetical async alchemy already exists.

agronholm · August 28, 2019, 11:27am

Having drivers conform to a common standard would at least make the process easier since you would not have to learn the semantics of each driver separately. Then there’s the matter of using these drivers without SQLAlchemy. Or would you argue that the PEP 249 standard is pointless too?

asvetlov · August 28, 2019, 11:51am

PEP 249 was good at the time of creation.

I recall the standard problems, e.g. it doesn’t support SQL prepared queries.

Another problem is that the DBAPI specifies exception names but there is no dbapi common module that contains these exception classes. SQLAlchemy has complex code for converting driver’s exception into alchemy’s counterpart.

Cursor factories are not a part of DBAPI standard but very common approach. Changing method return value depending on the argument is a code smell I think.

Sure, there are other problems as well.

I don’t argue that a standard for DBAPI is bad. I would say that development of a perfect standard is a super hard thing.

agronholm · August 28, 2019, 11:53am

What you have pointed out are problems in the current standard, and I don’t disagree with you. So would it make sense to create a better standard in its place?

malemburg · August 28, 2019, 11:55am

We could add a module level flag which identifies a DB-API compatible
module as asyncio compatible. I don’t think the API itself needs to
be changed in any way, though.

Note that most database libraries are often implemented in C and
the communication with the database happens at the C level, outside
the reach of Python and often using threads underneath, so an
asyncio driver would not be able to pass back control to an event
loop.

Some C APIs offer async support, e.g. the ODBC standard provides
support for it, to that may help, provided the underlying ODBC
drivers implement this.

The idea may work for wire protocol drivers entirely written in
Python. I guess using a name such as “aioXYZ” is already indication
enough that the API supports asyncio.

agronholm · August 28, 2019, 12:02pm

So you don’t think the driver needs to at least say something about whether connections or cursors can be shared between tasks?

Note that most database libraries are often implemented in C and
the communication with the database happens at the C level, outside
the reach of Python and often using threads underneath, so an
asyncio driver would not be able to pass back control to an event
loop.

Such drivers would obviously not be candidates to implement this hypothetical standard

Some C APIs offer async support, e.g. the ODBC standard provides
support for it, to that may help, provided the underlying ODBC
drivers implement this.

Yeah, I mentioned aioodbc in the initial post.

The idea may work for wire protocol drivers entirely written in
Python. I guess using a name such as “aioXYZ” is already indication
enough that the API supports asyncio.

So as not to limit the implementation to one specific async framework, I’m thinking of adding a module attribute that specifies the async framework(s) that it is compatible with.

asvetlov · August 28, 2019, 12:16pm

To be clear: aioodbc uses asyncio thread pool (run_in_executor() call), not ODBC async mode.

malemburg · August 28, 2019, 12:34pm

[agronholm] agronholm https://discuss.python.org/u/agronholm Alex
Grönholm https://discuss.python.org/u/agronholm
August 28
malemburg:

We could add a module level flag which identifies a DB-API compatible
module as asyncio compatible. I dont think the API itself needs to
be changed in any way, though.
So you dont think the driver needs to at least say something about
whether connections or cursors can be shared between tasks?

Sharing connections or cursors between tasks or threads is usually
not a good idea. The reason is that connections (and cursors as a
result) have state in form of transactions and sharing those between
asynchronously running parts of an application can be dangerous
if not done right.

Even if a driver states being threadsafe, I would not trust it beyond
perhaps sharing connections which are explicitly managed in a connection
pool.

…
The idea may work for wire protocol drivers entirely written in
Python. I guess using a name such as aioXYZ is already indication
enough that the API supports asyncio.
So as not to limit the implementation to one specific async framework,
Im thinking of adding a module attribute that specifies the async
framework(s) that it is compatible with.

That would work as well.

The DB API provides a base standard for database interaction and it’s
open to and encourages extensions. With asyncio being available in
core Python now, we can think of extensions which may be useful in
that world.

Such discussions should happen on the DB-SIG ML, though.

Note that DB API modules usually always release the GIL, so threading
is well possible with database access. As a result, database
applications typically don’t need asyncio to scale up, but of course
having the added option is certainly nice.

agronholm · August 28, 2019, 12:46pm

Yes, I agree. I guess it would be fine to assume a threadsafety (or task safety, if you will) of 1, meaning the module can be shared but connections cannot.

Scalability is not the point of async database connections. The ability to cancel a running database operation is. In an async app you just need to cancel the task and whatever was blocking there is then automatically cancelled – you don’t need to know the details. The same semantics are impossible with synchronous code.

malemburg · August 28, 2019, 5:15pm

Cancelling a running database operation will typically require informing
the database server about the cancellation. How would that work in an
async setting ?

With thread based connections you would cancel operations via a second
thread by letting the server know and cancel the operation on the
other thread. The other thread’s call would then return with an
error.

agronholm · August 29, 2019, 10:35am

When the running operation receives a cancellation exception, it has a couple options: 1) close the connection immediately (which would result in an automatic rollback of the transaction), or 2) initiate a shielded block with a short timeout which will inform the server of the cancellation and wait for a response. If the timeout is reached, the connection is automatically closed.

smurfix · August 31, 2019, 2:01pm

Sharing connections or cursors between tasks or threads is usually
not a good idea. The reason is that connections (and cursors as a
result) have state in form of transactions and sharing those between
asynchronously running parts of an application can be dangerous
if not done right.

Yes. This is why async context managers were invented. Bingo, while you’re in the context of async with db.transaction() as t you’re in a transaction. Cancelling the transaction, or indeed any exception that causes the context to end, should tell the database to abort. Using t after the context ends, or leaving it with pending commands, should obviously result in an error. Bingo, using a transaction from multiple tasks is suddenly no problem whatsoever.

Likewise, async with t.select(…) as sel: async for row in sel: await process(row) will tell the server to abort the select when cancelled (or otherwise exited).

Yes, all of this needs work when the database doesn’t support multiple transactions on a connection (well, that’s easy, just open another connection behind the scenes, don’t forget to to re-send prepared statements) or cursors, i.e. interleaving select with other commands (save the result on the client, require the app to provide a limit on the number of records when that happens). Well, even mysql supports cursors these days …

methane · September 2, 2019, 7:10am

In case of MySQL, there is no safe way to cancel the query, especially when the database is load balanced. So canceling the query on server side must be optional, not required behavior.

Additionally, DB-API’s Connection object abstracts one (logical) connection. I don’t like multiple physical connections behind one logical connection idea. In production quality app, number of physical connections must be controlled by application.

I prefer design cleary separates “connection pool” and “connection”, like Go’s database/sql and database/sql/driver.

agronholm · September 3, 2019, 7:25am

I don’t think anybody suggested this to be a requirement. If the database server doesn’t natively support the cancellation of queries, then simply closing the connection works too.

Reading through PEP 249, it is not obvious to me where it says that a Connection object can have many underlying physical connections. Would you elaborate on this please?

methane · September 3, 2019, 7:51am

I hope so. But @smurfix said “should tell the database to abort”.

For people who read RFCs, terms like “should”, “must”, or “may” have special meaning.
“should tell the database to abort” looks like it’s a requirement.

I never said PEP 249 says so. I replied to the @smurfix’s reply.

If I read it correctly, he is suggesting about making the Connection to connection pool when DB doesn’t support interleaving multiple transactions/statements on one physical connection.

agronholm · September 3, 2019, 7:56am

“should” means that if the database supports this then the driver should use it, but it’s not a hard requirement.

That confused me as well – maybe he meant nested transactions (savepoints)?

Anyway this discussion seems to be veering off course. What is your stance – would you prefer just an async conversion of PEP 249 or a DBAPI v3 specification with async support?

methane · September 3, 2019, 8:23am

I have no preference about it. I am maintainer of mysqlclient-python, PyMySQL, and go-sql-driver/mysql.
But I use only MySQL and I don’t use complex features in MySQL (nested transaction, multiple cursors, etc.). I can’t design or maintain complex API set for complex features.

My preferences are:

Clear separation between Connection Pool and Connection. Connections should provide enough APIs for Connection Pool. But there is no need to cover connection pool in the DB-API.
I think standardized autocommit option should be added.
Autocommit by default seems better for async applications. Transactions are used only when explicitly required.

Some explanation about (2) and (3). In Go, Connection Pool (sql.DB) has methods like Execute, Query, etc.
For result, err := pool.Execute("INSERT INTO ... "):

pool checks out one connection and execute the query with autocommit mode.
when query is finished, the connection is returned to the pool.

This is very convenient API. If autocommit is not standardized, the pool need to execute commit every time. It doubles the query/sec. That’s why autocommit is important for connection pool.
And efficient connection pool is very important for async applications.

malemburg · September 3, 2019, 10:08am

I think you need to reconsider:

Punting on transactional database connections would make an
asynchronous DB API useless for any production quality application
that has to write to the database.

The main advantage of databases over other storage mechanisms is
that they try very hard to maintain integrity. When running in
auto-commit mode, you lose this important feature, which can then
easily result in data corruption.

This is the reason why the DB API defaults to transactional
connections and only offers switching to auto-commit when
explicitly requested.

BTW: The connection pool design you are mentioning is much too fine
grained. Since connections encapsulate transactions, you typically
get a connection for a whole set of queries that make up a transaction,
not just one. In web applications, this would be serving a single
request, in accounting, processing all transfers which make up a
financial transaction, etc.

smurfix · September 3, 2019, 10:31am

Yes, well, requirement in the sense that if it’s supportable in a reasonable way then do it, but if not, well, that’s life. There’s a difference between SHOULD and MUST …

Yes, of course a Connection and a ConnectionPool are different objects. I am not arguing for conflating them.

On the other hand: how much boilerplate do you require the user of an async interface to use? If the database is unable to accept a kill command while it’s flooding you with a runaway SELECT, of course await result.kill() (or simply falling off the end of the SELECT’s async context while there are unread records) needs to grab another connection from the pool to send the appropriate message … we’re not going to force the user to manually grab a new connection from the pool for that, esp. as another database might forbid doing that from another connection.

async with pool.connect() as conn:
    async with conn.transaction() as t:
        await t.command("insert into foo (bar) values(123)")
        async with t.select("select * from bar") as curs:
            async for result in curs:
                print(result['bar'])

is all well and good but in practice this should be sufficient:

async with pool.transaction() as t:
    await t.command("insert into foo (bar) values(123)")
    async for result in t.select("select * from bar"):
        print(result['bar'])

(assuming that your code doesn’t escape the loop – while the transaction/pool-giveback cleanup can handle that case, it’s still sub-optimal).

Convenience methods like pool.command should of course be autocommitting.

A database driver that needs a separate connection in order to abort a runaway SELECT requires access to the pool.

I agree that the individual database driver should be pool-agnostic (except for acquiring another connection if required) but as to the application’s view of the DB API, I see no reason why a single database connection should even be exposed to it.

You misunderstand. Nobody is asking for that. We only want all commands to be autocommitting outside async with db.transaction() as t blocks.

This implies that a typical database-using application will not issue any explicit BEGIN, COMMIT or ROLLBACK statements. This is a feature. Also, when a connection is within a transaction, conn.command or conn.select should simply raise an error – you want to affect the data within a transaction, so use the transaction object.