Deprecating sqlite object reinitialization

encukou · September 8, 2021, 9:25am

Hi,
The sqlite3 module allows you to call Connection.__init__ to reinitialize an existing connection – basically, open a new database using the same connection object.
This is full of corner cases that break, for example, you can currently get a segfault with:

import sqlite3

conn = sqlite3.connect(':memory:')
conn.execute('CREATE TABLE foo (bar)')

# Attempt to open a new database
try:
    conn.__init__('/bad-file/')
except sqlite3.OperationalError:
    pass

# Attempt to use a destroyed database connection
conn.execute('INSERT INTO foo (bar) VALUES (1), (2), (3), (4)')

If anyone has a use case for __init__ on connection objects (or on cursors, etc), rather than creating a new object, I’d love to hear it. But I doubt there is one.
Due to the possibility of bugs in edge cases (and ongoing maintenance burden of keeping the edge cases in mind), we’d like to deprecate re-initialization of objects from sqlite3 in Python 3.11 and remove it in Python 3.13.

@erlendaasland is working on fixing this (see bpo-45126) – but honestly, I’d rather document it as unpredictable/dangerous and let it go away in 2 years than spend time reviewing the fixes.

More generally, the tp_new docs say:

The tp_new function should call subtype->tp_alloc(subtype, nitems) to allocate space for the object, and then do only as much further initialization as is absolutely necessary. Initialization that can safely be ignored or repeated should be placed in the tp_init handler. A good rule of thumb is that for immutable types, all initialization should take place in tp_new , while for mutable types, most initialization should be deferred to tp_init .

I believe replacing “immutable types” → “immutable types and state managed in C” and “mutable types” → “properties mutable from Python code” would make a better rule of thumb, but I’m having trouble wording it concisely.

steve.dower · September 8, 2021, 9:21pm

I’m inclined to discourage it, and if we’ve ever suggested that this is a good use, deprecate that suggestion.

But I think the general concept of being able to re-call __init__ should work, and we should fix the segfault. We’ve just gone through a big effort to be able to reinitialise modules, so I don’t see why we’d suddenly decide that instances are now special in the opposite direction.

I still don’t think we should encourage it as a coding pattern

encukou · September 9, 2021, 7:10am

Wasn’t that an effort to be able to make additional module objects? Or are we talking about different efforts?
Reinitializing a module – tearing down the C-level state and handling existing objects that depend on it, and creating a fresh one – is still not usually possible, I hope.

storchaka · September 9, 2021, 7:32am

I fixed a lot of similar issues with builtin types. But there are more left. There are two issues:

__init__ should clear previous state to avoid leaks. Instead simple assignment you should to use Py_XSETREF. In some cases it is not enough, you should to save the previous state, then set a new state, and only after that decref previous state and release resources, to avoid the object been in inconsistent state during calling __del__ in referenced objects.
If __init__ is not called, every method and property should check whther the object was initialized. It makes the code larger and slower.

If the type is not designed for subclassing, the simplest solution is to move initialization from __init__ to __new__. If it is not subclassable it can be done without deprecation.

The Connection class is not internal, it is mentioned several times in the documentation, and its methods and attributes are documented. Its constructor is not documented, and the official way of creating the Connection object is sqlite3.connect(), but it has the factory parameter, and I I think it means that Connection can be subclassed.

erlendaasland · September 9, 2021, 7:41am

@storchaka, my first PR against bpo-45126 closed the connection and freed all objects on reinitialisation:

github.com/python/cpython

bpo-45126: Harden `sqlite3` connection initialisation

python:main ← erlend-aasland:sqlite-cleanup-init

opened 09:21PM - 07 Sep 21 UTC

erlend-aasland

+110 -62

- make sure we always start `__init__` with a known connection state - make sur…e we always leave the connection in a consistent state - group member initialisation Resolves #89289  https://bugs.python.org/issue45126

I fixed a lot of init issues in bpo-31746.

Yes, that is the intention.

encukou · September 9, 2021, 7:47am

Also,

Every code path must leave the object in a consistent state
Every object that depends on the state (e.g. Cursor using its self->connection->db) must be prepared for the state changing at any time. (sqlite3 actually does this quite well, but it’s not obvious that future maintainers/reviewers need to think about this.)

The Connection class can be subclassed, but the C-level state can’t be manipulated from Python. You can’t override the DB connection by subclassing. You can skip calling __init__, but that makes the object entirely unusable.

erlendaasland · September 9, 2021, 7:49am

We have the check_cursor() and pysqlite_check_connection() support functions for handling most (all?) cases.

steve.dower · September 9, 2021, 4:03pm

I’m thinking of the per-interpreter module state, which on one hand is clearly a different motivation, but the implication is that in two-stage initialization, the second stage should be safe to repeat. It’s weird to have a “in this kind of two-stage initializations, semantics are X, and in this two-stage initialization, semantics are Y”. (And I can tell you that nobody knows all the rationale behind the design - doesn’t even matter which particular design we’re talking about, the rationale gets lost pretty quickly )

Without a really good reason to have them behave differently, we should keep them consistent so that both are easier to understand.

All that just speaks to not changing the definition of __init__ to suit this case. If the best way to fix the issue here is to forbid calling __init__ a second time, then it can easily detect that and raise its own exception, though it sounds like we have better fixes here too.

erlendaasland · September 9, 2021, 8:02pm

I believe it is fully possible to verify that reinitialisation will work fine for sqlite3.Connection (and sqlite3.Cursor). There’s only a handful of methods in sqlite3.Cursor, and a dozen or so (from the top of my head) methods in sqlite3.Connection to audit.

IIRC, this is needed (given that connection __init__ is cleaned up a la GH-28227):

all connection and cursor methods must check that they operate on a fully initialised object
all connection and cursor methods must fail gracefully when the database pointer is NULL
all cursor methods must fail gracefully when the statement pointer is NULL

IIRC, most of these conditions are already in place.

+1

I totally agree

encukou · September 10, 2021, 8:19am

Learning partly from the mistakes in multi-stage initialization for modules, I propose this rule of thumb:

Anything that can be done from Python code is (by definition) safe* to repeat; and (assuming it can be undone from Python, which isn’t a big stretch) safe to skip. That should be done in __init__.
C-level state should be initialized in __new__, since there’s no value in the ability to repeat or skip it; and code that keeps skip/reinit in mind is tedious to write, tedious to test, and dangerous if forgotten.

* (that’s C-level “safe” – “doesn’t have UB”, not “throws exceptions or behaves weird”)

Modules are complicated because the __main__ module object exists before Python figures out what to run. To allow that, there’s the ability to initialize a pre-existing module object. In hindsight, I think I should have thought a lot more about how to remove this design constraint, rather than work with it.

encukou · September 10, 2021, 8:22am

To be clear, I’m not saying reinit isn’t possible, or that sqlite3 does things wrong. I’m thinking about general guidance for extension authors, and designing future APIs to make it easier to do the right thing.

erlendaasland · September 10, 2021, 8:49am

I’d be happy to move sqlite3 connection and cursor initialisation from __init__ to __new__, but that is a backwards incompatible change. I don’t know if connection and cursor factories are used much in the wild, though.

malemburg · September 10, 2021, 9:19am

[encukou] Petr Viktorin https://discuss.python.org/u/encukou encukou CPython
core developer
September 10

Learning partly from the mistakes in multi-stage initialization for modules, I
propose this rule of thumb:

Anything that can be done from Python code is (by definition) safe* to
repeat; and (assuming it can be undone from Python, which isnt a big
stretch) safe to skip. That should be done in |init|.

C-level state should be initialized in |new|, since theres no value in
the ability to repeat or skip it; and code that keeps skip/reinit in mind is
tedious to write, tedious to test, and dangerous if forgotten.

I don’t think that’s such a good plan.

.__new__() is meant for allocating the needed object storage, not for
initializing external resources (that’s what .__init__() is for):

(thats C-level safe doesnt have UB, not throws exceptions or behaves
weird)

On the topic itself: I don’t remember ever having written code which
calls .__init__() again to reinit an object.

When there was a need
to reinit an object, I always implemented a separate method to call
for this purpose. The reason being that such a reinit method would
not necessarily take the same arguments that .__init__() does, but
instead work based on the already configured state and only
change parts of the state, e.g. to reconnect to a database when
the connection was lost.

So instead of pushing more code into .__new__(), it’s better to
clearly state that calling .__init__() again is not a supported
or recommended feature of Python, IMO.

encukou · September 13, 2021, 9:22am

I’m aware of the status quo, but I’m looking for the reasoning behind it, because I’d like to change it. It seems there’s a reason that I’m missing.

Declaring something risky/unsupported works for Python code. But this is C, where we need to actually prevent undefined behavior if possible.

malemburg · September 13, 2021, 11:32am

[encukou] Petr Viktorin https://discuss.python.org/u/encukou encukou CPython
core developer
September 13
malemburg:

I dont think thats such a good plan.

>.__new__()| is meant for allocating the needed object storage, not for
initializing external resources (thats what |.__init__()| is for):
Im aware of the status quo, but Im looking for the reasoning behind it,
because Id like to change it. It seems theres a reason that Im missing
https://en.wikipedia.org/wiki/G._K._Chesterton#Chesterton’s_fence.

tp_new and __new__ were introduced with the new-style classes in
Python 2.2.

See Guido’s essay and PEP 253 for details:

in particular:

The main reason for having two different methods is
to be able to separate setup and allocation of the object from
initializing its state.

E.g. pickle will create an object using __new__ and then load the state
using __setstate__ (or whatever the __reduce__ protocol defines),
without calling __init__. Sub-classes may want to initialize things
differently, or based on different parameter sets, than the base class
while still using the same allocation mechanism, etc.

See pickle — Python object serialization — Python 3.12.1 documentation
for details on pickling.

malemburg:

So instead of pushing more code into |.__new__()| , its better to
clearly state that calling |.__init__()| again is not a supported
or recommended feature of Python, IMO.
Declaring something risky/unsupported works for Python code. But this is C,
where we need to actually /prevent/ undefined behavior if possible.

I don’t think that’s the case. The only parts where you need to prevent
certain things is where assumptions are made about e.g. input types,
flag values, NULL values, etc. Those need to be checked, since we want
to make sure that Python doesn’t segfault.

However, higher level cases, where the extension implementation design
comes into play, are no longer the responsibility of the Python
interpreter. Here, we apply the consenting adults principle and trust
that developers will read the docs and do the right thing - or, if not,
they get to keep the pieces

encukou · October 12, 2021, 2:02pm

But this is about low-level classes – these are written in C and currently need an “__init__ was run” check at the start of each method. Without running the default __init__ they are completely unusable.

The pickle example doesn’t sound convincing: the default __setstate__ can’t set the C-level state, so if pickling needs to be supported, there’s no downside to doing it through __getnewargs__ and __new__.

malemburg · October 12, 2021, 2:42pm

They should not need a check of the sort “__init__ was run”. Instead, they should probably check whether the connection/cursor is still open and usable. That’s a semantic check, not an API level one.

I’m not sure I follow your comment about the pickle example. You wanted to learn about the background of why we have __new__ and __init__ and I pointed you to the reasoning. In simply terms, __new__ is all about allocation, whereas __init__ is all about preparing the object to be used. pickle is an example of where this separation is being used.

From the above PEP 253:

The tp_new() slot should ensure only the most essential invariants, without which the C code that implements the objects would break. The tp_init() slot should be used for overridable user-specific initializations.

For database module connection objects, as an example, this means: tp_new needs to allocate the data structures needed to interface to the database client library (and possibly init this client library), whereas tp_init actually tries to open a connection to the database.

erlendaasland · October 12, 2021, 5:58pm

Sure, that is sufficient, and we could remove the “initialised” member in favour of such a check. However, we still need to check the database pointer at the start of each method, so I’m not sure it really matters. It would, of course, simplify the code, so I would be in favour of such a change, but isn’t this a slight digression?

steve.dower · October 18, 2021, 10:42am

Perhaps, but we all basically agreed back here on the sqlite connection object. Everything since then has been a digression for whether/how __new__/__init__ should interact, and MAL’s last example was just that - an example.

erlendaasland · October 18, 2021, 11:11am

I guess we did I’ll revisit my PR then.