tl;dr: In pandas, we’ve updated the __module__ attribute of public classes to point to the public module they can be imported from. E.g. although DataFrame is defined in pandas.core.frame, its __module__ is simply pandas. This helps users identify where they should be importing these classes from both in a REPL and in our docs.
However when doing this, we didn’t realize that class methods would still have the original __module__. This isn’t much of a concern, except that when a class’s __module__ disagrees with that of its method, doctests do not run. We have a fix for this, but it’s an ugly hack ( DOC: Run all doctests by rhshadrach · Pull Request #62988 · pandas-dev/pandas · GitHub ).
It seems a bit unexpected to me that setting a class’s __module__ does not carry down to the methods defined underneath (and only methods defined via def …(…):). I can’t imagine a situation where that would be desirable. In addition, this behavior also impacts dataclasses:
I think it’s reasonable to say pandas shouldn’t be modifying __module__ in the first place, but my current opinion is that the pros of doing so outweigh the (currently known) cons.
I don’t think this can work for methods in general as many different classes may share the same object as a method. Your add_one example uses a single lambda, but that could easily be a function constructed elsewhere and shared between classes.
For a concrete example, all dataclasses share the same __replace__ function (since Python 3.13) as dataclasses._replace is simply attached to this name. Changing its __module__ if a class __module__ changed would change it for all dataclasses.
Ah, that’s too bad. That’s exactly the kind of “lie” that can break analysis tools using introspection. For example, in Griffe, we rely on the __module__ attribute to know where an object comes from, in order to know whether it was imported in the currently inspected module, or defined within it. This info in turns helps us understand whether an object should be considered public or not. This also lets us avoid scanning the same objects twice or more, or even infinite loops/recursivity.
This is well illustrated by the issue you mention, where (I suppose) doctests uses __module__ to know whether a member of a class was actually defined within the current module, or elsewhere, in which case it shouldn’t execute tests because they will be found and executed in this other module (and we don’t want to run things twice or more). But I just glanced at the issue and code so might be wrong, of course.
In any case, I totally understand how updating __module__ is an improvement in the console: I do agree that it’s much better to show pandas.Dataframe rather than pandas.core.frame.Dataframe. As always, there’s something lacking in Python itself for properly declaring, managing and communicating public APIs
As for the actual idea/question here: I don’t think members of a class should automatically carry the same __module__ as the class they are part of, exactly because a member could have been imported from elsewhere, or assigned to an object imported from elsewhere, etc., and analysis tools need this info to be correct.
By the way, will inspect.getsource(DataFrame) work now that you changed the __module__ value? EDIT: no, it now fails with OSError: source code not available.
Thanks for the replies @DavidCEllis and @pawamoy. I was intending this to be only those defined on a class via the standard way (with def …(…):), but did not put that detail in the OP. I’ve made an edit there.
To @pawamoy’s point, it seems to me there is value in having both the place where the class is defined (__module__) and where it is publicly available (__public_module__ perhaps). This way tools can utilize which ever is appropriate for their situation (e.g. __module__ for linking to the source, and __public_module__ for user-facing docs). I need to do some digging to see if this has been considered in the ecosystem.
I would be very surprised if setting __module__ to something else doesn’t break some of your existing users, and I personally wouldn’t risk it over a slightly better output in the console.
This gets assumed to be correct[1] when not None by a lot of existing tools.
I do actually mean correct, and not just an opinionated statement about what the correct use of this is, __module__ has a specified meaning that actually says it is where an object was defined, not just where it is accessible from. ↩︎