External typing systems or typing subsystems

Description

PEP 483 describes the theory of type systems. The germane part is subtype relations. It declares the second (strong) type criterion: “every function from first_type is also in the set of functions of second_type”. This seems to be the reason why NewType requires “a proper class (i.e., not a type construct like Union, etc.)” What I am proposing is some way to loosen the criterion to “every function within an important subset of functions from first_type is also in the set of functions of second_type”.

Example

Would it be possible to use some other annotation information to say that an object belongs to a type in a different typing system? e.g.

from numpy import NDArray
from scipy.sparse import sparray

Array = typing.ExternalType("Array", NDArray, sparray, *other_classes)

CovarianceMatrix = NewType("CovarianceMatrix", Array)

arr: typing.External[NDArray, CovarianceMatrix]

Here, typing.ExternalType declares a name to be the root of a separate type system as well as the python types that are allowed to take on this type. Equivalently, this declares the limited subset of functions as the intersection of legal functions on all arguments after the name. NewType, Union, etc. can then declare arbitrary additional types. The annotation typing.External takes a parameter that can restrict the python type, and then any number of types that live in independent ExternalType graphs.

Use Case

I would like to be able to use mathematical types to prevent avoidable errors in numerical algorithms. Consider the following:

def z_score(x: float, mean: float, stdev: float) -> float:
    return (x-mean) / stdev

def hyperparam_search(data) -> float:
    """Calculate the optimal variance of the training data"""
    ...

def fit_something()
    var = hyperparam_search(data)
    return z_score(x, mean, var)  # should be z_score(x, mean, sqrt(var))

This seems like a clear case of “situations where a programmer might want to avoid logical errors by creating simple classes” by using NewType. Properly annotating these functions with Variance = NewType("Variance", float) and StdDeviation = NewType("StdDeviation", float) would show the problem.

Now consider that because of duck typing, my functions work on float | np.float64. However, I can’t create Variance and StdDeviation on a union type. In fact, this was the exact problem I ran into. I had built an equation using variance because of convexity rules, but a reviewer asked to see plots in standard deviation because the units were more understandable. When I modified my code, I introduced the error.

More generally, this problem manifests a lot when working on difficult math, because testing a function requires proving the proper invariants and identifying algorithm condition number, which are both hard. At the same time, a lot of code can utilize sparse matrices or other types, which are faster, via duck typing without any modification. Off the top of my head, I can think of a dozen types of sparse or dense matrices and that I use any day: inequality constraints, positive definite matrices, positive semidefinite matrices, covariance matrices, finite difference matrices, toeplitz matrices, etc.

Alternative/Current Workaround

class FakeFloat(np.float64, float):  # type: ignore # incompatible method definitions
    def __new__(cls, other):
        return other

class StDev(FakeFloat):
    pass

class Variance(FakeFloat):
    pass

def foo(a: Variance) -> None: pass

foo(StDev(3.0))

mypy will correctly point out that foo() takes a variance, not a standard deviation.

What I’d love to understand, even if this isn’t possible

I’m a bit out of my depth when it comes to this level of type theory, but

  • Does type theory (beyond python) have an existing concept of “every function within an important subset of functions from first_type is also in the set of functions of second_type”? Sort of like how probability theory has a concept conditional independence: X \bot\!\!\!\bot Y | Z
  • Where does the documentation specify how static analyzers should use things like __mro_entries__ and __supertype__?
  • How does this differ when handling Protocol and Union types?
  • Is there a way to see this interactively?

I have a better understanding of typing now, but I once posted related questions under Python Help, and once as a discussion in Pyright,

Seems like this question would be most appropriately directed to typing experts.

@moderators Could you move this thread to Typing?

2 Likes

canonically and without type:ignore hacks, this is doable with typing.cast. the caller of cast is responsible for ensuring that the types going in are appropriate, but that’s also the case in your current workaround.

from typing import NewType, cast

import numpy as np

FloatVariance = NewType("FloatVariance", float)
F64Variance = NewType("F64Variance", np.float64)
Variance = FloatVariance | F64Variance

FloatStdDev = NewType("FloatStdDev", float)
F64StdDev = NewType("F64StdDev", np.float64)
StdDev = FloatStdDev | F64StdDev


def something_returning_a_variance() -> Variance:
    return cast(Variance, 1.0)


def something_using_a_std_dev(x: StdDev) -> None:
    ...

# Argument of type "Variance" cannot be assigned to parameter "x" of type "StdDev" in function "something_using_a_std_dev"
something_using_a_std_dev(something_returning_a_variance()) 

the resulting use is compatible with only the set of things that are possible on both floats and numpy float64s, but is distinct as a type in terms of checking that you are passing the right things around.

In order:

  1. yes. In python, this is expressable by checking a subtype relationship with a protocol or an abstract base class. It isn’t currently expressible in quite the same way with a “normal” concrete class. (this is structural subtyping)
  2. I’m not sure it does, I think this is currently just assumed that type checkers must know this to understand the language (__mro_entries__ is part of the data model)
  3. Protocols don’t use mro_entries in any special way, each member of a union’s mro needs considering when considering subtyping, but this also isn’t treated in a special way.
  4. typing.reveal_type maybe?

Thank you for that detailed resonse! A lot I’ve got to look up and try out. I hadn’t read about cast, and I’ll have to read it and play with it to get a better sense. Your suggestion seems like a reasonable solution, with the drawback of needing to replicate equivalent subtype derivations for each base class.

In python, this [subtype based on just an important subset of functions] is expressable by checking a subtype relationship with a protocol or an abstract base class

Yeah, protocols/ABCs are definitely a way of explicitly describing the “important subset of functions”! Originally I had parts of this post written with Protocol! But then I hit on Union as a way of declaring the important subset of functions as an implicit structural subtyping. Faster and easier, but perhaps more dangerous in that, if used to verify type safety of functions, wouldn’t provide enough confidence from passing static analysis.

So I guess there’s broadly two questions my proposal is asking (1) If a user of Protocol is doing so correctly, is there any additional risk of allowing NewType of a protocol? And (2) how bad is the risk of allowing implicit structural subtyping via NewType of a union?

this is currently just assumed that type checkers must know this to understand the language

That makes sense :slight_smile: I had thought __mro_entries__ was somehow the way python was preventing NewTypes of unions and protocols. I think I was conflating runtime and static concepts.

The problem I see with NewType on protocol is a bit thorny, Protocols use structural subtyping, not nominal. NewType relies on the semantics of nominal subtyping (jargon below).

Mixing the two here is a bit hard, and might not be possible in any way that is more ergonomic than what is provided above. Given type intersections (pep being drafted as we speak), you could use a marker class here and retain the type information of what goes in while still marking it as variance to check for formula errors.

class Variance:
    pass

type AnyFloat = float | np.float64 | np.float128 | ...  # or a protocol :)

def takes_floats_gives_variance[T: AnyFloat](data: list[T]) -> T & Variance:
    variance = ... # do the math here
    return typing.cast(T & Variance, variance)

def takes_variance[T: AnyFloat & Variance](variance: T):
    ...

sorry for the jargon, but we’re in the weeds here.

structural subtyping: something is a subtype if it satisfies a specified interface. strongly similar to duck-typing, and would be the preferred way to type things if you want to be permissive and only state how you use things, not what things need to be.

nominal subtyping: something is something because it either is that type or is a subclass of it. This is closer to traditional inheritance, but there are some additional typing semantics around it here and there, as well as some rules about Liskov Substitution principle that are more pragmatic than pure in most python type checkers.

1 Like

Thanks @mikeshardmind. I’ve read the PEPs a few times in the past. So while none of the above terms are 100% new to me, using them in discussing a real problem is new, and I really appreciate you taking the time to help me understand the weeds. Please do use the jargon :smile:

Type intersection seems useful, and I’m glad to hear people are working on it (just saw the proposal discussion)! The workaround using type intersections seems the best yet.

The problem I see with NewType on protocol is a bit thorny, Protocols use structural subtyping, not nominal. NewType relies on the semantics of nominal subtyping (jargon below).

Mixing the two here is a bit hard …

Is the concern that NewType of a protocol wouldn’t be able to guarantee Liskov Substitution fully, and thus mixing structural and nominal subtyping would make it unclear what type safety guarantees a static analyzer should provide? I’ll scratch my head and see if I can come up with an example.

Not quite, but it’s directly along those lines about mixing structural and nominal subtyping. In this case, it’s a lack of specificity for the behavior of these two things interacting. mypy and pyright outright disallow creating a newtype from a protocol.

static analysis subtyping checks follow 2 possibilities

  1. If we’re checking that something is a subtype of a protocol, check that the methods and attributes defined in the protocol have matching definitions in what is being checked. This is the structural subtyping. LSP is not a consideration here.

  2. If we’re checking that something is a subtype of a non-protocol, check that it is directly of that type or has that type in the inheritance. This is the nominal subtyping. LSP is checked at the type’s definition and while a consideration, it is not rechecked here.

It’s not immediately clear what NewType should do to create a drop-in replacement for a protocol that is distinct from that protocol. mypy and pyright each disallow it, but it is allowed at runtime. If it created a new type that was a protocol still, this would not result in a distinctly different allowed type, subtyping checks against both would use the same structural checks. If it was intended to synthesize a new non-protocol, then what should happen if that type is directly used at runtime? protocol classes aren’t designed to be instantiated at runtime.

There could be some sort of persistent tagging of objects that happens purely in static analyses, but this would require more complexity in analysis closer to (and potentially exceeding that of) the level a full optimizing compiler would have because you’d need to build a complete directed graph to determine if an object ever becomes viewed as the version of the protocol that should be treated as incompatible with the original. I imagine any proposal to force that would be met with skepticism as to the value proposition being worthwhile, but there are various ways it could be done with some amount of extra work or some special casing here or there, and some might require less intensive work than the one I thought of first.

2 Likes

That is irrelevant. The runtime behavior here exists purely to support the static type checker feature. (This is different from most situations, where the runtime behavior preceded PEP 484 by many years and the static typing tries to model the runtime behavior – this is more like cast, a feature needed by static type checking for which runtime support had to be added.)

1 Like

I noted that it’s allowed at runtime just as a note, as not all things with typing and interacting with protocols are. The simplest example of this being

>>> from typing import Protocol
>>>
>>> class X(Protocol): pass
...
>>> X()

Which will error at runtime informing the user that this is not allowed. I figured it was worth pointing out since the discussion was about how this could be interpreted, and to be clear, I did not in any way intend to suggest that it being allowed at runtime means we should do anything else with it. (though it might make sense for NewType to error at runtime if the type is a protocol just like other cases with protocols and runtime usage do since this isn’t actually allowed in NewType by the type system)

1 Like

That makes sense, feel free to create a cpython issue (even better, also a PR).

I would assume that NewType of a Protocol would work at runtime the same as NewType currently does - as an instance who’s __call__(other) just returns other. To continue your example:

from typing import Protocol

class X(Protocol):
    foo: int

class ImplementsX:
    def __init__(self): self.foo = 1

SpecialX = NewType("SpecialX", X)
SpecialX(ImplementsX()) # allowed at runtime; type checks fine
SpecialX(object()) # allowed at runtime; type check error: incompatible type "object"; expected "X"
X() # not allowed at runtime

There could be some sort of persistent tagging of objects that happens purely in static analyses, but this would require more complexity… to build a complete directed graph to determine if an object ever becomes viewed as the version of the protocol that should be treated as incompatible with the original.

In terms of “juice vs squeeze”: I suppose I’d have to get into the nitty gritty of a static analyzer code to understand why all that would be necessary or what it means for an object to be viewed as a version of something. But the juice is being able to quickly and safely refactor algebraic expressions.

E.g. In this “log-likelihood” or “loss” expression \alpha\|Ax - b\|^2_{\Sigma^{-1}}, \Sigma is a numpy array (or scipy sparse array) and \alpha is a float. But I might refactor to instead represent any of the following equivalent expressions:

\|Ax - b\|^2_{\alpha\Sigma^{-1}} = \|Ax - b\|^2_{(\frac{1}{\alpha}\Sigma)^{-1}} = \alpha\|L^T(Ax - b)\|^2 = \|\sqrt{\alpha} L^T(Ax - b)\|^2

I can of course refer to \Sigma, \alpha\Sigma, L^T, and \sqrt\alpha\Sigma with different names as I refactor. Each variant has a different mathematical meaning, after all, even if they are all the same class (or type union). But its easy to make the mistake of having a function that accepts \Sigma instead get passed \alpha \Sigma. It would be great if I could refer to them as different types and have a static analyzer catch my mistakes. Especially as testing math functions is hard.

There are a few other situations where restricting arbitrary subtypes (subsets, since at this point LSP isn’t guaranteed?) could help safety, e.g NewType("StiffODE", list[Callable]), NewType("NonstiffODE", list[Callable]), and

def ode45(rhs_func: NonstiffODE): ...
def lsoda(rhs_func: StiffODE): ...

Here, using the wrong function may or may not be an error, but can take orders of magnitude more time.

In that way, being able to declare arbitrary subtypes quickly and efficiently seems like an intermediate step between choosing good names and building fully functional classes, but I probably need to understand more of type variance and how type analyzers work in order to see how this could work without compromising things that need to be verified as obeying LSP.

A NewType of a protocol makes no sense from a type standpoint.

As others have mentioned in this thread, a protocol is a structural type not a nominal type. You can think of a protocol as a “template” that defines the “shape” of a type. All protocols with the same “shape” are equivalent regardless of the names they are given. If an object is compatible with a protocol, then it’s compatible with all equivalent protocols.

Defining a NewType of a protocol therefore would effectively just create a type alias for the original protocol which would have no utility beyond a simple type alias. If you want to provide different names for equivalent protocols to aid in making the code clear for people who are reviewing the code, a type alias is a good solution. However, a static type checker will treat all of these protocol aliases as equivalent and will not help you with type enforcement.

In this code sample, Proto1, Proto2 and Proto3 are equivalent.

class Proto1(Protocol):
    x: int

class Proto2(Protocol):
    x: int

Proto3: TypeAlias = Proto1
2 Likes

Appreciate all the help guys - my misunderstanding was a too-broad understanding of what a “type” is. I get that structural and nominal typing need to be thought about differently and don’t easily generalize into “the set of objects that are compatible here”.

I’ve got the workarounds, which basically revolve around creating new types from object and having them either (a) wrapper class for a type union or protocol, (b) intersect a type union w/ cast, if and when that PEP gets accepted, or (c) type: ignore.

I still think the suggestion has merit from a user perspective. It’s easy to use NewType and static analysis to enforce theorem guarantees (e.g. given a positive-definite matrix A and left-singular vectors U of rank k…). It’s not currently possible to have both that and duck typing.

I need to read into more how SCA libraries work, because I don’t yet see whether what I’ve suggested has some theoretical conflict with what NewType means at a user level or if its just a conflict internal to how type analyzers process them.

If what you’re trying to achieve is a version of NewType that accepts a protocol and returns a nominal type which implements just that protocol and will correctly wrap any type that implements the source protocol, you can definitely do that, although it will be a little bit more clumsy than being able to directly use NewType to create the type constructor:

from typing import cast, reveal_type, Protocol

class XProtocol(Protocol):
    @property
    def x(self) -> int:
        return NotImplemented  # so X isn't considered abstract

class X(XProtocol):
    def __new__(cls, obj: XProtocol) -> "X":
        return cast("X", obj)

class Foo:
    x: int
    
x = X(Foo())
reveal_type(x)

In particular this portion is more or less functionally equivalent to what NewType("X", Base) returns:

class X(Base):
    def __new__(cls, obj: Base) -> "X":
        return cast("X", obj)

I’ll pass in this case. Those using NewType should see the error from static analysis, and I’d prefer not to be adding to the costs of imports by doing more work everywhere these exist at a module level, while there is an argument for it from consistency, the target audience for this feature already should be getting consistent feedback.