`TypeMap`: Generalizing `overload` semantics

Function overloading (via typing.overload) provides the unique and powerful capability of mapping types (and even combinations of types) to other arbitrary types.

# Essentially {(str, bytes): bool, (int, bool): tuple[str, float]}
@overload
def func(x: str, y: bytes) -> bool: ...
@overload
def func(x: int, y: bool) -> tuple[str, float]: ...

Currently, this level of expressiveness is limited to function definitions.

I propose generalizing overload mechanics via a new TypeMap special type.
TypeMap maps input type expressions or tuples of type expressions (analogous to overloaded argument types) to output type expressions (analogous to overloaded return types).
Specializations of TypeMaps (i.e. MyTypeMap[SomeType]) will use the same semantics as overload, with type checkers treating the specialized type as the output type (or union of output types) as defined by the map.

Syntax

I propose allowing TypeMaps to be defined via both a functional and class-based syntax.

Functional syntax

MyTypeMap = TypeMap('MyTypeMap', {
    str: list[str],
    bytes: list[int],
})

x: MyTypeMap[str]
reveal_type(x)  # revealed type is `list[str]`

y: MyTypeMap[str | bytes]
reveal_type(x)  # revealed type is `list[str] | list[bytes]`


MultiTypeMap = TypeMap('MultiTypeMap', {
    (str, bytes): bool,
    (int, bool): tuple[str, float],
})

z: MultiTypeMap[str, bytes]
reveal_type(z)  # revealed type is `bool`

This would be roughly equivalent to:

@overload
def _map_type(a1: str, a2: bytes) -> bool: ...
@overload
def _map_type(a1: int, a2: bool) -> tuple[str, float]: ...

type MyTypeMap[T1, T2] = TypeOf[_map_type(cast(T1, ...), cast(T2, ...))]

Class-based syntax

A class-based syntax would allow using PEP 695 type variables

class CollectionTypeMap[T](TypeMap):
    __types__ = {
        (list, T): list[T],
        (set, T): set[T],
    }

# See https://github.com/microsoft/pyright/discussions/4369
class CollectionTypePreservingClass[GenericCollectionT: (list, set), T]:
    def __init__(self, coll: CollectionTypeMap[CollT, T]) -> None:
        self.coll: CollectionTypeMap[CollT, T] = coll

    def get_one(self) -> T:
        if isinstance(self.coll, set):
            return next(iter(self.coll))
        return self.coll[0]

x = CollectionTypePreservingClass[set, int]()
reveal_type(x.coll)  # revealed type is `set[int]`
reveal_type(x.get_one())  # revealed type is `int`

This would be roughly equivalent to:

@overload
def _map_type[T](a1: list, a2: T) -> list[T]: ...
@overload
def _map_type[T](a1: set, a2: T) -> set[T]: ...

type CollectionTypeMap[T1, T2] = TypeOf[_map_type(cast(T1, ...), cast(T2, ...))]

Real use cases

1. Tagged-data parsing

This is a common pattern when implementing communication protocols.

from enum import IntEnum

class DataType(IntEnum):
  UINT8 = 0
  UINT64 = 1
  STRING = 2

class Message:
  data_type: DataType
  value: int | str

  def encode(self) -> bytes:
    if self.data_type is DataType.UINT8:
      # Type checker error: Cannot access attribute "to_bytes" for class "str"
      encoded_value = self.value.to_bytes(1)
    elif self.data_type is DataType.UINT64:
      # Type checker error: Cannot access attribute "to_bytes" for class "str"
      encoded_value = self.value.to_bytes(8, 'little')
    elif self.data_type is DataType.STRING:
      # Type checker error: Cannot access attribute "encode" for class "int"
      encoded_value = self.value.encode('utf-8')
    else:
      assert_never(self.data_type)
    return self.data_type.to_bytes(1) + encoded_value

With TypeMap this becomes:

from enum import IntEnum
from typing import TypeMap

class DataType(IntEnum):
  UINT8 = 0
  UINT64 = 1
  STRING = 2

DataTypeMap = TypeMap('DataTypeMap', {
    Literal[DataType.UINT8]: int,
    Literal[DataType.UINT64]: int,
    Literal[DataType.STRING]: str,
})

class Message[DataTypeT: (
    Literal[DataType.UINT8],
    Literal[DataType.UINT64],
    Literal[DataType.STRING],
)]:
  data_type: DataTypeT
  value: DataTypeMap[DataTypeT]

  def encode(self) -> bytes:
    if self.data_type is DataType.UINT8:
      encoded_value = self.value.to_bytes(1)
    elif self.data_type is DataType.UINT64:
      encoded_value = self.value.to_bytes(8, 'little')
    elif self.data_type is DataType.STRING:
      encoded_value = self.value.encode('utf-8')
    else:
      assert_never(self.data_type)
    return self.data_type.to_bytes(1) + encoded_value

2. More compact function overloads

subprocess.Popen is a great example of the limitations of overload even for functions.

Currently, usage such as the following requires an assertion or cast:

proc = Popen(..., stdin=PIPE)
# Type checker error: "write" is not a known attribute of "None"
proc.stdin.write(...)

This could be handled by the current type system by making Popen generic over StdinT, StdoutT, and StderrT and adding more overloads. But subprocess.pyi already defines six overloads just to handle the different ways to induce text mode. To handle all of the different combinations of stdin/stdout/stderr, there would need to be 36[1] overloads. (And this is only looking at the 3.11+ version.)

With TypeMap, all of these overloads can be expressed with a single declaration:

# Make PIPE a distinct type
@global_enum
class _PopenFileSpecial(Enum):
  PIPE = -1

PIPE: Final = _PopenFileSpecial.PIPE

TextModeTypeMap = TypeMap('TextModeTypeMap', {
    (Literal[False] | None, Literal[False] | None, None, None): bytes,
    (Literal[True], Literal[True] | None, str | None, str | None): str,
    (Literal[True] | None, Literal[True], str | None, str | None): str,
    (Literal[True] | None, Literal[True] | None, str, str | None): str,
    (Literal[True] | None, Literal[True] | None, str | None, str): str,
    (bool | None, bool | None, str | None, str | None): str | bytes,
})

PipeTypeMap = TypeMap('PipeTypeMap', {
    (Literal[_PopenFileSpecial.PIPE], bytes): IO[bytes],
    (Literal[_PopenFileSpecial.PIPE], str): IO[str],
    (int | IO[Any] | None, str | bytes): None,
})

class Popen[
  AnyStr: (str, bytes),
  StdinT: (IO[bytes], IO[str], None),
  StdoutT: (IO[bytes], IO[str], None),
  StderrT: (IO[bytes], IO[str], None),
]:
  stdin: StdinT
  stdout: StdoutT
  stderr: StderrT

  def __init__[
    UniveralNewlinesT: (Literal[True], Literal[False], None),
    TextT: (Literal[True], Literal[False], None),
    EncodingT: (str, None),
    ErrorsT: (str, None),
    StdinArgT: (int | IO[Any] | None, Literal[PIPE]),
    StdoutArgT: (int | IO[Any] | None, Literal[PIPE]),
    StderrArgT: (int | IO[Any] | None, Literal[PIPE]),
  ](
    self: Popen[
        TextModeTypeMap[UniveralNewlinesT, TextT, EncodingT, ErrorsT],
        PipeTypeMap[
            StdinArgT,
            TextModeTypeMap[UniveralNewlinesT, TextT, EncodingT, ErrorsT],
        ],
        PipeTypeMap[
            StdoutArgT,
            TextModeTypeMap[UniveralNewlinesT, TextT, EncodingT, ErrorsT],
        ],
        PipeTypeMap[
            StderrArgT,
            TextModeTypeMap[UniveralNewlinesT, TextT, EncodingT, ErrorsT],
        ],
    ],
    args: _CMD,
    bufsize: int = -1,
    executable: StrOrBytesPath | None = None,
    stdin: StdinArgT = None,
    stdout: StdoutArgT = None,
    stderr: StderrArgT = None,
    preexec_fn: Callable[[], Any] | None = None,
    close_fds: bool = True,
    shell: bool = False,
    cwd: StrOrBytesPath | None = None,
    env: _ENV | None = None,
    universal_newlines: UniveralNewlinesT = None,
    startupinfo: Any | None = None,
    creationflags: int = 0,
    restore_signals: bool = True,
    start_new_session: bool = False,
    pass_fds: Collection[int] = (),
    *,
    text: TextT = None,
    encoding: EncodingT = None,
    errors: ErrorsT = None,
    user: str | int | None = None,
    group: str | int | None = None,
    extra_groups: Iterable[str | int] | None = None,
    umask: int = -1,
    pipesize: int = -1,
    process_group: int | None = None,
  ) -> None: ...

Required changes to CPython

The only required change to CPython would be adding TypeMap to typing. No special syntax is required. This change is also easily backported via typing_extensions.


  1. (7 combinations of at least one pipe) * (4 text mode overloads + 1 binary mode) + (1 overload for no pipes, text/binary mode irrelevant). ā†©ļøŽ

10 Likes

(This is a follow-up/alternative to this proposal for associated types.)

Although this isn’t the first time this has been proposed, I (still) would really like to see this become a thing. It’s no exaggeration to say that this could at least half the number of overloads in the numpy stubs and scipy-stubs.

For bookkeeping, here are some of the related discussions:

In particular, the syntax I (informally) proposed in Make `@overload` less verbose - #10 by jorenham could be worth taking into consideration.
But to be honest, I don’t care much about the choice of syntax. Fancy stuff like this case type can just as well be added at a later point as far as I’m concerned.

I think that the straightforward functional API you propose here would be quickest and least bumpy road to success. The class-based syntax seems unnecessary to me.

Also note that this generalized ā€œTypeVar with constraintsā€, since e.g. AnyStr can also be written in terms of TypeMap("StrMap", {str: str, bytes: bytes}). So with this, we’d also be able to resolve the concerns regarding the clashing PEP 695 syntax for contrained type variables and Idea: Simpler and More Expressive Type Annotations

It might be worth to also specify the behavior in case of subtyping/assignability, e.g. what’ll happen if you map bool according to {int: A, bool: B, Any: C}, and other ambiguous situations like this.

Also, why did you go with (str, bytes): bool and not str | bytes: bool? I think the latter would be more natural, and will avoid a syntax clash with Idea: Simpler and More Expressive Type Annotations.

4 Likes

I think that this would be very useful for SymPy. Most public functions in SymPy look like:

def some_public_func(arg):
    arg = sympify(arg)
    ...

where the sympify function is a global coercion function that turns all inputs into symbolic expressions of different types like:

def sympify(arg):
    if isinstance(arg, Expr):
        return Expr
    elif isinstance(arg, bool):
        return Boolean(arg)
    elif isinstance(arg, int):
        return Integer(arg)
    ...

It would be possible to specify what the sympify function does with a large number of @overload lines. Ideally though this would be done with a single large recursive TypeMap and then the annotations would be:

T = TypeVar('T', Expr)

Sympifiable = TypeMap('Sympifiable', {
    Boolean: bool | Boolean,
    Integer: int | Integer,
    ...
    Number: 'Sympifiable[int | Float | Rational] | Number',
    T: 'Sympifiable[Number] | T',
    FiniteSet[T]: 'set[Sympifiable[T]] | FiniteSet[T]',
    ...
})

def sympify(arg: Sympifiable[T]) -> T:
    ...

def some_public_func(arg: Sympifiable[Number]) -> Number:
    ...

def other_func(arg: Sympifiable[Expr]) -> Expr:
    ...

def generic_func[T: Expr](arg: Sympifiable[T]) -> T:
    ...

If it were possible to express the type Sympifiable used here then it would be used in the annotations for perhaps the majority of function parameters in the whole of SymPy’s public API.

I know that dicts are ordered now in Python but I still think of them conceptually as unordered mappings so it seems off to me that a dict is being used here if the ordering of the items matters (which it does if this is analogous to @overload).

1 Like

I’m definitely in favor for this, as I’ve actually made a TypeMapping before. However I’m unsure about the proposed class-based syntax. The attribute is not really clear, and it’s harder to directly see the key-value correlation. Imo type ars could be added with the functional syntax, as some

TM = TypeMap[T, U]("TM", {tuple[T, U]: list[U], tuple[T, T, U]: list[T]})

I’d however find the usage of other typing constructs, like ParamSpecs or TypeVarTuples more interesting, especially as those would need to be made hashable.

For bookkeeping, here are some of the related discussions:

I’ll admit to being mildly embarrassed that I didn’t find these before writing this up.

In particular, the syntax I (informally) proposed in Make @overload less verbose - #10 by jorenham] could be worth taking into consideration.
But to be honest, I don’t care much about the choice of syntax. Fancy stuff like this case type can just as well be added at a later point as far as I’m concerned.

I think when it comes to moving this forward, it’s best to start with something that doesn’t require new syntax. In principle though I do like your proposed syntax.

The class-based syntax seems unnecessary to me.

The class based syntax was definitely more of an afterthought, with the sole purpose of allowing type vars (which would be necessary to fully match the expressiveness of overload, and could provide a way to work around the lack of generic type var bounds). But I’ll admit that that’s a much more niche use case, and it probably makes sense to drop the class syntax.

It might be worth to also specify the behavior in case of subtyping/assignability

I think the simplest thing for now is to specify that the behavior follows that of overload. So unless C is compatible with both A and B, a type error should be emitted at the map’s definition site.

(I think the bigger question is how to handle variance, but here too I think we should stick to overload’a semantics, and maybe later introduce a way to specify contravariant/invariant keys.)

Also, why did you go with (str, bytes): bool and not str | bytes: bool ?

The tuple is to signify two distinct type parameters. The idea is to allow multiple ā€œinputā€ types, like in the case of Popen, where the type of the stdin attribute depends on both the stdin and text (et al) parameters.

StringTypeMap = TypeMap(ā€˜StringTypeMap’, {
  Literal[True]: str,
  Literal[False]: bytes,
})

PipeTypeMap = TypeMap(ā€˜PipeTypeMap’, {
  (Literal[True], str): IO[str],
  (Literal[True], bytes): IO[bytes],
  (Literal[False], str | bytes): None,
})

def make_io[
  DecodeT: (Literal[True], Literal[False]),
  PipeT: (Literal[True], Literal[False]),
](decode: DecodeT, pipe: PipeT) -> PipeTypeMap[PipeT, StringTypeMap[DecodeT]]:
  ...

In terms of moving forward, I’m interested in turning this into a PEP (and possibly trying to implement support in Pyright) once we iron out the general details.

The main reason I introduced the class-based syntax was to avoid needing to resort to pre-3.12 style TypeVars, but I agree that the syntax is not intuitive and provides little advantage. I think using the old TypeVar is acceptable, and a dedicated ā€œmodernā€ syntax can be discussed later.

1 Like

Agreed :ok_hand:

Overloads are very complicate, so I don’t think that would be simplest. So I’d try to avoid relating these to overloads if I were in your shoes.

There’s no state, so the types won’t vary, so no need to worry about variance. Or are you talking about whether e.g. a {int: int, str: str} map should accept bool (and map it to int), i.e. whether the TypeMap domain/keys/input is in itself covariant? Personally I think that would be the most natural behaviour, matching TypeVar(..., int, str).

Ah sorry I misunderstood, They’re separate input parameters; gotcha.

But in that case, maybe it would help to also write TypeMap with single input as {(str,): bool}, ... then, i.e. always requiring a tuple as key? That way we avoid ambiguity if we’ll someday be able to write tuple[int, str] as (int, str).

Great!
I don’t have much experience in the PEP writing process myself, but I’d be happy to help out in any way I can.

1 Like

I read through the overload rules, most of the complication comes from splitting unions, which I think is something that we’d like to have. Is there a specific part of the overload spec that you think is undesirable/unnecessary here?

In general, I think having two similar-but-subtly-different type mapping mechanics would cause more confusion. It also increases friction toward getting a PEP accepted, since the new mechanics would need to be fully evaluated. Finally, type checkers have already implemented overload, which should make getting a reference implementation for TypeMap much simpler.

That’s what I was referring to. I can’t actually come up with any use cases for alternative behavior though, so I’m with you on that.

I feel like the extra parentheses make it harder to understand, and anyway I can’t really see that alternative tuple-type syntax being accepted, since it would make type bounds and specialization ambiguous, or would require wrapping everything in a 1-tuple (in which case it would make sense to require that for TypeMap as well).

class A[T: (str, bytes)]:  # `str/bytes` or tuple[str, bytes]?
  ...

class A[T: ((str, bytes),)]:
  ...

class B[*T]:
  ...

B[(str, bytes)]  # same as B[str, bytes]
B[(str, bytes),]  # don't forget the trailing comma!

Some additional questions I came up with:

  1. Should we define result-to-input inference? Leave it up to individual type checkers for now?
    MyMap = TypeMap('MyMap', {
      bool: bytes,
      str: int,
    })
    
    # Should this be mandated/optional/forbidden?
    def f[T: (bytes, int)](x: MyMap[T]) -> T: ...
    
  2. Should we require the number of type parameters to be constant?
    MyMap = TypeMap('MyMap', {
      bool: bytes,
      str: int,
      (int, float): str,  # Any reason to allow/disallow this?
    })
    
  3. Should there be a way to specify "any valid input for this TypeMap? maybe a .inputs attribute?
    If so,
    a. Should it be a union, or some special type that can be used as bounds for a type variable?
    MyMap = TypeMap('MyMap', {
       bool: bytes,
       str: int,
    })
    
    # Is this equivalent to
    #   class A[T: bool | str]:
    # or
    #   class A[T: (bool, str)]:
    # ?
    class A[T: MyMap.inputs]:
        ...
    
    a: A[bool | str]  # Should this be allowed?
    
    b. How would this work for multiple inputs? Should subscripting be allowed? MyMap.inputs[0]?

@erictraut I know you’ve weighed in on similar proposals in the past, but I’d love to get your feedback here.

I’ll try to address a couple of the concerns you’ve raised in the past:

I believe the two primary incentives for this proposal are:

  1. Class-level overloads, something which isn’t currently possible. (Even though you can overload __init__, that doesn’t provide the same one-to-one guarantees within methods, as demonstrated in encoding/decoding example above.)
  2. Exponential overloads. Although technically possible with the current type system, it’s so unwieldy and difficult to maintain that even the official _typeshed stubs don’t bother. (See my Popen example, which itself has been a pain-point for any typed project using subprocess with pipes.)

In general, as my workplace’s local ā€œtyping evangelicalā€, it’s difficult to make the case for adding typing, when maintaining the type annotations becomes more work than maintaining the code itself. As such, I see a major benefit in making typing constructs reusable, even if the resulting semantics could theoretically be expressed using existing features.

I think the overall idea is great, but I’m not a fan of the function call based syntax. It sometimes forces people to stringify the types, both in the case of forward references and when some key types are unhashable (not sure if that can reasonably occurr with ā€œnormalā€ types, but it’s certainly not illegal). We already made PEP 649/749 to deal with the issues coming from that, I don’t think we should introduce new features with the same problems. It also doesn’t allow for type variables. Yes, you can technically still use the old T = TypeVar("T") construct but we really should not encourage people to do so.

Also, while the overload resolution process is pretty complex, I think that type maps should also use it. If it’s meant to be a replacement for overloads, it needs to also behave in the same way. The complex parts of overload resolution also aren’t historical issues or something like that, any similar spec for deciding which type map key to select would have to handle the same cases.

From what I can tell, most of the examples presented here require that to work. So if those are our goal, we kinda have to support that. But then the resolution process gets even more complicated. In general, it also depends on the type map if such a resolution even is possible. There would definitely have to a be formal spec laying out all the details.

When would that realistically be used? When you index into a type map you either have some fixed set of arguments or are unpacking a type var tuple. But you (currently) cannot constrain a type var tuple, so you can never safely unpack it into a fixed type map. For simplicity’s sake I think it’s best to just leave them with a fixed number of arguments.

I don’t think this should be part of the initial proposal of this. If there ends up being demand, there should be some kind of key of syntax/special form/whatever that also works for typed dictionaries.

I’m not in love with the syntax either, but I think the most realistic path forward is to start with a version that doesn’t require any special syntax, and only then propose syntax changes (taking into account real-world usage).

For forward references, type aliases can be used instead of stringification.
To allow for unhashable types we can allow the mapping to be specified as a list/tuple of (key, value) pairs.

In short, I agree that better syntax is needed, but I think the functional syntax is good enough for 99% of use cases, and the issue of syntax is best left for a follow-up PEP.

Is that true? I think all of the examples here have the ā€œinputsā€ in parameter locations, where there shouldn’t be a need to infer from the ā€œoutputā€.

If possible I’d prefer to leave reverse inference as TBD behavior, and iron it out once there’s some real-world usage to look at.

One more alternative syntax:

class MyTypeMap(TypeMap):
  def _(data_type: Literal[DataType.UINT8, DataType.UINT64]) -> int: ...
  def _(data_type: Literal[DataType.STRING]) -> str: ...

This one’s more verbose, but allows type vars and forward references. It also keeps the syntax close to that of overload, and the requirement to name inputs may actually be a good thing, particularly when using multiple inputs.

The part that isn’t in the spec: overlapping overloads. I rather define a TypeMap as simply being distributive over unions, and leave it at that.

Wouldn’t TypeMap still have the same overlap issue, though?

Not if it’s defined to select the most specific case, deviating from the overload spec.

Defining it like that would also enable a very common pattern in NumPy that is currently not allowed:

ToScalar = TypeMap("ToScalar", {
    (bool,): np.bool_,
    (int,): np.int_,
    (float,): np.float64,
    (complex,): np.complex128,
    # -- snip --
})

If this were to written in terms of overloads, this wouldn’t be valid because all overloads overlap but their return types are incompatible (because their intersection is Never). The reason for this restriction is a purely type-theoretical one, without any practical advantages.