Proposal: relax un-correlated constrained TypeVars

bukzor · July 31, 2024, 8:00pm

I’ll cut to the demo since I think we’ll agree that this situation is silly although we might disagree how to solve it.

This typechecks just fine:

import urllib.parse
test_cases: list[str | bytes] = ["foo", b"bar"]
for test_case in test_cases:
   if isinstance(test_case, str):
      value = urllib.parse.parse_qsl(test_case)
   else:
      value = urllib.parse.parse_qsl(test_case)
   print(value)

This does not:

import urllib.parse
test_cases: list[str | bytes] = ["foo", b"bar"]
for test_case in test_cases:
    print(urllib.parse.parse_qsl(test_case))
    # └╴E  Argument of type "str | bytes" cannot be assigned to parameter "qs" of type "AnyStr@parse_qsl | None" in function "parse_qsl" Pyright (reportArgumentType) [5, 34]

I propose that two programs that differ only by a conditional that evaluates to the same expression in all cases should have equal type-correctness.

In particular, I’d like to propose that (in the special case of one-argument constrained type variables) that these two programs are held equivalent by the python type system.

Background:

my pyright bug: Argument of type "str | bytes" cannot be assigned to type "AnyStr@parse_qsl | None" · Issue #8623 · microsoft/pyright · GitHub
a discussion in discord #type-hinting Discord

MegaIng · July 31, 2024, 8:31pm

The full definition of parse_qsl is:


def parse_qsl(
    qs: AnyStr | None,
    keep_blank_values: bool = False,
    strict_parsing: bool = False,
    encoding: str = "utf-8",
    errors: str = "replace",
    max_num_fields: int | None = None,
    separator: str = "&",
) -> list[tuple[AnyStr, AnyStr]]: ...

As you can see, AnyStr appears more than once, specifically the return type depends on it. If we were to allow str | bytes to be assigned to AnyStr in this case, the return type would suddenly be list[tuple[str|bytes, str|bytes]], implying a mixture of str and bytes in the result, where none will be. This is loss of type information. This means the two branches are only equivalent as long as you don’t use the returned value.

If the signature instead was something like

def parse(text: AnyStr) -> int: ...

Then I would agree with your point, however, I am pretty sure that always when this is the case, the type signature should just use str | bytes instead, circumventing the need for this special case.

erictraut · July 31, 2024, 8:42pm

I’m not sure what you’re proposing. Could you go into more detail about what you mean by “the special case of one-argument constrained type variables”? Do you mean “a callable whose input signature uses a function-scoped constrained type variable only once”?

I think the underlying problem here is that the definition for the parse_qsl function in the typeshed stubs is incorrect. It should use an overload, but it instead (mis)uses a type variable with value constraints. This was probably done for brevity — or because the developer who wrote the definition didn’t understand the implications of using a constrained type variable in this case.

Currently, this function has the following definition in typeshed.

def parse_qsl(
    qs: AnyStr | None,
    ... <additional params omitted>
) -> list[tuple[AnyStr, AnyStr]]: ...

In addition to the problem caused by the (mis)use of the value constraints, this definition also has the problem that the type variable can go unsolved if the user passes None to the qs parameter. I’d need to dig into the implementation of this function to understand what it actually returns in that case, but it’s probably not what the definition currently indicates.

The definition should probably be changed to something like this:

@overload
def parse_qsl(
    qs: None,
    ... <additional params omitted>
) -> ????????: ...
@overload
def parse_qsl(
    qs: str,
    ... <additional params omitted>
) -> list[tuple[str, str]]: ...
@overload
def parse_qsl(
    qs: bytes,
    ... <additional params omitted>
) -> list[tuple[bytes, bytes]]: ...

I’ve confirmed that with this modified definition, your code sample type checks without a problem.

If that sounds like a good solution, I recommend filing a bug report and/or a PR in the typeshed project.

There’s a need for additional coverage in the typing spec about how type variables with value constraints should work. PEP 484 was very light on details here, but there are rules that type checkers follow. Making those rules explicit in the typing spec is on the to-do list. These rules are already complex, so creating special cases on top of the existing rules is probably not the right answer. Special cases inevitably lead to composability problems.

I think there’s also need for better developer guidance about when value constraints should and shouldn’t be used. I find that developers often reach for them in cases where they should not. This is a good example.

Maybe type checkers should detect and report situations where a type variable with value constraints is inappropriately used within a function definition, like it is in the case of parse_qsl. We’d need to think about whether we could establish rules that wouldn’t lead to a bunch of false positives.

bukzor · August 6, 2024, 9:48pm

Thanks. I believe you that this is an inappropriate use, but I don’t see why. Can you point me to the reading?

Is it not the case that all constrained typevars could be transformed to overloads? If so, it seems like brevity is the reason constrained typevars exist.

mikeshardmind · August 7, 2024, 7:29am

No, not all cases of constrained typevars can be transformed to overloads.

Cases with classes tie together multiple definitions, and constraints on those are required to have some coherent interfaces without duplicating implementations. Constrained typevars are also properly type checked, whereas overloads generally aren’t, only allowing the body to be checked that it is consistent with all of what the function takes and returns and not that for each code path based on overloads that the correct corresponding return type is enforced.

bukzor · August 9, 2024, 6:49pm

Buck Evan:

import urllib.parse
test_cases: list[str | bytes] = ["foo", b"bar"]
for test_case in test_cases:
   if isinstance(test_case, str):
      value = urllib.parse.parse_qsl(test_case)
   else:
      value = urllib.parse.parse_qsl(test_case)
   print(value)

So nobody agrees that forcing user to write code like this is silliness? To me it makes it obvious that the design should be adjusted somehow. I’ve not chosen the right adjustment it seems. Perhaps I should have left that out.

erictraut · August 9, 2024, 7:05pm

I agree that writing code like that should be unnecessary. You’re welcome to submit a PR to typeshed to fix parse_qsl.

mikeshardmind · August 9, 2024, 7:19pm

This should only need (At most) 2 overloads, not 3.

a typevar with a bound of str | bytes works here (not constraints), I’m not sure on the intended behavior of parse_qsl when None is passed, and the return type in the typeshed is arguably ambiguous here already.

mikeshardmind · August 9, 2024, 7:29pm

Looks like runtime behavior is to return an empty list or dict in that case (depending on which), so no overloads are needed at all, changing that to a typevar with bound=str | bytes | None appears to just work while matching runtime behavior as well.

bukzor · August 9, 2024, 7:52pm

In a completely arbitrary spot check, about half of the stdlib functions that mention AnyStr have this problem:

import os, os.path

for test_case in ["/a.b", b"/c.d"]:
    if isinstance(test_case, str):
        os.path.abspath(test_case)
        os.path.expanduser(test_case)
        os.path.commonpath([test_case])
        os.path.relpath(test_case, test_case)
        os.readlink(test_case)
        os.walk(test_case)
    else:
        os.path.abspath(test_case)
        os.path.expanduser(test_case)
        os.path.commonpath([test_case])
        os.path.relpath(test_case, test_case)
        os.readlink(test_case)
        os.walk(test_case)

    # these definitions avoid the issue, for various reasons
    os.path.splitext(test_case)
    os.path.dirname(test_case)
    os.path.basename(test_case)
    os.path.join(test_case)
    os.path.realpath(test_case)
    os.path.splitroot(test_case)
    os.fspath(test_case)

Do you think the typeshed maintainers prefer 100 single-edit PRs or a single 100-edit PR?

What’s worse is literally all uses of constrained typevars have this problem, even when valid and necessary:

from typing import TypeVar
T = TypeVar("T", int, str)


def g(x: int|str) -> None:
    assert isinstance(x, (str, int))

def f(x: T, y: T) -> T:
    g(x)
    return x + y


for test_case in [1, "one"]:
    if isinstance(test_case, str):
        x = f(test_case, test_case)
    else:
        x = f(test_case, test_case)

The type system itself needs to understand the constrained typevar equivalently to the hand-unrolled @overloads.

carljm · August 9, 2024, 8:24pm

As mentioned above, using a bound typevar gives a less precise return type than the overloads; you want tuple[str, str] | tuple[bytes, bytes], but the bound typevar will give you tuple[str | bytes, str | bytes].

I think @bukzor is correct that in principle type checkers could automatically handle passing a union of the constraints to a generic function that uses a constrained typevar, by considering the call with each element of the union and unioning the resulting return types, and that would give the best behavior for that scenario. I don’t see any reason why that would be undesirable in principle; it’s really just a matter of how difficult this would be for type checkers to implement, and how often it comes up (in other words, is it a priority.)

carljm · August 9, 2024, 8:34pm

I haven’t looked at the relevant code for this in any of the type checkers, but given that overloads fix this, it seems like this must already be what they do for an overloaded function called with a union (consider the call with each element of the union and union the return types), so it doesn’t seem like it should be difficult to do the same for calls with constrained TypeVars.

mikeshardmind · August 9, 2024, 8:37pm

To an extent, I agree here, but I don’t know that this is worth any further attention or exploding this into overloads. The only time the type would be less precise than the overloads is in cases where you don’t know the input type precisely, which is a relatively rare case to begin with that you would have to check the type before using if the difference mattered with either the more or less precise type.

I think handling it at call site when there’s only 1 input variable that varies in an unknown way is doable, but this can quickly turn less clear in cases with multiple inputs.

For instance,

from typing import TypeVar
T = TypeVar("T", int, str)

def f(x: T, y: T) -> T:
    return x + y


def fails(x: list[int | str]):
    f(x[0], x[1])  # should still error

So while I can agree that a human reader can look at this and say that this can be improved, the question of priority definitely matters here, because picking the right behavior on this may not be as obvious as it looks in toy cases.

carljm · August 9, 2024, 10:24pm

Yeah, I think this only applies in single-input cases. The logical extension to multiple input cases is to try all permutations, but with a constrained typevar all mixed cases are an error, so it will always be an error. So I don’t think there’s a lot of subtlety there, but certainly there’s limited applicability. And constrained typevars aren’t super common to begin with.

OTOH, there is at least one widely used constrained typevar (AnyStr), and @bukzor has already shown that it is used in a number of single-argument cases in typeshed where this would apply.

carljm · August 9, 2024, 10:32pm

Buck Evan:

def f(x: T, y: T) -> T:
    g(x)
    return x + y


for test_case in [1, "one"]:
    if isinstance(test_case, str):
        x = f(test_case, test_case)
    else:
        x = f(test_case, test_case)

This only applies to multiple-argument cases when you are actually passing the same value to both arguments, which seems like an even narrower situation to have special handling for.

mikeshardmind · August 9, 2024, 10:45pm

If we’re willing to only look for an easy middle ground to improve the common case right now, I could get behind a narrowly tailored rule or suggestion for type checkers handling that more intelligently since the single constraint use form is just the union of possible outputs given possible inputs. This would vastly improve the AnyStr use in typeshed at least.

I’m not sure constrained typevars are as rare as I’ve seen claimed, they end up a natural option in a lot of cases involving data pipelines and function composition, and I haven’t had issues using them, though the pipelines I deal with only have homogenous data at the point where functions defined using constrained typevars come into play.