Merge typed_ast back into CPython

I meant currently in typed-ast. It most definitely stores type comments.

It’s probably worth spending a few minutes considering how this proposal relates to @ambv’s proposal to provide a fully comment-preserving parser in the stdlib (bpo-33337). I guess that like most things they’re less related than the summaries make them sound, but I’m not super familiar with the details myself.

This is an interesting idea, but it will require some work to merge it back. Another idea about arbitrary comments: we can preserve a mapping {line_number: comment_string}, something like # type: ignore comments, that are just a list of line numbers in typed_ast currently.

Anyway, I am +1 on this.

@njs The original goal/motivation of that issue was a bit different, but there is a large overlap, in particular typed_ast also tokenizes all versions down to 2.7. So if we are exposing the possibility to parse all versions by same parser, we can also expose the possibility to tokenize all versions.

As others said, this doesn’t actually fix the problem for mypy as there will always be a need for external tools to be compatible with more than a single version of Python. There are two cases of this: supporting older and newer versions than the currently used one. This is the main (but not the only) reason why Black keeps its own bundled version of lib2to3’s pytree.

The “# noqa” idea is intriguing but I wonder if it could be made to work as easily as type comments since noqa is line-based and therefore can meaningfully come literally after any token. And if it could, at this point it probably would make sense to preserve all comments because why not. But this makes the AST bigger.

More importantly, I’m not sure what @guido is asking for here. The built-in AST for Python 3.8 could support type comments but it doesn’t need them for anything, there are annotations after all. The only type comment that is not expressible as an annotation is “type: ignore” which, again, Python itself does not need.

I might be misunderstanding what is discussed here. Merging type comment support to the built-in AST just to simplify typed-ast’s maintenance seems wrong to me. Python itself doesn’t need this functionality and typed-ast will need to continue existing anyway.

This proposal is about a concrete syntax tree, and specifically about taking pytree out of lib2to3, merging the two implementations of tokenizer.py and documenting the result (which it currently is not).

True, but the maintenance burden for typed_ast is really high (basically it might take a person-month to do the port, since the original maintainer has left). And once we have this, creating a separate typed_ast would be much simpler (just copy the files).

Sorry, this reads really wrong to me. I must be misunderstanding.
A high-maintenance external project losing its maintainer is sad, but that shouldn’t be a reason to merge the project into the stdlib (i.e. ask core developers are now asked to do the maintenance instead). Who’d be the core dev responsible for maintaining it in stdlib, and why can’t this person do it in an external library?

Would the stdlib version be open to extending for other use cases?

In bpo-33337, a fully comment-preserving *ST (improved pgen2) was asked to become a third-party library. I fail to see how typed_ast is different – it even seems typed_ast could be built on top of such a comment-preserving tree, so adding typed_ast to stdlib first seems backwards.

I am confused. I must be misunderstanding something. I hope this post doesn’t read as an attack.

@encukou
Maybe you didn’t realize typed_ast is a fork of ast, with exactly two things added? It adds fields to certain nodes that hold the type comment, and it adds a bitmap indicating which lines contain # type: ignore comments (which look like type comments but really are a different thing – they can occur in places where a type comment would not be legal). Note that both type comments and # type: ignore are described in a PEP. This is not your ordinary third-party package.

Most of the maintenance to typed_ast is keeping it up to date with changes to Python’s syntax for each new Python release. Currently this must either be done by starting afresh with a new fork of ast, and then re-applying the (very stable) changes to support type comments, or by taking the entire diff between ast for two adjacent Python versions and manually applying it to typed_ast. Bot of these processes are difficult due to code reformatting and unrelated other changes to the ast module in the CPython repo.

In addition, flake8 currently uses the ast module, and it would benefit from having the type comments: an import that is only used in a type comment should not be counted as unused. There are probably other tools that would benefit from having the type comments available too (though IIRC pylint has its own parser, Astroid).

The feature could be extended to e.g. keeping track of # noqa comments (which flake8 currently parses separately). Those are more like # type: ignore than like type comments, in that they apply to arbitrary lines. They do optionally allow additional text, but that’s simple to solve.

1 Like

Ah, that cleared it up for me (and hopefully some others). Thank you!

In what way do you imagine solving it? Would the additional text be preserved?
If yes, what’s holding the parser back from also retaining other comments, like attribute/variable documentation?

@guido, AFAICT if we merge type comment support to 3.8, typed_ast will continue to live because:

  • it will be essentially a backport of AST 3.8+ to previous versions;
  • it will be a forward port of old, incompatible ASTs to 3.8+ (3.4 without type hints and async/await, 3.5 without var annotations and async gens, 3.6 without real async/await kwds, and so on).

That said, both mypy and typed_ast are now PSF projects so it makes sense to share maintenance cost, especially if this move actually decreases required maintenance. If having type comment/noqa support really makes it much easier to create 3.8, 3.9, and later variants of the AST in typed_ast, I say make a PR @guido and I will merge it. I hope you and Ivan will be around to help on the cpython end if there are any issues on the type comment front :slight_smile:

1 Like

It looks like the special features of typed_ast are:

  • A number of AST classes like FunctionDef have an extra node attached with the type comment
  • There’s a new func_type parse mode
  • There’s an option to parse to selectively disable new syntax
  • Module objects have an attached list of lines that had a # type: ignore comment
  • Str objects have a kind field that lets you distinguish between u"foo" and "foo"

I’m not sure why the Python-3-mode type-checker needs to distinguish between u"foo" and "foo", but I guess the kind field would be a harmless addition to the stdlib in any case. I don’t know what the func_type parse mode does, so I can’t comment on that. (What does it do?) I’d be interested to hear whether the disable-new-syntax feature is something you think the stdlib parser should add and maintain going forward.

The thing that got me looking into this though, is: for the parts involving comments, shouldn’t it be pretty straightforward to implement those without modifying the parser? It’s easy to use tokenize to find the locations and contents of all the comments in a source file. That’s trivially enough to let you calculate which lines have # type: ignore on them. And for the type comments, if you know which lines the type comments are on, and you know which lines the FunctionDefs and friends are on (because the AST already tracks that information), it should be pretty easy to match them up? Of course I’m fully prepared for the answer to be “yes it should be but in practice it isn’t because ____.” :slight_smile:

(Alternatively, I can also see an argument for teaching the parser to treat type comments as an alternative syntax for type annotations, and report them directly in the annotation field as if they were PEP 563 deferred annotations.)

Answering your immediate question: this is harder than it should due to some off by one errors in line/column reporting in the AST.

As a more general note, and please don’t take it personally, I sometimes see discussions mutate in this fashion:

  • hey, can we do $SIMPLE_THING so that it solves $PROBLEM?
  • how about we explore $THING_OF_UNKNOWN_SCOPE_AND_RISK instead which doesn’t directly solve $PROBLEM but avoids doing $SIMPLE_THING?
  • nothing gets done, $PROBLEM persists

There are various flavors of this but the point I’m trying to make is that this pattern derails the discussion. In our particular case my gut feeling is that it is very unlikely for mypy to spend time rewriting what typed-ast enables using the tokenizer. And it would need a non-standard parser for the multi-version support anyway. Let’s focus on how what Guido is asking for would actually help typed-ast (and mypy by extension) and how is it useful for others to have this functionality in core Python.

2 Likes

Sure, I don’t want to derail… I actually went back and forth a few times on whether to put a disclaimer about that in :-).

It is sometimes useful to get new eyes on things before merging them into the stdlib, just in case someone notices something new :-). And it’s actually pretty unclear to me right now what exactly the proposal here is. Guido says typed_ast adds “exactly 2 features”, but the docs list 5. Guido’s suggesting possibly expanding the scope further to track # noqa comments – is that the right amount of expansion? You seem to be assuming that typed_ast’s features for parsing old 3.x features won’t be ported over, but that’s not obvious from the topic title, and in that case I don’t understand how mypy is planning to maintain its old version support…

But you all certainly understand this part of the interpreter better than I do though so if my comments are distracting rather than helpful, please ignore me :slight_smile:

It would be nice to have more accurate line/column tracking in the AST module in any case. (I’d love to see start/end positions.)

To try to bring this back around then, I think the feature request is:

  1. # type: ... comments be translated to appropriate type hints on AST nodes
  2. # type: ignore somehow be represented
  3. Maybe support # noqa

It seems some have suggested to solve all three by having line comments preserved across the AST and then let mypy do the comment handling themselves. I know @ambv wondered:

I do remember there being a discussion somewhere about trying to standardize on the code quality comment structure (and I feel like @barry was in on that conversation for some reason), but I can’t find a reference to it.

To me, it seems like if we can reasonably carry at least line-level comments with the appropriate AST node that’s the most general solution. But if that turns out not to work out well then adding in support for just the type comments since they are standardized in a PEP and have direct relations to the appropriate AST nodes makes sense to me.

I’ll have to see if I can remember where that happened. I feel like it was on the code-quality mailing list, but it could have been the mypy or flake8 issue tracker. The problem we were running into was needing to add tag comments for multiple tools and getting forced too far off the right side.

It feels like this could be PEP worthy. Maybe comments aren’t the best way to specify this (maybe they are though!). If you’re going to modify the AST, would it be possible to add syntax that made this kind of thing explicit? A with statement? A statement decorator? Even if we stick with comments, some kind of standardization might be useful.

(I just found out that Astroid also uses typed_ast.)

It is used to parse type comments used in type signatures, which contain syntax that’s not a valid Python expression, e.g. # type: (int, int) -> str.

I’d like to propose to reduce the scope of the project to just the features that typed_ast currently adds (thanks Nathaniel for referencing the list). I guess I’ll have to work on a PR for this, based on typed_ast. Łukasz has said he’s merge it. While adding # type: noqa support would be nice, the linters currently all have a solution for that, and I propose to punt on that for now.

The PR I have to produce should roughly mimic the instructions in typed_ast for updating, except instead of making a free-standing fork I’d do it as a PR for CPython (master, 3.8).

My only worry would be the code to conditionally parse older versions of the syntax. For mypy this would require supporting 3.4 and up, basically all versions which are still officially supported in some way. (For 2.7 we have a separate backport of ast in typed_ast that in general does not need maintenance.)

After this has been accepted into CPython, producing a new version typed_ast would then be hugely simpler: instead of the elaborate update process mentioned above it would just require copying a bunch of files from CPython into the typed_ast repo with minimal modifications. (This wouldn’t save any effort for the upcoming typed_ast version, but it would make future ones much less work, so we can start supporting new Python versions in mypy sooner.)

2 Likes

I’d like to circle back on this. I now have a thorough understanding of what typed_ast does, and I think it would be straightforward to port it upstream. We’d need to define two new tokens to represent # type: ignore and # type: <whatever>, and tokenizer code to recognize these. Then we need a new flag to be passed to the tokenizer (via the parser) that enables this behavior. We make a small number of changes to Grammar (inserting optional TYPE_COMMENT tokens and to Python.asdl (adding fields to a few node types to hold the optional type comment), and a fair number of changes to ast.c to extract the type comments.

By default, ast.parse() does not return type comments, since this would reject some perfectly good Python code (with a type comment in a place where the grammar doesn’t allow it). But passing an new flag will cause the tokenizer to process type comments and the returned tree will contain them.

I could produce a PR with this in a few days (having just gone over most of the process for porting typed_ast from 3.6 to 3.7).

There’s one more feature I’d like to push for – a feature_version flag that modifies the grammar slightly so it resembles an older version of Python (going back to 3.4). This is used in mypy to decouple the Python version you’re running from the Python version for which you’re checking compatibility (useful when checking code that will be deployed on a system with a different Python version installed). I imagine this would be useful to other linters as well, and the implementation is mostly manipulating whether async and await are keywords. But if there’s pushback to this part I can live without it – the rest of the work is still useful.

Thoughts?

UPDATE: I created an issue: https://bugs.python.org/issue35766

3 Likes

I’d like to merge this. I have everything working (not counting the optional backwards compatibility feature I’ll submit separately). Can I get someone to review it? It might actually make 3.8.0a1!

2 Likes

(And yes this did make it in!)

7 Likes