Merge typed_ast back into CPython

guido · November 4, 2018, 5:03pm

There’s a fork of the ast module (in C) named typed_ast used by mypy, pytype and (IIRC) also by some linters. Its redeeming quality is that it preserves certain comments (currently only # type: comments; I could imagine that it might be extended to support # noqa comments too). We’ve found that it’s hard work to keep this code up to date with developments in the language’s grammar. (E.g. mypy still doesn’t support all new Python 3.7 syntax.)

I propose to merge this code back into the ast module, thereby simplifying maintenance of the typed_ast module and ensuring that it stays compatible with new syntax added.

Thoughts?

PS. I’m not sure this belongs in Ideas or Committers – if it’s in Ideas it would be simpler for other users of the ast (e.g. linter owners) to provide feedback or support for this idea. But in Committers the people who will end up maintaining it are reached.

dstufft · November 4, 2018, 5:16pm

I think the intent is that it belongs in “Ideas”, giving everyone a chance to be a part of the discussion. We don’t have a direct parallel to python-dev in Discourse, and there is some discussion here about why that is. It seems like the intent is for Ideas to mirror python-ideas, and Users to sort of mirror python-list and python-dev. I don’t really understand that and feel like we need a Development category, but I’m not a Discourse admin, maybe @ambv can weigh in more?

With regards to the proposal itself, I don’t really have an opinion, sorry!

guido · November 4, 2018, 5:36pm

OK, moved it to Ideas.

gpshead · November 5, 2018, 5:17am

I’m generally +1 on the idea of merging in and maintaining this as part of CPython along with the latest grammar.

But can we please keep maintaining a current release of this on PyPI? That way it can be used under older Python versions enabling tooling running on them to analyze the latest syntax. Otherwise we’re stuck in the heck that things like pylint find themselves in of only supporting the language version that the linter is running under.

guido · November 5, 2018, 7:24am

Yeah, we don’t want that. Though right now mypy is generally able to parse Python 3.4 and up with a single typed_ast version. Hmm, except that async and await are always keywords. So you’re right.

Anyway, we also need the PyPI version for Python 2.7. (But that version doesn’t evolve.

Still I think it would be an improvement if this was maintained as part of CPython and kept up to date with CPython.

encukou · November 5, 2018, 9:47am

How problematic would it be to keep all comments?
If only type and noqa are special, I expect that people will be tempted to use them for unrelated things (like attribute docstrings or blockly coordinates).

guido · November 5, 2018, 6:06pm

RIght now typed_ast preserves type annotations in extra fields on the AST. It only has slots for these in specific types. Preserving all comments would require a very different architecture (more like the AST nodes used by lib2to3). So this feels out of scope for a simple “move the code to upstream” proposal; I don’t want to complicate the proposal with scope creep.

storchaka · November 5, 2018, 10:33pm

Currently comments are entirely ignored. The following code is represented as a single AST node Constant(value='foobar'):

('foo'
 # type: int
 'bar')

guido · November 6, 2018, 12:42am

I meant currently in typed-ast. It most definitely stores type comments.

njs · November 6, 2018, 5:31am

It’s probably worth spending a few minutes considering how this proposal relates to @ambv’s proposal to provide a fully comment-preserving parser in the stdlib (bpo-33337). I guess that like most things they’re less related than the summaries make them sound, but I’m not super familiar with the details myself.

ilevkivskyi · November 7, 2018, 2:02pm

This is an interesting idea, but it will require some work to merge it back. Another idea about arbitrary comments: we can preserve a mapping {line_number: comment_string}, something like # type: ignore comments, that are just a list of line numbers in typed_ast currently.

Anyway, I am +1 on this.

@njs The original goal/motivation of that issue was a bit different, but there is a large overlap, in particular typed_ast also tokenizes all versions down to 2.7. So if we are exposing the possibility to parse all versions by same parser, we can also expose the possibility to tokenize all versions.

ambv · November 7, 2018, 7:43pm

As others said, this doesn’t actually fix the problem for mypy as there will always be a need for external tools to be compatible with more than a single version of Python. There are two cases of this: supporting older and newer versions than the currently used one. This is the main (but not the only) reason why Black keeps its own bundled version of lib2to3’s pytree.

The “# noqa” idea is intriguing but I wonder if it could be made to work as easily as type comments since noqa is line-based and therefore can meaningfully come literally after any token. And if it could, at this point it probably would make sense to preserve all comments because why not. But this makes the AST bigger.

More importantly, I’m not sure what @guido is asking for here. The built-in AST for Python 3.8 could support type comments but it doesn’t need them for anything, there are annotations after all. The only type comment that is not expressible as an annotation is “type: ignore” which, again, Python itself does not need.

I might be misunderstanding what is discussed here. Merging type comment support to the built-in AST just to simplify typed-ast’s maintenance seems wrong to me. Python itself doesn’t need this functionality and typed-ast will need to continue existing anyway.

This proposal is about a concrete syntax tree, and specifically about taking pytree out of lib2to3, merging the two implementations of tokenizer.py and documenting the result (which it currently is not).

guido · November 20, 2018, 12:41am

True, but the maintenance burden for typed_ast is really high (basically it might take a person-month to do the port, since the original maintainer has left). And once we have this, creating a separate typed_ast would be much simpler (just copy the files).

encukou · November 20, 2018, 10:03am

Sorry, this reads really wrong to me. I must be misunderstanding.
A high-maintenance external project losing its maintainer is sad, but that shouldn’t be a reason to merge the project into the stdlib (i.e. ask core developers are now asked to do the maintenance instead). Who’d be the core dev responsible for maintaining it in stdlib, and why can’t this person do it in an external library?

Would the stdlib version be open to extending for other use cases?

In bpo-33337, a fully comment-preserving *ST (improved pgen2) was asked to become a third-party library. I fail to see how typed_ast is different – it even seems typed_ast could be built on top of such a comment-preserving tree, so adding typed_ast to stdlib first seems backwards.

I am confused. I must be misunderstanding something. I hope this post doesn’t read as an attack.

guido · November 21, 2018, 12:34am

@encukou
Maybe you didn’t realize typed_ast is a fork of ast, with exactly two things added? It adds fields to certain nodes that hold the type comment, and it adds a bitmap indicating which lines contain # type: ignore comments (which look like type comments but really are a different thing – they can occur in places where a type comment would not be legal). Note that both type comments and # type: ignore are described in a PEP. This is not your ordinary third-party package.

Most of the maintenance to typed_ast is keeping it up to date with changes to Python’s syntax for each new Python release. Currently this must either be done by starting afresh with a new fork of ast, and then re-applying the (very stable) changes to support type comments, or by taking the entire diff between ast for two adjacent Python versions and manually applying it to typed_ast. Bot of these processes are difficult due to code reformatting and unrelated other changes to the ast module in the CPython repo.

In addition, flake8 currently uses the ast module, and it would benefit from having the type comments: an import that is only used in a type comment should not be counted as unused. There are probably other tools that would benefit from having the type comments available too (though IIRC pylint has its own parser, Astroid).

The feature could be extended to e.g. keeping track of # noqa comments (which flake8 currently parses separately). Those are more like # type: ignore than like type comments, in that they apply to arbitrary lines. They do optionally allow additional text, but that’s simple to solve.

encukou · November 21, 2018, 9:17am

Ah, that cleared it up for me (and hopefully some others). Thank you!

In what way do you imagine solving it? Would the additional text be preserved?
If yes, what’s holding the parser back from also retaining other comments, like attribute/variable documentation?

ambv · November 21, 2018, 10:39pm

@guido, AFAICT if we merge type comment support to 3.8, typed_ast will continue to live because:

it will be essentially a backport of AST 3.8+ to previous versions;
it will be a forward port of old, incompatible ASTs to 3.8+ (3.4 without type hints and async/await, 3.5 without var annotations and async gens, 3.6 without real async/await kwds, and so on).

That said, both mypy and typed_ast are now PSF projects so it makes sense to share maintenance cost, especially if this move actually decreases required maintenance. If having type comment/noqa support really makes it much easier to create 3.8, 3.9, and later variants of the AST in typed_ast, I say make a PR @guido and I will merge it. I hope you and Ivan will be around to help on the cpython end if there are any issues on the type comment front

njs · November 22, 2018, 12:03am

It looks like the special features of typed_ast are:

A number of AST classes like FunctionDef have an extra node attached with the type comment
There’s a new func_type parse mode
There’s an option to parse to selectively disable new syntax
Module objects have an attached list of lines that had a # type: ignore comment
Str objects have a kind field that lets you distinguish between u"foo" and "foo"

I’m not sure why the Python-3-mode type-checker needs to distinguish between u"foo" and "foo", but I guess the kind field would be a harmless addition to the stdlib in any case. I don’t know what the func_type parse mode does, so I can’t comment on that. (What does it do?) I’d be interested to hear whether the disable-new-syntax feature is something you think the stdlib parser should add and maintain going forward.

The thing that got me looking into this though, is: for the parts involving comments, shouldn’t it be pretty straightforward to implement those without modifying the parser? It’s easy to use tokenize to find the locations and contents of all the comments in a source file. That’s trivially enough to let you calculate which lines have # type: ignore on them. And for the type comments, if you know which lines the type comments are on, and you know which lines the FunctionDefs and friends are on (because the AST already tracks that information), it should be pretty easy to match them up? Of course I’m fully prepared for the answer to be “yes it should be but in practice it isn’t because ____.”

(Alternatively, I can also see an argument for teaching the parser to treat type comments as an alternative syntax for type annotations, and report them directly in the annotation field as if they were PEP 563 deferred annotations.)

ambv · November 22, 2018, 12:56am

Answering your immediate question: this is harder than it should due to some off by one errors in line/column reporting in the AST.

As a more general note, and please don’t take it personally, I sometimes see discussions mutate in this fashion:

hey, can we do $SIMPLE_THING so that it solves $PROBLEM?
how about we explore $THING_OF_UNKNOWN_SCOPE_AND_RISK instead which doesn’t directly solve $PROBLEM but avoids doing $SIMPLE_THING?
nothing gets done, $PROBLEM persists

There are various flavors of this but the point I’m trying to make is that this pattern derails the discussion. In our particular case my gut feeling is that it is very unlikely for mypy to spend time rewriting what typed-ast enables using the tokenizer. And it would need a non-standard parser for the multi-version support anyway. Let’s focus on how what Guido is asking for would actually help typed-ast (and mypy by extension) and how is it useful for others to have this functionality in core Python.

njs · November 22, 2018, 2:19am

Sure, I don’t want to derail… I actually went back and forth a few times on whether to put a disclaimer about that in :-).

It is sometimes useful to get new eyes on things before merging them into the stdlib, just in case someone notices something new :-). And it’s actually pretty unclear to me right now what exactly the proposal here is. Guido says typed_ast adds “exactly 2 features”, but the docs list 5. Guido’s suggesting possibly expanding the scope further to track # noqa comments – is that the right amount of expansion? You seem to be assuming that typed_ast’s features for parsing old 3.x features won’t be ported over, but that’s not obvious from the topic title, and in that case I don’t understand how mypy is planning to maintain its old version support…

But you all certainly understand this part of the interpreter better than I do though so if my comments are distracting rather than helpful, please ignore me

It would be nice to have more accurate line/column tracking in the AST module in any case. (I’d love to see start/end positions.)