Consistent token type int values across python versions

Dear my friends,

Python’s built-in token module assigns an integer value to each token type. However, these numeric values are not guaranteed to be stable across Python versions.

The following tables show the differences among active python versions:

| token type | python 3.10 | python 3.11 | python 3.12 | python 3.13 | python 3.14 |

|------------------|-------------|-------------|-------------|-------------|-------------|

| ENDMARKER | 0 | 0 | 0 | 0 | 0 |

| NAME | 1 | 1 | 1 | 1 | 1 |

| … | … | … | … | … | … |

| EXCLAMATION | | | 54 | 54 | 54 |

| OP | 54 | 54 | 55 | 55 | 55 |

| AWAIT | 55 | 55 | 56 | | |

| ASYNC | 56 | 56 | 57 | | |

| TYPE_IGNORE | 57 | 57 | 58 | 56 | 56 |

| TYPE_COMMENT | 58 | 58 | 59 | 57 | 57 |

| SOFT_KEYWORD | 59 | 59 | 60 | 58 | 58 |

| FSTRING_START | | | 61 | 59 | 59 |

| FSTRING_MIDDLE | | | 62 | 60 | 60 |

| FSTRING_END | | | 63 | 61 | 61 |

| TSTRING_START | | | | | 62 |

| TSTRING_MIDDLE | | | | | 63 |

| TSTRING_END | | | | | 64 |

| COMMENT | | | 64 | 62 | 65 |

| NL | | | 65 | 63 | 66 |

| ERRORTOKEN | 60 | 60 | 66 | 64 | 67 |

| N_TOKENS | 64 | 64 | 68 | 66 | 69 |

| NT_OFFSET | 256 | 256 | 256 | 256 | 256 |

This may lead to necessary extra works about token type value alignment for projects which involves tokenizing, for example, linters, formatters or code analyzers.

I propose that the token type values keep consistent in future python versions. Existed ones from ENDMARKER to N_TOKENS can stay the same as python 3.14; any token types to be removed (like ASYNC / AWAIT did) still occupied the int values; new token types can be append after existed ones instead of inserted before any one except special type NT_OFFSET.

I know that token types is used as enum internally within cpython source. Fixed value may be trival for cpython development, but, it can be helpful for external projects like grammar refactor, linter & formatters.

I have started a project which extracts the related codes of builtin `tokenize` module from incoming python 3.14 source, to be an standalone tokenizer that compatible with all active python versions (3.10 ~ 3.14rc) . The token types that generated are fixed as what in python 3.14 and keep consistent among versions.

I am looking forward to hear your opinions! :grinning_face_with_smiling_eyes:

This is why there is the token module, so you can use more human-readable and stable names. If you need to export tokens, do it by name, not by integer value.

3 Likes

I am very grateful to get the advice form core developer!

Now I am working on a grammar-related project that use builtin tokenize module, which generate tokens with type that specific to version, especially missing token types (like TSTRING_Xxx that only available in incomming 3.14). I have to do much version check and token module query.

Besides, compare an int value is much efficient than import current version token modules. I think it is helpful indeed for crossing version library development involving tokenizer.