GitHub code search skips 9 CPython source files because they are large

I had a confusing experience searching the CPython source code on GitHub recently. I was looking for the definition of a function PyUnicode_DecodeUTF16Stateful. GitHub code search did not find it for me. But it is right there in Objects/unicodeobject.c#L5983-L6136 . Why could GitHub not find it?

It turns out that GitHub’s code search has published limitations, including that “files over 350 KiB are excluded”. There are nine .c, .h, and .py source files in CPython which exceed this limit:

find . \( -iname '*.c' -or -iname '*.h' -or -iname '*.py' \) \
> -size +350k -ls | sort -k 7 --reverse   
... 1478859 ... ./Parser/parser.c
... 1315746 ... ./Modules/unicodename_db.h
...  823119 ... ./Lib/pydoc_data/topics.py
...  601387 ... ./Python/Python-ast.c
...  572370 ... ./Modules/unicodedata_db.h
...  487848 ... ./Modules/posixmodule.c
...  471722 ... ./Objects/unicodeobject.c
...  376291 ... ./Lib/test/test_typing.py
...  365550 ... ./Modules/cjkcodecs/mappings_jp.h

I have filed an enhancement request with GitHub about this limitation. Maybe they will lift it someday. In the meantime, if you, like me, are searching and searching CPython via GitHub and just can’t find something, consider downloading the CPython sources to your local machine and using your local tools to search.

4 Likes