Easier way to look into the cpython implementation

When I want to see the code for how something is implemented in cpython, I need to manually search for it, for example,
when multiplying a set by 2,

s = {1, 2, 3, 1}
s * 2

gives the error,

TypeError: unsupported operand type(s) for *: 'set' and 'int'

to find the implementation of this in cpython, I had to manually search and then found this line,

which probably is the reason for there being no multiply for set, as there is no set_mul.

but I had to manually search for this.

another example is this one,

(1.__add__)

which gives the error,

SyntaxError: invalid decimal literal

but to find where is this thing implemented in cpython I again need to manually search.

If I get an error for something like inspect module in the cpython repository, then I can use shortcut in my IDE to navigate to the reasoning for that error, because inspect module is written in python in the cpython repository, so it is easier for me to navigate/find the reason for an error.
like,

import inspect
inspect.getsource(len)

gives error,

TypeError: module, class, method, function, traceback, frame, or code object was expected, got builtin_function_or_method

which I am able to locate through the traceback in the IDE that is here,

but for things implemented in c in cpython repository, I need to manually search.

Is there a repository/tool which enables to search for the reason for an error in cpython repository easier?

It shouldn’t be necessary to read the source code to understand the error messages. I assume you are interested in the internals for their own sake.

I spend a lot of time searching the CPython source because I am interested in how it works. This includes abusing it and then, like you, I go looking for the error messages, to decide what process it has gone through. But I could equally be following the logic of how it gets it right. I find I need to look up what C functions do a lot, so I really spend a lot of time searching C source and header files.

For this, I have local clones of the CPython source at release points of interest. My favourite tool for exploring is jEdit because it has a brilliant “hypersearch” feature that will enumerate in a side-bar all matches to a regex across all files in subdirectories filtered by a glob. I think a C IDE could be even better, but then I’d have to learn one.

You’ve exhibited a three errors, two run-time and one syntax error. As you’ve found, run-time errors are quite a good hook, but syntax errors, which arise from the tokenisation or compilation, are less so because the code that does this is generated from a grammar.

You understand why 1.__add__ is an error, I hope? It’s because the tokeniser, working left to right, has decided it is processing a number (because it starts with a digit) and then a decimal (float) because it found a decimal point. The _ could be part of the number (10_0.1_23 is ok), but apparently not right after the decimal point. FWIW, I’d say we are here:

and then the message is from here:

1 Like

thanks for your reply,
but I think there must be a way to get the list of functions that are called in C, as I write something in Python, I was hoping that there would be some python package that works like this,

pip install cpython_c_traceback
import cpython_c_traceback
cpython_c_traceback.traceback('s = {1, 2, 3}; s * 2')

and it would give me the list of C functions that were called to execute the python code.
If there are a lot of C functions that get called, then I have the option to view the last 5 functions or something like that,

cpython_c_traceback.traceback('s = {1, 2, 3}; s * 2', 5)

this can be done in python, using exception handling, through which one could get the traceback, to get the list of functions that are called, but I am not familiar with how exception handling, or traceback works in C.

C has no such thing as exceptions or tracebacks. In fact, the compiled python interpreter that you run might not even have all of the right functions that you expect, since C compilers are allowed to lie cheat and steal to make strange transformations in the code like eliminating function calls for the sake of performance, as long as the resulting behavior is “as if” the program is doing what you asked it to do.

So what you need is to be able to tell the C compiler to “don’t do anything too strange” when it compiles the C code into a binary. This is the role of “debug” builds. The only One way to do that is to compile your own interpreter from C code, rather than downloading python from somewhere. There’s a reference at https://devguide.python.org/ for how to get started.

If you’re using Windows, you can use Visual Studio to set breakpoints at the relevant points in the C code, and then when the interpreter runs and his a breakpoint, you can use the VS C debugger to view the call stack and local variables and whatnot. If you’re running Linux, you can do a similar thing with the gdb debugger.

2 Likes

You don’t necessarily have to build from source. The full Windows installer can install the debug binaries and symbols. You can attach a platform debugger such as WinDbg or cdb (console) to a debug build and break in at a check point (e.g. call DebugBreak() via ctypes) or break on a common function such as _PyErr_SetObject. Then dump the call stack. For example:

>>> (1.__add__)
Breakpoint 0 hit
python310_d!_PyErr_SetObject:
00007ffc`12c744c0 4c89442418      mov     qword ptr [rsp+18h],r8 
    ss:0000003b`5a7ee180=0000019d00000001

0:000> kc 0n20
Call Site
python310_d!_PyErr_SetObject
python310_d!PyErr_SetObject
python310_d!_syntaxerror_range
python310_d!syntaxerror
python310_d!verify_end_of_number
python310_d!tok_get
python310_d!PyTokenizer_Get
python310_d!_PyPegen_fill_token
python310_d!single_subscript_attribute_target_rule
python310_d!single_target_rule
python310_d!_tmp_20_rule
python310_d!assignment_rule
python310_d!simple_stmt_rule
python310_d!simple_stmts_rule
python310_d!statement_newline_rule
python310_d!interactive_rule
python310_d!_PyPegen_parse
python310_d!_PyPegen_run_parser
python310_d!_PyPegen_run_parser_from_file_pointer
python310_d!_PyParser_ASTFromFile

You can see that the syntax error was raised from verify_end_of_number() because “_” is a potential identifier character, as was already determined by Jeff Allen.

3 Likes

yes, I was able to get the list of functions using the technique you described.
but there are a few more problems

  1. for some reason when I set a breakpoint on the function that you mentioned that is, _PyErr_SetObject,
    it gives me an error,
    python.exe was compiled with optimization - stepping may behave oddly; variables may not be available.
    and this happens for a lot of functions, I attempted setting a breakpoint on main, Py_BytesMain and got the same error.
    I was able to get the list of functions for (1.__add__) after setting a breakpoint on verify_end_of_number.
    I think so it is some issue with optimization, and I need to disable it, but I am not sure how to achieve that.

  2. sometimes, alongside getting the list of functions, I would want to see the timing details also, for example, if I have, 1 in {1, 2, 3} vs 1 in [1, 2, 3], so, I time them on ipython, but if I want to know the c functions that are leading to the difference in time, then alongside the list of functions I would want the number of times a function has been called, and the time taken by it.
    In python, this can be done using tools like snakeviz, cProfile.
    I found out that in C, there is gprof, which is used for time profiling, but is there a way for me to give the input to gprof these list of functions,

for example, if the traceback for,

1 in {1, 2, 3}

involves.

function_1 at some_set_file_3.c
function_2 at some_set_file_2.c
function_3 at some_set_file_1.c
...
PyBytes_Main(argc, argv) at main.c

and similarly for,

1 in [1, 2, 3]

involves,

function_3 at some_list_file_3.c
function_2 at some_list_file_2.c
function_1 at some_list_file_1.c
...
PyBytes_Main(argc, argv) at main.c

then I would want the time, number of function calls associated with them also,
like,

function_1 at some_set_file_3.c, took: t sec, called: n times

could this traceback be passed as input to gprof? :thinking:

Run the debug build, “python_d.exe”. The full installer has options to install the debug binaries and the associated symbol files (i.e. PDB files).

Note there’s an app version of WinDbg in the Microsoft Store. There’s also an app version of the Windows Performance Analyzer (WPA), which analyzes ETL traces recorded with Windows Performance Recorder (wpr.exe).

I am currently on macOS, I did something like this,

make distclean
./configure --with-pydebug --enable-profiling
make -s -j2
lldb python.exe
(lldb) b main
(lldb) run

but I get the same error again.
I found out that I need to set the optimization level to zero, that is done by specifying -O0.

with the following setting,

./configure --with-pydebug --enable-profiling CFLAGS="-O0" OPT="-O0"
make -s -j2 CFLAGS="-O0" OPT="-O0"

I set breakpoint on _PyErr_SetObject, the error does not display, but the debugging stops without showing me the error, and it points to

PyObject *tb = Null;

with an underline below tb
when I set a breakpoint on _PyObject_New, then it again stops marking on this line

PyObject *op = (PyObject *) PyObject_Malloc(_PyObject_SIZE(tp));

with an underline below _PyObject_SIZE

also in the documentation it says use --enable-profiling to profile the C functions also, I think so this might be the solution to the second point in my last post, but there is no description to how to see the profiled output.

also, I was reading through configure options and I think so the documentation has a mistake, for
--enable-pystats, it says,
Use Tools//summarize_stats.py to read the stats.
but on my computer, this file is in Tools/scripts/summarize_stats.py

update:

I think so what is happening here is that, as soon as I set a breakpoint on let us say main or _PyErr_SetObject, then it would not allow me to run python code, I can call functions like,

(lldb) b main
(lldb) run
(lldb) call PyLong_FromLong(1)
(lldb) call Py_Initialize()
(lldb) call PyList_New(0)

this can be done after setting a breakpoint on main (or _PyErr_SetObject)

but after setting a breakpoint on main it does not let me run python code in lldb, it says,

Target 0: (python.exe) stopped.