Exposing internal functions for development

dg-pb · June 2, 2024, 4:40am

Would appreciate help on this.

What is the best method to expose internal C function for occasional use?

E.g. a function might be needed for re-calibration in the future, but it is not exposed - it is only being called in certain cases as part of another function.

So it should be conveniently reachable when in development, but hidden from the user.

What would be the best practice? E.g. place for modules that can be easily turned on with a flag.

vstinner · June 2, 2024, 9:33pm

You can expose an internal C API in Python in the _testinternalcapi module. I’m not sure if I understood correctly your use case.

Hidden in C or hidden in Python?

dg-pb · June 2, 2024, 9:57pm

Hidden in python. Well, that is the thing, I don’t know how much it needs to be hidden.

Situation is as follows. There is a function exposed to Python user:

def find_string(...):
    if method_a_is_faster_than_method_b():
        return method_a()
    else:
        return method_b()

There is currently no way of invoking method_a or method_b from python, but calibration of method_a_is_faster_than_method_b routine requires to have access to them separately. It shouldn’t be a frequent practice, but it might be needed to calibrate it some time in the future. E.g. if performance improvements are made to method_a or method_b.

This is essentially str.find/count/in routine.

So I would like to add convenient access to them from python together with a calibration script.

I was thinking it could be useful to expose these in string module. After all method_a_is_faster_than_method_b can never be perfect and it could be convenient to have access to these separately for cases where more savy user can select one which fits his problem better.

Alternatively, some hidden module where functions with similar purpose are placed would work too.

If there is no such place and placing these in string (or any other) module is not an option, then what could be an option here?

dg-pb · June 2, 2024, 10:13pm

I think _testinternalcapi could be a good place for it.

Is there a similar place, where I can place Python code for internal use?

vstinner · June 3, 2024, 8:10am

I suggest to implement find_string() in C and call it in Python, so you don’t have to expose method_a_is_faster_than_method_b() in Python.

dg-pb · June 3, 2024, 11:50am

Apologies. I failed to explain. And provided a misleading code.

Py_LOCAL_INLINE(Py_ssize_t)
find_string(s, n, p, m, maxcount, mode, 0){
float ratio = 0.5;    // manually set constant
if (method_a_is_faster_than_method_b(n, m, ratio)){
    return horspool_find(s, n, p, m, maxcount, mode);
} else {
    return two_way_find(s, n, p, m, maxcount, mode);
}

is all in C.

To calibrate method_a_is_faster_than_method_b(n, m) I would like to expose horspool_mini_find and two_way_find functions to Python, because to calibrate I need to time their execustion for various sets of inputs.

To do that maybe I can use _testinternalcapi. Can I? E.g. expose those as _testinternalcapi._horspool_mini_find and _testinternalcapi._two_way_find.

Also, I would like to write a calibration routine in python, which uses those 2 exposed functions and returns parameters that are used by method_a_is_faster_than_method_b.

e.g.:

def calibrate_find(func_a: Callable, func_b: Callable):
    ratio = timeit(func_a) / timeit(func_b)
    return ratio    # ratio is a parameter that is set to `method_a_is_faster_than_method_b`

This is just a simplified situation. If this was all that it was, I would probably do it all in C. But calibrate_find is complex enough to be worth doing it in Python.

So where would be best places to expose 2 C functions to python and store 1 pure python function when the only purpose of these is calibration?

dg-pb · June 3, 2024, 12:26pm

find_string function that I am referring to is: cpython/Objects/stringlib/fastsearch.h at 61d3ab32da92e70bb97a544d76ef2b837501024f · python/cpython · GitHub

Under the hood it calls 3 different functions depending on a situation.

And they are not accessible individually, which is needed to conveniently perform calibration.

Currently, I need to modify C code to fix one of them to be sure that I am calling the right one.

I think it would be good to have them exposed separately for testing purposes anyway.

Currently, if something went wrong it would be extra work to isolate misbehaving code.

pf_moore · June 3, 2024, 12:38pm

Why do these need to be exposed permanently? Once you’ve calibrated the algorithm, what would change that could require recalibration? In other words, what’s wrong with just instrumenting your test build, working out the correct values to use, and then hard coding those values? If the calibration process is complicated, by all means document the process you used and why you chose the values you did - there’s a great example of such documentation (for Python’s sort algorithm) in Objects\listsort.txt.

I don’t think we expose helper functions that are used to set other algorithmic parameters, like the loading factor for dictionaries. Why would this be different?

dg-pb · June 3, 2024, 1:04pm

You are right, this is not the case where re-calibration needs to be performed frequently (as opposed to situation where algorithm needs to be continuously calibrated on a new data).

And the way it is currently done is exactly the way you described. It was calibrated once and constants are hardcoded.

However, the work that I am doing now is a proof that changes happen from time to time and if what I am referring to was done last time, my current work would have been much smoother.

Things that could change:
a) Someone decides to improve calibration, because finds proof that it can be done better. (e.g. what I am doing now).
b) New algorithm is discovered and needs to be incorporated into the basket.
c) Low level performance changes of functions that some of algorithms use are significant enough to justify recalibration.

I have done such things many times and having explicit access to such components (that perform the same thing but are mixed together for performance purposes) made my life much easier.

E.g. one python function that I sometimes use:

def insort_left_many(a, xs, lo=0, hi=None, *, key=None, method=AUTO):
    if method is AUTO:
        method = select_method(a, xs, lo, hi, key)
    if method == 0:
        return method0(...)
    elif method == 1:
        return method1(...)
    raise ValueError

Initially I had these hard coded the way you are referring.

But after several similar instances as I am currently facing with string_find I changed my practices and started ensuring possibility for individual access.

Although these are not the cases where continuous calibration is needed, but nevertheless, minimal effort to have convenient access, at least from my experience, proves to be worthwhile.

dg-pb · June 3, 2024, 3:33pm

Having that said, I am completely ok to agree to disagree and just not expose anything. I doubt there is a high chance that I will work on this particular case again.

Here, I am just being mindful of a next person who will (even if it is in 10 years time) and good practices for such cases in general.