The slowness here is that on Linux, when using the default scheduler, clock_nanosleep takes 50 microseconds. This is an explicit implementation choice in Linux and one that I think is basically fine… this is smaller than all the example small numbers mentioned in the thread you link. It’s two orders of magnitude smaller than default sys.getswitchinterval().
I think the users who encounter this (seemingly one in three years for CPython) are probably better served by changing the scheduling policy (e.g. via os.sched_setscheduler) or futzing with PR_SET_TIMERSLACK. After all, Linux man pages aren’t too bullish on sched_yield when used with the default scheduler: “Use of sched_yield [with the default scheduler] is unspecified and very likely means your application design is broken”. Though a yield API using sched_yield would put us in good company with other languages, as you point out.
I kind of feel time.sleep(0) is fine for 99.9% of users and the 0.1% users are better off making exactly the syscalls they need or tuning their process settings.
I don’t know much about what fairness guarantees the current lock backing the GIL provides that would let us yield in user space — but if it does, that could be an interesting thing to expose.