PEP 594 deprecated the cgi module and scheduled it for removal in Python 3.13. The PEP recommends email.message.Message().get_params() as a replacement for cgi.parse_header(), giving the following example:
>>> from cgi import parse_header
>>> from email.message import Message
>>> parse_header(h)
('application/json', {'charset': 'utf8'})
>>> m = Message()
>>> m['content-type'] = h
>>> m.get_params()
[('application/json', ''), ('charset', 'utf8')]
>>> m.get_param('charset')
'utf8'
However, this change has resulted in a measurable performance regression, particularly because the function is called on every HTTP request. The regression shows in Django’s ASV benchmarks and in microbenchmarks comparing the original cgi implementation with the new email.message-based one.
Below are results from “microbenchmarking” on Python 3.11 (using the actual cgi module) through 3.14 (using the copied/ported implementation of parse_header). The script I used is this one, results are in microseconds per call:
This matters for Django because this logic runs for every request and adds measurable overhead, especially in performance-sensitive WSGI or ASGI environments.
I would appreciate some help understanding if this should be filed as an issue, or if there is a misunderstanding about how cgi.parse_header should be replaced. Perhaps get_params() could short-circuit when there are no extra parameters? That would allow an optimization for the simplest case (like "text/plain"), which currently shows the worst performance regression.
Your code is using unicode headers, but they are bytes.
When I worked on code to parse HTTP headers we worked in bytes to save the cost of decoding to unicode.
What is the performance if you parse as bytes and not unicode?
Can you run cProfile on the code to see where the hot-spots are?
You were on point, there is a performance gain for complex headers when using bytes, although the simpler header cases remained slower even when passing bytes to Message.get_params(). See the updated benchmarks below (targeting Python 3.11 with the real cgi module and Python 3.13 with a cgi-like shim for comparison).
(Suffice to say, the str → bytes encoding is performed outside the benchmarked code.)
Python 3.11 (uses real `cgi`, email.message gets `bytes`)
At this stage, I see two main areas worth a deeper conversation:
First, whether it’s reasonable to pursue targeted performance improvements in Message.get_params() for simpler header constructs, which continue to exhibit a 3x-5x slowdown compared to cgi.parse_header() even when operating on bytes. Would it be OK if a ticket is created about this?
Second, how we might realistically access the original header bytes in WSGI-based environments. This matters because Django is parsing Content-Type headers from HTTP requests, and AFAIU, the WSGI spec requires all incoming headers to be decoded to str before making them available in environ. That constraint makes it tricky to work at the byte level, even if doing so could offer performance benefits.
Both of these areas could help clarify the tradeoffs of moving away from cgi, particularly as Django now faces the decision of whether to revert the recent switch from cgi.parse_header() to Message.get_params(). Personally, I’d strongly prefer not to revert this change, since relying on the Python stdlib whenever possible improves maintainability, reduces potential security risks, and benefits from broader community scrutiny.
I dug into Message.get_params(), and from the call chain it seems clear that everything is built around str headers. For example, _parseparam() (the call chain is get_params → _get_params_preserve → _parseparam) starts with these two lines:
def _parseparam(s):
# RDM This might be a Header, so for now stringify it.
s = ';' + str(s)
It then string-splits, string-strips, and string-compares all the way down. If I assign bytes to the header, the parsing doesn’t work as expected: you basically just get the repr() of the bytes object. So while I understand the idea of avoiding decoding overhead, I don’t see a way to feed bytes into Message and get equivalent results. The API really assumes str throughout.
Could you or someone else advise on next steps? This performance regression is a blocker for Django 6.0, with feature freeze on September 17th. As mentioned before, yes, one option is to revert the migration we did to replace the deprecated cgi.parse_header with Message.get_params(), but I’m reluctant to revert, since keeping the Python stdlib offers better maintainability, reduces security risks, and benefits from wider community review (even if Django may be the only project observing this regression, as James noted).
You can open an issue, but I would advise you keep your backport of cgi.parse_header() if it gives you the performance you’re after.
I know how that might look like that from the outside, but cgi and the other modules we deprecated were due to low usage and low maintenance help (i.e. no core dev ownership).
For the sake of explicitness: I don’t mean to suggest reviving cgi is what we’re after. I fully understand why it was deprecated, and I agree it’s not a good idea to keep unsupported code in the stdlib. From Django’s side, we’ll keep our backport, but I’d still be keen to hear if there’s a better stdlib path we might be overlooking.
The main driver for this post is that the PEP literally suggested Message.get_params() as the replacement for cgi.parse_header(), but it doesn’t perform as well. My goal was to understand whether that means there’s a performance issue in get_params() that should be reported/fixed, or if there’s another stdlib utility we should be using instead. We’d always prefer relying on stdlib for header parsing over maintaining our own copy, but if this is simply the tradeoff then we’ll stick with the backport.