There seems to be no reliable way to access original email header values of a message.
With policy.compat32:
>>> # ok:
>>> email.message_from_bytes(b'X-Header: =?utf8?b?w6TDtsO8?=')['X-Header']
'=?utf8?b?w6TDtsO8?='
>>> # broken:
>>> email.message_from_bytes('X-Header: äöü'.encode())['X-Header']
<email.header.Header object at 0x7fe9a80e7e30>
>>> str(email.message_from_bytes('X-Header: äöü'.encode())['X-Header'])
'������'
>>> # raw value not reconstructible:
>>> email.header.decode_header(email.message_from_bytes(b'X-Header: =?utf8?b?w6TDtsO8?=')['X-Header'])
[(b'\xc3\xa4\xc3\xb6\xc3\xbc', 'utf8')]
With policy.default:
>>> # not the raw value:
>>> email.message_from_bytes(b'X-Header: =?utf8?b?w6TDtsO8?=', policy=email.policy.default)['X-Header']
'äöü'
>>> email.header.decode_header(email.message_from_bytes(b'X-Header: =?utf8?b?w6TDtsO8?=', policy=email.policy.default)['X-Header'])
[('äöü', None)]
I suspect that policy.compat32 is hopeless, but I have an idea for policy.default:
--- a/Lib/email/headerregistry.py
+++ b/Lib/email/headerregistry.py
@@ -268,0 +269,9 @@ def parse(cls, value, kwds):
+ kwds['raw'] = value.encode("ascii", errors="surrogateescape")
+
+ def init(self, *args, **kw):
+ self._raw = kw.pop('raw')
+ super().init(*args, **kw)
+
+ @property
+ def raw(self):
+ return self._raw
The above change yields:
>>> email.message_from_bytes(b'X-Header: =?utf8?b?w6TDtsO8?=', policy=email.policy.default)['X-Header'].raw
b'=?utf8?b?w6TDtsO8?='
>>> email.message_from_bytes('X-Header: äöü'.encode(), policy=email.policy.default)['X-Header'].raw
b'\xc3\xa4\xc3\xb6\xc3\xbc'
>>> email.message_from_bytes('X-Header: äöü'.encode('latin1'), policy=email.policy.default)['X-Header'].raw
b'\xe4\xf6\xfc'
I would appreciate feedback, whether this is worth a PR.