Add PDFDocEncoding to standard codecs

johns1c · April 21, 2022, 9:21am

Python is commonly used to manipulate PDF files and the file format has been for some time an open standard. Packages such as PyPDF2, PDFMiner and user written python code have taken different approaches to parsing these files and dealing with the encodings used. Although the Codecs package can be extended this is not trivial and has performance hits compared with using a built in codec so in practice many use Latin-1 which is slightly different to PDFDocEncoding i.e is wrong.

Python advertises it is “batteries included” and has built in support for other widely used formats. This change would go some way to supporting those processing PDFs using Python and would allow a canonical solution to be used across packages and code “one method to rule them all”.

There are already examples of supporting PDF artifacts in the standard library. For example, the base64 module supports ASCI85 encoded strings, optionally surrounded by the <~ and `~>’ tags since Python 3.4.

Such a change should be implemented in a way that is performant and compliant with the specification. This would be a major advantage compared to the current approaches.

malemburg · April 21, 2022, 10:37am

We’re generally hesitant to add new encodings to Python, but this appears to be a case where the basic criteria are fulfilled:

in current use
in wide-spread use
won’t get replaced by e.g. UTF-8 anytime soon

I’ve done some research and found these resources:

PDF Reference 1.7 (from archive.org, since the original Adobe link doesn’t work anymore)
This includes the encoding map in section D.2
Example implementation available as part of pikepdf: pikepdf/codec.py at master · pikepdf/pikepdf · GitHub
However, this implementation is custom one. It would be better to implement the codec as standard charmap codec (see e.g. cpython/cp1252.py at main · python/cpython · GitHub as reference).

I don’t think a PEP is needed for new codecs. Even though it comes as a new stdlib module (which would require a PEP), the codec will only be used via the codec interface. Opening a ticket and providing a PR with the implementation should be enough.

CAM-Gerlach · April 21, 2022, 11:44am

The Pikepdf encoding documentation also seems somewhat useful (if somewhat lacking in specifics):

https://pikepdf.readthedocs.io/en/latest/topics/encoding.html