Add PDFDocEncoding to standard codecs

Python is commonly used to manipulate PDF files and the file format has been for some time an open standard. Packages such as PyPDF2, PDFMiner and user written python code have taken different approaches to parsing these files and dealing with the encodings used. Although the Codecs package can be extended this is not trivial and has performance hits compared with using a built in codec so in practice many use Latin-1 which is slightly different to PDFDocEncoding i.e is wrong.

Python advertises it is “batteries included” and has built in support for other widely used formats. This change would go some way to supporting those processing PDFs using Python and would allow a canonical solution to be used across packages and code “one method to rule them all”.

There are already examples of supporting PDF artifacts in the standard library. For example, the base64 module supports ASCI85 encoded strings, optionally surrounded by the <~ and `~>’ tags since Python 3.4.

Such a change should be implemented in a way that is performant and compliant with the specification. This would be a major advantage compared to the current approaches.

1 Like

We’re generally hesitant to add new encodings to Python, but this appears to be a case where the basic criteria are fulfilled:

  • in current use
  • in wide-spread use
  • won’t get replaced by e.g. UTF-8 anytime soon

I’ve done some research and found these resources:

I don’t think a PEP is needed for new codecs. Even though it comes as a new stdlib module (which would require a PEP), the codec will only be used via the codec interface. Opening a ticket and providing a PR with the implementation should be enough.

4 Likes

The Pikepdf encoding documentation also seems somewhat useful (if somewhat lacking in specifics):

https://pikepdf.readthedocs.io/en/latest/topics/encoding.html

1 Like