The common format used in METADATA
, PKG-INFO
and WHEEL
files is currently not well defined. PEP 241 written in 2001 referred to RFC 822, which specifies the email message format, for a description of the format.
The complexity of the email
library has lead many metadata generators to implement the format themselves. However these implementations usually forget to check for line-breaks and as different writers are used for PKG-INFO
and METADATA
the contents often differ. This results sometimes in the generation of invalid metadata files. On the other hand parsers usually use the email
package but donât check for defects or logic errors in the metadata files. This leads to invalid files being accepted and uploaded to PyPI.
Because of the earlier discussion in Core metadata email fields & Unicode with @takluyver and @uranusjr, I had a closer look at the metadata format and tried to come up with a solution to these issues. Python packages donât make use of the complex features of email messages, so a replacement should be feasible, although some churn is inevitable if one wants to improve on the status-quo as a few published metadata files for popular packages are invalid (see below).
Iâve drafted a written specification of the format, that is compatible with the metadata files already deployed on PyPI but does not depend on the email RFCs for the message syntax. In addition Iâve implemented a parser and a serializer for the format using only the standard library. Currently a dict-like API for accessing the metadata fields is missing as well as additional message validation. To test my implementation I collected metadata files from the top 4000 packages on PyPI. I can parse and serialize again all files without problems, except those that contain errors and arenât correctly parsed by the email
package as well.
Examples of invalid PKG-INFO
files found on PyPI:
- tendo-0.2.15: Each keyword on its own line, without leading whitespace. This breaks the message as each line should be a âkey: valueâ pair, of if line folding is used start with a space to continue the previous value.
-
rstr-2.2.6: user put a long multi-line description in the
Summary
field. Same issue as above. - vaderSentiment-3.3.2: description contains for an unknown reason completely blank lines. A blank line without whitespace signals the end of the message header. The remainder of the message is erroneously considered to be the payload.
- additional errors in passlib-1.7.4 and win_inet_pton-1.1.0
The METADATA
files in the wheels for these packages were produced from the broken PKG-INFO
files, and while syntactically valid contain mangled or incomplete data from the PKG-INFO
.
- Does anyone know of
METADATA
/PKG-INFO
files containing aDescription
field in the piped format from the standard? - Is it currently possible to block the upload of invalid
METADATA
andPKG-INFO
files to PyPI?