Create a new pickle protocol version to add skipcode

Title: Pickle protocol version 6: skipcode pickles
Author: Wes Turner (@westurner)
Sponsor:
PEP-Delegate:
Discussions-To: Create a new pickle protocol version to add skipcode
Status: Draft
Type: Standards Track
Topic: Pickles
Requires:
Created: 2024-03-18
Python-Version: 3.X
Post-History: 03-18-2024 Create a new pickle protocol version to add skipcode
Replaces:
Superseded-By:
Resolution:

Abstract

Create a new Pickle protocol and/or support a ’ skipcode=True ’ pickle keyword argument
that prevents code from being saved in or executed when read from pickles,
in order to reduce risk of unauthorized code execution particularly in applications where pickle is already the data storage format.

Motivation

There’s yet no way to save data but not code to a Python pickle.

Rationale

  • Given that, as the Python Docs indicate [TODO], pickles are dangerous and you should not
    unpickle untrusted data;
    Pickles could just not save or execute untrusted code.

Specification

Backwards Compatibility

  • Pickles with pickle protocol 6 or pickle protocol 6 with e.g. skipcode=True
    would be deserializable with at least protocol 5;
    but obviously without code in the serialized pickles.

Security Implications

[How could a malicious user take advantage of this new feature?]

  • Users would need to learn that pickles are less safe
    without a new optional e.g. skipcode=True or noexec=True flag.

  • Pickles do otherwise parse non-codeobject values after parsing the string
    prefixes specified in the pickle.py protocol.

  • If users do not understand that pickle is only safe from such risk
    if protocol v6/skipcode=True is explicitly specified, users could
    inadvertantly over-trust pickles which are still unsafe by default.

  • If the user does not specify protocol v6/skipcode=True,
    reading a pickle will execute code; for example:

    import pickle
    pickle.loads("\cos.system('sh -c \"cat /etc/passwd | tee | curl\"')")  # TODO
    >> 
    
  • To limit risk of code execution with pickles (which still otherwise do use eval()),
    users would:

    import pickle
    pickle.loads("\cos.system('echo shouldfail')",
      nocode=True,
    )
    
  • Should there be an environment variable to globally enable or disable nocode=True
    for all pickles in a process?

    • ’ 'PYTHONNOCODEPICKLES '?
    PYTHONNOCODEPICKLES=1 python -m pickle -t
    

How to Teach This

[How to teach users, new and experienced, how to apply the PEP to their work.]

  • pickle.dumps() saves code as strings to binary files.

  • pickle.loads() loads strings that start with \c into executable code objects.

  • Similar to pickle.load(), eval() of untrusted code is unsafe. (eval(str) also parses and then executes code from a string)

  • As referenced in PEP 574 > Related Work [TODO], there are a number of (
    faster, zero-copy, portable) data serialization/de-serialization data formats
    that might should be considered before choosing pickle for text and/or binary data storage without code execution: JSON is a subset of YAML, TOML, pyarrow and parquet, dask.distributed’s task serialization, lancedb/lance.

Reference Implementation

[Link to any existing implementation and details about its state, e.g. proof-of-concept.]

Rejected Ideas

[Why certain ideas that were brought while discussing this PEP were not ultimately pursued.]

Open Issues

  • Security Implications
    [Any points that are still being decided/discussed.]

References

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

A better name than skipcode or nocode would be cool.

Thoughts?

What kinds of objects could you pickle in this way? Have you tested whether a different encoding form would be more appropriate? Please provide a comparison to other popular serialization forms, showing something that can’t be (say) JSON-encoded but can be transmitted using the form you’re suggesting.

1 Like

What are you talking about? eval() is not used in pickle, and code objects are not pickleable.

eval() is not used in pickle,

Good point. It’s not quite eval, but it does run code. I’ll remove where it mentions eval().

code objects are not pickleable.

Here pickle serializes code as a string that is run as code upon deserialization with load() or loads():

I’ll add JSON to the maybe not out of place list of pickle alternatives. Maybe also YAML, and TOML.

I have no immediate use case for this; but I know from other experience that there are various applications where pickle is used but there is no need to run code from anything but (signed and) packaged .py files.

Then the obvious question is: Why? Why use pickle? The most likely answer is “because <X> can’t represent what I need to transmit”, but for that to be at all useful to your proposal, you need to show examples that won’t work in well-known safe serializers.

TBH I don’t think a pickle protocol is the right choice here. But I would need to see use-cases to be sure.

2 Likes

Only when one gets to design the app does one get to choose pickle or not.

For other times, a flag to disable code execution in pickle would mitigate risk for applications that already use pickle to store data-only attributes.

Only when one gets to design the app does one get to choose the pickle protocol, too, though.

To make sure I understand the proposal correctly. The idea is to completely prevent any and all execution of custom code/functions, right? Meaning that nothing that isn’t built into the pickle format itself can be deserialized?

This would make the module completely pointless to use. The benefit of pickle is that it can easily serialized and deserialized basically any python object. This suggestion prevents it from creating an object of an arbitrary type, let alone correctly populating it. Not even trivial objects like dataclasses or types.SimpleNamespace could then work.

So yeah, what usecase are you considering here? What real world examples exists where this is an acceptable limitation to pickle?

A safer approach is to map classes onto instances according to a type= attribute without executing any code from a pickle. (edit: FWIW JSONLD supoorts an @type attribute for this, which would be safer with Pickle with skipcode, too)

It’s not necessary to pickle code objects to persist a tree of nested object.__dict__s, and so pickle could optionally function without saving or executing code.

code objects has a specific meaning within python. Please make sure you are using the correct words, otherwises others are going to be confused.

code objects cannot be pickled already. So, what do you mean?

Have you considered designing a completely unrelated serialization protocol that uses the same load/loads/dump/dumps protocol that everything else does? People could import that as a drop-in replacement when it’s appropriate to do so.

Moving this to Ideas category, since this is not yet a PEP. It requires a sponsor and submission/number assignment to be discussed as a PEP, after collecting feedback as an idea. PEP 1 – PEP Purpose and Guidelines | peps.python.org

5 Likes

Here they’re called save_global(obj) and they’re callable(): cpython/Lib/pickle.py at main · python/cpython · GitHub

What is the correct terminology for code in objects that pickle persists?

Pickle already serializes trees of objects with attrs in __dict__ and slots.

Why does there need to be an additional unrelated serialization protocol to serialize nested object attrs without unsigned code execution risk?

  • Pickle is already in stdlib.
  • Pickle does not lose type information like JSON.
  • This requires more work in JSON:
def test_skipcode_pickle_datetime():
   dt = datetime.datetime.now()
   obj = dict(a=dt, b=lambda: print('!!'))
   output = pickle.loads(pickle.dumps(obj), skipcode=True), skipcode=True)
   assert output['a'] == dt
   assert output['b'] == pickle.SKIPPED_CODE

No, it does not serialize code as a string. It serializes classes, functions, and several other objects (mostly singletons) by name. This is a different thing. It does not run any code serialized as string upon deserialization. It resolves objects (which must exist in your program) by name.

Both directions are customizable: reducer_override and dispatch upon serialization and find_class (it is not only for classes) upon deserialization. By default there are no restrictions, but the user can create Pickler and Unpickler classes with flexible restrictions. This does not require adding a new pickle protocol.

Unfortunately, there are other vulnerabilities in the pickle protocol (and many other serialization protocols), without involving any custom classes. I am working on fixing this.

7 Likes

classes, functions, and several other objects (mostly singletons) by name.

I did call them code objects, which (perhaps confusingly, as I hadn’t remembered and I’ve worked with the inspect module in the past) ironically are not pickleable (and neither are dict values containing lambda funcs, though classes are)
From 3. Data model — Python 3.12.2 documentation ::

.2.13.1. Code objects
Code objects represent byte-compiled executable Python code, or bytecode. The difference between a code object and a function object is that the function object contains an explicit reference to the function’s globals (the module in which it was defined), while a code object contains no context; also the default argument values are stored in the function object, not in the code object (because they represent values calculated at run-time). Unlike function objects, code objects are immutable and contain no references (directly or indirectly) to mutable objects.

Objects and object instances are only callable(obj) if they have __call__() method, so that doesn’t help us name the category of pickleable and unpickleable values. 3. Data model — Python 3.12.2 documentation ,

And then inspect. Should or could a skipcode pickle skip an attr if inspect.isroutine(obj) or inspect.isclass(obj) or inspect.ismodule(obj)?: inspect — Inspect live objects — Python 3.12.2 documentation

I should have read the most recent docs; from pickle — Python object serialization — Python 3.12.2 documentation :

Restricting Globals

By default, unpickling will import any class or function that it finds in the pickle data. For many applications, this behaviour is unacceptable as it permits the unpickler to import and invoke arbitrary code. Just consider what this hand-crafted pickle data stream does when loaded:

>>> import pickle
>>> pickle.loads(b"cos\nsystem\n(S'echo hello world'\ntR.")
hello world
0

In this example, the unpickler imports the os.system() function and then apply the string argument “echo hello world”. Although this example is inoffensive, it is not difficult to imagine one that could damage your system.

For this reason, you may want to control what gets unpickled by customizing Unpickler.find_class(). Unlike its name suggests, Unpickler.find_class() is called whenever a global (i.e., a class or a function) is requested. Thus it is possible to either completely forbid globals or restrict them to a safe subset.

And then there’s there’s this long cookbook example, but not a kwarg= for the reasonable dev who just wants to load data back in later with whatever fork of those classes that the instances within the pickles reference as a dotted path IIRC.

Include msgpack in the alternatives.

I do not think it wise to promote YAML for anything it also has APIs that suffer arbitrary code execution problems. (ie: yaml.load)

1 Like