Create a new pickle protocol version to add skipcode

westurner · March 18, 2024, 6:23pm

Title: Pickle protocol version 6: skipcode pickles
Author: Wes Turner (@westurner)
Sponsor:
PEP-Delegate:
Discussions-To: Create a new pickle protocol version to add skipcode
Status: Draft
Type: Standards Track
Topic: Pickles
Requires:
Created: 2024-03-18
Python-Version: 3.X
Post-History: 03-18-2024 Create a new pickle protocol version to add skipcode
Replaces:
Superseded-By:
Resolution:

Abstract

Create a new Pickle protocol and/or support a ’ skipcode=True ’ pickle keyword argument
that prevents code from being saved in or executed when read from pickles,
in order to reduce risk of unauthorized code execution particularly in applications where pickle is already the data storage format.

Motivation

There’s yet no way to save data but not code to a Python pickle.

Rationale

Given that, as the Python Docs indicate [TODO], pickles are dangerous and you should not
unpickle untrusted data;
Pickles could just not save or execute untrusted code.

Specification

Other than a protocol version bump and NOP’ing out the serialize code and deserialize
parts of pickle.py, there should be no necessary changes to the pickle specification.
A data-only pickle serialization protocol implementation would need to skip
calls to self.save_global() in pickle._Pickler.save() if condition(pickle_protocol) here also in the save_type() dispatch table at #L1123 .

Backwards Compatibility

Pickles with pickle protocol 6 or pickle protocol 6 with e.g. skipcode=True
would be deserializable with at least protocol 5;
but obviously without code in the serialized pickles.

Security Implications

[How could a malicious user take advantage of this new feature?]

Users would need to learn that pickles are less safe
without a new optional e.g. skipcode=True or noexec=True flag.
Pickles do otherwise parse non-codeobject values after parsing the string
prefixes specified in the pickle.py protocol.
If users do not understand that pickle is only safe from such risk
if protocol v6/skipcode=True is explicitly specified, users could
inadvertantly over-trust pickles which are still unsafe by default.
If the user does not specify protocol v6/skipcode=True,
reading a pickle will execute code; for example:
```
import pickle
pickle.loads("\cos.system('sh -c \"cat /etc/passwd | tee | curl\"')")  # TODO
>> 
```
To limit risk of code execution with pickles (which still otherwise do use eval()),
users would:
```
import pickle
pickle.loads("\cos.system('echo shouldfail')",
  nocode=True,
)
```
Should there be an environment variable to globally enable or disable nocode=True
for all pickles in a process?
- ’ 'PYTHONNOCODEPICKLES '?
```
PYTHONNOCODEPICKLES=1 python -m pickle -t
```

How to Teach This

[How to teach users, new and experienced, how to apply the PEP to their work.]

pickle.dumps() saves code as strings to binary files.
pickle.loads() loads strings that start with \c into executable code objects.
Similar to pickle.load(), eval() of untrusted code is unsafe. (eval(str) also parses and then executes code from a string)
As referenced in PEP 574 > Related Work [TODO], there are a number of (
faster, zero-copy, portable) data serialization/de-serialization data formats
that might should be considered before choosing pickle for text and/or binary data storage without code execution: JSON is a subset of YAML, TOML, pyarrow and parquet, dask.distributed’s task serialization, lancedb/lance.

Reference Implementation

[Link to any existing implementation and details about its state, e.g. proof-of-concept.]

Rejected Ideas

[Why certain ideas that were brought while discussing this PEP were not ultimately pursued.]

Open Issues

Security Implications
[Any points that are still being decided/discussed.]

References

“PEP 574 – Pickle protocol 5 with out-of-band data” (2018; Python 3.8)
PEP 574 – Pickle protocol 5 with out-of-band data | peps.python.org
“PEP 3154 – Pickle protocol version 4” (2011)
PEP 3154 – Pickle protocol version 4 | peps.python.org
“PEP 307 – Extensions to the pickle protocol” (2003)
PEP 307 – Extensions to the pickle protocol | peps.python.org
CPython Docs > Pickle module:
pickle — Python object serialization — Python 3.12.2 documentation

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

westurner · March 18, 2024, 6:25pm

A better name than skipcode or nocode would be cool.

Thoughts?

Rosuav · March 18, 2024, 6:46pm

What kinds of objects could you pickle in this way? Have you tested whether a different encoding form would be more appropriate? Please provide a comparison to other popular serialization forms, showing something that can’t be (say) JSON-encoded but can be transmitted using the form you’re suggesting.

storchaka · March 18, 2024, 6:55pm

What are you talking about? eval() is not used in pickle, and code objects are not pickleable.

westurner · March 18, 2024, 7:14pm

eval() is not used in pickle,

Good point. It’s not quite eval, but it does run code. I’ll remove where it mentions eval().

code objects are not pickleable.

Here pickle serializes code as a string that is run as code upon deserialization with load() or loads():

westurner · March 18, 2024, 7:16pm

I’ll add JSON to the maybe not out of place list of pickle alternatives. Maybe also YAML, and TOML.

I have no immediate use case for this; but I know from other experience that there are various applications where pickle is used but there is no need to run code from anything but (signed and) packaged .py files.

Rosuav · March 18, 2024, 7:22pm

Then the obvious question is: Why? Why use pickle? The most likely answer is “because <X> can’t represent what I need to transmit”, but for that to be at all useful to your proposal, you need to show examples that won’t work in well-known safe serializers.

TBH I don’t think a pickle protocol is the right choice here. But I would need to see use-cases to be sure.

westurner · March 18, 2024, 7:29pm

Only when one gets to design the app does one get to choose pickle or not.

For other times, a flag to disable code execution in pickle would mitigate risk for applications that already use pickle to store data-only attributes.

Rosuav · March 18, 2024, 7:30pm

Only when one gets to design the app does one get to choose the pickle protocol, too, though.

MegaIng · March 18, 2024, 7:30pm

To make sure I understand the proposal correctly. The idea is to completely prevent any and all execution of custom code/functions, right? Meaning that nothing that isn’t built into the pickle format itself can be deserialized?

This would make the module completely pointless to use. The benefit of pickle is that it can easily serialized and deserialized basically any python object. This suggestion prevents it from creating an object of an arbitrary type, let alone correctly populating it. Not even trivial objects like dataclasses or types.SimpleNamespace could then work.

So yeah, what usecase are you considering here? What real world examples exists where this is an acceptable limitation to pickle?

westurner · March 18, 2024, 7:35pm

A safer approach is to map classes onto instances according to a type= attribute without executing any code from a pickle. (edit: FWIW JSONLD supoorts an @type attribute for this, which would be safer with Pickle with skipcode, too)

It’s not necessary to pickle code objects to persist a tree of nested object.__dict__s, and so pickle could optionally function without saving or executing code.

MegaIng · March 18, 2024, 7:38pm

code objects has a specific meaning within python. Please make sure you are using the correct words, otherwises others are going to be confused.

code objects cannot be pickled already. So, what do you mean?

Rosuav · March 18, 2024, 7:39pm

Have you considered designing a completely unrelated serialization protocol that uses the same load/loads/dump/dumps protocol that everything else does? People could import that as a drop-in replacement when it’s appropriate to do so.

davidism · March 18, 2024, 7:40pm

Moving this to Ideas category, since this is not yet a PEP. It requires a sponsor and submission/number assignment to be discussed as a PEP, after collecting feedback as an idea. PEP 1 – PEP Purpose and Guidelines | peps.python.org

westurner · March 18, 2024, 7:41pm

Here they’re called save_global(obj) and they’re callable(): cpython/Lib/pickle.py at main · python/cpython · GitHub

What is the correct terminology for code in objects that pickle persists?

westurner · March 18, 2024, 7:50pm

Pickle already serializes trees of objects with attrs in __dict__ and slots.

Why does there need to be an additional unrelated serialization protocol to serialize nested object attrs without unsigned code execution risk?

Pickle is already in stdlib.
Pickle does not lose type information like JSON.
This requires more work in JSON:

def test_skipcode_pickle_datetime():
   dt = datetime.datetime.now()
   obj = dict(a=dt, b=lambda: print('!!'))
   output = pickle.loads(pickle.dumps(obj), skipcode=True), skipcode=True)
   assert output['a'] == dt
   assert output['b'] == pickle.SKIPPED_CODE

storchaka · March 18, 2024, 7:58pm

No, it does not serialize code as a string. It serializes classes, functions, and several other objects (mostly singletons) by name. This is a different thing. It does not run any code serialized as string upon deserialization. It resolves objects (which must exist in your program) by name.

Both directions are customizable: reducer_override and dispatch upon serialization and find_class (it is not only for classes) upon deserialization. By default there are no restrictions, but the user can create Pickler and Unpickler classes with flexible restrictions. This does not require adding a new pickle protocol.

Unfortunately, there are other vulnerabilities in the pickle protocol (and many other serialization protocols), without involving any custom classes. I am working on fixing this.

westurner · March 18, 2024, 8:38pm

classes, functions, and several other objects (mostly singletons) by name.

I did call them code objects, which (perhaps confusingly, as I hadn’t remembered and I’ve worked with the inspect module in the past) ironically are not pickleable (and neither are dict values containing lambda funcs, though classes are)
From 3. Data model — Python 3.12.2 documentation ::

.2.13.1. Code objects
Code objects represent byte-compiled executable Python code, or bytecode. The difference between a code object and a function object is that the function object contains an explicit reference to the function’s globals (the module in which it was defined), while a code object contains no context; also the default argument values are stored in the function object, not in the code object (because they represent values calculated at run-time). Unlike function objects, code objects are immutable and contain no references (directly or indirectly) to mutable objects.

Objects and object instances are only callable(obj) if they have __call__() method, so that doesn’t help us name the category of pickleable and unpickleable values. 3. Data model — Python 3.12.2 documentation ,

And then inspect. Should or could a skipcode pickle skip an attr if inspect.isroutine(obj) or inspect.isclass(obj) or inspect.ismodule(obj)?: inspect — Inspect live objects — Python 3.12.2 documentation

I should have read the most recent docs; from pickle — Python object serialization — Python 3.12.2 documentation :

Restricting Globals

By default, unpickling will import any class or function that it finds in the pickle data. For many applications, this behaviour is unacceptable as it permits the unpickler to import and invoke arbitrary code. Just consider what this hand-crafted pickle data stream does when loaded:
>>> import pickle
>>> pickle.loads(b"cos\nsystem\n(S'echo hello world'\ntR.")
hello world
0
In this example, the unpickler imports the os.system() function and then apply the string argument “echo hello world”. Although this example is inoffensive, it is not difficult to imagine one that could damage your system.

For this reason, you may want to control what gets unpickled by customizing Unpickler.find_class(). Unlike its name suggests, Unpickler.find_class() is called whenever a global (i.e., a class or a function) is requested. Thus it is possible to either completely forbid globals or restrict them to a safe subset.

And then there’s there’s this long cookbook example, but not a kwarg= for the reasonable dev who just wants to load data back in later with whatever fork of those classes that the instances within the pickles reference as a dotted path IIRC.

gpshead · March 18, 2024, 8:55pm

Include msgpack in the alternatives.

I do not think it wise to promote YAML for anything it also has APIs that suffer arbitrary code execution problems. (ie: yaml.load)