What does the pickle module do in laymans terms?

Hello all,

just a quick one. What does the pickle module do in every-man speak? From what I can gather by reading the documentation, it takes a python object and ‘encodes’ it as one, continuous chunk of binary data, which can then be written to a disk etc. and later reconstructed.

If I have understood this correctly (big if) then the implication is that objects stored in memory are not held as one continuous chunk of data, presumably because some of their associated data, e.g attribute values and methods, may be stored elsewhere in memory, possibly in the case they are defined in a base class?

This is all conjecture and I’m sure it’s much more complex, so I’d appreciate any input!

The short answer is that it allows you to save a copy of just about
any Python data structure so that it can be loaded later (by the
same program or another program). Let’s say you read in a bunch of
data and transformed it in-memory with a program, you can “pickle”
it and save that to disk. Then later you can unpickle it in a
program and don’t need to redo all those earlier transformations.

This is, of course, not without its risks. You will see warnings all
over the place about it, but just to add my own: pickled data can
contain pretty much anything, including executable routines. You
should never trust user-supplied pickled data. The pickle module may
seem convenient, but it’s not really a safe alternative to proper
data serialization formats, which you can run validation and
sanitizing routines over.

And to answer your other question, because data structures in Python
can contain other structures, there’s no guarantee that the various
parts are adjacent in memory. Memory allocations will end up all
over the place as your program adds and removes objects. The same
goes for just about any but the lowest-level language runtimes or
those which completely virtualize their memory management, memory
addressing isn’t usually under its control and those decisions are
delegated to the underlying system.

1 Like

Yes, that’s broadly correct. When you look at a Python object literal in your source code, that’s very similar to the object itself (it has all the important information), but it’s not actually the object. Pickling is similar, but more powerful - it can save things that don’t make sense in source code - by recording a set of rules for reconstructing the object.

Pickling is basically like a sci-fi teleporter. You can’t ACTUALLY send people through this weird beam thing, but instead, you write down all the information needed to perfectly recreate the person at the other end. Then you send that information down to the planet (that’s the pickle file), and some old-school sound effects happen, and you construct a brand new person right where they wanted to be.

1 Like

You have the gist of it, yes. Please keep in mind that you must never unpickle data from an untrusted source, because this can cause arbitrary code to run (as noted at the top of the pickle documentation).

Correct, but there are much easier ways to find this out :wink: In particular, lists can contain other objects, and “see” the change if that object (for example, another list) is modified:

>>> a = [1]
>>> b = [a]
>>> a.append(2)
>>> a
[1, 2]
>>> b
[[1, 2]]

This would be impossible if b had to embed a copy of a, directly in-line in memory, at the time of creating [a].

I should pause here to note that even in much lower level languages like C, it is not possible in general to just write out data structures from memory directly to disk and expect to be able to use that later. The problem is that pointers are specific to where in memory (and on modern computers, memory locations are a very complicated thing) the other object is, and if you just try to load the entire binary dump the next time the program runs, nothing will necessarily be in the same places as before, so the pointers would be corrupted. (Often, operating systems deliberately randomize the “base” address for the chunk of memory a process is using, because that makes it harder for malware to interfere with other processes.)

So, the pickled data won’t actually contain the exact memory contents: first off because the pointers wouldn’t make sense, and second because it needs a way to tell what’s what and make sense of the structure when it’s re-creating the object. It needs metadata, the same way that XML needs tags or JSON needs [] and {}. But within those limitations, it tries to use a more or less “raw” format.

It has nothing to do with inheritance or classes. Every attribute, every element of a list or tuple, every key and value in a dictionary, etc. are “indirected” in this way.

This is a huge part of how Python implements dynamic typing. In languages where the type of every value has to be known up front (static typing), one of the major benefits is that the size of everything is calculated up front, and pointers or references (in the broad sense of “memory where the purpose of the value is to indicate the memory location of something else”) are only used where necessary for the desired structure of the data. But the downside is that the sizes have to be knowable up front, which either requires rules for inferring them, or declarations from the programmer. Languages like Python are much more flexible; the types and sizes aren’t known up front. The implementation just needs some way to represent “pointer to any kind of object Python understands” (the type will be figured out when the code runs, by following the pointer to the other memory and looking at what’s stored there). Then the implementation for e.g. a list can just reserve space for the pointers. This also allows a list to easily store mixed data, which ranges from difficult to basically impossible (it gets to the point where you are basically re-implementing the dynamic typing feature yourself) in many other languages.

2 Likes

Thanks, it’s a pretty apt analogy. I also didn’t realise that’s how a teleporter is supposed to work, so I’ve learned something new there too. I’m not convinced an actual teleporter (working as you described) would be able to reconstruct a persons consciousness, so I suppose the analogy is especially appropriate if the un-pickled object is not a carbon copy of the original.

1 Like

Yeah, most sci-fi works kinda handwave that and pretend that consciousness doesn’t matter. Although Stellaris does actually examine the distinction between “materialist” and “spiritualist” philosophies, with the materialist empires being willing to do things like “upload our brains into robots”, and the spiritualist ones believing that there is something fundamental about a person that cannot be transferred in that way. (Which means: Uploading your brain into a robot is really just constructing a replica of yourself and then dying.)

A pickle file is a tightly-encoded set of instructions for reconstructing an object, but it can never truly be the object. A lot of the time, that doesn’t matter (if you order “Tea, Earl Grey, hot”, you don’t care whether the steaming cup is the same object that was referenced or a duplicate); but one of the limitations of pickling is that you will never be able to refer back to the original object. Which makes a lot of sense when you think about saving objects for later, but it’s also important in multiprocessing contexts - for example, you won’t be able to pickle an empty list, send it to a subprocess, and have the subprocess append to that list in order to make it appear in the parent. That sort of thing DOES work with genuine shared objects (eg with threads, where everything’s all in a single interpreter), but not when you pickle-send-unpickle.

Thanks, it’s a pretty apt analogy. I also didn’t realise that’s how a
teleporter is supposed to work, so I’ve learned something new there
too.

It’s how star trek teleporters work. In some of Charlie Stross’ books
there are both A-gates and T-gates. A-gates (assembler) disassembles you
and reassembles a copy at the far end. T-gates (translate) are
space-time portals and you just step through them. As a contrast, you
can duplicate things with an A-gate.

You can instantiate (unpickle) a pickled object many times, so pickling
is like an A-gate or a star trek teleporter.

I’m not convinced an actual teleporter (working as you described) would
be able to reconstruct a persons consciousness,

That depends on where consciousness comes from and what your definition
of the same person is. Plenty of people are of the view that a
disassemble/reassemble kills the original and makes a new person. I’m
less convinced, but my definition of personhood is looser, presumably.

Pickling doesn’t inherently destroy the original though - it’s still
there.

(The disassemble phase in SF is to my mind born of two things: the idea
that there’s just one person who disappears here and reappears there,
and also/alternatively because there isn’t a nondestructive way to
analyse a person in sufficient detail to make a true duplicate, thus the
disassembly.)

Pickle gets to walk the object tree nondestructively.

so I suppose the analogy is especially appropriate if the un-pickled
object is not a carbon copy of the original.

I think it’s meant to be functionally equivalent. Obviously, if it
references other objects they won’t be the original referenced objects
from the source (pickling) environment.

Disclaimer: I don’t use pickle myself.

Cheers,
Cameron Simpson cs@cskk.id.au

Yeah, “functionally equivalent” covers everything up to, but not including, object identity. If, in the future, 3D scanning and printing technology becomes molecule-accurate, I could pickle the Mona Lisa, take home a USB stick with all the information, and print out a duplicate; but that’s still not THE Mona Lisa, it’s still a copy. Do you care? It all depends how important identity is. As long as your entire pickling task is done at once, it’s probably not a huge problem, as it’s quite capable of handling recursive structures; but if you have two objects that reference each other, you can’t pickle them independently and get a useful result at the other end.

Note, though, that there ARE some strategies for getting around this. When I designed a MUD server some years ago (okay, a lot of years ago now), I designed it so that all rooms would be identified by a simple text string like “/home/polly”, and instead of actually encoding the room data into any sort of saved data, it would write that out. (That can be done with __getstate__.) Then, on load, it would look at the room identifier, see if any such room existed, and if so, would reuse it - otherwise it would construct it fresh. (I’m not sure if that can be done with __setstate__ but in my case I kinda had a “room proxy” object so it worked.) That doesn’t really fit into the teleporter analogy, but it’s worth keeping in mind for when you run into pickle’s limitations.