Storing references in variables

chris1109873 · December 26, 2020, 6:05am

Hi, I am reading the book automate the boring stuff, 2nd edition and I came across the following example which describes how values and references work in python:

>>> spam = 42
>>> cheese = spam
>>> spam = 100
>>> spam
100
>>> cheese
42

" When you assign 42 to the spam variable, you are actually creating the 42 value in the computer’s memory and storing a reference to it in the spam variable. When you copy the value in spam and assign it to the variable cheese, you are actually copying the reference. Both the spam and cheese variables refer to the 42 value in the computer’s memory.

When you later change the value in spam to 100, you’re creating a new 100 value and storing a reference to it in spam. This doesn’t affect the value in cheese. Integers are immutable values that don’t change; changing the spam variable is actually making it refer to a completely different value in memory."

So my question is, instead of creating a new value 100 in memory, assigning the address of this value to variable spam, and then finally have the garbage collector delete the value 42 later, why not just overwrite the value 42 with the new value 100 and leave the address in spam variable unchanged?

steven.daprano · December 26, 2020, 11:53am

Hi Chris,

The short answer to your question:

"So my question is, instead of creating a new value 100 in memory,

assigning the address of this value to variable spam, and then finally

have the garbage collector delete the value 42 later, why not just

overwrite the value 42 with the new value 100 and leave the address in

spam variable unchanged?"

is that your plan works fine for values that don’t take much memory,

like an integer 42, but not if your value is a ten megabye string, or a

big complex object with dozens of fields, some of which are themselves

big complex objects with dozens of fields. In those cases, your plan

would involving copying huge amounts of data on every operation, slowing

down every program. Copying an individual byte is fast, but when you

have to copy millions of them bytes, it all adds up and becomes slow.

So Python’s tactic avoids unnecessary copying, speeding things up a lot.

That’s the short answer. The long answer requires a detour to the

history of programming, and a comparison of two major models for storing

values in programming languages.

The first, oldest model comes from the way computer memory works in

hardware: the “variables are a box” model.

In this model, we can imagine every variable is a box sitting at a

particular location. The location of the box is fixed and cannot change,

and the box always contains something. It cannot be empty, but it can

contain “garbage” that doesn’t mean anything.

In this model, a line of code like:

x = 123

can be understand like this:

choose a box, let’s say the box at location 936782, and call it “x”
stuff the value 123 into that box

The compiler or interpeter keeps track of which box is called what and

the programmer doesn’t have to.

The oldest programming languages, and those that operate closest to the

hardware, operate like this. So for example, if you program in Fortran,

C, assembly language etc your language operates with a model like that.

One consequence of this model is that any time you make an assignment,

you have to make a copy of the value:

y = x

has to copy the contents of box “x” (the memory location 936782) and

stuff it into box “y”. That’s fine if the box is tiny and contains

something small. Copying one, two or four bytes of memory is close

enough to instantaneous. But what if “x” contains ten megabytes of data,

or a hundred?

In classical programming languages like Fortran, C etc, the way around

this is to introduce a level of indirection, the “pointer”. Instead of

talking about x and y directly, you have new variables that point to x

and y and you work with those most of the time. The programmer now

needs to keep track of which pointer variable refers to which actual

value, the compiler can’t help with that. Given some values:

x = ten megabyes of data

y = pointer to x

both x and y are “boxes” at a numbered address, and the compiler can

track that for you, but it can’t help you remember that y refers to x.

As far as it is concerned, y is just a box with the value 936782 (the

address of x).

As a programmer, what you do is write all your code to operate on y

rather than x directly. That means the compiler doesn’t have to copy the

full 10MB of x every access, which would be slow, it just has to copy

the pointer, which is only a handful of bytes, and that’s fast. Success!

But that means that all of your code has to be written to deal

with indirect references to the actual data you care about. You

actually care about the data in the box “x” but you can’t ever talk

about x, you have to always refer to that data through y.

# Never this:

add one to x



# Always this:

add one to the value that y points to

and heaven help you if you mess up:

add one to y

because that’s usually (but not always!) a disaster. Most programmers

find using pointers difficult to reason about and hard to get right.

It’s tedious to use, hard to teach, hard to learn, some people never

get it, and bugs tend to lead to uncontrollable software crashes which

villains can use to execute code and take over your computer.

Computers are good at doing tedious jobs, they don’t get bored or

confused. Using pointers from the “numbered boxes” model of variable

access is hard but its the sort of thing that computers excel at, so

language designers automated all that pointer stuff.

And that automated control of pointers is what we mean when we talk

about Python variables being “references to values”.

(It’s not just Python: most modern or “high level” languages use this

model, such as Javascript and Ruby. Even Java uses a mixed model, with

numbered boxes for native machine data types and “references” for

objects and “boxed values”.)

So in Python, under the hood the old “numbered box” model still applies,

because that’s how machine code and assembly language operates. But all

the tedious and hard stuff working with pointers, or “references”, is

handled by the interpreter. As a Python programmer, you almost never

need to care that

x = 123

actually means that x is a reference to a numbered box containing 123,

you just talk about x and the interpreter does the rest.

ncoghlan · December 29, 2020, 12:49am

If your specific concern is “Isn’t creating and destroying all those extra objects slow?”, the answer is “Yes, it can be”, but compiler and interpreter authors attempt to reduce those costs without needing to worry users with the details.

For example, small integers are super common and don’t need much memory, so Python interpreters typically pre-create many of them and hand out references to those shared copies rather than making multiple instances.

And when constants are used in arithmetic, a compiler will often try to work out the result at compile time (e.g. “1+1” is compiled as “2”) to reduce the work needed at runtime.

chris1109873 · December 29, 2020, 5:19pm

Got it. Thanks.

cben · December 31, 2020, 1:01pm

An implicit point left unsaid is that we’d not want cheese to become 100 as well, right?!
That would expose the underlying pointer (aka reference) behavior, making you worry every time you mutate a number whether anything other shares the same address and would change inadvertently.

But those concerns do exist in Python! Lists, dictionaries, and many other objects are mutable:

>>> toast = ["tomato", "olive"]
>>> salad = toast  # point to same address
>>> toast.append("cheese")
>>> salad
['tomato', 'olive', 'cheese']

Oh, no, I intended only vegetables in my salad .

The trouble with mutable objects is that every time you pass them around — put one in a variable, in a list, pass one to a function, take one from a list, receive one from a function, … — you need to make a decision: do you want a reference to the same thing (shared fate in the future) or a copy that go a separate way.
In the above example, I made the wrong decision, I should have made a copy:

>>> toast = ["tomato", "olive"]
>>> # list() is what we call a constructor, creates a new object.
>>> salad = list(toast)
>>> toast.append("cheese")
>>> salad
['tomato', 'olive']

Immutable value behavior is simpler; it doesn’t matter whether you made a copy (cheese = int(spam)) or didn’t, because neither object will ever change.
(This idea is sometimes called “mutable objects have identity”. Python has is operator to check that.)

So, why did Python give lists, dicts, etc. mutable behavior?

Performance: lists can be huge. So Python made you consciously decide when to make a copy…
(Hint: don’t be afraid, computers can copy ten megabytes in a blink! Write simple, correct code, don’t optimize until you’ve proven a particular part is too slow.)
It’s convenient to build up lists and dictionaries by changing them.
Sometimes you want helper functions to automate a bunch of changes, and that works because you pass the reference to the list to the function: random.shuffle(salad).

A related problem is when reading code, how do you tell where a copy is made?
In mixed-model languages like C or Java this is nasty, because sometimes a = b meant copy the value, sometimes share the reference .
Here is where Python really shines IMHO: the language always deals with references, copying is always an explicit action.
You’ll encounter many ways copying is spelled, may depend on the type of the object. list(a_list), copy.copy(a_list), a_list[:], a_dictionary.copy()… But it’s always visible !

So all that business about spam = 100 allocating a new box and garbage-collecting the old one ? It’s not really about numbers.

As Steven explained, different languages took 2 approaches to implementing value behavior — each variable is a box vs references — and if properly hidden you won’t know the difference.
But when you need mutable behavior, you will know the difference, and there Python made the important choice that all copying is explicit.

Allocating a new number box is just what it took to be consistent with that choice. And as Nick mentioned Python sometimes reuses number boxes, because the behavior doesn’t change (unless you check too closely with is operator. For values, is doesn’t matter, stay with ==).

AProudCat · December 31, 2020, 1:13pm

I am new to Python but I would say it’s like pointers in C.

AProudCat · December 31, 2020, 1:23pm

As a newbie to Python, but an experienced programmer, it’s pretty easy to figure it out.

spam = 42
A new variable called spam is created. The value is 42
cheese = spam
A new variable called cheese is copied with the value 42
spam = 100
The variable spam is erased and is created anew with different value.
spam
100
Bingo
cheese
42
Bingo

encukou · December 31, 2020, 1:59pm

That might be right or wrong, depending on what the term “variable” means to you.

I recommend Ned Batchelder’s presentation on the topic: Python Names and Values | Ned Batchelder
(You’ll notice he uses the term “name” rather than “variable”, which makes it much easier to talk with people who know “variables” in various languages.)

AProudCat · December 31, 2020, 2:14pm

Perhaps. Like I said I am new to Python. But as I have stated, I am an expert programmer.

The results explain themselves.

100
42

AProudCat · December 31, 2020, 2:36pm

How do I delete posts? I made a drunken dumbass posts. I wanna remove them.

encukou · December 31, 2020, 3:22pm

Click the ⋯ below the post, then the garbage can ()

chris1109873 · January 1, 2021, 5:29pm

@ Beni Cherniavsky-Paskin

“An implicit point left unsaid is that we’d not want cheese to become 100 as well, right?! That would expose the underlying pointer (aka reference) behavior, making you worry every time you mutate a number whether anything other shares the same address and would change inadvertently.”

Ah, yes. This right here is the answer I was looking for. I am wondering how I missed something so obvious though. Thank you and Happy New Year.