Hi Chris,
The short answer to your question:
"So my question is, instead of creating a new value 100 in memory,
assigning the address of this value to variable spam, and then finally
have the garbage collector delete the value 42 later, why not just
overwrite the value 42 with the new value 100 and leave the address in
spam variable unchanged?"
is that your plan works fine for values that don’t take much memory,
like an integer 42, but not if your value is a ten megabye string, or a
big complex object with dozens of fields, some of which are themselves
big complex objects with dozens of fields. In those cases, your plan
would involving copying huge amounts of data on every operation, slowing
down every program. Copying an individual byte is fast, but when you
have to copy millions of them bytes, it all adds up and becomes slow.
So Python’s tactic avoids unnecessary copying, speeding things up a lot.
That’s the short answer. The long answer requires a detour to the
history of programming, and a comparison of two major models for storing
values in programming languages.
The first, oldest model comes from the way computer memory works in
hardware: the “variables are a box” model.
In this model, we can imagine every variable is a box sitting at a
particular location. The location of the box is fixed and cannot change,
and the box always contains something. It cannot be empty, but it can
contain “garbage” that doesn’t mean anything.
In this model, a line of code like:
x = 123
can be understand like this:
-
choose a box, let’s say the box at location 936782, and call it “x”
-
stuff the value 123 into that box
The compiler or interpeter keeps track of which box is called what and
the programmer doesn’t have to.
The oldest programming languages, and those that operate closest to the
hardware, operate like this. So for example, if you program in Fortran,
C, assembly language etc your language operates with a model like that.
One consequence of this model is that any time you make an assignment,
you have to make a copy of the value:
y = x
has to copy the contents of box “x” (the memory location 936782) and
stuff it into box “y”. That’s fine if the box is tiny and contains
something small. Copying one, two or four bytes of memory is close
enough to instantaneous. But what if “x” contains ten megabytes of data,
or a hundred?
In classical programming languages like Fortran, C etc, the way around
this is to introduce a level of indirection, the “pointer”. Instead of
talking about x and y directly, you have new variables that point to x
and y and you work with those most of the time. The programmer now
needs to keep track of which pointer variable refers to which actual
value, the compiler can’t help with that. Given some values:
x = ten megabyes of data
y = pointer to x
both x and y are “boxes” at a numbered address, and the compiler can
track that for you, but it can’t help you remember that y refers to x.
As far as it is concerned, y is just a box with the value 936782 (the
address of x).
As a programmer, what you do is write all your code to operate on y
rather than x directly. That means the compiler doesn’t have to copy the
full 10MB of x every access, which would be slow, it just has to copy
the pointer, which is only a handful of bytes, and that’s fast. Success!
But that means that all of your code has to be written to deal
with indirect references to the actual data you care about. You
actually care about the data in the box “x” but you can’t ever talk
about x, you have to always refer to that data through y.
# Never this:
add one to x
# Always this:
add one to the value that y points to
and heaven help you if you mess up:
add one to y
because that’s usually (but not always!) a disaster. Most programmers
find using pointers difficult to reason about and hard to get right.
It’s tedious to use, hard to teach, hard to learn, some people never
get it, and bugs tend to lead to uncontrollable software crashes which
villains can use to execute code and take over your computer.
Computers are good at doing tedious jobs, they don’t get bored or
confused. Using pointers from the “numbered boxes” model of variable
access is hard but its the sort of thing that computers excel at, so
language designers automated all that pointer stuff.
And that automated control of pointers is what we mean when we talk
about Python variables being “references to values”.
(It’s not just Python: most modern or “high level” languages use this
model, such as Javascript and Ruby. Even Java uses a mixed model, with
numbered boxes for native machine data types and “references” for
objects and “boxed values”.)
So in Python, under the hood the old “numbered box” model still applies,
because that’s how machine code and assembly language operates. But all
the tedious and hard stuff working with pointers, or “references”, is
handled by the interpreter. As a Python programmer, you almost never
need to care that
x = 123
actually means that x is a reference to a numbered box containing 123,
you just talk about x and the interpreter does the rest.