Why does Python have variable hoisting like Javascript?

maggyero · August 11, 2020, 9:32am

The following Python program A outputs 1, as expected, while the following Python program B raises an unbound local variable x error, counterintuitively.

Program A:

def f(): print(x)
x = 1
f()

Program B:

def f(): print(x); x = 2
x = 1
f()

Javascript has the exact same behaviour.

Program A:

function f() { console.log(x); }
let x = 1;
f();

Program B:

function f() { console.log(x); let x = 2; }
let x = 1;
f();

However, C++ outputs 1 in both cases, as expected.

Program A:

#include <iostream>
int x;
void f() { std::cout << x; }
int main() { x = 1; f(); return 0; }

Program B:

#include <iostream>
int x;
void f() { std::cout << x; int x = 2; }
int main() { x = 1; f(); return 0; }

So all programs A output 1. The differences in programs B between Python and Javascript on the one hand, and C++ on the other hand, result from their different scoping rules: in C++, the scope of a variable starts at its declaration, while in Python and Javascript, it starts at the beginning of the block where the variable is declared. Consequently, in C++ printing variable x in function f resolves to the value 1 of global variable x since it is the only variable in context at this point of execution. In Python and Javascript printing variable x in function f resolves to nothing and raises an unbound local variable x error since local variable x is already in context at this point of execution and therefore it masks global variable x without being bound yet to the value 2. This counterintuitive behaviour of Python and Javascript is also known as variable hoisting since it ‘hoists’ variable declarations (but not definitions) at the beginning of their blocks.

What was the rationale of the Python language designers for choosing variable hoisting?

uranusjr · August 11, 2020, 10:38am

What Python are you using? I cannot replicate this with any version of Python.

>>> def f(): print(x); x = 2
...
>>> x = 1
>>> f()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in f
UnboundLocalError: local variable 'x' referenced before assignment

$ cat script.py
def f(): print(x); x = 2
x = 1
f()

$ py script.py
Traceback (most recent call last):
  File "script.py", line 3, in <module>
    f()
  File "script.py", line 1, in f
    def f(): print(x); x = 2
UnboundLocalError: local variable 'x' referenced before assignment

steven.daprano · August 11, 2020, 10:55am

Hi Géry,

“However, C++ outputs 1 in both cases, as expected.”

Expected by who? Not me. Why does C++ behave so strangely?

"What was the rationale of the Python language designers for choosing

variable hoisting?"

I have never heard of “variable hoisting” before, but if this is it:

"Variables declared using var are created before any code is executed in

a process known as hoisting. Their initial value is undefined."

then Python doesn’t have it. It’s certainly not a term commonly used in

Python. Python is not like Javascript, it doesn’t not create variables

before the code runs.

(Although some Python interpreters, not all, may sometimes allocate

space for local variables at runtime, when the function object is

created, before it is called.)

Your Python code:


def f():

    print(x)



x = 1

f()

Here x is a global variable. Inside the function f, the scope of the

names “print” and “x” are both global, so when you call the function,

the builtin print function and the global variable x are both found.


def f():

    print(x)

    x = 2



x = 1

f()

Here x is a local variable, and you try to print its value before

x exists, so you get an UnboundLocalError exception.

When the compiler is compiling a function, read-only access to a name

like “print” or “x” uses the global and builtin scopes. But if you

assign a value to a name, then the compiler treats it as a local

variable unless you declare it global.

So in your example above, the name “print” is read but not written to,

so it is looked for in the global scope and the builtin print

function is located and called.

The name “x” is written to (with the “x = 2” assignment) so the

compiler treats it as a local variable. At lookup time, the variable

doesn’t yet exist and so you get an exception.

The global statement is a compiler directive: it tells the compiler to

treat the name as a global variable, even if it otherwise would have

been treated as a local.


def f():

    global x

    print(x)

    x = 2



x = 1

f()

will print 1 and then assign 2 to the global variable x.

You can google for “Python scoping rule LEGB” for more information:

https://duckduckgo.com/?q=python+scoping+LEGB+rule

Why does Python work this way? I don’t know, why does any language

choose the scoping rules they choose?

Why does Lua default to having variables be global unless declared

local? Why does Javascript have a separate global and module scope? Why

does BASIC have only global variables? (1970s BASIC, not modern Visual

Basic.)

People design their languages to work the way they want them to work. I

imagine the same applied to Python: Guido chose the scoping rules

because they were easy to implement, or similar to what ABC used, or

because they solved a problem, or because he liked that rule and

disliked more complicated rules, or something like that.

“This counterintuitive behaviour of Python”

Counter-intuitive to who? It is perfectly intuitive to me, and the C++

behaviour seems strange even after you explained it.

"and Javascript is also known

as variable hoisting since it ‘hoists’ variable declarations (but

not definitions) at the beginning of their blocks."

Python doesn’t have declarations, with the possible exception of the

global and nonglobal statements. (I personally consider them to be more

like compiler directives than a declaration.) But in any case, in both

of your examples of Python code, there are no variable declarations, so

there is nothing to be hoisted.

maggyero · August 11, 2020, 1:46pm

Hi @uranusjr! You actually perfectly replicated this since I said that I got an unbound local variable error.

uranusjr · August 11, 2020, 1:58pm

Then you’re misunderstanding hoisting. UnboundLocalError indicates a variable is not declared, which is the exact opposite of what hoisting does (move declarations to the beginning of a scope).

maggyero · August 11, 2020, 2:24pm

Hi @steven.daprano! Thanks for answering. It is true that Python name declarations are tied to variable definitions, they cannot be separated like in other languages. Except in this case, since you don’t get a NameError: name 'x' is not defined (so the name does exist!), but an UnboundLocalError: local variable 'x' referenced before assignment (the existing name is not bound yet to any object!). In other words, like Javascript, the name is already in the environment before its definition, so the name is effectively forward declared (“hoisted”). But I do not want to argue about terminology (the bottom line is that both Python and Javascript behave the same: they raise an unbound name error).

I have absolutely no opinion on variable hoisting. So I wanted to know its benefits and drawbacks compared to the more traditional scoping rule of C++ where a scope starts at the point of declaration. In other words, what is the rationale that pushed @guido to adopt it? Maybe variable hoisting is better for programmers because it prevents some subtle bugs. Maybe it is worse but there are technical constraints that make it easier to implement. Or maybe there is no special reason. I don’t know, but I am curious.

guido · August 11, 2020, 2:58pm

It’s the first time I’ve heard the term “variable hoisting” in this context, I have no opinion on whether that’s the right theoretical term, but I’ve never used it for Python.

There are many connected reasons here. We want to use special opcodes for locals that don’t use dict lookups. But also, consider this example:

def f(a):
    for i in a:
        if isprime(i): break
    return i

There’s a bug here if a is empty. We don’t want to return the value of an unrelated global variable i in this case. There are many other scenarios, some much simpler (just conditionally set a variable and then unconditionally use it).

Long and short, the set of local variables is defined by anything that may be assigned in a function, and for those, all references in that function’s scope return either the value of the local variable, or raise UndefinedLocalError if it has no value.

maggyero · August 11, 2020, 3:49pm

Thanks for answering @guido! We now have the rationale, right from the Architect =). The provided example is very interesting, that is indeed a subtle bug that would be hard to detect if Python did not raise an UnboundLocalError. That seems like a convincing argument in favor of “variable hoisting”.

brettcannon · August 11, 2020, 5:40pm

I will also say it simplifies the compiler. Name resolution is much easier if you don’t have to track what names have been exposed per line versus per-scope (technically Python doesn’t even use block scoping; it’s known as LNGB: local, non-local/closure, global, built-in and was actually LGB for a long time). So you can do a single pass on a chunk of code and know before you start emitting bytecode what variable names are assumed to come from what scope instead of having to look up per-line what variables are or are not known.

guido · August 11, 2020, 7:24pm

I didn’t know those terms either.

Anyway, on my bike ride today I realized that the key here is that Python uses function scopes – unlike C++, any variable defined anywhere in a function has that whole function as its scope.

It’s one of those things that make Python simpler than C++.

maggyero · August 11, 2020, 7:29pm

So in addition to the benefits for Python users, this scoping rule is also beneficial for Python implementors. That makes it really compelling.

maggyero · August 11, 2020, 7:51pm

Exactly. The Wikipedia article Scope (computer science) has a well written paragraph on this topic (bold emphasis mine):

Scope can vary from as little as a single expression to as much as the entire program, with many possible gradations in between. The simplest scoping rule is global scope—all entities are visible throughout the entire program. The most basic modular scoping rule is two-level scoping, with a global scope anywhere in the program, and local scope within a function. More sophisticated modular programming allows a separate module scope, where names are visible within the module (private to the module) but not visible outside it. Within a function, some languages, such as C, allow block scope to restrict scope to a subset of a function; others, notably functional languages, allow expression scope, to restrict scope to a single expression. Other scopes include file scope (notably in C) which behaves similarly to module scope, and block scope outside of functions (notably in Perl).

A subtle issue is exactly when a scope begins and ends. In some languages, such as C, a name’s scope begins at its declaration, and thus different names declared within a given block can have different scopes. This requires declaring functions before use, though not necessarily defining them, and requires forward declaration in some cases, notably for mutual recursion. In other languages, such as JavaScript or Python, a name’s scope begins at the start of the relevant block (such as the start of a function), regardless of where it is defined, and all names within a given block have the same scope; in JavaScript this is known as variable hoisting. However, when the name is bound to a value varies, and behavior of in-context names that have undefined value differs: in Python use of undefined names yields a runtime error, while in JavaScript undefined names declared with var (but not names declared with let nor const) are usable throughout the function because they are bound to the value undefined.

steven.daprano · August 12, 2020, 1:17am

Géry:

"Except in this case, since you don’t get a `NameError: name ‘x’ is not

defined(so the name *does* exist!), but anUnboundLocalError: local

variable ‘x’ referenced before assignment` (the existing name is not

bound yet to any object!)."

UnboundLocalError is a subclass of NameError. In older versions of

Python, such as 1.5, NameError would be raised where today we raise

UnboundLocalError.

The difference in description is for the benefit of the programmer, it

does not reflect an essential difference between the two categories of

error. In both cases, the local variable has no value bound to the

name.

In the CPython interpreter, it happens to be that global variables live

inside a dict as key-value pairs, and local variables live in boxed

“slots” with a fixed address. But that’s not a language feature, it is

an implementation detail, and I am confident that IronPython and Jython

do not treat local variables this way. (I believe that they have, or at

least had, locals live in a namespace like globals do.)

So even though CPython actually implements local variables as

pre-allocated slots, the semantics of Python the language is that

variables are names bound to values in a namespace (usually a dict). In

the namespace model of variables, there are usually two states:

variable doesn’t exist if there is no key:value pair;
the variable exists if the key:value pair exists.

Python the language mandates the namespace semantics for variables.

Unlike Javascript, there is no concept of an variable which is

“undefined”

> typeof(x)

undefined

whereas in Python, if you call type(x) you will get a NameError.

The CPython interpreter simulates namespace semantics for locals using

boxed pre-allocated slots. (Globals are stored in a dict.) Since memory

addresses cannot be literally empty, the slots have to use a special

sentinel value that represents “this name is unbound”, equivalent to the

key being missing in a dict. (Possibly a nil pointer?)

But that’s an implementation detail: a compliant Python interpreter

could use a dict for locals, as CPython used to do, and as (I think)

Jython and IronPython continue to do.

P.S. Am I the only one seeing large numbers of ^M characters at the end

of, and between, every line in my replies? I don’t see them in other

Discuss groups, is there somewhere I can report this as a bug?

maggyero · August 12, 2020, 8:44am

You are right. But I was actually referring to the error message, not the error type: NameError: name 'x' is not defined states that the name is not in context, while UnboundLocalError: local variable 'x' referenced before assignment states that the name is in context (but not bound to any value yet) and therefore shadows the global variable since the term “local variable” is mentioned.

If I understood correctly, you are saying that the local name x would not be in context yet at the printing statement and therefore the global name x (which is always in context) would have been printed like in C++ if CPython implemented local variables with a live dictionary instead of pre-allocated slots. But that would violate the whole function scope of assigned names (that I called “variable hoisting”) intended by @guido:

So it seems to me that if CPython implemented local variables with a dictionary, a similar sentinel value would have been set for the local name x (meaning “the name is in context but not bound to any value yet”) in order to respect whole function scope.

No, I see them too in your replies.

encukou · August 12, 2020, 9:57am

There kind of lookup (local, “enclosed”, global/built-in), is (and was, AFAIK) determined at compile time. For local variable, Python only looks in the locals; if the name is not found it doesn’t go on to check globals.
So, no sentinel was needed; it’s enough if the entry was not the locals dict.

eryksun · August 12, 2020, 6:48pm

That depends. Loading a local variable in non-optimized code (e.g. module, class body, or exec) uses the LOAD_NAME opcode. For example:

>>> dis.dis(compile('x = 0; x', '', 'exec'))
  1           0 LOAD_CONST               0 (0)
              2 STORE_NAME               0 (x)
              4 LOAD_NAME                0 (x)
              6 POP_TOP
              8 LOAD_CONST               1 (None)
             10 RETURN_VALUE

LOAD_NAME falls back on globals and builtins, which makes it easy to temporarily shadow a global or builtin name. For example:

>>> exec(r'''
... abs = 'temporary shadow'
... print(abs)
... del abs
... print(abs)
... ''')
temporary shadow
<built-in function abs>

This won’t work in optimized code (e.g. a function created by the def statement), which uses the LOAD_FAST opcode to load a local variable that’s in the fast locals array, which does not fall back on globals and builtins. For example:

>>> def f():
...     abs = 'temp'
...     del abs
...     abs
... 
>>> dis.dis(f)
  2           0 LOAD_CONST               1 ('temp')
              2 STORE_FAST               0 (abs)

  3           4 DELETE_FAST              0 (abs)

  4           6 LOAD_FAST                0 (abs)
              8 POP_TOP
             10 LOAD_CONST               0 (None)
             12 RETURN_VALUE
>>> f()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 4, in f
UnboundLocalError: local variable 'abs' referenced before assignment

In optimized code, one has to explicitly reset back to a known global or builtin reference, such as abs = builtins.abs.

steven.daprano · August 12, 2020, 11:26pm

Guido van Rossum:

def f(a):
    for i in a:
        if isprime(i): break
    return i

“There’s a bug here if a is empty. We don’t want to return the value
of an unrelated global variable i in this case. There are many other
scenarios, some much simpler (just conditionally set a variable and then
unconditionally use it).”

That’s not the only bug I assume the idea is to return the first
prime number found in the list, but if none of the items are prime it
returns the last composite found. But that’s beside the point.

I see that this behaviour would be undesirable in this case, but we
manage to cope with that behaviour for globals and builtins. If we move
the loop to the global level, we can easily see that sort of shadowing:

for id in customer_ids:
    process(id)
    if condition: break

print("last id processed", id)

As Eryk Sun pointed out in another comment, that sort of temporary
shadowing might even be considered a useful feature.

The language specification for name resolution is, in my opinion, a
little inaccurate:

https://docs.python.org/3/reference/executionmodel.html#naming-and-binding

For example, it fails to distinguish between the global and built-in
scopes properly (at one point, it implies that there are only two
scopes, local and global; at another it states that module-level names
and builtin names are in a single scope). So I think the docs could be
improved.

But if we take CPython as the reference implementation, and the
behaviour here as normative, I think the intended behaviour is clear:

if a name is bound within a function, it is a local;
name lookups for locals only search the local (function) scope;
otherwise, name lookups search the global (module) and builtin scopes.

(I’m not touching nonlocal and class scope names for brevity.)

So I think that the CPython behaviour which allows temporary shadowing
at the module level, but not in the function level, is intentional. Is
that correct?

Guido:

“Long and short, the set of local variables is defined by anything that
may be assigned in a function, and for those, all references in that
function’s scope return either the value of the local variable, or raise
UndefinedLocalError if it has no value.”

I remember the introduction of UndefinedLocalError, but not why a
subclass was used instead of just using NameError with a better error
message. Do you recall the reason for using a subclass for missing
locals but not for missing nonlocals?

steven.daprano · August 13, 2020, 12:38am

Géry Ogam, referring to UnboundLocalError being a subclass of NameError:

“You are right. But I was actually referring to the error message, not the error type: NameError: name 'x' is not defined states that the name is not in context, while UnboundLocalError: local variable 'x' referenced before assignment states that the name is in context (but not bound to any value yet) and therefore shadows the global variable since the term “local variable” is mentioned.”

I think you are reading too much into minor differences of error messages.

Error messages are not part of the language definition and can reflect implementation indiosyncracies. They can also be changed at any time. In this case, we could easily change the messages to be:

NameError: name 'x' referenced before assignment

UnboundLocalError: local name 'x' is not defined

and the error messages would be still correct. If a name is referenced before assignment, it is not defined, and vice versa.

Géry:

“If I understood correctly, you are saying that the local name x would not be in context yet at the printing statement and therefore the global name x (which is always in context) would have been printed like in C++ if CPython implemented local variables with a live dictionary instead of pre-allocated slots.”

No, that is not what I am saying.

Using a dict as a namespace, names are keys. If the key is missing, you know that the name is not present in the namespace, which means it is not defined in that namespace. “Undefined name” is synonomous with “missing key”.

In the case of function locals, you can immediately raise an exception: the language specification mandates that for function local variables, only the local namespace is searched.

At module level, the rule is different: if the name is not found in the module namespace, go on to look up in the builtin namespace as well, and only raise if the name is missing there as well.

In CPython’s case, the compiler doesn’t uses a dict for local variables. (I believe that in 1.x and 2.x, it would use a dict for locals if you used an exec or wildcard import inside the function.) Instead, CPython allocates a fixed slot for each local variable. That fixed slot exists whether the variable is defined or not.

Other languages which use fixed memory locations for variables include Pascal and C. In Pascal, you can declare a variable but fail to give it a value, in which case the Pascal compiler will allocate a memory location for the variable and happily use whatever random bits happen to be at that memory location. I think C may be similar.

CPython doesn’t do that: it initialises the function local slots with a special sentinel value (possible a nil pointer?) that tells the interpreter to treat it as undefined. So for function locals using slots, “undefined name” is synonomous with “slot exists but the contents are the special sentinel” rather than a missing key, but otherwise the two cases are effectively the same: the variable isn’t defined.

holdenweb · August 17, 2020, 8:04pm

One might wish the Javascript designers had chosen a better name, since for many programmers the word hoisting is inextricably bound up with loop optimizations. I know, I’m old …

EvanCarroll · May 13, 2021, 6:51am

It’s clear to me given the above example, it should not compile, or should otherwise not allow execution. This is how Perl and JavaScript work.

"use strict";
function f(a) { for ( let i in a ) {}  return i }
f([1,2,3])

And for perl,

use strict;
sub f { for my $i (@_) {} $i };
f(1,2,3)

Note in the case of perl, it’s caught by the compiler. In the case of JavaScript the function can’t be executed. The more subtle bug by not scoping to control flow is this one,

a = [1,2,3]
b = []

def fn(i):
  return i

for i in a:
  err = fn(i)
  break

print( f'Got error {err}' )

# 1000 lines later

for i in b:
  err = fn(i)
  break

print( f'Got error {err}' )

if err:
  exit()

That will run totally fine in Python. Creating a potentially subtle bug. I’ve always wondered if you regretted this design decision. You’ve also always been pretty open about your regrets (I remember you doing a presentation on this). JavaScript has moved away from var. It’s now considered an anti-pattern that’s been replaced by let. Modern languages (like Rust and Go) also scope lexically to the block where control flow structures create a block.

It seems both conceptually easier, cleaner, and less error-prone with the one caveat that such a scheme requires a keyword to declare the variable.