Python performance myths and fairy tales [LWN.net]


Antonio Cuni, who
is a longtime Python performance engineer and PyPy developer, gave a presentation at EuroPython
2025
about “Myths and fairy tales around Python performance” on
the first day of the conference in Prague. As might be guessed from the
title, he thinks that much of the conventional wisdom about Python
performance is misleading at best. With lots of examples, he showed where
the real problems that he sees lie. He has come to the conclusion that memory
management will ultimately limit what can be done about Python performance,
but he has an
early-stage project called SPy that
might be a way toward a super-fast Python.

He started by asking the audience to raise their hands if they thought
Python is slow or not fast enough“; lots of hands went up, which
was rather different than when he gave the presentation at PyCon Italy,
where almost no one raised their hand. “Very different audience“, he
said with a smile. He has been working on Python performance for many
years, has talked with many Python developers, and heard some persistent
myths, which he would like to try to dispel.

Myths

The first is that “Python is not slow“; based on the raised hands,
though, he thought that most attendees already knew that was a myth. These
days, he hears developers say that Python speed doesn’t really matter,
because it is a glue language; “nowadays only the GPU matters“, so
Python is fast enough. Python is fast enough for some tasks, he
said, which is why there are so many people using it and attending
conferences like EuroPython.

No AI slop, all substance: subscribe to LWN today

LWN has always been about quality over quantity; we need your help
to continue publishing in-depth, reader-focused articles about Linux
and the free-software community. Please subscribe today to support our work
and keep LWN on the air; we are offering a free one-month trial subscription to get you started.

There is a set of programs where Python is fast enough, but that set does
not hold all of the Python programs in use—it is only a subset. The
programs that need more Python performance are what is driving all of the
different efforts to optimize the interpreter, but are also causing
developers to constantly work to improve the performance of their programs, often by using Cython, Numba, and the like.


[Antonio Cuni]

In his slides,
he represented the two sets as circles, with “programs where Python is fast
enough” fully inside “Python programs”; he then added the set of “all
possible programs” fully encompassing the other two. In his ideal world,
all possible programs would be able to be written with Python; currently,
programs that need all of the performance of the processor cannot use
Python. He would like to see the inner circles grow so that Python can
be used in more programs.

The corollary of the “it’s just a glue language” statement is that you
“just need to rewrite the hot parts in C/C++“, though that is a little out
of date; “nowadays they say that we should rewrite it in Rust“.
That is “not completely false“, it is a good technique to speed up
your code, but soon enough it will “hit a wall“. The Pareto
principle
—described with a slide created by ChatGPT for unclear
reasons—says that 80% of the time will be spent in 20% of the code. So
optimizing that 20% will help.

But the program will then run into Amdahl’s law, which
says that the improvement for optimizing one part of the code is limited by
the time spent in the now-optimized code;
what was the hot part now is very very fast and then you need to
optimize everything else
“. He showed a diagram where some
inner() function was taking 80% of the time; if that gets reduced
to, say, 10% of what it was, the rest of the program now dominates the run
time.

Another “myth” is that Python is slow because it is interpreted; again,
there is some truth to that, but interpretation is only a small part of
what makes Python slow. He gave the example of a simple Python
expression:

    p.x * 2

A compiler for C/C++/Rust could turn that kind of expression into three
operations: load the value of x, multiply it by two, and then
store the result. In Python, however, there is a long list of operations
that have to be performed, starting with finding the type of p,
calling its __getattribute__() method, through unboxing
p.x and 2, to finally boxing the result, which requires
memory allocation. None of that is dependent on whether Python is
interpreted or not, those steps are required based on the language
semantics.

Static types

Now people are using static types in Python, so he hears people say that
compilers for the language can now skip past all of those steps and
simply do the operation directly. He put up an example:

    def add(x: int, y: int) -> int:
        return x + y

    print(add(2, 3))

But static typing is not enforced at run time, so there are various ways to
call add() with non-integers, for example:

    print(add('hello ', 'world')) # type: ignore

That is perfectly valid code and the type-checker is happy because of the
comment, but string addition is not the same as for integers. The
static types “are completely useless from the point of view of
optimization and performance
“. Beyond that, the following is legal
Python too:

    class MyClass:
        def __add__(self, other):
            ...

    def foo(x: MyClass, y: MyClass) -> MyClass:
        return x + y

    del MyClass.__add__

Static compilation of Python is problematic because everything can
change
“, he said.

So, maybe, “a JIT compiler can solve all of your problems“; they can
go a long way toward making Python, or any dynamic language, faster, Cuni
said. But that leads to “a more subtle problem“. He put up a slide
with a trilemma
triangle: a dynamic language, speed, or a simple implementation.
You can have two of those, but not all three.

Python has historically favored a dynamic, simply implemented language, but
it is moving toward a dynamic, fast language with projects like the CPython JIT compiler. That loses the simple
implementation, but he does not have to care “because there are people
in the front row doing it for me
“, he said with a grin.

In practice, though, it becomes hard to predict performance with a JIT.
Based on his experience with PyPy, and as a consultant improving Python
performance for customers, it is necessary to think about what the JIT will
do in order to get the best performance. That is a complex and error-prone
process; he found situations where he was “unable to trigger
optimizations in PyPy’s compiler because the code was too complicated
“.

All of this leads to what he calls “optimization chasing“. It
starts with a slow program that gets its fast path optimized, which results
in a faster program and everyone is happy. Then they start to rely on that
extra speed, which can suddenly disappear with a seemingly unrelated change
somewhere in the program. His favorite example is a program that was
running on PyPy (using Python 2) and suddenly got 10x slower; it turned out
that a Unicode key was being used in a dictionary of strings
that led the JIT to de-optimize the code so that everything got much
slower.

Dynamic

He put up some code that did not really do anything exciting or useful, he
said, but did demonstrate some of the problems that Python compilers
encounter:

    import numpy as np

    N = 10

    def calc(v: np.ndarray[float], k: float) -> float:
        return (v * k).sum() + N

The compiler really can assume nothing from that code. Seemingly, it
imports NumPy in the usual way, the
calc() function multiplies each element of the v array by
k, adds them all up with sum() and then adds the constant
N to that. First off, the import may not bring in NumPy
at all; there could be some import hook somewhere that does something
completely unexpected. N cannot be assumed to be ten, because
that could be changed elsewhere in the code; as with the earlier
add() function, the type declarations on calc() are not
ironclad either.

But, in almost all cases, that code would do exactly what it looks like it
does. Developers rarely do these kinds of things that the language would
allow, but the
gap between the way programmers normally write Python and the definition of
the language is what “makes life complicated for the interpreter“.
In practice, a lot of what Python allows does not actually happen.

It is the extremely dynamic nature of the language that makes it slow,
but at the same time it’s what makes Python very nice“. The
dynamic features are not needed 99% of the time, Cuni said, but “in that
1% are what you need to make Python awesome
“. Libraries often use
patterns that rely on the dynamic nature of the language in order to make
APIs “that end users can use nicely” so those features cannot simply
be removed.

Game

The “compiler game” was up next; he progressively showed some code snippets
to point out how little a compiler can actually “know” about the code.
This code might seem like it should give an error of some sort:

    class Point:
        def __init__(self, x, y):
            self.x = x
            self.y = y

    def foo(p: Point):
        assert isinstance(p, Point)
        print(p.name) # ???

Inside foo(), the compiler knows that p is a
Point, which has no name attribute. But, of course,
Python is a dynamic language:

    def bar():
        p = Point(1, 2)
        p.name="P0"
        foo(p)

Meanwhile, here is an example where the compiler cannot even assume that
the method exists:

    import random

    class Evil:
        if random.random() > 0.5:
            def hello(self):
                print('hello world')

    Evil().hello() # 🤷🏻‍♂️

Legal Python, but “this is not something to define in production, I
hope
“, he said with a laugh. “Half of the time it still works, half
of the time
it raises an exception. Good luck compiling it.

In another example, he showed a function:

    def foo():
        p = Person('Alice', 16)
        print(p.name, p.age)
        assert isinstance(p, Person) # <<<

The Person class was not shown (yet), but there was an empty class
(just “pass“) called Student. In this case, the
assert will fail, because of the definition of Person:

    class Person:
        def __new__(cls, name, age):
            if age < 18:
                p = object.__new__(Student)
            else:
                p = object.__new__(Person)
            p.name = name
            p.age = age
            return p

You can have a class with a dunder-new [i.e. __new__()], which
returns something which is unrelated and is not an instance of the class.
Good luck optimizing that.

The final entrant in the game was the following:

    N = 10

    @magic
    def foo():
       return N

He “de-sugared” the @magic decorator and added some assertions:

    def foo():
       return N

    bar = magic(foo)

    assert foo.__code__ == bar.__code__
    assert bar.__module__ == '__main__'
    assert bar.__closure__ is None

    assert foo() == 10
    assert bar() == 20 # 🤯😱

The code object for foo() and bar() are the same, but
they give different results. As might be guessed, the value of N
has been changed by magic(); the code is as follows:

    def rebind_globals(func, newglobals):
        newfunc = types.FunctionType(
            func.__code__,
            newglobals,
            func.__name__,
            func.__defaults__,
            func.__closure__)
        newfunc.__module__ = func.__module__
        return newfunc

    def magic(fn):
        return rebind_globals(fn, {'N': 20})

That returns a version of the function (foo() was passed) that has
a different view of the values of the global variables. That may seem like
a far-fetched example, but he wrote code
much like that
for the pdb++
Python debugger
many years ago. “I claim I had good reason to do
that
“, he said with a chuckle.

Abstraction

So there are parts of the language that need to be accounted for, as he
showed in the game, but there is a more fundamental problem: “in Python,
abstractions are not free
“. When code is written, developers want
performance, but they also want the code to be understandable and
maintainable. That comes at a cost. He started with a simple function:

    def algo(points: list[tuple[float, float]]):
        res = 0
        for x, y in points:
            res += x**2 * y + 10
        return

It takes a list of points, each represented as a tuple of floating-point
numbers, and performs a calculation using them.
Then he factored out the calculation into its own function:

    def fn(x, y):
        return x**2 * y + 10

That is already slower than the original, because there
is overhead for calling a function: the function has to be looked up, a
frame object has to be created, and so on. A JIT compiler can help, but it
will still have more overhead. He took things one step further by
switching to a Point data class:

    @dataclass
    class Point:
        x: float
        y: float

    def fn(p):
        return p.x**2 * p.y + 10

    def algo(items: list[Point]):
        res = 0
        for p in items:
            res += fn(p)
        return

That, of course, slows it down even further. This is a contrived example,
Cuni said, but the idea is that every abstraction has a cost, “and then
you end up with a program that is very slow
“. It was an example of
what he calls “Python to Python” abstraction, where the code is being
refactored strictly within the language.

A “Python to C” abstraction, where the hot parts of the code are factored
out into C or some other compiled language, also suffers from added costs.
One could imagine that Python implementations get more and more
optimizations such that the list of Point objects is represented
in a simple linear array of floating-point numbers, without boxing, but if
fn() is written for Python’s C API, those numbers will need to be
boxed and unboxed (in both directions), which is completely wasted work.
It is “unavoidable with the current C API“. One of the ways to
speed up programs that were running under PyPy was to remove the C code and
perform the calculations directly in Python, which PyPy could optimize
well.

An elephant

There is an elephant in the room, however, with regard to Python
performance, though it is one he rarely hears about: memory management. In
today’s hardware, “computation is very cheap“, but memory is the
bottleneck. If the data is in a cache at any level, accessing it is
inexpensive, but RAM accesses are quite slow. “Generally speaking, if
you want to have very very good performance, we should avoid cache misses
as much as possible.

But Python is prone to having a memory layout that is cache-unfriendly. He
showed a simple example:

    class Person:
        def __init__(self, name, age):
            self.name = name
            self.age = age

    p = [Person('Alice', 16), Person('Bob', 21)]

Each Person has two fields, which ideally would be placed together
in memory, and the two objects in the list would also be placed together,
for a cache-friendly layout. In practice, though, those objects are all
scattered throughout memory; he showed a visualization from
Python Tutor
. Each arrow represented a pointer that needed to be
followed, thus a potential cache miss; there were nearly a dozen arrows for
this simple data structure.

This is something you cannot just solve with a JIT compiler; it’s
impossible to solve it without changing semantics.
” Python is
inherently cache-unfriendly, he said, “and I honestly don’t know how to
solve this problem
“. His “sad truth” conclusion is that “Python
cannot be super-fast
” without breaking compatibility. Some of the
dynamic features (“let’s call it
craziness
“) he had described in the talk will eventually hamper performance improvements. “If we
want to keep this craziness, well, we have to leave some performance on the
table.

His next slide was “The end”, complete with emojis of sadness (“😢💔🥹”),
which is where he ended the talk when he gave it at PyCon Italy a year
earlier. This time, though, he wanted to “give a little hope” so he
added a question mark, then reiterated that without breaking
compatibility Python could not get super-fast.

He has a proposal for the community if it decides that Python should try to
reach top-level performance, which he hopes the community does, but
it’s fine to say ‘no’“. He suggests tweaking the language
semantics by keeping the dynamic features where they are actually useful,
perhaps by limiting the kinds of dynamic changes that can be made to
specific points in time, so that compilers can depend on certain behavior
and structure. “Not to allow the world to change at any point in
time as it is now.

Meanwhile, the type system should be revamped with an eye on performance.
Currently, the types are optional and not enforced, so they cannot be
used for optimizations. The intent would be that performance-oriented code
could be written in Python, not in some other language called from Python.
But, for cases where calling another language is still desirable, the
extra cost (e.g. boxing) of doing so should be removed. “Most
importantly, we want something which stays Pythonic, because we like this
language or we wouldn’t be here.

Cuni said that he has a potential solution, “which is not to make Python
faster
“, because he claims that is not possible. SPy, which stands for
“Static Python”, is a project he started a few years ago to address the
performance problems. All of the standard disclaimers apply to SPy, it is
a work in progress, research and development, [and] we don’t know where it
will go
“. The best information can be found on the GitHub page linked
above or in his talk on SPy at PyCon
Italy
in late May.

He showed a quick demo of doing
realtime edge detection from a camera; it ran in the browser using PyScript. The demo shows the raw camera
feed on the left
and, at first, edge detection being run in NumPy on the right; NumPy
achieves fewer than two frames per second (fps). Switching to a SPy-based
edge-detection algorithm makes the right-hand image keep up with the
camera, running at around 60fps. The code for the
demo
is available on GitHub as well.

He recommended the SPy repository and its issue tracker in particular for
interested attendees; some issues have been tagged as “good first issue”
and “help wanted”. There is also a Discord server for
chatting about the project. Before too long, a video of the talk should
appear on the EuroPython YouTube channel.

[I would like to thank the Linux Foundation, LWN’s travel sponsor, for
travel assistance to Prague for EuroPython.]







Source link