Python 3.13: Four-times faster or three times slower?
Introductions
Python 3.13 will be available without the Global Interpreter Lock (GIL). As a heavy user of the Python ecosystem, I’m looking forward to experimenting with it. Theoretically, this change opens up significant opportunities for faster execution, reduced memory footprint, lower latency in switching, and better communication between threads. It offers true parallelism and improved performance for multi-threaded, multi-core applications. Are we there yet?
When I saw Simon Willison’s post on this topic with x4 improvements I decided to try my self.
Setup:
I’m currently using Ubuntu 24.04 (6.8.0-38-generic) along with pyenv. If you’re on the same setup, replicating this should be straightforward. First, navigate to the directory where you’ll be working. Then, install the pyenv plugin that allows us to install two versions of Python with the same name:
git clone https://github.com/pyenv/pyenv.git
cd pyenv/plugins/python-build
./install.shthen install python with gil:
python-build 3.13-dev ~/.pyenv/versions/3.13-dev_giland python without gil:
PYTHON_CONFIGURE_OPTS='--disable-gil' pyenv install 3.13-devNow we have everything, let’s do some tests!
Test 1.
I decided to start with replication of Simon’s code from here:
import argparse
import time
from concurrent.futures import ThreadPoolExecutor
import sysconfig
print("Py_GIL_DISABLED", sysconfig.get_config_var("Py_GIL_DISABLED"))
def cpu_bound_task(n):
    """A CPU-bound task that computes the sum of squares up to n."""
    return sum(i * i for i in range(n))
def main():
    parser = argparse.ArgumentParser(description="Run a CPU-bound task with threads")
    parser.add_argument("--threads", type=int, default=4, help="Number of threads")
    parser.add_argument("--tasks", type=int, default=10, help="Number of tasks")
    parser.add_argument(
        "--size", type=int, default=5000000, help="Task size (n for sum of squares)"
    )
    args = parser.parse_args()
    print(f"Running {args.tasks} tasks of size {args.size} with {args.threads} threads")
    start_time = time.time()
    with ThreadPoolExecutor(max_workers=args.threads) as executor:
        list(executor.map(cpu_bound_task, [args.size] * args.tasks))
    end_time = time.time()
    duration = end_time - start_time
    print(f"Time with threads: {duration:.2f} seconds")
if __name__ == "__main__":
    main()As we expected:
Test 2:
Next, I moved on to something directly using threading.
import threading
import time
import random
import sysconfig
print("Py_GIL_DISABLED", sysconfig.get_config_var("Py_GIL_DISABLED"))
# Function to multiply a submatrix
# written by LLM, we don't care if it is wrong
def matrix_multiply(A, B, C, start_row, end_row):
    num_cols_B = len(B[0])
    for i in range(start_row, end_row):
        for j in range(num_cols_B):
            C[i][j] = sum(A[i][k] * B[k][j] for k in range(len(A[0])))
# Generate a random matrix
def generate_matrix(rows, cols):
    return [[1 for _ in range(cols)] for _ in range(rows)]
# Matrix dimensions
N = 500
A = generate_matrix(N, N)
B = generate_matrix(N, N)
C = [[0 for _ in range(N)] for _ in range(N)]
# Number of threads
num_threads = 10
threads = []
# Calculate range for each thread
rows_per_thread = N // num_threads
start_time = time.time()
for i in range(num_threads):
    start_row = i * rows_per_thread
    end_row = (i + 1) * rows_per_thread if i != num_threads - 1 else N
    thread = threading.Thread(target=matrix_multiply, args=(A, B, C, start_row, end_row))
    threads.append(thread)
    thread.start()
for thread in threads:
    thread.join()
end_time = time.time()
print(f"Execution time: {end_time - start_time} seconds")The result is:
As you can see, it’s three times slower. The problem is in the line sum(A[i][k] * B[k][j] for k in range(len(A[0]))) where we access global memory, if you replace it with sum(1 for k in range(len(A[0]))) no-gil will be six times faster!
So, there are no free improvements if you just turn off gil. You need to know how to work with no-GIL Python properly.
UPD 14.07.2024:
Although the point of the post was to show that simply recompiling Python won’t give you the benefits of lack of locking right away, people started to suggest ways to fix my (LLM’s) code. Here are the results of running suggestions of Joe Yearsley
As expected, it is possible to optimize memory access, and this gives us roughly a 1.5x boost.
Conclusions:
Python without GIL is still in its early days and shows great promise, but it’s essential to be cautious and thoroughly test everything for your specific setup and problem.