Community Article

The GIL: What It Actually Blocks

The GIL blocks parallel Python bytecode and nothing else. I/O, well-behaved C extensions, and multiprocessing all sidestep it. Here is how I tell GIL-bound code from code that just feels slow.

The GIL: What It Actually Blocks

The GIL blocks parallel Python bytecode and nothing else. I/O, well-behaved C extensions, and multiprocessing all sidestep it. Here is how I tell GIL-bound code from code that just feels slow.

py-gil
concurrency
performance
py-multiprocessing
fundamentals
freyadiallo

By @freyadiallo

March 14, 2026

·

Updated May 20, 2026

715 views

19

4.3 (11)

I once spent two days convinced a Python service was GIL-bound. The service was processing image uploads, ran on a 16-core box, and CPU sat at about 100%, never higher. I added multiprocessing. Throughput barely moved. I added concurrent.futures.ThreadPoolExecutor. Throughput barely moved. The actual fix was that the underlying image library was doing all its work in a single C call that held the GIL the entire time. The GIL was not the problem; the C extension was. By the time I figured this out, I had a much sharper understanding of what the GIL actually does and, more importantly, what it does not.

The argument I want to make is that the GIL has a precise job description, and most internet writing about it gets this wrong. It is not a global slowdown. It is not a sign of bad design. It is a serialisation point on a specific operation: executing Python bytecode. Every other operation, including I/O, native code, and subprocesses, sidesteps it. Once you internalise that, you can tell GIL-bound code from "feels slow" code in about thirty seconds, and you stop reaching for multiprocessing reflexively.

What the GIL is, in one paragraph

The Global Interpreter Lock is a mutex inside CPython (the reference implementation) that ensures only one OS thread executes Python bytecode at a time. The lock exists primarily to make CPython's reference-counting garbage collector safe under multi-threading without requiring atomic operations on every Python object. Threads still run; they just take turns holding the lock. The interpreter releases the lock periodically (every ~5ms by default) and during certain operations (most notably, blocking I/O calls and well-behaved C extensions).

GIL state machine
  thread holds GIL  -->  runs Python bytecode
  releases GIL on  -->  - blocking I/O syscall
                        - well-behaved C extension call
                        - voluntary release every ~5ms (sys.setswitchinterval)
  another thread acquires GIL --> resumes from where it stopped

That is the entire mechanism. Every confusion about the GIL traces back to one of two things: (a) what counts as "running Python bytecode", or (b) the assumption that a slow Python program is automatically GIL-bound. Both are answerable with a profiler.

What the GIL actually blocks

The GIL blocks parallel execution of Python bytecode across threads. If you have two threads, both running pure-Python code (loops, attribute access, arithmetic on Python ints, list comprehensions), they will not both make progress simultaneously. The OS scheduler will switch between them, and one of them will hold the GIL while the other waits.

import threading

def cpu_bound_python():
    total = 0
    for i in range(10_000_000):
        total += i * i
    return total

# two threads, but they take turns under the GIL
t1 = threading.Thread(target=cpu_bound_python)
t2 = threading.Thread(target=cpu_bound_python)
t1.start(); t2.start()
t1.join(); t2.join()
# wall time is roughly 2x single-threaded; no parallelism gain

This is the textbook GIL case. Two threads doing pure Python compute. The GIL serialises them. Wall-clock time is roughly twice the single-threaded time, not half.

What the GIL does NOT block

This is the part most articles get wrong. Anything that is not running Python bytecode at the moment runs in parallel.

Blocking I/O. When a thread calls socket.recv(), requests.get(), time.sleep(), or anything else that issues a blocking syscall, it releases the GIL. Other threads acquire the GIL and run while it waits. For an I/O-bound workload, threads scale almost linearly until you saturate the I/O.

# this scales: each thread is in I/O, GIL is released
def fetch(url):
    return requests.get(url).text

with ThreadPoolExecutor(max_workers=20) as ex:
    results = list(ex.map(fetch, urls))

Well-behaved C extensions. A C extension that does serious work without touching Python objects will release the GIL with Py_BEGIN_ALLOW_THREADS / Py_END_ALLOW_THREADS. NumPy, pandas, scipy, image libraries, ML frameworks, and database drivers all do this for their hot paths. Two threads running NumPy matrix multiplications can saturate two cores; the work is in C, the GIL is released for the duration.

import numpy as np
# this scales: numpy releases the GIL during the matmul
def compute(a, b):
    return a @ b  # large matrix mul, ~all time in C

Multiprocessing. Each process has its own GIL. multiprocessing.Pool and concurrent.futures.ProcessPoolExecutor give you actual parallelism for CPU-bound Python by spawning subprocesses. The cost is process startup overhead, IPC for moving arguments and results, and the inability to share Python objects directly. For genuinely CPU-bound Python work, this is the standard answer (until PEP 703).

How to tell what your code actually is

The single test I run before reaching for any concurrency tool: profile the code single-threaded and ask, "is the bottleneck Python bytecode, native code, or I/O?" Three different bottlenecks, three different fixes.

import cProfile
import pstats

with cProfile.Profile() as p:
    run_workload()
stats = pstats.Stats(p).sort_stats('cumulative')
stats.print_stats(30)

Read the top of the profile. If most time is in your Python functions and Python builtins, it is CPU-bound Python. If most time is in requests/socket/open/anything ending in read or write, it is I/O-bound. If most time is in a single C extension call (a NumPy op, a pandas op, a database driver), it is native-bound.

BottleneckRight tool
I/O-boundThreadPoolExecutor, asyncio
Native-bound (well-behaved extension)ThreadPoolExecutor, threads
CPU-bound PythonProcessPoolExecutor, multiprocessing, or PEP 703 free-threaded build
Native-bound (bad extension that holds GIL)Switch the extension or move the call to a subprocess

The last row is what I hit with the image library. The C extension was doing serious work but never released the GIL, so threading bought me nothing. multiprocessing worked but was overkill; the better fix was switching to a different image library that released the GIL during compute.

When cProfile is too heavy for the situation (a running production process I cannot restart, a workload that misbehaves only under real traffic), I reach for py-spy instead. It is a sampling profiler that attaches to a live PID and prints flame graphs without touching the target process's code. The flag I care about for this kind of diagnosis is py-spy dump --pid <pid> --gil, which shows which thread currently holds the GIL, plus py-spy record -p <pid> --gil -o profile.svg for a sampled view over time. If the GIL holder is almost always the same Python frame, that frame is your contention point. If the holder rotates rapidly across threads doing the same work, the GIL is just being shared normally and your bottleneck is elsewhere.

A quick way to measure GIL contention

There is a precise diagnostic when you suspect threading is being serialised by the GIL. Run the workload single-threaded, time it, then run it across N threads, time it again. If the threaded run is roughly N times faster, the workload was I/O-bound or native-bound and the GIL was being released. If the threaded run is the same speed, or even slower, the GIL is serialising the work. The slowdown case is informative: thread context switches cost something even with no parallelism gain, so contended GIL workloads sometimes run worse than single-threaded.

import time
import threading

def workload(): ...  # the thing you suspect

# baseline
start = time.perf_counter()
workload()
base = time.perf_counter() - start

# threaded
threads = [threading.Thread(target=workload) for _ in range(4)]
start = time.perf_counter()
for t in threads: t.start()
for t in threads: t.join()
threaded = time.perf_counter() - start

print(f"speedup: {(base * 4) / threaded:.2f}x")

A speedup near 4 means the GIL is not in the way (this work parallelises well on threads). A speedup near 1 means the GIL is serialising every thread. Anything in between is partial: some of the time is in C code releasing the GIL, some is in pure Python serialising on it. The number tells you which fix to reach for next.

PEP 703 and the free-threaded interpreter

This is the part of the story that is changing. PEP 703 specifies an experimental build of CPython without the GIL (called "free-threaded" or python3.13t). It uses biased reference counting, deferred reference counting on the call stack, and a per-object lock for objects that need it. The GIL goes away; threads execute Python bytecode in parallel.

Two things to know about it. First, it is opt-in: you have to build (or download) a special interpreter. The default python3.13 and python3.14 still have the GIL. Second, there is a real single-threaded performance cost (a few percent in 3.13, less in 3.14 as optimisations land). Library compatibility is the bigger story: any C extension that relied on the GIL for thread-safety has to be updated. NumPy, pandas, and the major scientific libraries are all working through this.

If you are experimenting with this, the runtime check I run first is sys.flags.gil_enabled, which Python 3.13+ exposes specifically to let code branch on which interpreter it is on. The free-threaded build is shipped under the binary name python3.13t (the t stands for free-threaded), and sys.version includes a matching marker so you can confirm from inside the process. There is also an environment variable, PYTHON_GIL=0, that disables the GIL on builds that support runtime toggling; if you are profiling, set it explicitly rather than relying on whatever default the build was compiled with.

import sys
print(sys.version)              # check for the 't' build suffix
print(sys.flags.gil_enabled)    # 0 means the GIL is off, 1 means on
# disable the GIL on a build that supports runtime toggling
PYTHON_GIL=0 python3.13t my_workload.py

The practical advice I give anyone trying free-threaded Python on an existing codebase: run your test suite under it before you trust any benchmark numbers. Most thread-safety bugs in third-party packages surface as flaky tests, not crashes, so I run pytest -p no:cacheprovider --forked to isolate each test in its own process and surface dependency bugs that the GIL was previously hiding. A clean test run on the free-threaded build is the bar I set before benchmarking; otherwise the numbers are measuring a workload that has data races, not one that is correctly parallel.

My take: free-threaded Python is the future, but I am not in a hurry to adopt it for production. I will wait for the libraries I depend on to ship free-threaded wheels, for the single-threaded performance gap to close, and for the long tail of bugs to surface. For a CPU-bound Python service today, multiprocessing is still my answer. For a service in 2027, free-threaded threads are likely to be it.

The myth I would like to bury

The internet has decided that the GIL is the reason Python is slow. Every benchmark thread, every "why isn't Python more like Go" post, every Hacker News comment about CPython performance circles back to the GIL. It is mostly wrong. Python is slow because it is dynamically typed, has reference counting, has high-level data structures with overhead, and has an interpreter that has historically been the polar opposite of a JIT (this last point is changing with the 3.13+ JIT). The GIL is a small part of the story, mostly contained, and irrelevant for most production Python services because most production Python services are I/O-bound.

If your service is talking to a database, a queue, an API, a cache, and a search index, your service is I/O-bound. The GIL is not your problem. Threads work. asyncio works. You do not need multiprocessing, you do not need to wait for free-threaded Python, and you do not need to rewrite anything in Go. Profile first, identify the actual bottleneck, and reach for the matching tool. The GIL is precise about what it serialises, and engineers who know that precision pick the right concurrency primitive on the first try; engineers who do not waste two days reaching for multiprocessing and getting nothing.

Back to Articles