gh-138122: Implement frame caching in RemoteUnwinder to reduce memory reads #142137

pablogsal · 2025-12-01T15:09:59Z

This PR makes the profiler up (when we have a lots of repeated frames in the stack) to 15x times faster 🎉

This PR implements frame caching in the RemoteUnwinder class to significantly reduce memory reads when profiling remote processes with deep call stacks.

When cache_frames=True, the unwinder stores the frame chain from each sample and reuses unchanged portions in subsequent samples. Since most profiling samples capture similar call stacks (especially the parent frames), this optimization avoids repeatedly reading the same frame data from the target process.

The implementation adds a last_profiled_frame field to the thread state that tracks where the previous sample stopped. On the next sample, if the current frame chain reaches this marker, the cached frames from that point onward are reused instead of being re-read from remote memory.

The sampling profiler now enables frame caching by default.

Issue: Implement PEP 799 – A dedicated profiling package for organizing Python profiling tools #138122

pablogsal · 2025-12-01T15:20:48Z

Python/ceval.c

+    // Update last_profiled_frame for remote profiler frame caching.
+    // By this point, tstate->current_frame is already set to the parent frame.
+    // The guarded check avoids writes when profiling is not active (predictable branch).
+    if (tstate->last_profiled_frame != NULL) {


TL;DR: This should be obvious but in case someone is worried I did my homework and this guard check has zero measurable cost and pyperformance shows 1.00x (no change). It's literally free.

This is the only place outside the profiler that has a modification, and before anyone considers if this has any effect: the answer is that it has literally zero effect. Here is how I measured it.

The guard check added to _PyEval_FrameClearAndPop compiles down to just two instructions:

278cab: cmpq $0x0,0x50(%rdi) ; Compare last_profiled_frame with NULL 278cb0: je 278cba ; Jump if equal (skip the write)

During normal operation when no profiler is attached, last_profiled_frame is always NULL, which means this branch always takes the exact same path. Modern CPUs predict this perfectly after just a few iterations and never mispredict it again. A perfectly predicted branch executes speculatively with zero pipeline stalls, making it effectively free.

I confirmed this using Linux perf with hardware performance counters. I ran the Python test suite and some selected tests (test_list test_tokenize test_gc test_dict test_ast test_compile) at maximum sample rate (99,999 Hz) collecting separate data files for CPU cycles, branch mispredictions, and cache misses:

perf record -F 99999 -g -o ./perf_cycles.data -- ./python -m test perf record -e branch-misses -F 99999 -g -o ./perf_branch_misses.data -- ./python -m test

First, I checked how much the entire function contributes to total CPU time:

$ perf report -i ./perf_cycles.data --stdio --sort=symbol | grep FrameClearAndPop

# Samples: 20K of event 'cpu_atom/cycles/P' 0.10% 0.09% [.] _PyEval_FrameClearAndPop # Samples: 422K of event 'cpu_core/cycles/P' 0.12% 0.10% [.] _PyEval_FrameClearAndPop

The whole function is only 0.10% of cycles on P-cores and 0.09% on E-cores, so we're already in negligible territory. But the real question is whether the guard check causes any branch mispredictions.

I checked the function's contribution to total branch misses:

$ perf report -i ./perf_branch_misses.data --stdio --sort=symbol | grep FrameClearAndPop

# Samples: 12K of event 'cpu_atom/branch-misses/' 0.07% 0.06% [.] _PyEval_FrameClearAndPop # Samples: 162K of event 'cpu_core/branch-misses/' 0.11% 0.11% [.] _PyEval_FrameClearAndPop

The entire function is only 0.11% of total branch misses. But within that, how much does our guard check contribute? I used perf annotate to see exactly which instructions caused branch misses within the function:

$ perf annotate -i ./perf_branch_misses.data _PyEval_FrameClearAndPop --stdio

This command reads the branch misprediction samples and maps them to specific instructions, showing what percentage of the function's branch misses occurred at each location. The result for our guard check:

; The guard check - if (tstate->last_profiled_frame != NULL) 0.00 : 278cab: cmpq $0x0,0x50(%rdi) ; ← 0.00% branch misses 0.00 : 278cb0: je 278cba ; ← 0.00% branch misses (PERFECTLY PREDICTED)

Zero. Not a single branch misprediction was sampled at that instruction across hundreds of thousands of samples. The CPU's branch predictor correctly predicts this branch every single time because it always takes the same path.

For comparison, here's more of the annotated output showing other branches in the same function:

0.00 : 278cb0: je 278cba ; Guard check - 0.00% misses 32.25 : 278cbe: jne 278d00 ; frame->owner check - 32.25% misses 50.13 : 278cc8: call 2c39e0 ; Function call 21.62 : 278ce5: je 278d60 ; Refcount check - 21.62% misses

The frame ownership check (frame->owner == FRAME_OWNED_BY_THREAD) accounts for 32.25% of the function's branch misses, and the refcount check (--op->ob_refcnt == 0) accounts for 21.62%. These are data-dependent branches that the CPU cannot predict perfectly. Our guard check contributes exactly 0.00% because it is perfectly predictable, unlike these other branches that depend on runtime data.

The overall Python branch miss rate is already very low (0.03% of all branches), and the guard check contributes nothing to this.

Finally, I ran pyperformance comparing main (ea51e745c713) against this PR (8d4a83894398). The geometric mean across all benchmarks is 1.00x, confirming no measurable regression in real-world workloads:

Pyperformance run:

All benchmarks:

Benchmark main-ea51e745c713 PR-8d4a83894398

subparsers 100 ms 97.8 ms: 1.02x faster

async_generators 646 ms 664 ms: 1.03x slower

bpe_tokeniser 6.87 sec 7.06 sec: 1.03x slower

comprehensions 27.3 us 26.6 us: 1.03x faster

coverage 134 ms 131 ms: 1.02x faster

crypto_pyaes 121 ms 124 ms: 1.02x slower

deepcopy_reduce 6.94 us 6.76 us: 1.03x faster

fannkuch 656 ms 640 ms: 1.02x faster

generators 45.4 ms 44.4 ms: 1.02x faster

logging_format 17.4 us 17.0 us: 1.02x faster

logging_simple 15.2 us 14.8 us: 1.03x faster

mdp 2.09 sec 2.13 sec: 1.02x slower

nbody 176 ms 180 ms: 1.02x slower

pickle_pure_python 647 us 630 us: 1.03x faster

regex_compile 227 ms 223 ms: 1.02x faster

regex_dna 255 ms 262 ms: 1.03x slower

regex_effbot 4.23 ms 4.34 ms: 1.03x slower

scimark_monte_carlo 108 ms 110 ms: 1.02x slower

scimark_sor 188 ms 184 ms: 1.02x faster

spectral_norm 154 ms 150 ms: 1.03x faster

unpack_sequence 69.2 ns 71.1 ns: 1.03x slower

xdsl_constant_fold 97.6 ms 95.4 ms: 1.02x faster

xml_etree_generate 148 ms 144 ms: 1.03x faster

xml_etree_process 107 ms 105 ms: 1.02x faster

Geometric mean (ref) 1.00x faster

Benchmark hidden because not significant (85): 2to3, many_optionals, async_tree_none, async_tree_cpu_io_mixed, async_tree_cpu_io_mixed_tg, async_tree_eager, async_tree_eager_cpu_io_mixed, async_tree_eager_cpu_io_mixed_tg, async_tree_eager_io, async_tree_eager_io_tg, async_tree_eager_memoization, async_tree_eager_memoization_tg, async_tree_eager_tg, async_tree_io, async_tree_io_tg, async_tree_memoization, async_tree_memoization_tg, async_tree_none_tg, asyncio_tcp, asyncio_tcp_ssl, asyncio_websockets, chameleon, chaos, bench_mp_pool, bench_thread_pool, coroutines, dask, deepcopy, deepcopy_memo, deltablue, django_template, docutils, dulwich_log, float, create_gc_cycles, gc_traversal, genshi_text, genshi_xml, go, hexiom, html5lib, json_dumps, json_loads, logging_silent, mako, meteor_contest, nqueens, pathlib, pickle, pickle_dict, pickle_list, pidigits, pprint_safe_repr, pprint_pformat, pyflate, python_startup, python_startup_no_site, raytrace, regex_v8, richards, richards_super, scimark_fft, scimark_lu, scimark_sparse_mat_mult, sphinx, sqlalchemy_declarative, sqlalchemy_imperative, sqlglot_v2_normalize, sqlglot_v2_optimize, sqlglot_v2_parse, sqlglot_v2_transpile, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tomli_loads, tornado_http, typing_runtime_protocols, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse

So the conclusion is that the branch is perfectly predicted, adds no memory traffic beyond reading a value already in L1 cache (tstate is hot), and avoids cache line dirtying when the profiler is not attached. Zero cost.

pablogsal · 2025-12-01T15:24:43Z

CC @ivonastojanovic @lkollar

ZeroIntensity · 2025-12-01T15:48:41Z

This PR makes the profiler up to 15x times faster 🎉

Woah, does that mean it's 15M megahertz now?

pablogsal · 2025-12-01T15:51:15Z

Woah, does that mean it's 15M megahertz now?

Haha, no it means it is closer to 1MHz more times (with deeper stacks). 1Mhz is the spherical cow in vacuum with no friction case ;)

bedevere-bot · 2025-12-01T16:58:28Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 7af5161 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

bedevere-bot · 2025-12-01T17:35:36Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit d47dc4b 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

Remote profilers that sample call stacks from external processes need to read the entire frame chain on every sample. For deep stacks, this is expensive since most of the stack is typically unchanged between samples. This adds a `last_profiled_frame` pointer that remote profilers can use to implement a caching optimization. When sampling, a profiler writes the current frame address here. The eval loop then keeps this pointer valid by updating it to the parent frame in _PyEval_FrameClearAndPop. This creates a "high-water mark" that always points to a frame still on the stack, allowing profilers to skip reading unchanged portions of the stack. The check in ceval.c is guarded so there's zero overhead when profiling isn't active (the field starts NULL and the branch is predictable).

The _Py_RemoteDebug_WriteRemoteMemory function was duplicated across modules. Moving it to the shared header (alongside the existing read functions) allows both the sys.remote_exec machinery and the _remote_debugging module to use the same implementation. This is needed for the upcoming frame cache which writes to remote process memory from _remote_debugging.

When sampling deep call stacks repeatedly, most of the stack is typically unchanged between samples. This implements a per-thread frame cache that stores (address, frame_info) pairs. On each sample, if the current_frame has reached the last_profiled_frame pointer (indicating unchanged stack below), we stop walking and append cached frames instead of re-reading them from the remote process. The cache is keyed by thread ID and stores the full frame list from the previous sample. When process_frame_chain hits the cached boundary, frame_cache_lookup_and_extend appends the cached continuation. After each sample we update last_profiled_frame in the target process so the eval loop can maintain the invariant. This is opt-in via cache_frames=True on RemoteUnwinder to avoid changing behavior for existing users.

Turn on cache_frames=True in the sampling profiler and the external inspection benchmark. This significantly reduces the cost of sampling deep stacks since unchanged parent frames don't need to be re-read from the target process.

Tests cover the key frame caching scenarios: cache hits when the stack is stable, cache invalidation when functions return or new ones are called, partial stack reuse when only the top changes, and recursive frames. Also verifies that cache_frames=True produces equivalent results to cache_frames=False.

When a new RemoteUnwinder is created to profile a process that was previously profiled by another unwinder, the last_profiled_frame values in thread states may be stale. This caused the new unwinder to stop frame walking early (when hitting the stale marker) but fail to find cached frames (since the cache is empty in the new unwinder), resulting in empty frame lists. Fix by clearing last_profiled_frame for all threads when a new RemoteUnwinder with cache_frames=True is initialized.

…acks When the frame walker stopped at a cached frame boundary but the cache lookup failed to find the continuation (e.g., when a new unwinder instance encounters stale last_profiled_frame values from a previous profiler), the code would silently return an incomplete stack. Now when a cache lookup misses, the walker continues from that point to collect the remaining frames, ensuring complete stack traces are always returned.

bedevere-bot · 2025-12-01T19:27:54Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 4680abf 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

bedevere-bot · 2025-12-03T02:44:27Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 5ba13c5 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

lkollar

I'll need more time to properly review this, just a couple of comments/questions now.

Lib/profiling/sampling/sample.py

Modules/_remote_debugging/frames.c

Modules/_remote_debugging/module.c

Modules/_remote_debugging/frames.c

Modules/_remote_debugging/module.c

…bers The frame cache was returning stale line numbers when execution moved within the same function between samples. By always reading the current frame from memory and only caching parent frames, we get correct line numbers while still benefiting from cache hits on the unchanged call stack. This also skips the expensive stack chunk copying in cache mode since direct memory reads are cheaper when most frames come from cache anyway.

pablogsal · 2025-12-05T20:10:23Z

@lkollar addressed everything in 8df2106 (#142137)

bedevere-bot · 2025-12-05T21:02:09Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 8df2106 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

bedevere-bot · 2025-12-06T19:25:56Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 1944a02 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

pablogsal requested review from AA-Turner, ZeroIntensity, ericsnowcurrently and markshannon as code owners December 1, 2025 15:09

bedevere-app bot mentioned this pull request Dec 1, 2025

Implement PEP 799 – A dedicated profiling package for organizing Python profiling tools #138122

Open

bedevere-app bot added the awaiting core review label Dec 1, 2025

pablogsal force-pushed the tachyon-faster branch from 180daaa to 8d4a838 Compare December 1, 2025 15:10

pablogsal commented Dec 1, 2025

View reviewed changes

pablogsal requested a review from ivonastojanovic December 1, 2025 15:21

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 1, 2025

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 1, 2025

pablogsal requested review from a team, emmatyping and erlend-aasland as code owners December 1, 2025 17:29

pablogsal force-pushed the tachyon-faster branch from 7fa34b2 to d47dc4b Compare December 1, 2025 17:33

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 1, 2025

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 1, 2025

pablogsal force-pushed the tachyon-faster branch from d47dc4b to 06f27e2 Compare December 1, 2025 17:35

pablogsal added 7 commits December 1, 2025 18:50

Enable frame caching by default in the sampling profiler

19047cf

Turn on cache_frames=True in the sampling profiler and the external inspection benchmark. This significantly reduces the cost of sampling deep stacks since unchanged parent frames don't need to be re-read from the target process.

Add thread test

adb0429

Add NEWS entry

315ac74

pablogsal added 7 commits December 1, 2025 18:50

Fix NEWS entry Sphinx references

f16f8a5

Regen stuff

b4e3295

make test more reliable

6e88425

Rework the cache

cf15c90

make test more reliable

a9a9efe

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 1, 2025

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 1, 2025

Return only from the frame profiled

c5307fb

pablogsal force-pushed the tachyon-faster branch from 4680abf to c5307fb Compare December 2, 2025 03:41

Bump the expected frames

5ba13c5

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 3, 2025

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 3, 2025

pablogsal added 2 commits December 3, 2025 11:20

Add stats

7ecbe13

Regen files

2711907

lkollar reviewed Dec 3, 2025

View reviewed changes

Lib/profiling/sampling/sample.py Outdated Show resolved Hide resolved

Lib/profiling/sampling/sample.py Outdated Show resolved Hide resolved

Modules/_remote_debugging/frames.c Show resolved Hide resolved

Modules/_remote_debugging/module.c Outdated Show resolved Hide resolved

lkollar reviewed Dec 5, 2025

View reviewed changes

Modules/_remote_debugging/frames.c Show resolved Hide resolved

Modules/_remote_debugging/module.c Show resolved Hide resolved

lkollar approved these changes Dec 5, 2025

View reviewed changes

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 5, 2025

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 5, 2025

Fixes for small edge cases, docs and cache stats

1944a02

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 6, 2025

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 6, 2025

Benchmark	main-ea51e745c713	PR-8d4a83894398
subparsers	100 ms	97.8 ms: 1.02x faster
async_generators	646 ms	664 ms: 1.03x slower
bpe_tokeniser	6.87 sec	7.06 sec: 1.03x slower
comprehensions	27.3 us	26.6 us: 1.03x faster
coverage	134 ms	131 ms: 1.02x faster
crypto_pyaes	121 ms	124 ms: 1.02x slower
deepcopy_reduce	6.94 us	6.76 us: 1.03x faster
fannkuch	656 ms	640 ms: 1.02x faster
generators	45.4 ms	44.4 ms: 1.02x faster
logging_format	17.4 us	17.0 us: 1.02x faster
logging_simple	15.2 us	14.8 us: 1.03x faster
mdp	2.09 sec	2.13 sec: 1.02x slower
nbody	176 ms	180 ms: 1.02x slower
pickle_pure_python	647 us	630 us: 1.03x faster
regex_compile	227 ms	223 ms: 1.02x faster
regex_dna	255 ms	262 ms: 1.03x slower
regex_effbot	4.23 ms	4.34 ms: 1.03x slower
scimark_monte_carlo	108 ms	110 ms: 1.02x slower
scimark_sor	188 ms	184 ms: 1.02x faster
spectral_norm	154 ms	150 ms: 1.03x faster
unpack_sequence	69.2 ns	71.1 ns: 1.03x slower
xdsl_constant_fold	97.6 ms	95.4 ms: 1.02x faster
xml_etree_generate	148 ms	144 ms: 1.03x faster
xml_etree_process	107 ms	105 ms: 1.02x faster
Geometric mean	(ref)	1.00x faster

Uh oh!

gh-138122: Implement frame caching in RemoteUnwinder to reduce memory reads #142137

Are you sure you want to change the base?

gh-138122: Implement frame caching in RemoteUnwinder to reduce memory reads #142137

Conversation

pablogsal commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

All benchmarks:

Uh oh!

pablogsal commented Dec 1, 2025

Uh oh!

ZeroIntensity commented Dec 1, 2025

Uh oh!

pablogsal commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-bot commented Dec 1, 2025

Uh oh!

bedevere-bot commented Dec 1, 2025

Uh oh!

bedevere-bot commented Dec 1, 2025

Uh oh!

bedevere-bot commented Dec 3, 2025

Uh oh!

lkollar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pablogsal commented Dec 5, 2025

Uh oh!

bedevere-bot commented Dec 5, 2025

Uh oh!

bedevere-bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pablogsal commented Dec 1, 2025 •

edited

Loading

pablogsal Dec 1, 2025 •

edited

Loading

pablogsal commented Dec 1, 2025 •

edited

Loading