Skip to content

Conversation

@pablogsal
Copy link
Member

@pablogsal pablogsal commented Dec 1, 2025

This PR makes the profiler up (when we have a lots of repeated frames in the stack) to 15x times faster 🎉

This PR implements frame caching in the RemoteUnwinder class to significantly reduce memory reads when profiling remote processes with deep call stacks.

When cache_frames=True, the unwinder stores the frame chain from each sample and reuses unchanged portions in subsequent samples. Since most profiling samples capture similar call stacks (especially the parent frames), this optimization avoids repeatedly reading the same frame data from the target process.

The implementation adds a last_profiled_frame field to the thread state that tracks where the previous sample stopped. On the next sample, if the current frame chain reaches this marker, the cached frames from that point onward are reused instead of being re-read from remote memory.

The sampling profiler now enables frame caching by default.

Python/ceval.c Outdated
// Update last_profiled_frame for remote profiler frame caching.
// By this point, tstate->current_frame is already set to the parent frame.
// The guarded check avoids writes when profiling is not active (predictable branch).
if (tstate->last_profiled_frame != NULL) {
Copy link
Member Author

@pablogsal pablogsal Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: This should be obvious but in case someone is worried I did my homework and this guard check has zero measurable cost and pyperformance shows 1.00x (no change). It's literally free.


This is the only place outside the profiler that has a modification, and before anyone considers if this has any effect: the answer is that it has literally zero effect. Here is how I measured it.

The guard check added to _PyEval_FrameClearAndPop compiles down to just two instructions:

278cab: cmpq   $0x0,0x50(%rdi)    ; Compare last_profiled_frame with NULL
278cb0: je     278cba             ; Jump if equal (skip the write)

During normal operation when no profiler is attached, last_profiled_frame is always NULL, which means this branch always takes the exact same path. Modern CPUs predict this perfectly after just a few iterations and never mispredict it again. A perfectly predicted branch executes speculatively with zero pipeline stalls, making it effectively free.

I confirmed this using Linux perf with hardware performance counters. I ran the Python test suite and some selected tests (test_list test_tokenize test_gc test_dict test_ast test_compile) at maximum sample rate (99,999 Hz) collecting separate data files for CPU cycles, branch mispredictions, and cache misses:

perf record -F 99999 -g -o ./perf_cycles.data -- ./python -m test 
perf record -e branch-misses -F 99999 -g -o ./perf_branch_misses.data -- ./python -m test 

First, I checked how much the entire function contributes to total CPU time:

$ perf report -i ./perf_cycles.data --stdio --sort=symbol | grep FrameClearAndPop
# Samples: 20K of event 'cpu_atom/cycles/P'
     0.10%     0.09%  [.] _PyEval_FrameClearAndPop
# Samples: 422K of event 'cpu_core/cycles/P'
     0.12%     0.10%  [.] _PyEval_FrameClearAndPop

The whole function is only 0.10% of cycles on P-cores and 0.09% on E-cores, so we're already in negligible territory. But the real question is whether the guard check causes any branch mispredictions.

I checked the function's contribution to total branch misses:

$ perf report -i ./perf_branch_misses.data --stdio --sort=symbol | grep FrameClearAndPop
# Samples: 12K of event 'cpu_atom/branch-misses/'
     0.07%     0.06%  [.] _PyEval_FrameClearAndPop
# Samples: 162K of event 'cpu_core/branch-misses/'
     0.11%     0.11%  [.] _PyEval_FrameClearAndPop

The entire function is only 0.11% of total branch misses. But within that, how much does our guard check contribute? I used perf annotate to see exactly which instructions caused branch misses within the function:

$ perf annotate -i ./perf_branch_misses.data _PyEval_FrameClearAndPop --stdio

This command reads the branch misprediction samples and maps them to specific instructions, showing what percentage of the function's branch misses occurred at each location. The result for our guard check:

; The guard check - if (tstate->last_profiled_frame != NULL)
    0.00 :   278cab: cmpq   $0x0,0x50(%rdi)   ; ← 0.00% branch misses
    0.00 :   278cb0: je     278cba            ; ← 0.00% branch misses (PERFECTLY PREDICTED)

Zero. Not a single branch misprediction was sampled at that instruction across hundreds of thousands of samples. The CPU's branch predictor correctly predicts this branch every single time because it always takes the same path.

For comparison, here's more of the annotated output showing other branches in the same function:

    0.00 :   278cb0: je     278cba    ; Guard check - 0.00% misses
   32.25 :   278cbe: jne    278d00    ; frame->owner check - 32.25% misses
   50.13 :   278cc8: call   2c39e0    ; Function call
   21.62 :   278ce5: je     278d60    ; Refcount check - 21.62% misses

The frame ownership check (frame->owner == FRAME_OWNED_BY_THREAD) accounts for 32.25% of the function's branch misses, and the refcount check (--op->ob_refcnt == 0) accounts for 21.62%. These are data-dependent branches that the CPU cannot predict perfectly. Our guard check contributes exactly 0.00% because it is perfectly predictable, unlike these other branches that depend on runtime data.

The overall Python branch miss rate is already very low (0.03% of all branches), and the guard check contributes nothing to this.

Finally, I ran pyperformance comparing main (ea51e745c713) against this PR (8d4a83894398). The geometric mean across all benchmarks is 1.00x, confirming no measurable regression in real-world workloads:

Pyperformance run:

All benchmarks:

Benchmark main-ea51e745c713 PR-8d4a83894398
subparsers 100 ms 97.8 ms: 1.02x faster
async_generators 646 ms 664 ms: 1.03x slower
bpe_tokeniser 6.87 sec 7.06 sec: 1.03x slower
comprehensions 27.3 us 26.6 us: 1.03x faster
coverage 134 ms 131 ms: 1.02x faster
crypto_pyaes 121 ms 124 ms: 1.02x slower
deepcopy_reduce 6.94 us 6.76 us: 1.03x faster
fannkuch 656 ms 640 ms: 1.02x faster
generators 45.4 ms 44.4 ms: 1.02x faster
logging_format 17.4 us 17.0 us: 1.02x faster
logging_simple 15.2 us 14.8 us: 1.03x faster
mdp 2.09 sec 2.13 sec: 1.02x slower
nbody 176 ms 180 ms: 1.02x slower
pickle_pure_python 647 us 630 us: 1.03x faster
regex_compile 227 ms 223 ms: 1.02x faster
regex_dna 255 ms 262 ms: 1.03x slower
regex_effbot 4.23 ms 4.34 ms: 1.03x slower
scimark_monte_carlo 108 ms 110 ms: 1.02x slower
scimark_sor 188 ms 184 ms: 1.02x faster
spectral_norm 154 ms 150 ms: 1.03x faster
unpack_sequence 69.2 ns 71.1 ns: 1.03x slower
xdsl_constant_fold 97.6 ms 95.4 ms: 1.02x faster
xml_etree_generate 148 ms 144 ms: 1.03x faster
xml_etree_process 107 ms 105 ms: 1.02x faster
Geometric mean (ref) 1.00x faster

Benchmark hidden because not significant (85): 2to3, many_optionals, async_tree_none, async_tree_cpu_io_mixed, async_tree_cpu_io_mixed_tg, async_tree_eager, async_tree_eager_cpu_io_mixed, async_tree_eager_cpu_io_mixed_tg, async_tree_eager_io, async_tree_eager_io_tg, async_tree_eager_memoization, async_tree_eager_memoization_tg, async_tree_eager_tg, async_tree_io, async_tree_io_tg, async_tree_memoization, async_tree_memoization_tg, async_tree_none_tg, asyncio_tcp, asyncio_tcp_ssl, asyncio_websockets, chameleon, chaos, bench_mp_pool, bench_thread_pool, coroutines, dask, deepcopy, deepcopy_memo, deltablue, django_template, docutils, dulwich_log, float, create_gc_cycles, gc_traversal, genshi_text, genshi_xml, go, hexiom, html5lib, json_dumps, json_loads, logging_silent, mako, meteor_contest, nqueens, pathlib, pickle, pickle_dict, pickle_list, pidigits, pprint_safe_repr, pprint_pformat, pyflate, python_startup, python_startup_no_site, raytrace, regex_v8, richards, richards_super, scimark_fft, scimark_lu, scimark_sparse_mat_mult, sphinx, sqlalchemy_declarative, sqlalchemy_imperative, sqlglot_v2_normalize, sqlglot_v2_optimize, sqlglot_v2_parse, sqlglot_v2_transpile, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tomli_loads, tornado_http, typing_runtime_protocols, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse

So the conclusion is that the branch is perfectly predicted, adds no memory traffic beyond reading a value already in L1 cache (tstate is hot), and avoids cache line dirtying when the profiler is not attached. Zero cost.

@pablogsal
Copy link
Member Author

CC @ivonastojanovic @lkollar

@ZeroIntensity
Copy link
Member

This PR makes the profiler up to 15x times faster 🎉

Woah, does that mean it's 15M megahertz now?

@pablogsal
Copy link
Member Author

pablogsal commented Dec 1, 2025

Woah, does that mean it's 15M megahertz now?

Haha, no it means it is closer to 1MHz more times (with deeper stacks). 1Mhz is the spherical cow in vacuum with no friction case ;)

@pablogsal pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 1, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 7af5161 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 1, 2025
@pablogsal pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 1, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit d47dc4b 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 1, 2025
Remote profilers that sample call stacks from external processes need to read the entire frame chain on every sample. For deep stacks, this is expensive since most of the stack is typically unchanged between samples.

This adds a `last_profiled_frame` pointer that remote profilers can use to implement a caching optimization. When sampling, a profiler writes the current frame address here. The eval loop then keeps this pointer valid by updating it to the parent frame in _PyEval_FrameClearAndPop. This creates a "high-water mark" that always points to a frame still on the stack, allowing profilers to skip reading unchanged portions of the stack.

The check in ceval.c is guarded so there's zero overhead when profiling isn't active (the field starts NULL and the branch is predictable).
The _Py_RemoteDebug_WriteRemoteMemory function was duplicated across modules. Moving it to the shared header (alongside the existing read functions) allows both the sys.remote_exec machinery and the _remote_debugging module to use the same implementation. This is needed for the upcoming frame cache which writes to remote process memory from _remote_debugging.
When sampling deep call stacks repeatedly, most of the stack is typically unchanged between samples. This implements a per-thread frame cache that stores (address, frame_info) pairs. On each sample, if the current_frame has reached the last_profiled_frame pointer (indicating unchanged stack below), we stop walking and append cached frames instead of re-reading them from the remote process.

The cache is keyed by thread ID and stores the full frame list from the previous sample. When process_frame_chain hits the cached boundary, frame_cache_lookup_and_extend appends the cached continuation. After each sample we update last_profiled_frame in the target process so the eval loop can maintain the invariant.

This is opt-in via cache_frames=True on RemoteUnwinder to avoid changing behavior for existing users.
Turn on cache_frames=True in the sampling profiler and the external inspection benchmark. This significantly reduces the cost of sampling deep stacks since unchanged parent frames don't need to be re-read from the target process.
Tests cover the key frame caching scenarios: cache hits when the stack is stable, cache invalidation when functions return or new ones are called, partial stack reuse when only the top changes, and recursive frames. Also verifies that cache_frames=True produces equivalent results to cache_frames=False.
When a new RemoteUnwinder is created to profile a process that was
previously profiled by another unwinder, the last_profiled_frame values
in thread states may be stale. This caused the new unwinder to stop
frame walking early (when hitting the stale marker) but fail to find
cached frames (since the cache is empty in the new unwinder), resulting
in empty frame lists.

Fix by clearing last_profiled_frame for all threads when a new
RemoteUnwinder with cache_frames=True is initialized.
…acks

When the frame walker stopped at a cached frame boundary but the cache lookup failed to find the continuation (e.g., when a new unwinder instance encounters stale last_profiled_frame values from a previous profiler), the code would silently return an incomplete stack. Now when a cache lookup misses, the walker continues from that point to collect the remaining frames, ensuring complete stack traces are always returned.
@pablogsal pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 1, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 4680abf 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 1, 2025
@pablogsal pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 3, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 5ba13c5 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 3, 2025
Copy link
Contributor

@lkollar lkollar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need more time to properly review this, just a couple of comments/questions now.

…bers

The frame cache was returning stale line numbers when execution moved
within the same function between samples. By always reading the current
frame from memory and only caching parent frames, we get correct line
numbers while still benefiting from cache hits on the unchanged call
stack. This also skips the expensive stack chunk copying in cache mode
since direct memory reads are cheaper when most frames come from cache
anyway.
@pablogsal
Copy link
Member Author

@lkollar addressed everything in 8df2106 (#142137)

@pablogsal pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 5, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 8df2106 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 5, 2025
@pablogsal pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 6, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 1944a02 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants