-
-
Notifications
You must be signed in to change notification settings - Fork 33.6k
gh-138122: Implement frame caching in RemoteUnwinder to reduce memory reads #142137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
180daaa to
8d4a838
Compare
Python/ceval.c
Outdated
| // Update last_profiled_frame for remote profiler frame caching. | ||
| // By this point, tstate->current_frame is already set to the parent frame. | ||
| // The guarded check avoids writes when profiling is not active (predictable branch). | ||
| if (tstate->last_profiled_frame != NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: This should be obvious but in case someone is worried I did my homework and this guard check has zero measurable cost and pyperformance shows 1.00x (no change). It's literally free.
This is the only place outside the profiler that has a modification, and before anyone considers if this has any effect: the answer is that it has literally zero effect. Here is how I measured it.
The guard check added to _PyEval_FrameClearAndPop compiles down to just two instructions:
278cab: cmpq $0x0,0x50(%rdi) ; Compare last_profiled_frame with NULL
278cb0: je 278cba ; Jump if equal (skip the write)During normal operation when no profiler is attached, last_profiled_frame is always NULL, which means this branch always takes the exact same path. Modern CPUs predict this perfectly after just a few iterations and never mispredict it again. A perfectly predicted branch executes speculatively with zero pipeline stalls, making it effectively free.
I confirmed this using Linux perf with hardware performance counters. I ran the Python test suite and some selected tests (test_list test_tokenize test_gc test_dict test_ast test_compile) at maximum sample rate (99,999 Hz) collecting separate data files for CPU cycles, branch mispredictions, and cache misses:
perf record -F 99999 -g -o ./perf_cycles.data -- ./python -m test
perf record -e branch-misses -F 99999 -g -o ./perf_branch_misses.data -- ./python -m test First, I checked how much the entire function contributes to total CPU time:
$ perf report -i ./perf_cycles.data --stdio --sort=symbol | grep FrameClearAndPop# Samples: 20K of event 'cpu_atom/cycles/P'
0.10% 0.09% [.] _PyEval_FrameClearAndPop
# Samples: 422K of event 'cpu_core/cycles/P'
0.12% 0.10% [.] _PyEval_FrameClearAndPop
The whole function is only 0.10% of cycles on P-cores and 0.09% on E-cores, so we're already in negligible territory. But the real question is whether the guard check causes any branch mispredictions.
I checked the function's contribution to total branch misses:
$ perf report -i ./perf_branch_misses.data --stdio --sort=symbol | grep FrameClearAndPop# Samples: 12K of event 'cpu_atom/branch-misses/'
0.07% 0.06% [.] _PyEval_FrameClearAndPop
# Samples: 162K of event 'cpu_core/branch-misses/'
0.11% 0.11% [.] _PyEval_FrameClearAndPop
The entire function is only 0.11% of total branch misses. But within that, how much does our guard check contribute? I used perf annotate to see exactly which instructions caused branch misses within the function:
$ perf annotate -i ./perf_branch_misses.data _PyEval_FrameClearAndPop --stdioThis command reads the branch misprediction samples and maps them to specific instructions, showing what percentage of the function's branch misses occurred at each location. The result for our guard check:
; The guard check - if (tstate->last_profiled_frame != NULL)
0.00 : 278cab: cmpq $0x0,0x50(%rdi) ; ← 0.00% branch misses
0.00 : 278cb0: je 278cba ; ← 0.00% branch misses (PERFECTLY PREDICTED)
Zero. Not a single branch misprediction was sampled at that instruction across hundreds of thousands of samples. The CPU's branch predictor correctly predicts this branch every single time because it always takes the same path.
For comparison, here's more of the annotated output showing other branches in the same function:
0.00 : 278cb0: je 278cba ; Guard check - 0.00% misses
32.25 : 278cbe: jne 278d00 ; frame->owner check - 32.25% misses
50.13 : 278cc8: call 2c39e0 ; Function call
21.62 : 278ce5: je 278d60 ; Refcount check - 21.62% misses
The frame ownership check (frame->owner == FRAME_OWNED_BY_THREAD) accounts for 32.25% of the function's branch misses, and the refcount check (--op->ob_refcnt == 0) accounts for 21.62%. These are data-dependent branches that the CPU cannot predict perfectly. Our guard check contributes exactly 0.00% because it is perfectly predictable, unlike these other branches that depend on runtime data.
The overall Python branch miss rate is already very low (0.03% of all branches), and the guard check contributes nothing to this.
Finally, I ran pyperformance comparing main (ea51e745c713) against this PR (8d4a83894398). The geometric mean across all benchmarks is 1.00x, confirming no measurable regression in real-world workloads:
Pyperformance run:
All benchmarks:
| Benchmark | main-ea51e745c713 | PR-8d4a83894398 |
|---|---|---|
| subparsers | 100 ms | 97.8 ms: 1.02x faster |
| async_generators | 646 ms | 664 ms: 1.03x slower |
| bpe_tokeniser | 6.87 sec | 7.06 sec: 1.03x slower |
| comprehensions | 27.3 us | 26.6 us: 1.03x faster |
| coverage | 134 ms | 131 ms: 1.02x faster |
| crypto_pyaes | 121 ms | 124 ms: 1.02x slower |
| deepcopy_reduce | 6.94 us | 6.76 us: 1.03x faster |
| fannkuch | 656 ms | 640 ms: 1.02x faster |
| generators | 45.4 ms | 44.4 ms: 1.02x faster |
| logging_format | 17.4 us | 17.0 us: 1.02x faster |
| logging_simple | 15.2 us | 14.8 us: 1.03x faster |
| mdp | 2.09 sec | 2.13 sec: 1.02x slower |
| nbody | 176 ms | 180 ms: 1.02x slower |
| pickle_pure_python | 647 us | 630 us: 1.03x faster |
| regex_compile | 227 ms | 223 ms: 1.02x faster |
| regex_dna | 255 ms | 262 ms: 1.03x slower |
| regex_effbot | 4.23 ms | 4.34 ms: 1.03x slower |
| scimark_monte_carlo | 108 ms | 110 ms: 1.02x slower |
| scimark_sor | 188 ms | 184 ms: 1.02x faster |
| spectral_norm | 154 ms | 150 ms: 1.03x faster |
| unpack_sequence | 69.2 ns | 71.1 ns: 1.03x slower |
| xdsl_constant_fold | 97.6 ms | 95.4 ms: 1.02x faster |
| xml_etree_generate | 148 ms | 144 ms: 1.03x faster |
| xml_etree_process | 107 ms | 105 ms: 1.02x faster |
| Geometric mean | (ref) | 1.00x faster |
Benchmark hidden because not significant (85): 2to3, many_optionals, async_tree_none, async_tree_cpu_io_mixed, async_tree_cpu_io_mixed_tg, async_tree_eager, async_tree_eager_cpu_io_mixed, async_tree_eager_cpu_io_mixed_tg, async_tree_eager_io, async_tree_eager_io_tg, async_tree_eager_memoization, async_tree_eager_memoization_tg, async_tree_eager_tg, async_tree_io, async_tree_io_tg, async_tree_memoization, async_tree_memoization_tg, async_tree_none_tg, asyncio_tcp, asyncio_tcp_ssl, asyncio_websockets, chameleon, chaos, bench_mp_pool, bench_thread_pool, coroutines, dask, deepcopy, deepcopy_memo, deltablue, django_template, docutils, dulwich_log, float, create_gc_cycles, gc_traversal, genshi_text, genshi_xml, go, hexiom, html5lib, json_dumps, json_loads, logging_silent, mako, meteor_contest, nqueens, pathlib, pickle, pickle_dict, pickle_list, pidigits, pprint_safe_repr, pprint_pformat, pyflate, python_startup, python_startup_no_site, raytrace, regex_v8, richards, richards_super, scimark_fft, scimark_lu, scimark_sparse_mat_mult, sphinx, sqlalchemy_declarative, sqlalchemy_imperative, sqlglot_v2_normalize, sqlglot_v2_optimize, sqlglot_v2_parse, sqlglot_v2_transpile, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tomli_loads, tornado_http, typing_runtime_protocols, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse
So the conclusion is that the branch is perfectly predicted, adds no memory traffic beyond reading a value already in L1 cache (tstate is hot), and avoids cache line dirtying when the profiler is not attached. Zero cost.
Woah, does that mean it's 15M megahertz now? |
Haha, no it means it is closer to 1MHz more times (with deeper stacks). 1Mhz is the spherical cow in vacuum with no friction case ;) |
|
🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 7af5161 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
7fa34b2 to
d47dc4b
Compare
|
🤖 New build scheduled with the buildbot fleet by @pablogsal for commit d47dc4b 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
d47dc4b to
06f27e2
Compare
Remote profilers that sample call stacks from external processes need to read the entire frame chain on every sample. For deep stacks, this is expensive since most of the stack is typically unchanged between samples. This adds a `last_profiled_frame` pointer that remote profilers can use to implement a caching optimization. When sampling, a profiler writes the current frame address here. The eval loop then keeps this pointer valid by updating it to the parent frame in _PyEval_FrameClearAndPop. This creates a "high-water mark" that always points to a frame still on the stack, allowing profilers to skip reading unchanged portions of the stack. The check in ceval.c is guarded so there's zero overhead when profiling isn't active (the field starts NULL and the branch is predictable).
The _Py_RemoteDebug_WriteRemoteMemory function was duplicated across modules. Moving it to the shared header (alongside the existing read functions) allows both the sys.remote_exec machinery and the _remote_debugging module to use the same implementation. This is needed for the upcoming frame cache which writes to remote process memory from _remote_debugging.
When sampling deep call stacks repeatedly, most of the stack is typically unchanged between samples. This implements a per-thread frame cache that stores (address, frame_info) pairs. On each sample, if the current_frame has reached the last_profiled_frame pointer (indicating unchanged stack below), we stop walking and append cached frames instead of re-reading them from the remote process. The cache is keyed by thread ID and stores the full frame list from the previous sample. When process_frame_chain hits the cached boundary, frame_cache_lookup_and_extend appends the cached continuation. After each sample we update last_profiled_frame in the target process so the eval loop can maintain the invariant. This is opt-in via cache_frames=True on RemoteUnwinder to avoid changing behavior for existing users.
Turn on cache_frames=True in the sampling profiler and the external inspection benchmark. This significantly reduces the cost of sampling deep stacks since unchanged parent frames don't need to be re-read from the target process.
Tests cover the key frame caching scenarios: cache hits when the stack is stable, cache invalidation when functions return or new ones are called, partial stack reuse when only the top changes, and recursive frames. Also verifies that cache_frames=True produces equivalent results to cache_frames=False.
When a new RemoteUnwinder is created to profile a process that was previously profiled by another unwinder, the last_profiled_frame values in thread states may be stale. This caused the new unwinder to stop frame walking early (when hitting the stale marker) but fail to find cached frames (since the cache is empty in the new unwinder), resulting in empty frame lists. Fix by clearing last_profiled_frame for all threads when a new RemoteUnwinder with cache_frames=True is initialized.
…acks When the frame walker stopped at a cached frame boundary but the cache lookup failed to find the continuation (e.g., when a new unwinder instance encounters stale last_profiled_frame values from a previous profiler), the code would silently return an incomplete stack. Now when a cache lookup misses, the walker continues from that point to collect the remaining frames, ensuring complete stack traces are always returned.
|
🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 4680abf 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
4680abf to
c5307fb
Compare
|
🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 5ba13c5 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
lkollar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need more time to properly review this, just a couple of comments/questions now.
…bers The frame cache was returning stale line numbers when execution moved within the same function between samples. By always reading the current frame from memory and only caching parent frames, we get correct line numbers while still benefiting from cache hits on the unchanged call stack. This also skips the expensive stack chunk copying in cache mode since direct memory reads are cheaper when most frames come from cache anyway.
|
@lkollar addressed everything in |
|
🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 8df2106 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
|
🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 1944a02 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F142137%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
This PR makes the profiler up (when we have a lots of repeated frames in the stack) to 15x times faster 🎉
This PR implements frame caching in the
RemoteUnwinderclass to significantly reduce memory reads when profiling remote processes with deep call stacks.When
cache_frames=True, the unwinder stores the frame chain from each sample and reuses unchanged portions in subsequent samples. Since most profiling samples capture similar call stacks (especially the parent frames), this optimization avoids repeatedly reading the same frame data from the target process.The implementation adds a
last_profiled_framefield to the thread state that tracks where the previous sample stopped. On the next sample, if the current frame chain reaches this marker, the cached frames from that point onward are reused instead of being re-read from remote memory.The sampling profiler now enables frame caching by default.