-
Notifications
You must be signed in to change notification settings - Fork 757
Fix redundant cancel tracebacks on ctrl+c (issue #343) #370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix redundant cancel tracebacks on ctrl+c (issue #343) #370
Conversation
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes issue #343 where pressing ctrl+c during benchmark runs caused redundant asyncio.CancelledError tracebacks from every runner/algorithm. The fix introduces a new _run_with_sigint() method that distinguishes between SIGINT-triggered cancellations (which should be treated as graceful shutdowns) and internal asyncio.CancelledError exceptions (which should be treated as failures).
Key changes:
- Added
_run_with_sigint()method that sets up a SIGINT handler and uses a global flagSIGINT_SEENto track whether cancellation was caused by SIGINT - Modified all
asyncio.run()calls to use_run_with_sigint()instead, providing consistent SIGINT handling across algorithm and runner processes - Added exception handlers for
asyncio.CancelledErrorin_execute_algorithmand_execute_runnerto ensure stop events are set before propagating the error
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| agentlightning/execution/client_server.py | Adds global SIGINT_SEEN flag, implements _run_with_sigint() method with SIGINT handler, adds CancelledError handlers to executor methods, and replaces all asyncio.run() calls with _run_with_sigint() |
| tests/execution/test_client_server.py | Adds _cancel_in_runner() helper function and two new tests to verify SIGINT handling (graceful shutdown) and non-SIGINT CancelledError propagation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| SIGINT_SEEN: bool = False | ||
|
|
||
|
|
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The global flag SIGINT_SEEN is not reset after use, which can cause incorrect behavior if _run_with_sigint is called multiple times in the same process. Once set to True, it will remain True for subsequent calls, causing non-SIGINT CancelledError exceptions to be incorrectly treated as SIGINT-triggered cancellations.
Consider using a context-local flag (e.g., passing a flag as a parameter or using a thread-local/task-local variable) or explicitly resetting the flag at the start of _run_with_sigint.
| def _run_with_sigint(...): | |
| global SIGINT_SEEN | |
| SIGINT_SEEN = False | |
| # ... rest of the function ... |
| try: | ||
| loop.run_until_complete(main_task) | ||
| except asyncio.CancelledError: | ||
| global SIGINT_SEEN |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The global keyword is unnecessary here since you're only reading SIGINT_SEEN, not modifying it. The global declaration is only needed when assigning to a global variable within a function.
Remove the global SIGINT_SEEN declaration on line 256.
| global SIGINT_SEEN |
| make_coro = lambda: self._execute_runner(runner, worker_id, store, stop_evt) | ||
| else: | ||
| assert algorithm is not None | ||
| make_coro = lambda: self._execute_algorithm(algorithm, store, stop_evt) | ||
|
|
||
| loop = asyncio.new_event_loop() | ||
| asyncio.set_event_loop(loop) | ||
|
|
||
| # Explicitly create the main task so the SIGINT handler can do cancel() | ||
| main_task: asyncio.Task[None] = loop.create_task(make_coro()) |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The lambda functions used here capture variables from the outer scope. Consider using a more explicit approach for better readability and to avoid potential closure-related issues:
if kind == "runner":
assert runner is not None
coro = self._execute_runner(runner, worker_id, store, stop_evt)
else:
assert algorithm is not None
coro = self._execute_algorithm(algorithm, store, stop_evt)
main_task: asyncio.Task[None] = loop.create_task(coro)| make_coro = lambda: self._execute_runner(runner, worker_id, store, stop_evt) | |
| else: | |
| assert algorithm is not None | |
| make_coro = lambda: self._execute_algorithm(algorithm, store, stop_evt) | |
| loop = asyncio.new_event_loop() | |
| asyncio.set_event_loop(loop) | |
| # Explicitly create the main task so the SIGINT handler can do cancel() | |
| main_task: asyncio.Task[None] = loop.create_task(make_coro()) | |
| coro = self._execute_runner(runner, worker_id, store, stop_evt) | |
| else: | |
| assert algorithm is not None | |
| coro = self._execute_algorithm(algorithm, store, stop_evt) | |
| loop = asyncio.new_event_loop() | |
| asyncio.set_event_loop(loop) | |
| # Explicitly create the main task so the SIGINT handler can do cancel() | |
| main_task: asyncio.Task[None] = loop.create_task(coro) |
| assert p.pid is not None | ||
| os.kill(p.pid, signal.SIGINT) | ||
| p.join(timeout=5.0) | ||
| assert not p.is_alive() | ||
| assert p.exitcode == 0 |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test may have a race condition. After sending SIGINT with os.kill(), there's no guarantee that the signal handler has completed setting the global flag before the process exits. Consider adding a small delay after sending the signal or checking that the process actually received and handled the signal.
Additionally, this test doesn't verify that SIGINT_SEEN was actually set in the child process, it only checks the exit code. If the global flag mechanism fails in a child process (e.g., in spawn mode), this test might still pass if the process exits cleanly for other reasons.
| # the runner/algorithm receives asyncio.CancelledError. | ||
| main_task.cancel() | ||
|
|
||
| signal.signal(signal.SIGINT, _sigint_handler) |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signal handler doesn't restore the previous SIGINT handler after the function completes. If _run_with_sigint is called multiple times or if the caller expects to have their own SIGINT handler, this could cause issues.
Consider saving and restoring the previous handler:
previous_handler = signal.signal(signal.SIGINT, _sigint_handler)
try:
# ... existing code ...
finally:
signal.signal(signal.SIGINT, previous_handler)
# ... existing cleanup code ...|
|
||
| import pytest | ||
|
|
||
| from agentlightning.execution import client_server as cs_mod |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'cs_mod' is not used.
| from agentlightning.execution import client_server as cs_mod |
| assert algorithm is not None | ||
| make_coro = lambda: self._execute_algorithm(algorithm, store, stop_evt) | ||
|
|
||
| loop = asyncio.new_event_loop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to create a new event loop here? what will happen if we already have a event loop?
| # the runner/algorithm receives asyncio.CancelledError. | ||
| main_task.cancel() | ||
|
|
||
| signal.signal(signal.SIGINT, _sigint_handler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this registration is process wide. Let's be extra careful on that.
try: except KeyboardInterrupt might be a better idea if that works.
|
I removed the previous logic and shifted exception handling to the outer When a By raising Also added tests that verify if |
|
@Kwanghoon-Choi have you tested what will happen if the algorithm crashes (in which case stop_evt is set)? will the runners emit a lot of logs and tracebacks? |
Problem (#343)
When running
tests/benchmark/benchmark_store.pyand pressing ctrl+c, every runner/algorithm logs an traceback forasyncio.CancelledError.SIGINT(ctrl+c) ends up asasyncio.CancelledError, which is difficult to distinguish from internal cancellations.Goal
Treat ctrl+c as a graceful shutdown (no traceback), while treating non-SIGINT
asyncio.CancelledErroras a failure with a traceback.Fix
_run_with_sigintas an entry point for runner/algorithm coroutines.SIGINT_SEEN) from the SIGINT handler.main_task.cancel()to convert the signal intoasyncio.CancelledError.asyncio.CancelledErrorbased onSIGINT_SEEN.KeyboardInterrupt.CancelledErroris propagated as before.Tests
test_run_with_sigint_child_runner_exits_cleanly_on_siginttest_run_with_sigint_propagates_non_sigint_cancelled_errorThe tests verify that
_run_with_sigint()distinguishes between SIGINT-triggered cancellation and internalasyncio.CancelledError.Reproduce