-
Notifications
You must be signed in to change notification settings - Fork 664
[Feature] [PD Disaggregation] simplify configuration for pd-disaggregated deployment, and refactor post-init and usage for all ports #5415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…factor post-init and usage for all ports
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5415 +/- ##
==========================================
Coverage ? 58.04%
==========================================
Files ? 329
Lines ? 41002
Branches ? 6206
==========================================
Hits ? 23798
Misses ? 15379
Partials ? 1825
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors port configuration for PD-disaggregated deployments by introducing automatic port allocation and consolidating port management logic. The changes aim to simplify configuration by removing the need for users to manually specify multiple ports per service.
Key Changes
- Introduced automatic port discovery via
find_free_ports()function with validation inpost_init_all_ports() - Centralized port extraction logic in
FDConfig.postprocess()to set per-DP ports (local_engine_worker_queue_port, etc.) - Added RDMA environment auto-detection via new
get_rdma_nics.shscript
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/utils.py | Added parse_ports(), find_free_ports(), and modified is_port_available() for port management |
| fastdeploy/engine/args_utils.py | Added post_init_all_ports() to automatically allocate ports; changed default cache_transfer_protocol to "ipc,rdma" |
| fastdeploy/config.py | Moved port extraction logic to postprocess() method; added local_engine_worker_queue_port attribute |
| fastdeploy/worker/worker_process.py | Moved port assignment from pre-config to post-config initialization |
| fastdeploy/engine/engine.py | Updated to use local_engine_worker_queue_port for DP queue services |
| fastdeploy/engine/common_engine.py | Simplified port handling by using local_engine_worker_queue_port directly |
| fastdeploy/engine/expert_service.py | Removed local port/device ID extraction logic (now handled in config) |
| fastdeploy/splitwise/splitwise_connector.py | Changed to use scalar pd_comm_port and local_device_ids |
| fastdeploy/cache_manager/prefix_cache_manager.py | Renamed pid_suffix to ipc_suffix; updated RDMA port usage |
| fastdeploy/cache_manager/cache_transfer_manager.py | Renamed engine_pid to ipc_suffix for clarity |
| fastdeploy/cache_manager/cache_messager.py | Renamed engine_pid to ipc_suffix for consistency |
| fastdeploy/cache_manager/transfer_factory/rdma_cache_transfer.py | Added automatic RDMA environment setup and NIC detection |
| fastdeploy/cache_manager/transfer_factory/get_rdma_nics.sh | New script for detecting RDMA-capable network interfaces |
| examples/splitwise/utils.sh | Improved health check display with UDP port checking |
| examples/splitwise/start_v1_tp1.sh | Simplified example by removing manual port specifications |
| + f" --engine_pid {pid_suffix}" | ||
| + f" --rdma_port {cache_config.rdma_comm_ports[i] if cache_config.rdma_comm_ports is not None else '0'}" | ||
| + f" --ipc_suffix {ipc_suffix}" | ||
| + f" --rdma_port {cache_config.rdma_comm_ports[0] if cache_config.rdma_comm_ports is not None else '0'}" |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to line 280, using only rdma_comm_ports[0] for all tensor parallel ranks is incorrect. Each rank needs its corresponding RDMA port:
+ f" --rdma_port {cache_config.rdma_comm_ports[i] if cache_config.rdma_comm_ports is not None else '0'}"| + f" --rdma_port {cache_config.rdma_comm_ports[0] if cache_config.rdma_comm_ports is not None else '0'}" | |
| + f" --rdma_port {cache_config.rdma_comm_ports[i] if cache_config.rdma_comm_ports is not None else '0'}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修复
| return [int(p) for p in ports.split(",")] | ||
| elif isinstance(ports, list): | ||
| return [int(p) for p in ports] |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parse_ports function doesn't handle invalid input gracefully. If the string contains non-numeric values or the list contains non-convertible elements, it will raise a ValueError without a clear error message. Consider adding try-except with a more descriptive error:
elif isinstance(ports, str):
try:
return [int(p.strip()) for p in ports.split(",")]
except ValueError as e:
raise TypeError(f"Cannot parse ports string '{ports}': contains non-numeric values") from e| return [int(p) for p in ports.split(",")] | |
| elif isinstance(ports, list): | |
| return [int(p) for p in ports] | |
| try: | |
| return [int(p.strip()) for p in ports.split(",")] | |
| except (ValueError, TypeError) as e: | |
| raise TypeError(f"Cannot parse ports string '{ports}': contains non-numeric values") from e | |
| elif isinstance(ports, list): | |
| try: | |
| return [int(p) for p in ports] | |
| except (ValueError, TypeError) as e: | |
| raise TypeError(f"Cannot parse ports list '{ports}': contains non-numeric values") from e |
|
|
||
| try: | ||
| import rdma_comm | ||
| except: |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a bare except: clause catches all exceptions including SystemExit and KeyboardInterrupt, which can make the program difficult to debug or terminate properly. Use except Exception: or a more specific exception type instead:
except ImportError:
raise RuntimeError(
"The installation of the RDMA library failed. Confirm whether your network card supports RDMA transmission."
)| except: | |
| except ImportError: |
| """ | ||
|
|
||
| cache_transfer_protocol: str = "ipc" | ||
| cache_transfer_protocol: str = "ipc,rdma" |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value for cache_transfer_protocol changed from "ipc" to "ipc,rdma". This is a breaking change because it now requires RDMA to be configured even if users only want IPC. Consider keeping the default as "ipc" for backward compatibility, or document this breaking change clearly.
| cache_transfer_protocol: str = "ipc,rdma" | |
| cache_transfer_protocol: str = "ipc" |
fastdeploy/engine/args_utils.py
Outdated
| assert ( | ||
| len(ports) == num_total_ports | ||
| ), f"Parameter `{name}` should have {num_total_ports} ports, got {len(ports)}." | ||
| ports = parse_ports(ports) |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic error in port validation. At line 552, parse_ports(ports) is called after the assertion at line 549-551, but the assertion checks len(ports) before parsing. If ports is an int or str (valid inputs), the len() call will fail or give incorrect results.
The parse_ports() call should happen before the length assertion:
ports = parse_ports(ports)
if ports is not None:
assert (
len(ports) == num_total_ports
), f"Parameter `{name}` should have {num_total_ports} ports, got {len(ports)}."| assert ( | |
| len(ports) == num_total_ports | |
| ), f"Parameter `{name}` should have {num_total_ports} ports, got {len(ports)}." | |
| ports = parse_ports(ports) | |
| ports = parse_ports(ports) | |
| assert ( | |
| len(ports) == num_total_ports | |
| ), f"Parameter `{name}` should have {num_total_ports} ports, got {len(ports)}." |
| if is_port_available(host, port): | ||
| free_ports.append(port) |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The find_free_ports function relies on is_port_available, which has a critical bug (always returns True). This means find_free_ports will incorrectly consider all ports as available and may attempt to use ports that are already in use, leading to runtime failures when the ports are actually bound.
Motivation
该 PR 旨在实现两个目标:
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.