Skip to content

paddlepaddle-gpu2.6.2运行ps示例报错 #76762

@oxrou

Description

@oxrou

bug描述 Describe the Bug

环境:
paddlepaddle-gpu2.6.2
ps模式下,1p1w/1p2w,worker启动后报错失败,cpu和gpu都不行,实测在paddlepaddle2.6.2的cpu版本这个示例可以跑通
使用相同启动命令:
fleetrun --master=10.250.1.253:8090 --servers=10.250.1.253:6070 --trainers=10.250.0.108:6071,10.250.1.172:6071 /workspace/train.py --lr 0.01
训练代码:
PaddleFleetX/tree/old_develop/examples/wide_and_deep_datase
报错信息:
/opt/paddle-env/lib/python3.10/site-packages/paddle/base/framework.py:688: UserWarning: You are using GPU version Paddle, but your CUDA device is not set properly. CPU device will be used by default.
warnings.warn(
[2025-12-03 11:53:27,783] [ INFO] distributed_strategy.py:214 - distributed strategy initialized
fl-ps > coordinator address is null!
Gloo init with HTTP: need_init_all: False, args: {'http.host': '10.250.1.253', 'http.port': '6767', 'store.prefix': '', 'start_http_server': False, 'http_server_d': <DictProxy object, typeid 'dict' at 0x7f16e86598d0>}
I1203 11:53:32.963233 142 gloo_wrapper.cc:355] gloo initialized done, rank=0, size=2, store_type=1
[2025-12-03 11:53:33,058] [ INFO] distributed_strategy.py:214 - distributed strategy initialized
valid_optimizer_list= [<paddle.distributed.fleet.meta_optimizers.ps_optimizer.ParameterServerOptimizer object at 0x7f16e78c4be0>]
meta_optimizer= <paddle.distributed.fleet.meta_optimizers.ps_optimizer.ParameterServerOptimizer object at 0x7f16e78c4be0>
graph_optimizer= None
is_heter_ps_mode in distributed_ops_pass False?
is_heter_ps_mode? False
public get_the_one_send_context sparse: embedding@GRAD ['embedding.block0'] [0, 10]
ShowClickEntry not configured, will not use
debug zcb slots: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
remote_optimize_vars: ['linear_1.w_0', 'linear_2.w_0@GRAD', 'linear_3.b_0', 'linear_0.w_0@GRAD', 'embedding', 'linear_3.w_0', 'linear_4.b_0', 'linear_2.w_0', 'linear_3.b_0@GRAD', 'linear_4.b_0@GRAD', 'linear_0.b_0', 'linear_3.w_0@GRAD', 'linear_0.b_0@GRAD', 'linear_0.w_0', 'linear_1.w_0@GRAD', 'linear_4.w_0', 'learning_rate_0', 'linear_2.b_0@GRAD', 'linear_1.b_0@GRAD', 'linear_2.b_0', 'linear_1.b_0', 'linear_4.w_0@GRAD', 'embedding@GRAD'], remote_optimize_op_role_vars: ['linear_1.w_0', 'linear_2.w_0@GRAD', 'linear_3.b_0', 'linear_0.w_0@GRAD', 'embedding', 'linear_3.w_0', 'linear_4.b_0', 'linear_2.w_0', 'linear_3.b_0@GRAD', 'linear_4.b_0@GRAD', 'linear_0.b_0', 'linear_3.w_0@GRAD', 'linear_0.b_0@GRAD', 'linear_0.w_0', 'linear_1.w_0@GRAD', 'linear_4.w_0', 'linear_2.b_0@GRAD', 'linear_1.b_0@GRAD', 'linear_2.b_0', 'linear_1.b_0', 'linear_4.w_0@GRAD', 'embedding@GRAD'], local_optimize_vars: []
is_heter_ps_mode? False
public get_the_one_send_context sparse: embedding@GRAD ['embedding.block0'] [0, 10]
paddle.static.default_startup_program: <function default_startup_program at 0x7f16f0c38310>
is_heter_ps_mode? False
public get_the_one_send_context sparse: embedding@GRAD ['embedding.block0'] [0, 10]
is_heter_ps_mode? False
public get_the_one_send_context sparse: embedding@GRAD ['embedding.block0'] [0, 10]
fl-ps > local_sparse: [], remote_sparse: []
is_heter_ps_mode? False
public get_the_one_send_context sparse: embedding@GRAD ['embedding.block0'] [0, 10]
idx, name, ctx: 0 embedding@GRAD varname: embedding@GRAD trainer_id: 0 table_id: 0slice varname: embedding.block0 ep: 127.0.0.1:6071 section: 0 origin varnames: embedding@GRAD aggregation->add: 1 is_sparse: 1 is_distributed: 1
table_id: 0
program_id: 139736505293296
is_tensor_table: 0
is_datanorm_table: 0

idx, name, ctx: 1 Dense@GRAD_1 varname: Dense@GRAD_1 trainer_id: 0 table_id: 1slice varname: Dense@GRAD_1 ep: 127.0.0.1:6071 section: 430815 origin varnames: linear_0.b_0@GRAD linear_0.w_0@GRAD linear_1.b_0@GRAD linear_1.w_0@GRAD linear_2.b_0@GRAD linear_2.w_0@GRAD linear_3.b_0@GRAD linear_3.w_0@GRAD linear_4.b_0@GRAD linear_4.w_0@GRAD aggregation->add: 1 is_sparse: 0 is_distributed: 0
table_id: 1
program_id: 139736505293296
is_tensor_table: 0
is_datanorm_table: 0

Wait 10s for PS to start...
I1203 11:53:43.298707 142 program_interpreter.cc:212] New Executor is Running.
adam_d2sum: False
new table_name: embedding
/opt/paddle-env/lib/python3.10/site-packages/paddle/distributed/ps/the_one_ps.py:739: UserWarning: The PS mode must use MemorySparseTable.
warnings.warn("The PS mode must use MemorySparseTable.")
/opt/paddle-env/lib/python3.10/site-packages/paddle/distributed/ps/the_one_ps.py:750: UserWarning: The shard_num of sparse table is not set, use default value 1000 in cpups.
warnings.warn(
/opt/paddle-env/lib/python3.10/site-packages/paddle/distributed/ps/the_one_ps.py:772: UserWarning: The accessor of sparse table is not set, use default value.
warnings.warn(
new var: persist trainable param embedding : SELECTED_ROWS.shape(1024, 10).dtype(float32).stop_gradient(False), 10, 12
adam_d2sum: False
new table_name: embedding
new var: persist trainable param embedding : SELECTED_ROWS.shape(1024, 10).dtype(float32).stop_gradient(False), 10, 12
adam_d2sum: False
adam_d2sum: False
is_heter_ps_mode? False
public get_the_one_send_context sparse: embedding@GRAD ['embedding.block0'] [0, 10]
is_heter_ps_mode? False
public get_the_one_send_context sparse: embedding@GRAD ['embedding.block0'] [0, 10]
communicator config: {'communicator_max_merge_var_num': '1', 'communicator_send_queue_size': '1', 'communicator_independent_recv_thread': '1', 'communicator_min_send_grad_num_before_recv': '1', 'communicator_thread_pool_size': '5', 'communicator_send_wait_times': '5', 'communicator_is_sgd_optimizer': '1'}
I1203 11:53:43.522832 142 server.cpp:1107] Server[paddle::distributed::DownpourPsClientService] is serving on port=8500.
I1203 11:53:43.522850 142 server.cpp:1110] Check out http://paddle-ps-5-1764756425-worker-0:8500 in web browser.
I1203 11:53:43.522902 142 brpc_ps_client.cc:131] BrpcPsClient Service addr: 10.250.0.108, 8500, 0
fl-ps > trainer_endpoint: 10.250.0.108:6071
fl-ps > with_coordinator? False
fl-ps > coordinator addr: []
I1203 11:53:43.833642 142 brpc_ps_client.cc:200] Client connect success:10.250.0.108:8500,10.250.1.172:8500,
create c2c connection done
entering self._init_all_params()
create DownpourLiteWorker
device worker program id: 139736351155072
device worker program_configs: {'139736351155072': {'pull_dense': [1], 'push_dense': [1], 'pull_sparse': [], 'push_sparse': []}}
device worker 139736351155072 139736351155072
device worker pull dense: [1]
device worker dense_table_config: {1: ['linear_0.b_0', 'linear_0.w_0', 'linear_1.b_0', 'linear_1.w_0', 'linear_2.b_0', 'linear_2.w_0', 'linear_3.b_0', 'linear_3.w_0', 'linear_4.b_0', 'linear_4.w_0']}
I1203 11:53:43.913957 142 fleet.cc:39] RegisterHeterCallback support later
I1203 11:53:43.915089 200 hogwild_worker.cc:1103] device id=0, total param count=295, persist count=20, param=19, fp16=0, share=19, reset=0, pinned=0, resize_var=0, need copy param count=0, delete vars count=0
I1203 11:53:43.917484 200 hogwild_worker.cc:921] device id=0, total op count=141, create op count=141, skip vars count=1, unused vars op count=118, offload op count=0, offload input count=0, cast count=0


C++ Traceback (most recent call last):

0 phi::ThreadPool::TaskLoop()
1 std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool)
2 paddle::framework::HogwildWorker::CreateDeviceResource(paddle::framework::ProgramDesc const&)
3 phi::GPUContext::stream() const


Error Message Summary:

FatalError: Segmentation fault is detected by the operating system.
[TimeInfo: *** Aborted at 1764762823 (unix time) try "date -d @1764762823" if you are using GNU date ***]
[SignalInfo: *** SIGSEGV (@0x18) received by PID 142 (TID 0x7f165ffff640) from PID 24 ***]

/opt/paddle-env/lib/python3.10/site-packages/paddle/base/framework.py:688: UserWarning: You are using GPU version Paddle, but your CUDA device is not set properly. CPU device will be used by default.
warnings.warn(
[2025-12-03 11:53:51,057] [ INFO] distributed_strategy.py:214 - distributed strategy initialized
fl-ps > coordinator address is null!
Gloo init with HTTP: need_init_all: False, args: {'http.host': '10.250.1.253', 'http.port': '6767', 'store.prefix': '', 'start_http_server': False, 'http_server_d': <DictProxy object, typeid 'dict' at 0x7f8b9e1658a0>}

其他补充信息 Additional Supplementary Information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions