-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Description
bug描述 Describe the Bug
环境:
paddlepaddle-gpu2.6.2
ps模式下,1p1w/1p2w,worker启动后报错失败,cpu和gpu都不行,实测在paddlepaddle2.6.2的cpu版本这个示例可以跑通
使用相同启动命令:
fleetrun --master=10.250.1.253:8090 --servers=10.250.1.253:6070 --trainers=10.250.0.108:6071,10.250.1.172:6071 /workspace/train.py --lr 0.01
训练代码:
PaddleFleetX/tree/old_develop/examples/wide_and_deep_datase
报错信息:
/opt/paddle-env/lib/python3.10/site-packages/paddle/base/framework.py:688: UserWarning: You are using GPU version Paddle, but your CUDA device is not set properly. CPU device will be used by default.
warnings.warn(
[2025-12-03 11:53:27,783] [ INFO] distributed_strategy.py:214 - distributed strategy initialized
fl-ps > coordinator address is null!
Gloo init with HTTP: need_init_all: False, args: {'http.host': '10.250.1.253', 'http.port': '6767', 'store.prefix': '', 'start_http_server': False, 'http_server_d': <DictProxy object, typeid 'dict' at 0x7f16e86598d0>}
I1203 11:53:32.963233 142 gloo_wrapper.cc:355] gloo initialized done, rank=0, size=2, store_type=1
[2025-12-03 11:53:33,058] [ INFO] distributed_strategy.py:214 - distributed strategy initialized
valid_optimizer_list= [<paddle.distributed.fleet.meta_optimizers.ps_optimizer.ParameterServerOptimizer object at 0x7f16e78c4be0>]
meta_optimizer= <paddle.distributed.fleet.meta_optimizers.ps_optimizer.ParameterServerOptimizer object at 0x7f16e78c4be0>
graph_optimizer= None
is_heter_ps_mode in distributed_ops_pass False?
is_heter_ps_mode? False
public get_the_one_send_context sparse: embedding@GRAD ['embedding.block0'] [0, 10]
ShowClickEntry not configured, will not use
debug zcb slots: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
remote_optimize_vars: ['linear_1.w_0', 'linear_2.w_0@GRAD', 'linear_3.b_0', 'linear_0.w_0@GRAD', 'embedding', 'linear_3.w_0', 'linear_4.b_0', 'linear_2.w_0', 'linear_3.b_0@GRAD', 'linear_4.b_0@GRAD', 'linear_0.b_0', 'linear_3.w_0@GRAD', 'linear_0.b_0@GRAD', 'linear_0.w_0', 'linear_1.w_0@GRAD', 'linear_4.w_0', 'learning_rate_0', 'linear_2.b_0@GRAD', 'linear_1.b_0@GRAD', 'linear_2.b_0', 'linear_1.b_0', 'linear_4.w_0@GRAD', 'embedding@GRAD'], remote_optimize_op_role_vars: ['linear_1.w_0', 'linear_2.w_0@GRAD', 'linear_3.b_0', 'linear_0.w_0@GRAD', 'embedding', 'linear_3.w_0', 'linear_4.b_0', 'linear_2.w_0', 'linear_3.b_0@GRAD', 'linear_4.b_0@GRAD', 'linear_0.b_0', 'linear_3.w_0@GRAD', 'linear_0.b_0@GRAD', 'linear_0.w_0', 'linear_1.w_0@GRAD', 'linear_4.w_0', 'linear_2.b_0@GRAD', 'linear_1.b_0@GRAD', 'linear_2.b_0', 'linear_1.b_0', 'linear_4.w_0@GRAD', 'embedding@GRAD'], local_optimize_vars: []
is_heter_ps_mode? False
public get_the_one_send_context sparse: embedding@GRAD ['embedding.block0'] [0, 10]
paddle.static.default_startup_program: <function default_startup_program at 0x7f16f0c38310>
is_heter_ps_mode? False
public get_the_one_send_context sparse: embedding@GRAD ['embedding.block0'] [0, 10]
is_heter_ps_mode? False
public get_the_one_send_context sparse: embedding@GRAD ['embedding.block0'] [0, 10]
fl-ps > local_sparse: [], remote_sparse: []
is_heter_ps_mode? False
public get_the_one_send_context sparse: embedding@GRAD ['embedding.block0'] [0, 10]
idx, name, ctx: 0 embedding@GRAD varname: embedding@GRAD trainer_id: 0 table_id: 0slice varname: embedding.block0 ep: 127.0.0.1:6071 section: 0 origin varnames: embedding@GRAD aggregation->add: 1 is_sparse: 1 is_distributed: 1
table_id: 0
program_id: 139736505293296
is_tensor_table: 0
is_datanorm_table: 0
idx, name, ctx: 1 Dense@GRAD_1 varname: Dense@GRAD_1 trainer_id: 0 table_id: 1slice varname: Dense@GRAD_1 ep: 127.0.0.1:6071 section: 430815 origin varnames: linear_0.b_0@GRAD linear_0.w_0@GRAD linear_1.b_0@GRAD linear_1.w_0@GRAD linear_2.b_0@GRAD linear_2.w_0@GRAD linear_3.b_0@GRAD linear_3.w_0@GRAD linear_4.b_0@GRAD linear_4.w_0@GRAD aggregation->add: 1 is_sparse: 0 is_distributed: 0
table_id: 1
program_id: 139736505293296
is_tensor_table: 0
is_datanorm_table: 0
Wait 10s for PS to start...
I1203 11:53:43.298707 142 program_interpreter.cc:212] New Executor is Running.
adam_d2sum: False
new table_name: embedding
/opt/paddle-env/lib/python3.10/site-packages/paddle/distributed/ps/the_one_ps.py:739: UserWarning: The PS mode must use MemorySparseTable.
warnings.warn("The PS mode must use MemorySparseTable.")
/opt/paddle-env/lib/python3.10/site-packages/paddle/distributed/ps/the_one_ps.py:750: UserWarning: The shard_num of sparse table is not set, use default value 1000 in cpups.
warnings.warn(
/opt/paddle-env/lib/python3.10/site-packages/paddle/distributed/ps/the_one_ps.py:772: UserWarning: The accessor of sparse table is not set, use default value.
warnings.warn(
new var: persist trainable param embedding : SELECTED_ROWS.shape(1024, 10).dtype(float32).stop_gradient(False), 10, 12
adam_d2sum: False
new table_name: embedding
new var: persist trainable param embedding : SELECTED_ROWS.shape(1024, 10).dtype(float32).stop_gradient(False), 10, 12
adam_d2sum: False
adam_d2sum: False
is_heter_ps_mode? False
public get_the_one_send_context sparse: embedding@GRAD ['embedding.block0'] [0, 10]
is_heter_ps_mode? False
public get_the_one_send_context sparse: embedding@GRAD ['embedding.block0'] [0, 10]
communicator config: {'communicator_max_merge_var_num': '1', 'communicator_send_queue_size': '1', 'communicator_independent_recv_thread': '1', 'communicator_min_send_grad_num_before_recv': '1', 'communicator_thread_pool_size': '5', 'communicator_send_wait_times': '5', 'communicator_is_sgd_optimizer': '1'}
I1203 11:53:43.522832 142 server.cpp:1107] Server[paddle::distributed::DownpourPsClientService] is serving on port=8500.
I1203 11:53:43.522850 142 server.cpp:1110] Check out http://paddle-ps-5-1764756425-worker-0:8500 in web browser.
I1203 11:53:43.522902 142 brpc_ps_client.cc:131] BrpcPsClient Service addr: 10.250.0.108, 8500, 0
fl-ps > trainer_endpoint: 10.250.0.108:6071
fl-ps > with_coordinator? False
fl-ps > coordinator addr: []
I1203 11:53:43.833642 142 brpc_ps_client.cc:200] Client connect success:10.250.0.108:8500,10.250.1.172:8500,
create c2c connection done
entering self._init_all_params()
create DownpourLiteWorker
device worker program id: 139736351155072
device worker program_configs: {'139736351155072': {'pull_dense': [1], 'push_dense': [1], 'pull_sparse': [], 'push_sparse': []}}
device worker 139736351155072 139736351155072
device worker pull dense: [1]
device worker dense_table_config: {1: ['linear_0.b_0', 'linear_0.w_0', 'linear_1.b_0', 'linear_1.w_0', 'linear_2.b_0', 'linear_2.w_0', 'linear_3.b_0', 'linear_3.w_0', 'linear_4.b_0', 'linear_4.w_0']}
I1203 11:53:43.913957 142 fleet.cc:39] RegisterHeterCallback support later
I1203 11:53:43.915089 200 hogwild_worker.cc:1103] device id=0, total param count=295, persist count=20, param=19, fp16=0, share=19, reset=0, pinned=0, resize_var=0, need copy param count=0, delete vars count=0
I1203 11:53:43.917484 200 hogwild_worker.cc:921] device id=0, total op count=141, create op count=141, skip vars count=1, unused vars op count=118, offload op count=0, offload input count=0, cast count=0
C++ Traceback (most recent call last):
0 phi::ThreadPool::TaskLoop()
1 std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool)
2 paddle::framework::HogwildWorker::CreateDeviceResource(paddle::framework::ProgramDesc const&)
3 phi::GPUContext::stream() const
Error Message Summary:
FatalError: Segmentation fault is detected by the operating system.
[TimeInfo: *** Aborted at 1764762823 (unix time) try "date -d @1764762823" if you are using GNU date ***]
[SignalInfo: *** SIGSEGV (@0x18) received by PID 142 (TID 0x7f165ffff640) from PID 24 ***]
/opt/paddle-env/lib/python3.10/site-packages/paddle/base/framework.py:688: UserWarning: You are using GPU version Paddle, but your CUDA device is not set properly. CPU device will be used by default.
warnings.warn(
[2025-12-03 11:53:51,057] [ INFO] distributed_strategy.py:214 - distributed strategy initialized
fl-ps > coordinator address is null!
Gloo init with HTTP: need_init_all: False, args: {'http.host': '10.250.1.253', 'http.port': '6767', 'store.prefix': '', 'start_http_server': False, 'http_server_d': <DictProxy object, typeid 'dict' at 0x7f8b9e1658a0>}
其他补充信息 Additional Supplementary Information
No response