跳到主要内容

分布式精调遇到的问题

报错 Duplicate GPU detected : rank 0 and rank 2 both on CUDA device 73010

INFO     | 2025-03-20 02:38:27 | finetune.tokenizer:get_tokenizer:99 - using chat template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<｜Assistant｜><think>\n'}}{% endif %}
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1545: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
[rank0]:[W320 02:38:27.572954504 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
INFO     | 2025-03-20 02:38:27 | top.http:post:28 - request: {"JobID": "job-id-cva3188ao6fipn61caug", "JobExecutionID": "job-exec-id-cvdnuo7q2j7s71lugp10", "Status": 9, "Reason": "finetune training error: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5\nncclInvalidUsage: This usually reflects invalid usage of NCCL library.\nLast error:\nDuplicate GPU detected : rank 0 and rank 2 both on CUDA device 73010", "Top": {"AccountId": 1000000000}}
[rank0]: Duplicate GPU detected : rank 0 and rank 2 both on CUDA device 73010
[rank0]: Last error:
[rank0]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5
[rank0]:     work = group.barrier(opts=opts)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4551, in barrier
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]:     dist.barrier()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 2442, in main_process_first
[rank0]:     next(self.gen)
[rank0]:   File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
[rank0]:     with TrainingArguments(**training_args).main_process_first(local=False):
[rank0]:   File "/app/finetune/sft.py", line 86, in train
[rank0]:     trainer.train(params)
[rank0]:   File "/app/main.py", line 165, in main
[rank0]:     raise e
[rank0]:   File "/app/main.py", line 199, in main
[rank0]:     main()
[rank0]:   File "/app/main.py", line 205, in <module>
[rank0]: Traceback (most recent call last):

原因卡配置不同，修改为同一个配置就可以运行了。

规格配置
Master
常规计算资源： CPU - 10 Core 内存 - 100 GiB  显卡资源： 卡型号 - A800-SXM4-80GB 卡数 - 2 卡
Worker
常规计算资源： CPU - 10 Core 内存 - 100 GiB  显卡资源： 卡型号 - A800-SXM4-80GB 卡数 - 4 卡  副本数： 副本数 - 1 个

规格配置
Master
常规计算资源： CPU - 10 Core 内存 - 100 GiB  显卡资源： 卡型号 - A800-SXM4-80GB 卡数 - 2 卡
Worker
常规计算资源： CPU - 10 Core 内存 - 100 GiB  显卡资源： 卡型号 - A800-SXM4-80GB 卡数 - 2 卡  副本数： 副本数 - 1 个

torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.

[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([5120, 27648]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 70:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([27648, 5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 61:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([27648, 5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 53:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 51:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([5120, 5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 41:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([1024]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 23:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([1024, 5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 22:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([1024]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 15:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([1024, 5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 14:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 7:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([5120, 5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 6:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 4:
[rank0]: torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
[rank0]:     raise CheckpointError(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 903, in check_recomputed_tensors_match
[rank0]:     frame.check_recomputed_tensors_match(gid)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 1129, in unpack_hook
[rank0]:     input, weight, bias = ctx.saved_tensors
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/linear.py", line 80, in backward
[rank0]:     return bwd(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 549, in decorate_bwd
[rank0]:     return user_fn(self, *args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 307, in apply
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 347, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 626, in backward
[rank0]:     scaled_loss.backward(retain_graph=retain_graph)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2247, in backward
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     self.engine.backward(loss, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/deepspeed.py", line 166, in backward
[rank0]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2188, in backward
[rank0]:     self.accelerator.backward(loss, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3518, in training_step
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2388, in _inner_training_loop
[rank0]:     return inner_training_loop(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2052, in train
[rank0]:     output = super().train(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 440, in train
[rank0]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank0]:   File "/app/finetune/sft.py", line 173, in train
[rank0]:     trainer.train(params)
[rank0]:   File "/app/main.py", line 165, in main
[rank0]:     raise e

  0%|          | 0/43 [00:15<?, ?it/s]
[HAMI-core Msg(105:139861200876544:multiprocess_memory_limit.c:468)]: Calling exit handler 105
[HAMI-core Msg(104:140546317530112:multiprocess_memory_limit.c:468)]: Calling exit handler 104
[HAMI-core Msg(103:140031096132608:multiprocess_memory_limit.c:468)]: Calling exit handler 103
[HAMI-core Msg(104:140633396208640:multiprocess_memory_limit.c:468)]: Calling exit handler 104