跳到主要内容

分布式精调遇到的问题

报错 Duplicate GPU detected : rank 0 and rank 2 both on CUDA device 73010

INFO     | 2025-03-20 02:38:27 | finetune.tokenizer:get_tokenizer:99 - using chat template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|><think>\n'}}{% endif %}
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1545: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
[rank0]:[W320 02:38:27.572954504 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
INFO | 2025-03-20 02:38:27 | top.http:post:28 - request: {"JobID": "job-id-cva3188ao6fipn61caug", "JobExecutionID": "job-exec-id-cvdnuo7q2j7s71lugp10", "Status": 9, "Reason": "finetune training error: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5\nncclInvalidUsage: This usually reflects invalid usage of NCCL library.\nLast error:\nDuplicate GPU detected : rank 0 and rank 2 both on CUDA device 73010", "Top": {"AccountId": 1000000000}}
[rank0]: Duplicate GPU detected : rank 0 and rank 2 both on CUDA device 73010
[rank0]: Last error:
[rank0]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5
[rank0]: work = group.barrier(opts=opts)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4551, in barrier
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]: dist.barrier()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 2442, in main_process_first
[rank0]: next(self.gen)
[rank0]: File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
[rank0]: with TrainingArguments(**training_args).main_process_first(local=False):
[rank0]: File "/app/finetune/sft.py", line 86, in train
[rank0]: trainer.train(params)
[rank0]: File "/app/main.py", line 165, in main
[rank0]: raise e
[rank0]: File "/app/main.py", line 199, in main
[rank0]: main()
[rank0]: File "/app/main.py", line 205, in <module>
[rank0]: Traceback (most recent call last):

原因卡配置不同,修改为同一个配置就可以运行了。

规格配置
Master
常规计算资源: CPU - 10 Core 内存 - 100 GiB 显卡资源: 卡型号 - A800-SXM4-80GB 卡数 - 2 卡
Worker
常规计算资源: CPU - 10 Core 内存 - 100 GiB 显卡资源: 卡型号 - A800-SXM4-80GB 卡数 - 4 卡 副本数: 副本数 - 1 个
规格配置
Master
常规计算资源: CPU - 10 Core 内存 - 100 GiB 显卡资源: 卡型号 - A800-SXM4-80GB 卡数 - 2 卡
Worker
常规计算资源: CPU - 10 Core 内存 - 100 GiB 显卡资源: 卡型号 - A800-SXM4-80GB 卡数 - 2 卡 副本数: 副本数 - 1 个

torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.

[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([5120, 27648]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 70:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([27648, 5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 61:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([27648, 5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 53:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 51:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([5120, 5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 41:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([1024]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 23:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([1024, 5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 22:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([1024]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 15:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([1024, 5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 14:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 7:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([5120, 5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 6:
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: saved metadata: {'shape': torch.Size([5120]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 4:
[rank0]: torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
[rank0]: raise CheckpointError(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 903, in check_recomputed_tensors_match
[rank0]: frame.check_recomputed_tensors_match(gid)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 1129, in unpack_hook
[rank0]: input, weight, bias = ctx.saved_tensors
[rank0]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/linear.py", line 80, in backward
[rank0]: return bwd(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 549, in decorate_bwd
[rank0]: return user_fn(self, *args)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 307, in apply
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank0]: _engine_run_backward(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 347, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 626, in backward
[rank0]: scaled_loss.backward(retain_graph=retain_graph)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2247, in backward
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: self.engine.backward(loss, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/deepspeed.py", line 166, in backward
[rank0]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2188, in backward
[rank0]: self.accelerator.backward(loss, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3518, in training_step
[rank0]: tr_loss_step = self.training_step(model, inputs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2388, in _inner_training_loop
[rank0]: return inner_training_loop(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2052, in train
[rank0]: output = super().train(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 440, in train
[rank0]: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank0]: File "/app/finetune/sft.py", line 173, in train
[rank0]: trainer.train(params)
[rank0]: File "/app/main.py", line 165, in main
[rank0]: raise e

0%| | 0/43 [00:15<?, ?it/s]
[HAMI-core Msg(105:139861200876544:multiprocess_memory_limit.c:468)]: Calling exit handler 105
[HAMI-core Msg(104:140546317530112:multiprocess_memory_limit.c:468)]: Calling exit handler 104
[HAMI-core Msg(103:140031096132608:multiprocess_memory_limit.c:468)]: Calling exit handler 103
[HAMI-core Msg(104:140633396208640:multiprocess_memory_limit.c:468)]: Calling exit handler 104