AssertionError: llama-2-7b/tokenizer.model torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4799) of binary: /usr/bin/python3

 initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "/home/rong/python/ai_project/llama/./example_text_completion.py", line 69, in <module>
    fire.Fire(main)
  File "/home/rong/.local/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/rong/.local/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/rong/.local/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/rong/python/ai_project/llama/./example_text_completion.py", line 32, in main
    generator = Llama.build(
  File "/home/rong/python/ai_project/llama/llama/generation.py", line 116, in build
    tokenizer = Tokenizer(model_path=tokenizer_path)
  File "/home/rong/python/ai_project/llama/llama/tokenizer.py", line 24, in __init__
    assert os.path.isfile(model_path), model_path
AssertionError: llama-2-7b/tokenizer.model
[2024-03-28 15:45:33,315] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4799) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/home/rong/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./example_text_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-28_15:45:33
  host      : pc
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 4799)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

这个错误是在运行Python代码时出现的,具体是在尝试初始化一个名为”Llama”的模型时遇到了问题。这个错误主要涉及到模型的并行初始化、分布式数据并行(DDP)和流水线并行(pipeline)的初始化,以及模型的构建和分词器(tokenizer)的初始化。

根据错误信息,问题出现在以下地方:

  1. AssertionError: llama-2-7b/tokenizer.model:这一行的错误信息表示,代码在尝试检查llama-2-7b/tokenizer.model文件是否存在时,发现该文件并不存在。assert os.path.isfile(model_path), model_path这行代码的意思是,如果model_path(即llama-2-7b/tokenizer.model)不是一个文件,那么就抛出一个AssertionError。

解决方案:你需要检查llama-2-7b/tokenizer.model这个文件是否真的存在于你的文件系统中,或者是否路径被正确指定。如果文件存在但路径不正确,你需要修改代码中的tokenizer_path变量,使其指向正确的文件路径。如果文件不存在,你可能需要下载或生成这个文件。

  1. torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4799) of binary: /usr/bin/python3:这个错误表示在尝试使用PyTorch的分布式弹性多进程API进行模型训练时出现了问题。这个错误可能是由于上述的AssertionError导致的,因为模型的构建和初始化在这个过程中失败了。

解决方案:首先,确保上述的llama-2-7b/tokenizer.model文件存在并且路径正确。然后,检查你的PyTorch版本和torch.distributed.elastic是否安装正确,且版本兼容。你可能需要更新或者降级你的PyTorch版本。最后,确保你的代码中的分布式设置(如设备数量、每个设备的本地排名等)是正确的。

总的来说,你需要检查模型的路径、文件的存在性、PyTorch的版本和设置,以解决这个问题。

扫码领红包

微信赞赏支付宝扫码领红包

发表回复

后才能评论