AssertionError: llama-2-7b/tokenizer.model torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4799) of binary: /usr/bin/python3
initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
File "/home/rong/python/ai_project/llama/./example_text_completion.py", line 69, in <module>
fire.Fire(main)
File "/home/rong/.local/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/rong/.local/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/rong/.local/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/rong/python/ai_project/llama/./example_text_completion.py", line 32, in main
generator = Llama.build(
File "/home/rong/python/ai_project/llama/llama/generation.py", line 116, in build
tokenizer = Tokenizer(model_path=tokenizer_path)
File "/home/rong/python/ai_project/llama/llama/tokenizer.py", line 24, in __init__
assert os.path.isfile(model_path), model_path
AssertionError: llama-2-7b/tokenizer.model
[2024-03-28 15:45:33,315] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4799) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/home/rong/.local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./example_text_completion.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-28_15:45:33
host : pc
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 4799)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
这个错误是在运行Python代码时出现的,具体是在尝试初始化一个名为”Llama”的模型时遇到了问题。这个错误主要涉及到模型的并行初始化、分布式数据并行(DDP)和流水线并行(pipeline)的初始化,以及模型的构建和分词器(tokenizer)的初始化。
根据错误信息,问题出现在以下地方:
AssertionError: llama-2-7b/tokenizer.model:这一行的错误信息表示,代码在尝试检查llama-2-7b/tokenizer.model文件是否存在时,发现该文件并不存在。assert os.path.isfile(model_path), model_path这行代码的意思是,如果model_path(即llama-2-7b/tokenizer.model)不是一个文件,那么就抛出一个AssertionError。
解决方案:你需要检查llama-2-7b/tokenizer.model这个文件是否真的存在于你的文件系统中,或者是否路径被正确指定。如果文件存在但路径不正确,你需要修改代码中的tokenizer_path变量,使其指向正确的文件路径。如果文件不存在,你可能需要下载或生成这个文件。
torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4799) of binary: /usr/bin/python3:这个错误表示在尝试使用PyTorch的分布式弹性多进程API进行模型训练时出现了问题。这个错误可能是由于上述的AssertionError导致的,因为模型的构建和初始化在这个过程中失败了。
解决方案:首先,确保上述的llama-2-7b/tokenizer.model文件存在并且路径正确。然后,检查你的PyTorch版本和torch.distributed.elastic是否安装正确,且版本兼容。你可能需要更新或者降级你的PyTorch版本。最后,确保你的代码中的分布式设置(如设备数量、每个设备的本地排名等)是正确的。
总的来说,你需要检查模型的路径、文件的存在性、PyTorch的版本和设置,以解决这个问题。
扫码领红包
微信赞赏
支付宝扫码领红包
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。侵权投诉:375170667@qq.com








