AssertionError: llama-2-7b/tokenizer.model torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4799) of binary: /usr/bin/python3
initializing model parallel with size 1 > initializing ddp with size 1 > initializing pipeline with size 1 Traceback (most recent call last): File "/home/rong/python/ai_project/llama/./example_text_completion.py", line 69, in <module> fire.Fire(main) File "/home/rong/.local/lib/python3.10/site-packages/fire/core.py", line 143, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/rong/.local/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/rong/.local/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/rong/python/ai_project/llama/./example_text_completion.py", line 32, in main generator = Llama.build( File "/home/rong/python/ai_project/llama/llama/generation.py", line 116, in build tokenizer = Tokenizer(model_path=tokenizer_path) File "/home/rong/python/ai_project/llama/llama/tokenizer.py", line 24, in __init__ assert os.path.isfile(model_path), model_path AssertionError: llama-2-7b/tokenizer.model [2024-03-28 15:45:33,315] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4799) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/rong/.local/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/rong/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ ./example_text_completion.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-03-28_15:45:33 host : pc rank : 0 (local_rank: 0) exitcode : 1 (pid: 4799) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
这个错误是在运行Python代码时出现的,具体是在尝试初始化一个名为”Llama”的模型时遇到了问题。这个错误主要涉及到模型的并行初始化、分布式数据并行(DDP)和流水线并行(pipeline)的初始化,以及模型的构建和分词器(tokenizer)的初始化。
根据错误信息,问题出现在以下地方:
AssertionError: llama-2-7b/tokenizer.model
:这一行的错误信息表示,代码在尝试检查llama-2-7b/tokenizer.model
文件是否存在时,发现该文件并不存在。assert os.path.isfile(model_path), model_path
这行代码的意思是,如果model_path
(即llama-2-7b/tokenizer.model
)不是一个文件,那么就抛出一个AssertionError。
解决方案:你需要检查llama-2-7b/tokenizer.model
这个文件是否真的存在于你的文件系统中,或者是否路径被正确指定。如果文件存在但路径不正确,你需要修改代码中的tokenizer_path
变量,使其指向正确的文件路径。如果文件不存在,你可能需要下载或生成这个文件。
torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4799) of binary: /usr/bin/python3
:这个错误表示在尝试使用PyTorch的分布式弹性多进程API进行模型训练时出现了问题。这个错误可能是由于上述的AssertionError
导致的,因为模型的构建和初始化在这个过程中失败了。
解决方案:首先,确保上述的llama-2-7b/tokenizer.model
文件存在并且路径正确。然后,检查你的PyTorch版本和torch.distributed.elastic是否安装正确,且版本兼容。你可能需要更新或者降级你的PyTorch版本。最后,确保你的代码中的分布式设置(如设备数量、每个设备的本地排名等)是正确的。
总的来说,你需要检查模型的路径、文件的存在性、PyTorch的版本和设置,以解决这个问题。
扫码领红包微信赞赏支付宝扫码领红包
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。侵权投诉:375170667@qq.com