Open
Description
I really appreciate your work. During the process of reproducing it, I encountered a dimension mismatch issue when testing on the unseen data. A 513-dimensional vector appeared, and I'm not sure how to resolve this problem.
[04:07:39.871757] Beginning evaluation of val_seen
Traceback (most recent call last):
File "/data1_8t/user/zr/codes/NavCoT/finetune_src/r2r/main.py", line 316, in
main()
File "/data1_8t/user/zr/codes/NavCoT/finetune_src/r2r/main.py", line 312, in main
valid(args, train_env, val_envs, rank=rank)
File "/data1_8t/user/zr/codes/NavCoT/finetune_src/r2r/main.py", line 273, in valid
agent.test(use_dropout=False, feedback='argmax', iters=iters)
File "/data1_8t/user/zr/codes/NavCoT/finetune_src/r2r/agent_cmt.py", line 1114, in test
super().test(iters=iters, rollout_function=self.rollout_llm)
File "/data1_8t/user/zr/codes/NavCoT/finetune_src/r2r/agent_base.py", line 42, in test
for traj in rollout_function(**kwargs):
File "/data1_8t/user/zr/codes/NavCoT/finetune_src/r2r/agent_cmt.py", line 1042, in rollout_llm
nav_output = self.llm.generate(nav_input["prompts"],images=None,max_gen_len=64,temperature=self.args.temperature)
File "/data1_8t/user/zr/miniconda3/envs/accessory/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data1_8t/user/zr/codes/NavCoT/LLaMA2-Accessory/accessory/model/meta.py", line 127, in generate
tokens[k, : len(t)] = torch.tensor(t).long()
File "/data1_8t/user/zr/miniconda3/envs/accessory/lib/python3.9/site-packages/torch/utils/_device.py", line 62, in torch_function
return func(*args, **kwargs)
RuntimeError: The expanded size of the tensor (512) must match the existing size (513) at non-singleton dimension 0. Target sizes: [512]. Tensor sizes: [513]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1516850) of binary: /data1_8t/user/zr/miniconda3/envs/accessory/bin/python3.9
Traceback (most recent call last):
File "/data1_8t/user/zr/miniconda3/envs/accessory/bin/torchrun", line 8, in
sys.exit(main())
File "/data1_8t/user/zr/miniconda3/envs/accessory/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/data1_8t/user/zr/miniconda3/envs/accessory/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/data1_8t/user/zr/miniconda3/envs/accessory/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/data1_8t/user/zr/miniconda3/envs/accessory/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data1_8t/user/zr/miniconda3/envs/accessory/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
r2r/main.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-06-12_04:13:29
host : ubuntu
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1516850)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Could you kindly spare some time to help me amidst your busy schedule? Thank you.
Metadata
Metadata
Assignees
Labels
No labels