You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Trying to update the libtpu nightly to anything after 04/25 hangs the TPU tests.
To Reproduce
Set libtpu nightly to after 04/25 in setup.py file
Run python test/test_operations.py
This leads to hang
Expected behavior
Tests should pass.
Environment
Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
torch_xla version: 04/25 nightly
Additional context
Libtpu turned the flag --xla_tpu_use_enhanced_launch_barrier to default true value. This flag ensures that each device that the pjrt_executable is compiled for is executing the same code by doing an allreduce on the run_id.
I think that when running Compile we use all the available PjRt devices to compile
When executing the computation, the barrier probably expects all the devices to be running the same computation due to the device assignment. Creating an issue to verify and fix this.
The text was updated successfully, but these errors were encountered:
🐛 Bug
Trying to update the libtpu nightly to anything after 04/25 hangs the TPU tests.
To Reproduce
setup.py
fileExpected behavior
Tests should pass.
Environment
Additional context
Libtpu turned the flag
--xla_tpu_use_enhanced_launch_barrier
to defaulttrue
value. This flag ensures that each device that the pjrt_executable is compiled for is executing the same code by doing an allreduce on the run_id.I think that when running Compile we use all the available PjRt devices to compile
xla/torch_xla/csrc/runtime/pjrt_computation_client.cc
Line 618 in 02c0ed9
When executing the computation, the barrier probably expects all the devices to be running the same computation due to the device assignment. Creating an issue to verify and fix this.
The text was updated successfully, but these errors were encountered: