Libtpu pin update after 04/25 hangs #9084

bhavya01 · 2025-05-02T21:45:15Z

🐛 Bug

Trying to update the libtpu nightly to anything after 04/25 hangs the TPU tests.

To Reproduce

Set libtpu nightly to after 04/25 in setup.py file
Run python test/test_operations.py
This leads to hang

Expected behavior

Tests should pass.

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
torch_xla version: 04/25 nightly

Additional context

Libtpu turned the flag --xla_tpu_use_enhanced_launch_barrier to default true value. This flag ensures that each device that the pjrt_executable is compiled for is executing the same code by doing an allreduce on the run_id.

I think that when running Compile we use all the available PjRt devices to compile

xla/torch_xla/csrc/runtime/pjrt_computation_client.cc

Line 618 in 02c0ed9

xla::DeviceAssignment device_assignment(client_->device_count(), 1);

When executing the computation, the barrier probably expects all the devices to be running the same computation due to the device assignment. Creating an issue to verify and fix this.

The text was updated successfully, but these errors were encountered:

bhavya01 self-assigned this May 2, 2025

ysiraichi added bug Something isn't working libtpu labels May 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Libtpu pin update after 04/25 hangs #9084

Libtpu pin update after 04/25 hangs #9084

Libtpu pin update after 04/25 hangs #9084

Libtpu pin update after 04/25 hangs #9084

Comments

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context