Fix TestLoRAFinalCheckpoints test class #498

ebsmothers · 2024-03-14T03:55:56Z

This test class has a number of problems. In no particular order:

The enable_fsdp=True case is not working properly (i.e. since we are on a single device we are actually testing NO_SHARD and not FULL_SHARD sharding strategy).
The full_bf16=True case is probably adding more complexity than value.
The building and modifying of different formatted strings for different tune commands is unclear and not well-documented.
(Part of a larger issue) We should split the test_lora_finetune.py test file into single device and distributed files to align with the split in the recipe files.

The text was updated successfully, but these errors were encountered:

rohan-varma · 2024-03-14T12:44:43Z

We should also definitely enable multi-GPU CI now that we're in pytorch repo and these runners are available to us: #500

kartikayk · 2024-03-14T14:12:41Z

BTW @ebsmothers I'm also getting "duplicate distributed initialized" error for this test. I thought that got fixed?

ebsmothers · 2024-03-14T21:12:52Z

BTW @ebsmothers I'm also getting "duplicate distributed initialized" error for this test. I thought that got fixed?

@kartikayk it should be fixed. Can you give the command you're running? And is the error the same one about trying to initialize a process group that's already been initialized?

ebsmothers · 2024-03-29T20:49:29Z

Fixed in #537

ebsmothers self-assigned this Mar 14, 2024

ebsmothers mentioned this issue Mar 14, 2024

Update Checkpointing to support Adapter Weights #494

Merged

ebsmothers closed this as completed Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix TestLoRAFinalCheckpoints test class #498

Fix TestLoRAFinalCheckpoints test class #498

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix TestLoRAFinalCheckpoints test class #498

Fix TestLoRAFinalCheckpoints test class #498

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!