8000 Add support for standalone mode when default port is occupied on single node by laitifranz · Pull Request #3576 · huggingface/accelerate · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add support for standalone mode when default port is occupied on single node #3576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 20, 2025

Conversation

laitifranz
Copy link
Contributor

What does this PR do?

This PR adds support for the --standalone mode in Accelerate, addressing issue #3175 and building upon the work in PR #3501 by @hellobiondi.
While the documentation notes that setting the port to 0 will automatically select an available port, this PR leverages the built-in --standalone functionality from torch.distributed.run and propagates the argument to the underlying torch.distributed.run launcher without raising a connection error, reducing the likelihood of port conflicts when running multiple distributed jobs on the same machine (see also).
The user is informed of the port conflict and the fallback to standalone mode, with guidance for future runs.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@SunMarc or @zach-huggingface This relates to the Command Line Interface and distributed training/inference functionality on a single node

…he main process port is in use, allowing for automatic port selection
Copy link
Member
@S1ro1 S1ro1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's wait for Marc's review.

Copy link
Member
@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Just a nit

f"Tried to launch distributed communication on port `{main_process_port}`, but another process is utilizing it. "
"Please specify a different port (such as using the `--main_process_port` flag or specifying a different `main_process_port` in your config file)"
" and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`."
args.standalone = True
Copy link
Member
@SunMarc SunMarc May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should only be used in a single node setup (we have the additional int(args.machine_rank) == 0 check in need_port_check ). If there are multiple nodes, let's raise the error as usual

f"Tried to launch distributed communication on port `{main_process_port}`, but another process is utilizing it. "
"Please specify a different port (such as using the `--main_process_port` flag or specifying a different `main_process_port` in your config file)"
" and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`."
args.standalone = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member
@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SunMarc
Copy link
Member
SunMarc commented May 20, 2025

@bot /style

Copy link
Contributor

Style fixes have been applied. View the workflow run here.

@SunMarc SunMarc merged commit 33967d4 into huggingface:main May 20, 2025
@laitifranz laitifranz deleted the standalone-mode branch May 20, 2025 16:22
S1ro1 added a commit that referenced this pull request Jun 10, 2025
commit 2f8fd72
Author: Simon <80467011+sorgfresser@users.noreply.github.com>
Date:   Tue Jun 10 13:50:34 2025 +0100

    Remove device_count (#3587)

commit d2e6b03
Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>
Date:   Tue Jun 10 05:26:48 2025 -0700

    [FSDP2] Refactor + FP8 (#3585)

    * Fix double wrap

    * Clocking off, ~equal to torch baseline

    * works?

    * Working version

    * Partial rewrite

    * FSDP2 path works

    * Fix back prepare

    * Almost done, proper AC left

    * Feat: should work, cleanup + test more benchmarks left

    * Style+quality

    * Feat: fp8 example

    * Feat: better example

    * Feat: add readme

    * Docs + should be done

    * Fix: typos

    * Fix: protect imports

    * Feat: address comments

    * Feat: add flops image

commit b9fee48
Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Date:   Tue Jun 10 13:24:43 2025 +0100

    better handle FP8 with and without deepspeed (#3611)

    * use the state mixed precision which has undergone all preprocessing

    * Update src/accelerate/accelerator.py

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

    * Update src/accelerate/accelerator.py

    * accelerator state sets the mixed precision for deepspeed and fp8_enabled

    * fix

    * fix

    ---------

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

commit 3a82b05
Author: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Date:   Tue Jun 10 11:29:59 2025 +0200

    Fix bf16 training with TP  (#3610)

    * fix

    * Apply style fixes

    ---------

    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 6b61a37
Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Date:   Fri Jun 6 13:48:43 2025 +0100

    fix deepspeed regional compilation (#3609)

commit 682691d
Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Date:   Tue Jun 3 12:36:56 2025 +0200

    Update Gaudi Runners (#3593)

    * test

    * fix

    * push

    * in the morning

    * fix backend

    * run first

    * set habana modules

    * dynamo backend

    * trigger

    * remove on pr

    * remove on file change

commit 791055b
Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>
Date:   Tue Jun 3 12:24:20 2025 +0200

    Fix: list object has no attribute keys (#3603)

commit 16bf1d8
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Fri May 30 23:36:34 2025 +0800

    enable torchao and pippy test cases on XPU (#3599)

    * enable torchao and pippy test cases on XPU

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit ab3c604
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Fri May 30 23:23:26 2025 +0800

    enable big_model_inference on xpu (#3595)

    * enable big_model_inference on XPU

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix quality

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit 273799c
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 27 20:08:59 2025 +0800

    enable fsdp2 benchmark on XPU (#3590)

    * enable fsdp2 benchmark on XPU

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * add deterministic

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit 43526c5
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 27 17:44:50 2025 +0800

    add device-agnostic GradScaler (#3588)

    * add device-agnostic GradScaler

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix bug

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix review comments

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * format

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * Apply style fixes

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>
    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 07f2392
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 27 17:17:18 2025 +0800

    change to use torch.device (#3594)

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit ee2f48c
Author: Fanli Lin <fanli.lin@intel.com>
Date:   Tue May 27 17:16:42 2025 +0800

    [docs] no hard-coded cuda in the ddp documentation (#3589)

    * make device-agnostic

    * refactor

commit 4f3abb7
Author: jiqing-feng <jiqing.feng@intel.com>
Date:   Mon May 26 21:55:10 2025 +0800

    Set ccl and KMP param in simple launch (#3575)

    * Even 1 CPU mechine can also run multi process

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * fix ccl and kml param setting

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * set master addr only when processes > 1

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * fix num process check

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * fix ccl args check

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    ---------

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

commit db536cb
Author: Yuanzhou Cai <80858000+yuanjua@users.noreply.github.com>
Date:   Mon May 26 21:08:13 2025 +0800

    Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup (#3581)

    * Fix tracker initialize distributed before InitProcessGroupKwargs

    * Fix tracker initialize distributed before InitProcessGroupKwargs

    * Add test for bug #3550

    * Improve test for #3550

    * Remove redundant code

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

    * fix style

    ---------

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

commit 4e9d0de
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Mon May 26 21:05:42 2025 +0800

    enable regional_compilation benchmark on xpu (#3592)

    * enable regional_compilation benchmark on xpu

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * Apply style fixes

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>
    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 8cb3ace
Author: Luiz F. G. dos Santos <luiz.fernando0992@gmail.com>
Date:   Thu May 22 10:21:54 2025 -0500

    Add kwargs to optimizer, scheduler and dataloader using function `accelerator().load_state()` (#3540)

    * Added artifacts and figure tracking at MLFlow tracker

    * Added `log_artifact` to the MLFlowTracker

    * Remove changes

    * Added kwargs when loading state.

    * added doc string

    * Adjusted correct default types of kwargs

    * Changed the load kwargs to a single one

    * removed None value from kwargs

    * fix kwargs for loading the model

    * removed load_kwargs from optimizer state dict

    * make load_kwargs a dictionary

    * revert last changes

    * reverted load_kwargs

    * fix docstring

    * added dict initiation

    * Fix quality error during PR

commit b6d97cb
Author: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Date:   Thu May 22 17:26:31 2025 +0300

    Resolve logger warnings (#3582)

    Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

commit 33967d4
Author: Francesco Laiti <25352428+laitifranz@users.noreply.github.com>
Date:   Tue May 20 12:29:53 2025 +0200

    Add support for standalone mode when default port is occupied on single node (#3576)

    * add standalone mode and replace ConnectionError with a warning when the main process port is in use, allowing for automatic port selection

    * address review feedback: warn on port conflict only for single-node; raise error for multi-node

    * Apply style fixes

    ---------

    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 5b1fcda
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 20 18:04:24 2025 +0800

    enable test_cli & test_example cases on XPU (#3578)

    * enable test_cli & test_example cases on XPU

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * remove print

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix ci issue

    Signed-off-by: YAO Matrix <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>
    Signed-off-by: YAO Matrix <matrix.yao@intel.com>

commit f55f053
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 20 18:02:14 2025 +0800

    goodbye torch_ccl (#3580)

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

commit 1ec99f0
Author: Yao Matrix <yaoweifeng0301@126.com>
Date:   Mon May 19 17:27:40 2025 +0800

    enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU (#3579)

    * enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * Update test_load_checkpoint_and_dispatch_with_broadcast.py

    ---------

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>
S1ro1 added a commit that referenced this pull request Jun 10, 2025
commit 2f8fd72
Author: Simon <80467011+sorgfresser@users.noreply.github.com>
Date:   Tue Jun 10 13:50:34 2025 +0100

    Remove device_count (#3587)

commit d2e6b03
Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>
Date:   Tue Jun 10 05:26:48 2025 -0700

    [FSDP2] Refactor + FP8 (#3585)

    * Fix double wrap

    * Clocking off, ~equal to torch baseline

    * works?

    * Working version

    * Partial rewrite

    * FSDP2 path works

    * Fix back prepare

    * Almost done, proper AC left

    * Feat: should work, cleanup + test more benchmarks left

    * Style+quality

    * Feat: fp8 example

    * Feat: better example

    * Feat: add readme

    * Docs + should be done

    * Fix: typos

    * Fix: protect imports

    * Feat: address comments

    * Feat: add flops image

commit b9fee48
Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Date:   Tue Jun 10 13:24:43 2025 +0100

    better handle FP8 with and without deepspeed (#3611)

    * use the state mixed precision which has undergone all preprocessing

    * Update src/accelerate/accelerator.py

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

    * Update src/accelerate/accelerator.py

    * accelerator state sets the mixed precision for deepspeed and fp8_enabled

    * fix

    * fix

    ---------

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

commit 3a82b05
Author: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Date:   Tue Jun 10 11:29:59 2025 +0200

    Fix bf16 training with TP  (#3610)

    * fix

    * Apply style fixes

    ---------

    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 6b61a37
Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Date:   Fri Jun 6 13:48:43 2025 +0100

    fix deepspeed regional compilation (#3609)

commit 682691d
Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Date:   Tue Jun 3 12:36:56 2025 +0200

    Update Gaudi Runners (#3593)

    * test

    * fix

    * push

    * in the morning

    * fix backend

    * run first

    * set habana modules

    * dynamo backend

    * trigger

    * remove on pr

    * remove on file change

commit 791055b
Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>
Date:   Tue Jun 3 12:24:20 2025 +0200

    Fix: list object has no attribute keys (#3603)

commit 16bf1d8
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Fri May 30 23:36:34 2025 +0800

    enable torchao and pippy test cases on XPU (#3599)

    * enable torchao and pippy test cases on XPU

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit ab3c604
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Fri May 30 23:23:26 2025 +0800

    enable big_model_inference on xpu (#3595)

    * enable big_model_inference on XPU

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix quality

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit 273799c
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 27 20:08:59 2025 +0800

    enable fsdp2 benchmark on XPU (#3590)

    * enable fsdp2 benchmark on XPU

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * add deterministic

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit 43526c5
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 27 17:44:50 2025 +0800

    add device-agnostic GradScaler (#3588)

    * add device-agnostic GradScaler

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix bug

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix review comments

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * format

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * Apply style fixes

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>
    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 07f2392
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 27 17:17:18 2025 +0800

    change to use torch.device (#3594)

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit ee2f48c
Author: Fanli Lin <fanli.lin@intel.com>
Date:   Tue May 27 17:16:42 2025 +0800

    [docs] no hard-coded cuda in the ddp documentation (#3589)

    * make device-agnostic

    * refactor

commit 4f3abb7
Author: jiqing-feng <jiqing.feng@intel.com>
Date:   Mon May 26 21:55:10 2025 +0800

    Set ccl and KMP param in simple launch (#3575)

    * Even 1 CPU mechine can also run multi process

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * fix ccl and kml param setting

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * set master addr only when processes > 1

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * fix num process check

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * fix ccl args check

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    ---------

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

commit db536cb
Author: Yuanzhou Cai <80858000+yuanjua@users.noreply.github.com>
Date:   Mon May 26 21:08:13 2025 +0800

    Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup (#3581)

    * Fix tracker initialize distributed before InitProcessGroupKwargs

    * Fix tracker initialize distributed before InitProcessGroupKwargs

    * Add test for bug #3550

    * Improve test for #3550

    * Remove redundant code

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

    * fix style

    ---------

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

commit 4e9d0de
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Mon May 26 21:05:42 2025 +0800

    enable regional_compilation benchmark on xpu (#3592)

    * enable regional_compilation benchmark on xpu

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * Apply style fixes

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>
    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 8cb3ace
Author: Luiz F. G. dos Santos <luiz.fernando0992@gmail.com>
Date:   Thu May 22 10:21:54 2025 -0500

    Add kwargs to optimizer, scheduler and dataloader using function `accelerator().load_state()` (#3540)

    * Added artifacts and figure tracking at MLFlow tracker

    * Added `log_artifact` to the MLFlowTracker

    * Remove changes

    * Added kwargs when loading state.

    * added doc string

    * Adjusted correct default types of kwargs

    * Changed the load kwargs to a single one

    * removed None value from kwargs

    * fix kwargs for loading the model

    * removed load_kwargs from optimizer state dict

    * make load_kwargs a dictionary

    * revert last changes

    * reverted load_kwargs

    * fix docstring

    * added dict initiation

    * Fix quality error during PR

commit b6d97cb
Author: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Date:   Thu May 22 17:26:31 2025 +0300

    Resolve logger warnings (#3582)

    Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

commit 33967d4
Author: Francesco Laiti <25352428+laitifranz@users.noreply.github.com>
Date:   Tue May 20 12:29:53 2025 +0200

    Add support for standalone mode when default port is occupied on single node (#3576)

    * add standalone mode and replace ConnectionError with a warning when the main process port is in use, allowing for automatic port selection

    * address review feedback: warn on port conflict only for single-node; raise error for multi-node

    * Apply style fixes

    ---------

    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 5b1fcda
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 20 18:04:24 2025 +0800

    enable test_cli & test_example cases on XPU (#3578)

    * enable test_cli & test_example cases on XPU

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * remove print

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix ci issue

    Signed-off-by: YAO Matrix <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>
    Signed-off-by: YAO Matrix <matrix.yao@intel.com>

commit f55f053
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 20 18:02:14 2025 +0800

    goodbye torch_ccl (#3580)

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

commit 1ec99f0
Author: Yao Matrix <yaoweifeng0301@126.com>
Date:   Mon May 19 17:27:40 2025 +0800

    enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU (#3579)

    * enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * Update test_load_checkpoint_and_dispatch_with_broadcast.py

    ---------

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>
S1ro1 added a commit that referenced this pull request Jul 9, 2025
commit 2f8fd72
Author: Simon <80467011+sorgfresser@users.noreply.github.com>
Date:   Tue Jun 10 13:50:34 2025 +0100

    Remove device_count (#3587)

commit d2e6b03
Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>
Date:   Tue Jun 10 05:26:48 2025 -0700

    [FSDP2] Refactor + FP8 (#3585)

    * Fix double wrap

    * Clocking off, ~equal to torch baseline

    * works?

    * Working version

    * Partial rewrite

    * FSDP2 path works

    * Fix back prepare

    * Almost done, proper AC left

    * Feat: should work, cleanup + test more benchmarks left

    * Style+quality

    * Feat: fp8 example

    * Feat: better example

    * Feat: add readme

    * Docs + should be done

    * Fix: typos

    * Fix: protect imports

    * Feat: address comments

    * Feat: add flops image

commit b9fee48
Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Date:   Tue Jun 10 13:24:43 2025 +0100

    better handle FP8 with and without deepspeed (#3611)

    * use the state mixed precision which has undergone all preprocessing

    * Update src/accelerate/accelerator.py

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

    * Update src/accelerate/accelerator.py

    * accelerator state sets the mixed precision for deepspeed and fp8_enabled

    * fix

    * fix

    ---------

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

commit 3a82b05
Author: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Date:   Tue Jun 10 11:29:59 2025 +0200

    Fix bf16 training with TP  (#3610)

    * fix

    * Apply style fixes

    ---------

    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 6b61a37
Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Date:   Fri Jun 6 13:48:43 2025 +0100

    fix deepspeed regional compilation (#3609)

commit 682691d
Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Date:   Tue Jun 3 12:36:56 2025 +0200

    Update Gaudi Runners (#3593)

    * test

    * fix

    * push

    * in the morning

    * fix backend

    * run first

    * set habana modules

    * dynamo backend

    * trigger

    * remove on pr

    * remove on file change

commit 791055b
Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>
Date:   Tue Jun 3 12:24:20 2025 +0200

    Fix: list object has no attribute keys (#3603)

commit 16bf1d8
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Fri May 30 23:36:34 2025 +0800

    enable torchao and pippy test cases on XPU (#3599)

    * enable torchao and pippy test cases on XPU

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit ab3c604
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Fri May 30 23:23:26 2025 +0800

    enable big_model_inference on xpu (#3595)

    * enable big_model_inference on XPU

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix quality

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit 273799c
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 27 20:08:59 2025 +0800

    enable fsdp2 benchmark on XPU (#3590)

    * enable fsdp2 benchmark on XPU

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * add deterministic

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit 43526c5
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 27 17:44:50 2025 +0800

    add device-agnostic GradScaler (#3588)

    * add device-agnostic GradScaler

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix bug

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix review comments

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * format

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * Apply style fixes

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>
    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 07f2392
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 27 17:17:18 2025 +0800

    change to use torch.device (#3594)

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit ee2f48c
Author: Fanli Lin <fanli.lin@intel.com>
Date:   Tue May 27 17:16:42 2025 +0800

    [docs] no hard-coded cuda in the ddp documentation (#3589)

    * make device-agnostic

    * refactor

commit 4f3abb7
Author: jiqing-feng <jiqing.feng@intel.com>
Date:   Mon May 26 21:55:10 2025 +0800

    Set ccl and KMP param in simple launch (#3575)

    * Even 1 CPU mechine can also run multi process

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * fix ccl and kml param setting

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * set master addr only when processes > 1

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * fix num process check

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * fix ccl args check

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    ---------

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

commit db536cb
Author: Yuanzhou Cai <80858000+yuanjua@users.noreply.github.com>
Date:   Mon May 26 21:08:13 2025 +0800

    Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup (#3581)

    * Fix tracker initialize distributed before InitProcessGroupKwargs

    * Fix tracker initialize distributed before InitProcessGroupKwargs

    * Add test for bug #3550

    * Improve test for #3550

    * Remove redundant code

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

    * fix style

    ---------

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

commit 4e9d0de
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Mon May 26 21:05:42 2025 +0800

    enable regional_compilation benchmark on xpu (#3592)

    * enable regional_compilation benchmark on xpu

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * Apply style fixes

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>
    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 8cb3ace
Author: Luiz F. G. dos Santos <luiz.fernando0992@gmail.com>
Date:   Thu May 22 10:21:54 2025 -0500

    Add kwargs to optimizer, scheduler and dataloader using function `accelerator().load_state()` (#3540)

    * Added artifacts and figure tracking at MLFlow tracker

    * Added `log_artifact` to the MLFlowTracker

    * Remove changes

    * Added kwargs when loading state.

    * added doc string

    * Adjusted correct default types of kwargs

    * Changed the load kwargs to a single one

    * removed None value from kwargs

    * fix kwargs for loading the model

    * removed load_kwargs from optimizer state dict

    * make load_kwargs a dictionary

    * revert last changes

    * reverted load_kwargs

    * fix docstring

    * added dict initiation

    * Fix quality error during PR

commit b6d97cb
Author: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Date:   Thu May 22 17:26:31 2025 +0300

    Resolve logger warnings (#3582)

    Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

commit 33967d4
Author: Francesco Laiti <25352428+laitifranz@users.noreply.github.com>
Date:   Tue May 20 12:29:53 2025 +0200

    Add support for standalone mode when default port is occupied on single node (#3576)

    * add standalone mode and replace ConnectionError with a warning when the main process port is in use, allowing for automatic port selection

    * address review feedback: warn on port conflict only for single-node; raise error for multi-node

    * Apply style fixes

    ---------

    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 5b1fcda
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 20 18:04:24 2025 +0800

    enable test_cli & test_example cases on XPU (#3578)

    * enable test_cli & test_example cases on XPU

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * remove print

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix ci issue

    Signed-off-by: YAO Matrix <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>
    Signed-off-by: YAO Matrix <matrix.yao@intel.com>

commit f55f053
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 20 18:02:14 2025 +0800

    goodbye torch_ccl (#3580)

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

commit 1ec99f0
Author: Yao Matrix <yaoweifeng0301@126.com>
Date:   Mon May 19 17:27:40 2025 +0800

    enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU (#3579)

    * enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * Update test_load_checkpoint_and_dispatch_with_broadcast.py

    ---------

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>
S1ro1 added a commit that referenced this pull request Jul 9, 2025
commit 2f8fd72
Author: Simon <80467011+sorgfresser@users.noreply.github.com>
Date:   Tue Jun 10 13:50:34 2025 +0100

    Remove device_count (#3587)

commit d2e6b03
Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>
Date:   Tue Jun 10 05:26:48 2025 -0700

    [FSDP2] Refactor + FP8 (#3585)

    * Fix double wrap

    * Clocking off, ~equal to torch baseline

    * works?

    * Working version

    * Partial rewrite

    * FSDP2 path works

    * Fix back prepare

    * Almost done, proper AC left

    * Feat: should work, cleanup + test more benchmarks left

    * Style+quality

    * Feat: fp8 example

    * Feat: better example

    * Feat: add readme

    * Docs + should be done

    * Fix: typos

    * Fix: protect imports

    * Feat: address comments

    * Feat: add flops image

commit b9fee48
Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Date:   Tue Jun 10 13:24:43 2025 +0100

    better handle FP8 with and without deepspeed (#3611)

    * use the state mixed precision which has undergone all preprocessing

    * Update src/accelerate/accelerator.py

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

    * Update src/accelerate/accelerator.py

    * accelerator state sets the mixed precision for deepspeed and fp8_enabled

    * fix

    * fix

    ---------

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

commit 3a82b05
Author: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Date:   Tue Jun 10 11:29:59 2025 +0200

    Fix bf16 training with TP  (#3610)

    * fix

    * Apply style fixes

    ---------

    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 6b61a37
Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Date:   Fri Jun 6 13:48:43 2025 +0100

    fix deepspeed regional compilation (#3609)

commit 682691d
Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Date:   Tue Jun 3 12:36:56 2025 +0200

    Update Gaudi Runners (#3593)

    * test

    * fix

    * push

    * in the morning

    * fix backend

    * run first

    * set habana modules

    * dynamo backend

    * trigger

    * remove on pr

    * remove on file change

commit 791055b
Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>
Date:   Tue Jun 3 12:24:20 2025 +0200

    Fix: list object has no attribute keys (#3603)

commit 16bf1d8
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Fri May 30 23:36:34 2025 +0800

    enable torchao and pippy test cases on XPU (#3599)

    * enable torchao and pippy test cases on XPU

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit ab3c604
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Fri May 30 23:23:26 2025 +0800

    enable big_model_inference on xpu (#3595)

    * enable big_model_inference on XPU

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix quality

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit 273799c
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 27 20:08:59 2025 +0800

    enable fsdp2 benchmark on XPU (#3590)

    * enable fsdp2 benchmark on XPU

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * add deterministic

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit 43526c5
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 27 17:44:50 2025 +0800

    add device-agnostic GradScaler (#3588)

    * add device-agnostic GradScaler

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix bug

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix review comments

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * fix

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * format

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * Apply style fixes

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>
    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 07f2392
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 27 17:17:18 2025 +0800

    change to use torch.device (#3594)

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

commit ee2f48c
Author: Fanli Lin <fanli.lin@intel.com>
Date:   Tue May 27 17:16:42 2025 +0800

    [docs] no hard-coded cuda in the ddp documentation (#3589)

    * make device-agnostic

    * refactor

commit 4f3abb7
Author: jiqing-feng <jiqing.feng@intel.com>
Date:   Mon May 26 21:55:10 2025 +0800

    Set ccl and KMP param in simple launch (#3575)

    * Even 1 CPU mechine can also run multi process

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * fix ccl and kml param setting

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * set master addr only when processes > 1

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * fix num process check

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    * fix ccl args check

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

    ---------

    Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

commit db536cb
Author: Yuanzhou Cai <80858000+yuanjua@users.noreply.github.com>
Date:   Mon May 26 21:08:13 2025 +0800

    Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup (#3581)

    * Fix tracker initialize distributed before InitProcessGroupKwargs

    * Fix tracker initialize distributed before InitProcessGroupKwargs

    * Add test for bug #3550

    * Improve test for #3550

    * Remove redundant code

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

    * fix style

    ---------

    Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

commit 4e9d0de
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Mon May 26 21:05:42 2025 +0800

    enable regional_compilation benchmark on xpu (#3592)

    * enable regional_compilation benchmark on xpu

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>

    * Apply style fixes

    ---------

    Signed-off-by: Matrix YAO <matrix.yao@intel.com>
    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 8cb3ace
Author: Luiz F. G. dos Santos <luiz.fernando0992@gmail.com>
Date:   Thu May 22 10:21:54 2025 -0500

    Add kwargs to optimizer, scheduler and dataloader using function `accelerator().load_state()` (#3540)

    * Added artifacts and figure tracking at MLFlow tracker

    * Added `log_artifact` to the MLFlowTracker

    * Remove changes

    * Added kwargs when loading state.

    * added doc string

    * Adjusted correct default types of kwargs

    * Changed the load kwargs to a single one

    * removed None value from kwargs

    * fix kwargs for loading the model

    * removed load_kwargs from optimizer state dict

    * make load_kwargs a dictionary

    * revert last changes

    * reverted load_kwargs

    * fix docstring

    * added dict initiation

    * Fix quality error during PR

commit b6d97cb
Author: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Date:   Thu May 22 17:26:31 2025 +0300

    Resolve logger warnings (#3582)

    Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

commit 33967d4
Author: Francesco Laiti <25352428+laitifranz@users.noreply.github.com>
Date:   Tue May 20 12:29:53 2025 +0200

    Add support for standalone mode when default port is occupied on single node (#3576)

    * add standalone mode and replace ConnectionError with a warning when the main process port is in use, allowing for automatic port selection

    * address review feedback: warn on port conflict only for single-node; raise error for multi-node

    * Apply style fixes

    ---------

    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

commit 5b1fcda
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 20 18:04:24 2025 +0800

    enable test_cli & test_example cases on XPU (#3578)

    * enable test_cli & test_example cases on XPU

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * remove print

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix ci issue

    Signed-off-by: YAO Matrix <matrix.yao@intel.com>

    ---------

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>
    Signed-off-by: YAO Matrix <matrix.yao@intel.com>

commit f55f053
Author: Yao Matrix <matrix.yao@intel.com>
Date:   Tue May 20 18:02:14 2025 +0800

    goodbye torch_ccl (#3580)

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

commit 1ec99f0
Author: Yao Matrix <yaoweifeng0301@126.com>
Date:   Mon May 19 17:27:40 2025 +0800

    enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU (#3579)

    * enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * fix style

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>

    * Update test_load_checkpoint_and_dispatch_with_broadcast.py

    ---------

    Signed-off-by: Matrix Yao <matrix.yao@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0