(Part 1) fix: make TP training compatible with new transformers #3457

kmehant · 2025-03-25T11:38:55Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. bug: broken TP training since tensor_parallel public API is removed #3456
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Thanks to @SunMarc for valuable discussion over #3456

@muellerzr or @SunMarc

SunMarc

Thanks ! Left a comment

src/accelerate/accelerator.py

muellerzr

Thanks!

HuggingFaceDocBuilderDev · 2025-03-25T16:09:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

I'll review that after merging the transformers PR ! But for a quick look it looks nice

src/accelerate/state.py

src/accelerate/test_utils/scripts/external_deps/test_performance.py

src/accelerate/commands/config/config_args.py

S1ro1

Have tested locally and and ran some stuff, seems to work! LGTM

kmehant · 2025-04-10T16:40:42Z

Failing test is unrelated. Thanks

SunMarc

Thanks ! Now that we've merged the PR about tp_size in transformers, maybe we can use that to infer automatically the tp_size so that we create the plugin accordingly.
Not sure how well this will integrate with the current code as we don't have access to the model when creating accelerator

SunMarc · 2025-04-10T16:39:48Z

tests/tp/test_tp.py

@@ -49,14 +49,15 @@ def setUp(self):
    def test_working_of_tp(self):
        self.test_file_path = self.test_scripts_folder / "test_performance.py"
        cmd = get_launch_command(
-            num_processes=self.test_tp_size, num_machines=1, machine_rank=0, use_tp=True, tp_size=self.test_tp_size
+            num_processes=self.test_tp_size, num_machines=1, machine_rank=0, tp_size=self.test_tp_size


where is tp_size used ?

SunMarc · 2025-04-10T16:41:53Z

src/accelerate/accelerator.py

+        if torch_tp_plugin is not None and not isinstance(torch_tp_plugin, TorchTensorParallelPlugin):
+            raise TypeError("`torch_tp_plugin` must be a TorchTensorParallelPlugin object.")


Shouldn't we create automatically torch_tp_plugin if the model is sharded with tp ? Fine for me that the user have to precise this for 4.51 - 4.52 but in 4.52, we have tp_size now.

@SunMarc
Approach 1: Should we allow for passing model while creating accelerator and doc string it mentioning it would only be used for TP? This approach would also come in handy for any parallelism that modifies the model like context parallel. This would allow users to not needing to create the plugin and pass model while creating accelerator.

Approach 2: modify TorchTensorParallelPlugin to take model as input rather tp_size, this way we enforce to users that to enable TP, they would need to pass a tplized model only. then we extract tp_size from model to create device mesh to be used by data loader.

Approach 3: We use tp_size to only validate if it matches with what model has been sharded to + creating device mesh for data loader (that we have it already). However, it would still feel redundant, from user's PoV since they would need to pass the same tp_size while sharding in transformers and while using the plugin.

Let me know which one sounds good for this PR. Thanks

cc: @S1ro1

I'm up for approach 3, though we shouldn't use it to validate, but as a fallback if model has no attr tp_size, this way there's no redundancy. Also you'll notice in #3498 I also do the same for DP/FSDP_size as those will be needed as well to recreate the same device mesh as was created for TP (gpu order is different in 1d/2d case), so my proposition is:

Extra dataclass - i.e. ParallelismConfig
Which would hold all parallelism sizes, would be used as a fallback though.

Thanks @S1ro1

@SunMarc your thoughts?

Hmmm i think approach 3 will be better. I don't want to force users to create the model before initializing accelerateor or TorchTensorParallelPlugin. Maybe in the future, accelerate also take care for sharding the any model for tp.
We can just check when preparing the model that if the model is tp-sharded, we check the tp size vs the one passed in TorchTensorParallelPlugin. If the model is not tp-sharded but the user passed a TorchTensorParallelPlugin, we return an error. WDYT ?

Thanks @SunMarc have gone with this approach and updated the PR. Performance test for TP passes as well. Can we merge this PR as a self contained piece for now and visit the discussion at #3498 for optimizations for nd parallel? Thanks

cc: @S1ro1

S1ro1 · 2025-04-10T16:51:47Z

Thanks ! Now that we've merged the PR about tp_size in transformers, maybe we can use that to infer automatically the tp_size so that we create the plugin accordingly. Not sure how well this will integrate with the current code as we don't have access to the model when creating accelerator

I think we can also cover this in #3498 as for example Trainer does create the TensorParallelPlugin for us if model was sharded. I'm all for removing it and leaving it on accelerate to create plugins based on tp_size provided.

SunMarc · 2025-04-11T09:20:05Z

I think we can also cover this in #3498 as for example Trainer does create the TensorParallelPlugin for us if model was sharded. I'm all for removing it and leaving it on accelerate to create plugins based on tp_size provided.

I don't mind if you think if it can make things simpler for you. What are the issue currently with TensorParallelPlugin ?

see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

S1ro1 · 2025-04-11T09:46:07Z

I don't mind if you think if it can make things simpler for you. What are the issue currently with TensorParallelPlugin ?

The biggest issue is TensorParallelPlugin creating its own device mesh, we would like to create the device mesh based on all the parallelisms provided, not just one (TensorParallelPlugin has access only to tp_size).
There are 2 options that solve it that come to mind:

Remove device_mesh from TensorParallelPlugin completely and make it 1st class citizen of Accelerator/Any of its states
Make another method that resets the device mesh on the plugin.

I'm heavily leaning towards number 1, #3498 does so already (it also exposes the method on the plugin but that's to be removed). This will allow us to receive any number of ?p_size from arguments of Accelerator and construct the device mesh as such to be available to all the plugins.

With this comes a question on how to error handle:

we can use these arguments only as a fact-check for what is saved on the model, those would be required.
we use these arguments as a fallback -> we don't have info on how model was sharded, we use these
we infer what we can (i.e. in 2D sharding only 1 needs to be specified) and throw an error if we can't - this approach I haven't really thought of through, but I'm not a fan of it really as it's pretty messy

see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

SunMarc · 2025-04-11T14:53:57Z

Sounds good for 1. I'm all for removing TensorParallelPlugin and using tp_size instead.
Note that right now, for deepspeed, we are creating a separate device_mesh also. Check _prepare_device_mesh method.

we can use these arguments only as a fact-check for what is saved on the model, those would be required.

I think it makes more sense to require this and not use it only as a fallback. Think of a situation where the user prepare the dataloader before the model.

SunMarc · 2025-04-11T14:57:03Z

src/accelerate/accelerator.py

+                if not compare_versions("transformers", ">=", BETA_TP_AVAILABLE_TRANSFORMERS_VERSION):
+                    raise ValueError(f"TP requires transformers >= {BETA_TP_AVAILABLE_TRANSFORMERS_VERSION}")
+                if not hasattr(model, "tp_size"):


This was added only recently, so we have to update BETA_TP_AVAILABLE_TRANSFORMERS_VERSION to 4.52.0 or the dev version.

Apologies on missing that, have updated it to 4.52.0 Thanks.

SunMarc

Thanks, minor nit on the version check but other than that LGTM ! You can merge it @S1ro1 if you are fine if that

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

S1ro1 · 2025-04-11T16:31:12Z

Sounds good for 1. I'm all for removing TensorParallelPlugin and using tp_size instead. Note that right now, for deepspeed, we are creating a separate device_mesh also. Check _prepare_device_mesh method.

As for the device mesh, that is fine with DeepSpeedPlugin as that encapsulates all levels of parallelism (dp, fsdp, tp), though in our case, we want TP to be composable with either of DDP or FSDP or both, where we don't have a clear "leader" that will be responsible for device mesh. Therefore the best option is to create it centrally and it being accessible from all FSDP/DDP plugins, possibly other such as PP if we decide to support that. Later we can move to having a central device mesh for DeepSpeed as well, but that's possibly breaking and bigger refactor.

However, this is to be discussed later, merging and let's move the discussion to #3498 .

kmehant mentioned this pull request Mar 25, 2025

bug: broken TP training since tensor_parallel public API is removed #3456

Closed

4 tasks

SunMarc reviewed Mar 25, 2025

View reviewed changes

src/accelerate/accelerator.py Outdated Show resolved Hide resolved

kmehant requested a review from SunMarc March 25, 2025 14:38

muellerzr approved these changes Mar 25, 2025

View reviewed changes

kmehant changed the title ~~fix: make TP training compatible with new transformers~~ (Part 1) fix: make TP training compatible with new transformers Mar 27, 2025

kmehant force-pushed the tp-compa branch from 4bf0284 to 72d52c2 Compare March 27, 2025 19:45

kmehant mentioned this pull request Mar 27, 2025

(Part 2) feat: allow for tp_size attr for tplizing the model huggingface/transformers#37054

Merged

5 tasks

SunMarc reviewed Mar 28, 2025

View reviewed changes

S1ro1 reviewed Apr 7, 2025

View reviewed changes

src/accelerate/state.py Show resolved Hide resolved

S1ro1 reviewed Apr 7, 2025

View reviewed changes

src/accelerate/test_utils/scripts/external_deps/test_performance.py Outdated Show resolved Hide resolved

S1ro1 reviewed Apr 10, 2025

View reviewed changes

src/accelerate/commands/config/config_args.py Show resolved Hide resolved

kmehant force-pushed the tp-compa branch from 72d52c2 to ff804c1 Compare April 10, 2025 15:50

S1ro1 approved these changes Apr 10, 2025

View reviewed changes

kmehant requested a review from SunMarc April 10, 2025 16:20

SunMarc reviewed Apr 10, 2025

View reviewed changes

kmehant added a commit to kmehant/accelerate that referenced this pull request Apr 11, 2025

fix: pick approach 3 as discussed in the PR

ae5fed7

see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant added a commit to kmehant/accelerate that referenced this pull request Apr 11, 2025

fix: pick approach 3 as discussed in the PR

e3e833f

see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the tp-compa branch from ae5fed7 to e3e833f Compare April 11, 2025 09:49

kmehant added a commit to kmehant/accelerate that referenced this pull request Apr 11, 2025

fix: pick approach 3 as discussed in the PR

b16eb90

see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the tp-compa branch from e3e833f to b16eb90 Compare April 11, 2025 10:04

kmehant added a commit to kmehant/accelerate that referenced this pull request Apr 11, 2025

fix: pick approach 3 as discussed in the PR

b26fcb5

see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the tp-compa branch from b16eb90 to b26fcb5 Compare April 11, 2025 10:06

kmehant added a commit to kmehant/accelerate that referenced this pull request Apr 11, 2025

fix: pick approach 3 as discussed in the PR

862bffc

see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the tp-compa branch from b26fcb5 to 862bffc Compare April 11, 2025 10:11

kmehant added a commit to kmehant/accelerate that referenced this pull request Apr 11, 2025

fix: pick approach 3 as discussed in the PR

398afa3

see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the tp-compa branch from 862bffc to 398afa3 Compare April 11, 2025 10:24

SunMarc reviewed Apr 11, 2025

View reviewed changes

SunMarc approved these changes Apr 11, 2025

View reviewed changes

kmehant added 7 commits April 11, 2025 20:32

feat: support new tp refactor for training

6fb9089

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: @S1ro1 review cmt

55abbb7

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: @S1ro1 review cmt - tp_plan flag docstr

80828ef

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: @SunMarc review cmt on un used flag

552e9e9

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: pick approach 3 as discussed in the PR

b129999

see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: styling errors

02b98cd

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: bump up transformers for tp_size feature

0f7e998

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the tp-compa branch from e99f0c3 to 0f7e998 Compare April 11, 2025 15:02

S1ro1 merged commit 67adb47 into huggingface:main Apr 11, 2025
25 checks passed

This was referenced Apr 14, 2025

Fix deepspeed tests #3503

Merged

Tensor parallel support for LLM training. huggingface/transformers#37505

Open

Fix: require transformers version for tp tests #3504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(Part 1) fix: make TP training compatible with new transformers #3457

(Part 1) fix: make TP training compatible with new transformers #3457

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		if torch_tp_plugin is not None and not isinstance(torch_tp_plugin, TorchTensorParallelPlugin):
		raise TypeError("`torch_tp_plugin` must be a TorchTensorParallelPlugin object.")

(Part 1) fix: make TP training compatible with new transformers #3457

(Part 1) fix: make TP training compatible with new transformers #3457

Uh oh!

Conversation

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!