-
Notifications
You must be signed in to change notification settings - Fork 1.2k
(Part 1) fix: make TP training compatible with new transformers #3457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! Left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll review that after merging the transformers PR ! But for a quick look it looks nice
src/accelerate/test_utils/scripts/external_deps/test_performance.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have tested locally and and ran some stuff, seems to work! LGTM
Failing test is unrelated. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! Now that we've merged the PR about tp_size
in transformers, maybe we can use that to infer automatically the tp_size
so that we create the plugin accordingly.
Not sure how well this will integrate with the current code as we don't have access to the model when creating accelerator
tests/tp/test_tp.py
Outdated
@@ -49,14 +49,15 @@ def setUp(self): | |||
def test_working_of_tp(self): | |||
self.test_file_path = self.test_scripts_folder / "test_performance.py" | |||
cmd = get_launch_command( | |||
num_processes=self.test_tp_size, num_machines=1, machine_rank=0, use_tp=True, tp_size=self.test_tp_size | |||
num_processes=self.test_tp_size, num_machines=1, machine_rank=0, tp_size=self.test_tp_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is tp_size
used ?
if torch_tp_plugin is not None and not isinstance(torch_tp_plugin, TorchTensorParallelPlugin): | ||
raise TypeError("`torch_tp_plugin` must be a TorchTensorParallelPlugin object.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we create automatically torch_tp_plugin
if the model is sharded with tp ? Fine for me that the user have to precise this for 4.51 - 4.52 but in 4.52, we have tp_size
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SunMarc
Approach 1: Should we allow for passing model while creating accelerator and doc string it mentioning it would only be used for TP? This approach would also come in handy for any parallelism that modifies the model like context parallel. This would allow users to not needing to create the plugin and pass model while creating accelerator.
Approach 2: modify TorchTensorParallelPlugin to take model as input rather tp_size, this way we enforce to users that to enable TP, they would need to pass a tplized model only. then we extract tp_size from model to create device mesh to be used by data loader.
Approach 3: We use tp_size to only validate if it matches with what model has been sharded to + creating device mesh for data loader (that we have it already). However, it would still feel redundant, from user's PoV since they would need to pass the same tp_size while sharding in transformers and while using the plugin.
Let me know which one sounds good for this PR. Thanks
cc: @S1ro1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm up for approach 3, though we shouldn't use it to validate, but as a fallback if model has no attr tp_size
, this way there's no redundancy. Also you'll notice in #3498 I also do the same for DP/FSDP_size
as those will be needed as well to recreate the same device mesh as was created for TP (gpu order is different in 1d/2d case), so my proposition is:
Extra dataclass - i.e. ParallelismConfig
Which would hold all parallelism sizes, would be used as a fallback though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm i think approach 3 will be better. I don't want to force users to create the model before initializing accelerateor or TorchTensorParallelPlugin
. Maybe in the future, accelerate also take care for sharding the any model for tp.
We can just check when preparing the model that if the model is tp-sharded, we check the tp size vs the one passed in TorchTensorParallelPlugin
. If the model is not tp-sharded but the user passed a TorchTensorParallelPlugin
, we return an error. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also cover this in #3498 as for example Trainer does create the |
I don't mind if you think if it can make things simpler for you. What are the issue currently with |
see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
The biggest issue is
I'm heavily leaning towards number 1, #3498 does so already (it also exposes the method on the plugin but that's to be removed). This will allow us to receive any number of With this comes a question on how to error handle:
|
see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Sounds good for 1. I'm all for removing
I think it makes more sense to require this and not use it only as a fallback. Think of a situation where the user prepare the dataloader before the model. |
if not compare_versions("transformers", ">=", BETA_TP_AVAILABLE_TRANSFORMERS_VERSION): | ||
raise ValueError(f"TP requires transformers >= {BETA_TP_AVAILABLE_TRANSFORMERS_VERSION}") | ||
if not hasattr(model, "tp_size"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added only recently, so we have to update BETA_TP_AVAILABLE_TRANSFORMERS_VERSION
to 4.52.0
or the dev version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies on missing that, have updated it to 4.52.0
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, minor nit on the version check but other than that LGTM ! You can merge it @S1ro1 if you are fine if that
Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
see huggingface#3457 (comment) for more details Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
As for the device mesh, that is fine with However, this is to be discussed later, merging and let's move the discussion to #3498 . |
What does this PR do?
Fixes #3456
Before submitting
Pull Request section?
to it if that's the case. bug: broken TP training since tensor_parallel public API is removed #3456
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Thanks to @SunMarc for valuable discussion over #3456
@muellerzr or @SunMarc