-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add support for Quantization-Aware Low-Rank Adaptation (QALoRA) #2571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…raLayer and GPTQLoraLinear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for picking up this old request. Really well done to use the newly added LoRA variant abstraction to implement this.
I checked the PR but haven't done an in-depth review yet. The reason for that is that LoRA variant support has only been added to vanilla LoRA layers (i.e. the layers defined in lora/layers.py
). The quantized layers, including GPTQ, don't have any code that would take LoRA variants into account. Therefore, as is, the GPTQ layer would still use the normal forward
call and not QALoraLinearVariant.forward
. Even worse, the GPTQ layer does not support merging and unmerging, so all of that code QALoraLinearVariant
is dead code. So unless I'm missing something, there is still some work required:
- Update
GPTQLoraLinear.forward
to account for LoRA variants (should be easy). - Implement merging and unmerging for
GPTQLoraLinear
(could be difficult, it depends), or scrap it for now.
To avoid 2., QA LoRA could be implemented for another quantization method that already supports merging and unmerging, like bitsandbytes, but even there, LoRA variant support has yet to be added. Also, I'm not sure how specific your code is to GPTQ.
Anyway, it's a really nice PR and I'd be happy to see it merged. LMK what you think.
Hi @BenjaminBossan , Thank you for your review and the helpful feedback! I've addressed the main points you raised:
Ready for another look when you have a moment! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the updates, I did another review, where I focused on the PEFT integration itself (not checking the example or the details of the QALoRA paper). There are still a few issues but they're not big, please take a look.
Once these issues are resolved, I'll review the example. We should also think about adding a test before merging.
src/peft/tuners/lora/config.py
Outdated
default=False, | ||
metadata={ | ||
"help": ( | ||
"Enable <a href='https://huggingface.co/papers/2309.14717'>Quantization-Aware Low-Rank Adaptation (QALoRA)</a>. This technique combines quantization-aware training " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mention that it is only implemented for GPTQ for now. Also, please update the docstring (can use the same text).
src/peft/tuners/lora/config.py
Outdated
"Enable <a href='https://huggingface.co/papers/2309.14717'>Quantization-Aware Low-Rank Adaptation (QALoRA)</a>. This technique combines quantization-aware training " | ||
"with LoRA to improve performance for quantized models. This can improve the performance of LoRA, " | ||
"especially at low ranks. Right now, QALoRA only supports linear layers. QALoRA introduces a bigger " | ||
"overhead than pure LoRA, so it is recommended to merge weights for inference." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This recommendation is a bit moot as merging is not supported for GPTQ. Let's remove this sentence.
src/peft/tuners/lora/gptq.py
Outdated
if use_qalora: | ||
from .variants import QALoraLinearVariant | ||
|
||
return QALoraLinearVariant() | ||
if not use_dora: | ||
return None | ||
|
||
from .variants import DoraLinearVariant | ||
|
||
return DoraLinearVariant() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change the check a bit for completeness to basically:
if use_dora and use_qalora:
NotImplementedError
elif use_dora:
variant = ...
elif use_qalora:
variant = ...
else:
variant = None
return variant
src/peft/tuners/lora/gptq.py
Outdated
@@ -64,29 +80,33 @@ def forward(self, x: torch.Tensor): | |||
return result | |||
|
|||
lora_A_keys = self.lora_A.keys() | |||
torch_result_dtype = result.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed, right?
src/peft/tuners/lora/gptq.py
Outdated
|
||
if requires_conversion: | ||
output = output.to(expected_dtype) | ||
# requires_conversion = not torch.is_autocast_enabled() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the comment?
src/peft/tuners/lora/variants.py
Outdated
|
||
# Create and store pooling factor for scaling | ||
if not hasattr(module, "qalora_scaling_factor"): | ||
module.qalora_scaling_factor = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above regarding other_param_names
. But do we even need qalora_scaling_factor
, as it can be calculated on the fly based on module.in_features
and qalora_group_size
anyway?
src/peft/tuners/lora/variants.py
Outdated
module.qalora_scaling_factor[adapter_name] = module.in_features / qalora_group_size | ||
else: | ||
# No special scaling if dimensions don't align | ||
module.qalora_scaling_factor[adapter_name] = 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this not lead to very different results than qalora_scaling_factor = module.in_features / qalora_group_size
?
I wonder if perhaps it makes more sense to raise an error here and require that module.in_features % qalora_group_size == 0
? I think this would simplify the code and when module.in_features % qalora_group_size != 0
we just get standard LoRA?
src/peft/tuners/lora/variants.py
Outdated
torch.Tensor: The calculated delta weight. | ||
""" | ||
if ( | ||
not hasattr(module, "qalora_group_size") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it even possible to hit this condition? Same question about not hasattr(module, "qalora_scaling_factor")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. We will never reach in here
…gement in variants
@BenjaminBossan I have integrated all your feedback into the code, except for the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for all the updates. I still found a couple of issues, please check my comments. I also checked and ran the script, where I also found some areas for improvement.
As for testing, I think it would be enough to have something very similar to this one with QALoRA enabled.
Also, before pushing, don't forget to run make style
to make the linter happy.
examples/qalora_finetuning/README.md
Outdated
--qalora_group_size 8 | ||
``` | ||
|
||
QALoRA also works with different quantization methods (GPTQ, EETQ, AWQ, etc.): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, that's not true, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question still open
examples/qalora_finetuning/README.md
Outdated
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | ||
``` | ||
|
||
## QALoRA vs. LoRA vs. DoRA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a specific reason to compare QALoRA to DoRA, is there? I think the more interesting question for users would be: What's the difference between QALoRA and QLoRA? When should I use which, what trade offs are there?
examples/qalora_finetuning/README.md
Outdated
2. The QALoRA adapter weights are then merged with the dequantized model | ||
3. The merged model must be re-quantized if quantization is still desired | ||
|
||
This implementation choice was made because **it yielded better performance in practice**, despite being less memory-efficient than the direct modification approach described in the paper. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: Is this your personal experience or is there a reference for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question still open
elif optimizer_name.lower() == "sgd": | ||
opt_size = param_size # SGD with momentum keeps 1 extra state | ||
|
||
print( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think printing this for each layer is a bit of information overkill.
examples/qalora_finetuning/README.md
Outdated
|
||
Run the finetuning script with a GPTQ quantized model: | ||
```bash | ||
python examples/qalora_finetuning/qalora_gptq_finetuning.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I run this locally, I get a train loss of 0.0, can you replicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question still open
task_type="CAUSAL_LM", | ||
use_dora=use_dora, | ||
use_qalora=use_qalora, | ||
qalora_group_size=8, # Explicitly set group size for QALoRA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to expose this argument to the CLI?
src/peft/tuners/lora/gptq.py
Outdated
) | ||
|
||
def resolve_lora_variant(self, *, use_dora: bool, use_qalora: bool, **kwargs) -> Optional[LoraVariant]: | ||
if use_dora and use_qalora: | ||
NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise NotImplementedError(...)
src/peft/tuners/lora/gptq.py
Outdated
from .variants import DoraLinearVariant | ||
variant = DoraLinearVariant() | ||
elif use_qalora: | ||
if self.in_features % kwargs["qalora_group_size"] == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not perform the check here. Instead, let's move it inside QALoraLinearVariant.init
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still open
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check can now be removed, right?
…and refactor training tests - Added detailed error message for unsupported simultaneous use of Dora and QA_LoRA in GPTQLoraLinear. - Refactored QALoraLinearVariant to streamline pooling and scaling operations, improving clarity and performance. - Consolidated multiple training test cases for VeRA and RandLora into a more organized structure, ensuring consistency across single and multi-GPU tests. - Updated training configurations to include new parameters for QALoRA and improved handling of token embeddings. - Ensured that model checkpoints are correctly validated and that training loss assertions are in place for all tests.
More tests are still needed for comprehensive coverage and some of the existing comments are a work in progress and will be refined. |
Okay, ich warte dann, bis diese fertig sind bevor ich erneut reviewe ;) |
…ction, enhancing gradient checks in tests, and improving adapter handling in model training.
@BenjaminBossan du kannst wieder reinschauen. Ich habe die Initialisierung nochmal überarbeitet. Jetzt wird lora_A direkt beim Init durch eine neue Instanz mit der gepoolten Eingabedimension ersetzt. Vorher hatten wir lora_A noch in der alten Shape und haben das Pooling erst im Forward angewendet, so wie es auch ursprünglich im Paper gemacht wurde. Mit der neuen Variante übernimmt lora_A das Pooling bereits beim Init, wodurch die Dimensionen und damit die Anzahl der trainierbaren Parameter bei QALoRA tatsächlich kleiner werden als bei QLoRA. Vorher waren die "trainable params" bei QLoRA und QALoRA identisch, jetzt ist QALoRA sparsamer. Eine interessante Beobachtung: Der Speicherverbrauch ist trotzdem sehr ähnlich zu QLoRA, obwohl weniger trainierbare Parameter vorhanden sind. Das scheint aber durch die zusätzlichen Pooling-Tensoren im Forward zustande zu kommen. Vielleicht kannst du mir nochmal bestätigen, ob das so korrekt und zu erwarten ist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's switch back to English.
Thanks you for the update, they look good overall. I noticed that some of my comments seem yet to be unaddressed, could you please double-check? Also, something seems to have gone wrong when adding the GPU tests, let's mostly revert those changes and only add the new test.
ch habe die Initialisierung nochmal überarbeitet. Jetzt wird lora_A direkt beim Init durch eine neue Instanz mit der gepoolten Eingabedimension ersetzt. Vorher hatten wir lora_A noch in der alten Shape und haben das Pooling erst im Forward angewendet, so wie es auch ursprünglich im Paper gemacht wurde.
Mit der neuen Variante übernimmt lora_A das Pooling bereits beim Init, wodurch die Dimensionen und damit die Anzahl der trainierbaren Parameter bei QALoRA tatsächlich kleiner werden als bei QLoRA. Vorher waren die "trainable params" bei QLoRA und QALoRA identisch, jetzt ist QALoRA sparsamer.
Nice optimization.
Eine interessante Beobachtung: Der Speicherverbrauch ist trotzdem sehr ähnlich zu QLoRA, obwohl weniger trainierbare Parameter vorhanden sind. Das scheint aber durch die zusätzlichen Pooling-Tensoren im Forward zustande zu kommen. Vielleicht kannst du mir nochmal bestätigen, ob das so korrekt und zu erwarten ist.
Hmm, this is very hard to say without further details. How did you test this and how large was the difference? In general, for big base models, the majority of memory will be consumed by the base weights and the amount of memory used by LoRA is relatively small. Quantization helps of course but the general pattern still holds. Therefore, if LoRA is made a little bit more memory efficient, the total effect may still be negligible. And as you mentioned, depending on the dataset (mainly the sequence length), activations/hidden states can take up a large portion of memory and those are also mostly unaffected by the LoRA parameter count.
examples/qalora_finetuning/README.md
Outdated
|
||
Run the finetuning script with a GPTQ quantized model: | ||
```bash | ||
python examples/qalora_finetuning/qalora_gptq_finetuning.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question still open
examples/qalora_finetuning/README.md
Outdated
--qalora_group_size 8 | ||
``` | ||
|
||
QALoRA also works with different quantization methods (GPTQ, EETQ, AWQ, etc.): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question still open
examples/qalora_finetuning/README.md
Outdated
2. The QALoRA adapter weights are then merged with the dequantized model | ||
3. The merged model must be re-quantized if quantization is still desired | ||
|
||
This implementation choice was made because **it yielded better performance in practice**, despite being less memory-efficient than the direct modification approach described in the paper. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question still open
src/peft/tuners/lora/config.py
Outdated
default=False, | ||
metadata={ | ||
"help": ( | ||
"It is only implemented in GPTQ for now. Enable <a href='https://huggingface.co/papers/2309.14717'>Quantization-Aware Low-Rank Adaptation (QALoRA)</a>. This technique combines quantization-aware training " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's ensure that each line is 120 chars max.
src/peft/tuners/lora/gptq.py
Outdated
from .variants import DoraLinearVariant | ||
variant = DoraLinearVariant() | ||
elif use_qalora: | ||
if self.in_features % kwargs["qalora_group_size"] == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still open
tests/test_gpu_examples.py
Outdated
|
||
# assert loss is not None | ||
assert trainer.state.log_history[-1]["train_loss"] is not None | ||
# In der PeftGPTQGPUTests Klasse hinzufügen: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
tests/test_gpu_examples.py
Outdated
@@ -1168,201 +1168,32 @@ def test_initialize_dora_with_bnb_on_cpu(self, kbit): | |||
weights_not_cpu = [name for name, p in peft_model.named_parameters() if p.device != torch.device("cpu")] | |||
assert not weights_not_cpu | |||
|
|||
@pytest.mark.single_gpu_tests | |||
def test_causal_lm_training_vera(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it looks like a bunch of unrelated tests where moved around. I don't think that should be necessary and it also makes reviewing a lot harder. Could you please ensure that the existing tests remain in place and only new tests for GPTQ-QALoRA are added?
…ion and structure for better readability and maintainability.
@gapsong Please ping me once the PR is ready for the next review. |
…y breaking long lines, update dataset loading to use provided data path, and remove unused arguments.
…cessary parameters for clarity and conciseness.
…ter, update dataset loading, and remove unused parameters for improved clarity and functionality.
@BenjaminBossan I have adjusted the code expect for the |
Thanks for the updates. I tried the example script again for 5 steps, using
For, mostly for these reasons:
|
…2 in README and add validation for divisibility by group size in QALoraLinearVariant.
@BenjaminBossan I’m done and I found the “bug”: I also adjusted the resolve_lora_variant function: I hope this time works out :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, not much is missing for the PR.
The qalora_group_size was set to 8, which was too small. I’ve changed the default value to 32.
Good find. Do you know why that resulted in a loss of 0? I wonder if there is some kind of check (say, some hidden states not containing nan) that could be performed to prevent this in the future. Depending on how train logging is set up, users may otherwise waste hours of training time.
--learning_rate 3e-4 \ | ||
--cutoff_len 512 \ | ||
--use_qalora \ | ||
--qalora_group_size 32 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating to a better default. Do you think that qalora_group_size=16
is still a good choice for the default value in LoraConfig
?
src/peft/tuners/lora/gptq.py
Outdated
from .variants import DoraLinearVariant | ||
variant = DoraLinearVariant() | ||
elif use_qalora: | ||
if self.in_features % kwargs["qalora_group_size"] == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check can now be removed, right?
src/peft/tuners/lora/variants.py
Outdated
""" | ||
if "qalora_group_size" not in kwargs: | ||
raise ValueError( | ||
"QALoraLinearVariant.init expects 'qalora_group_size' to be provided in kwargs." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"QALoraLinearVariant.init expects 'qalora_group_size' to be provided in kwargs." | |
"`use_qalora=True` requires 'qalora_group_size' to be provided in kwargs." |
src/peft/tuners/lora/variants.py
Outdated
|
||
if module.in_features is not None and module.in_features % kwargs["qalora_group_size"] != 0: | ||
raise ValueError( | ||
f"QALoraLinearVariant.init expects module.in_features ({module.in_features}) to be divisible by 'qalora_group_size' ({kwargs['qalora_group_size']})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"QALoraLinearVariant.init expects module.in_features ({module.in_features}) to be divisible by 'qalora_group_size' ({kwargs['qalora_group_size']})" | |
f"`use_qalora=True` requires `module.in_features` ({module.in_features}) to be divisible by 'qalora_group_size' ({kwargs['qalora_group_size']})" |
…cy; streamline argument handling in training script.
@BenjaminBossan included your suggestions! |
@BenjaminBossan something seems off with the GPU memory, when using QA_Lora I am investigating the problem at the moment. |
…ess, improve pooling logic, and enhance LoRA computation for clarity and efficiency. Memory peaks were really high
@BenjaminBossan GPU problem was resolved. I removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the latest update. I just have a small comment, otherwise the PR LGTM.
I also ran the example locally with and without QALoRA and the difference in memory usage was negligible, which I think is what we hoped to see.
tests/test_custom_models.py
Outdated
@@ -2820,6 +2820,51 @@ def test_requires_grad_lora_different_targets(self): | |||
"base_model.model.lin1.lora_B.adapter1.weight", | |||
) | |||
|
|||
def test_requires_grad_qalora_same_targets(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, for some reason I haven't really noticed this test. IMO it is not necessary and can be removed. This is because QALoRA does not change the handling of multiple adapters, so there is not really a reason to believe there could be anything wrong there. We also don't check other quantization methods here. Unless you had a specific reason to add this test, I'd suggest to just remove it.
…proved clarity and maintainability.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a final pass and discovered two small things that need fixing, otherwise LGTM. Thanks for your patience.
src/peft/tuners/lora/variants.py
Outdated
|
||
if module.in_features is not None and module.in_features % kwargs["qalora_group_size"] != 0: | ||
raise ValueError( | ||
f"`use_qalora=True` requires `module.in_features` ({module.in_features}) to be divisible by 'qalora_group_size' ({kwargs['qalora_group_size']})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add line breaks to this long line?
src/peft/tuners/lora/variants.py
Outdated
@@ -130,6 +130,101 @@ def forward(module: Linear, active_adapter: str, x: torch.Tensor, result: torch. | |||
return result | |||
|
|||
|
|||
class QALoraLinearVariant(LoraVariant): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this class to the bottom of the file, as is it sits between different DoRA classes.
@BenjaminBossan what do I have to do here to resolve the errors or is it something out of my scope? |
You mean the MacOS errors? Yeah, they're unrelated and should be fixed with the next transformers release. |
@BenjaminBossan shall we merge it then? |
Did you see my last comment? |
I have missed it. It should be ready now @BenjaminBossan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing QALoRA support with GPTQ to PEFT. The PR LGTM, nicely done using LoRA variants for this. Failing tests are unrelated.
The function signature was missing **kwargs, which results in a failure after merging huggingface#2571.
The function signature was missing **kwargs, which results in a failure after merging #2571.
This pull request introduces QALoRA (Quantization-Aware Low-Rank Adaptation), a new fine-tuning technique for quantized large language models, along with its implementation in the PEFT library. The changes include updates to documentation, configuration, and core logic to support QALoRA's memory-efficient and performance-preserving features.
Documentation Updates
examples/qalora_finetuning/README.md
: Added detailed documentation for QALoRA, including its introduction, implementation details, usage examples, command-line instructions, and comparison with other techniques like LoRA and DoRA.Configuration Enhancements
src/peft/tuners/lora/config.py
: Introduced two new configuration parameters:use_qalora
to enable QALoRA andqalora_group_size
to control the pooling group size for memory-performance tradeoffs.Core Logic for QALoRA
src/peft/tuners/lora/gptq.py
: Updated the GPTQ LoRA implementation to support QALoRA, including logic for resolving QALoRA variants and passing group size parameters. [1] [2] [3]src/peft/tuners/lora/layer.py
: Enhanced the layer update logic to initialize QALoRA-specific parameters and handle adapter-specific configurations. [1] [2]src/peft/tuners/lora/model.py
: Incorporated QALoRA-specific parameters into the model creation and replacement process.QALoRA Variant Implementation
src/peft/tuners/lora/variants.py
: Added theQALoraLinearVariant
class, implementing QALoRA-specific logic for initialization, delta weight computation, merging, unmerging, and forward propagation. This includes pooling input features and scaling them for efficient adaptation.