8000 Allow arbitary trainging args to be overridden by derekhiggins · Pull Request #1008 · instructlab/instructlab · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Allow arbitary trainging args to be overridden #1008

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

derekhiggins
Copy link
Contributor
@derekhiggins derekhiggins commented Apr 25, 2024

Adding as a hidden argument to allow experimentation on various devices. Eventually once we know whats needed we can add something more permanent.

Fixes #1007

With this PR and #1012 , running ilab e2e including training works on colab

!ilab train --device cuda --override-training-args '{"bf16":false, "gradient_checkpointing":true, "gradient_accumulation_steps":8}'

Comment on lines 243 to 244
training_args["fp16"] = use_fp16
training_args["bf16"] = not use_fp16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bf16 issue is addressed in #993

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, but this PR isn't really intended to deal with any specific training options, the point is to allow the advanced user to override any of them without needing to make changes to the code.

@mergify mergify bot added the needs-rebase This Pull Request needs to be rebased label Apr 29, 2024
Copy link
Contributor
mergify bot commented Apr 29, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @derekhiggins please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@github-actions github-actions bot removed the testing Relates to testing label Apr 29, 2024
@mergify mergify bot removed the needs-rebase This Pull Request needs to be rebased label Apr 29, 2024
@maxamillion
Copy link
Contributor

I tested this patch and the following worked for me using my RTX A4000 GPU with 16G of VRAM:

$ ilab train --device cuda --override-training-args '{"bf16":false, "gradient_checkpointing":true, "gradient_accumulation_steps":8}'

TY!

@mergify mergify bot added the testing Relates to testing label Apr 30, 2024
@mergify mergify bot added the needs-rebase This Pull Request needs to be rebased label May 6, 2024
Copy link
Contributor
mergify bot commented May 6, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @derekhiggins please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added needs-rebase This Pull Request needs to be rebased and removed needs-rebase This Pull Request needs to be rebased labels May 6, 2024
Copy link
Contributor
mergify bot commented May 7, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @derekhiggins please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@tyll
Copy link
tyll commented May 8, 2024

Due to the complexity of the data, it seems this is better suited to be added to config.yaml instead of passing it on the command line.

Copy link
Contributor
@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have functional test coverage for this?

@mergify mergify bot added ci-failure PR has at least one CI failure and removed needs-rebase This Pull Request needs to be rebased labels May 23, 2024
@derekhiggins
Copy link
Contributor Author
derekhiggins commented May 23, 2024

Due to the complexity of the data, it seems this is better suited to be added to config.yaml instead of passing it on the command line.

I've added a example of how to use this from a json file e.g. --override-training-args "$(< override_train_args.json)"
would this be enough? As there is no training section currently in the config.yaml and I'm not sure this is a good reason to add one?

Can we have functional test coverage for this?

If this is merged I'll update the e2e tests which should cover it (e.g. #1111 )

Adding as a hidden argument to allow experimentation
on various devices. Eventually once we know whats needed
we can add something more permanent.

Fixes instructlab#1007

Signed-off-by: Derek Higgins <derekh@redhat.com>
@mergify mergify bot removed the ci-failure PR has at least one CI failure label May 23, 2024
@leseb leseb requested a review from tiran May 30, 2024 14:15
@mergify mergify bot added the one-approval PR has one approval from a maintainer label May 30, 2024
@nathan-weinberg nathan-weinberg requested a review from a team June 4, 2024 14:23
@mergify mergify bot removed the e2e-trigger label Jun 4, 2024
Copy link
Contributor
mergify bot commented Jun 5, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @derekhiggins please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase This Pull Request needs to be rebased label Jun 5, 2024
try:
override_training_args_dict = json.loads(override_training_args)
except json.decoder.JSONDecodeError as e:
ctx.fail("Parsing override trainign args: " + str(e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"trainign" nit on spelling.

I think the command fail (CLI exits) if the input is malformed, too, rather than proceeding and making the user ctl-c and reload.

@JamesKunstle JamesKunstle self-requested a review June 21, 2024 15:35
Copy link
Contributor
@JamesKunstle JamesKunstle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality is super desirable. In the churn of designing the CLI in the context of other pillars of the project (SDG, evaluation, publishing), it's become clear that we need more wide-spread and type-checked configuration for everything. @cdoern's inbound "profiles" PR accounts for this, taking a first step toward application-wide default and override configuration support.

@derekhiggins your PR is very very appreciated, we'd love your input on @cdoern's work as well.

@russellb
Copy link
Member

This functionality is super desirable. In the churn of designing the CLI in the context of other pillars of the project (SDG, evaluation, publishing), it's become clear that we need more wide-spread and type-checked configuration for everything. @cdoern's inbound "profiles" PR accounts for this, taking a first step toward application-wide default and override configuration support.

@derekhiggins your PR is very very appreciated, we'd love your input on @cdoern's work as well.

@JamesKunstle can you provide a link (or links) to the work you're referring to and requesting feedback on?

@derekhiggins
Copy link
Contributor Author

Clos 6D40 ing this, a lot has changed since it was created and its probably no longer relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase This Pull Request needs to be rebased one-approval PR has one approval from a maintainer testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Training options only allow for well known/tested HW
7 participants
0