CLI flags/configuration and examples for Linux GPU training

The exact settings that need to change to successfully train with a Linux GPU can vary quite a lot by system. AMD cards vs Nvidia cards, memory available on the card, age of card, etc.

A list of the settings I've found so far that a user may want to tweak to get training working at all or to make tradeoffs in overall training speed versus resources used, nonexhaustive:

device(s) to use
fp16 vs bf16 precision
quantitization, specifically using 4bit BitsAndBytes vs not
gradient accumulation steps or disabled entirely
gradient checkpointing enabled/disabled
per-device training batch size
distributed training across multiple GPUs/CPUs (may be out of scope for just config, as that's more work to setup)

Some of these are configurable by CLI flags today. Can we expose all the needed parameters via CLI flags? Do we need configuration files? Which of these are also applicable to other lab commands, such as serve, generate, test, convert?

However we expose the necessary configuration, a list of example configuration/flags to use for different setups would be nice. Show people the things to tweak to lower memory usage at the expense of speed. Perhaps some guidance on the options needed to reduce GPU memory required under popular thresholds, like 8GB, 16GB, 24GB.

My assumption here is a goal would be to give people enough knobs to turn that they can get training going on their machine without having to change the actual Python code to do it. Perhaps others disagree with that assumption? All opinions are welcome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions