Description
The exact settings that need to change to successfully train with a Linux GPU can vary quite a lot by system. AMD cards vs Nvidia cards, memory available on the card, age of card, etc.
A list of the settings I've found so far that a user may want to tweak to get training working at all or to make tradeoffs in overall training speed versus resources used, nonexhaustive:
- device(s) to use
- fp16 vs bf16 precision
- quantitization, specifically using 4bit BitsAndBytes vs not
- gradient accumulation steps or disabled entirely
- gradient checkpointing enabled/disabled
- per-device training batch size
- distributed training across multiple GPUs/CPUs (may be out of scope for just config, as that's more work to setup)
Some of these are configurable by CLI flags today. Can we expose all the needed parameters via CLI flags? Do we need configuration files? Which of these are also applicable to other lab commands, such as serve, generate, test, convert?
However we expose the necessary configuration, a list of example configuration/flags to use for different setups would be nice. Show people the things to tweak to lower memory usage at the expense of speed. Perhaps some guidance on the options needed to reduce GPU memory required under popular thresholds, like 8GB, 16GB, 24GB.
My assumption here is a goal would be to give people enough knobs to turn that they can get training going on their machine without having to change the actual Python code to do it. Perhaps others disagree with that assumption? All opinions are welcome!