8000 Auto-detect bf16 support for CUDA by tiran · Pull Request #993 · instructlab/instructlab · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Auto-detect bf16 support for CUDA #993

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ numpy>=1.26.4,<2.0.0 ; python_version != '3.10'
openai>=1.13.3,<2.0.0
peft>=0.9.0,<0.10.0
prompt-toolkit>=3.0.38,<4.0.0
psutil>=5.9.8,<6.0.0
pydantic>=2.6.0,<3.0.0
pydantic_yaml>=1.2.0,<2.0.0
PyYAML>=6.0.0,<7.0.0
Expand Down
2 changes: 2 additions & 0 deletions src/instructlab/lab.py
Original file line number Diff line number Diff line change
Expand Up @@ -872,6 +872,8 @@ def convert(self, value, param, ctx) -> "torch.device":
device = torch.device(value)
except RuntimeError as e:
self.fail(str(e), param, ctx)
else:
device = value

if device.type not in self.supported_devices:
supported = ", ".join(repr(s) for s in sorted(self.supported_devices))
Expand Down
2 changes: 1 addition & 1 deletion src/instructlab/llamacpp/llamacpp_convert_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -1627,7 +1627,7 @@ def convert_llama_to_gguf(
big_endian: bool = False,
pad_vocab: bool = False,
skip_unknown: bool = False,
):
) -> str:
"""Convert a LLaMA model to a GGML compatible file"""
# TODO validate vocab_type as was done in click.option declaration:
# type=click.Choice(
Expand Down
52 changes: 49 additions & 3 deletions src/instructlab/train/linux_train.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
)
from trl import DataCollatorForCompletionOnlyLM, SFTTrainer
import click
import psutil
import torch

# Local
Expand Down Expand Up @@ -94,6 +95,7 @@ def report_cuda_device(args_device: torch.device, min_vram: int = 0) -> None:
"""Report CUDA/ROCm device properties"""
print(f" NVidia CUDA version: {torch.version.cuda or 'n/a'}")
print(f" AMD ROCm HIP version: {torch.version.hip or 'n/a'}")
print(f" Supports bf16: {torch.cuda.is_bf16_supported()}")

def _gib(size: int) -> str:
return "{:.1f} GiB".format(size / 1024**3)
Expand Down Expand Up @@ -173,6 +175,50 @@ def linux_train(
hpu.init()
report_hpu_device(device)

# device register a module, e.g. torch.cpu or torch.cuda
device_module = getattr(torch, device.type, None)
# bfloat16 is not supported on older CUDA versions and devices
# with CUDA support level < 8.0.
if hasattr(device_module, "is_bf16_supported"):
use_bf16 = device_module.is_bf16_supported()
use_fp16 = not use_bf16
elif device.type == "cpu":
# TODO: check if Torch and CPU support AVX2, F16C, AVX512
use_bf16 = False
use_fp16 = False
else:
# assume bf16 supported unless device says otherwise
use_bf16 = True
use_fp16 = False

torch_dtype = "auto" if device.type == "cuda" else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be super useful to document why torch_dtype=None is faster on CPUs

So on CPUs, dtype=float32 will be faster than dtype=bfloat16. Why? A link to a nice explanation of that would be great

I guess torch_dtype=None gives a dtype=float32 on CPUs? Why? I can't quickly find any docs that explains that - does None just mean that we use whatever torch.get_default_dtype() gives, which is float32 by default?

What dtype does torch_dtype=auto give on CPUs? Sounds like dtype=bfloat16? Why? It's detecting that this particular model was saved with that dtype?

Can't seem to find the docs on hugginface.co/docs so see: https://github.com/huggingface/transformers/blob/f3f640dce14bee3b3930a774c3dfac92977eee7f/src/transformers/modeling_utils.py#L2878-L2898

We don't have a dtype in the model's config.json so that doesn't seem to be a factor, but it could be

If the torch_dtype=auto behavior is model-specific, but we know we want float32 except on low-memory systems ... maybe for CPUs, we should just explicitly set torch_dtype to either float32 or bfloat16?

Copy link
Contributor Author
@tiran tiran May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It very much depends on the hardware and compiler. In general, x86_64 CPUs have support for standard precision and double precision floats (fp32, fp64). Half precision instructions (fp16) were added in ISA level x86_64-v3 and brain float (bf16) SIMD instruction were added in x86_64-v4. The document https://pytorch.org/docs/stable/amp.html#ops-that-can-autocast-to-float16 explains autocasting.

AFAIK we want to use

  • bf16 when the CPU or hardware accelerator supports it
  • fall back to fp16 on Nvidia hardware with CUDA compute level < 8.0
  • fall back to fp32 on CPUs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To quote from another comment I just made ...

My laptop with 32 GB of RAM is performing similar as the much larger server -- it finished in about 13 minutes compared to 12. This is with the change I added to fall back to torch_dtype="auto" with less than 64 GB RAM. With torch_dtype=None it will run out of memory and get killed.

Meanwhile, on that powerful server (64 vCPUs and 128 GB RAM), using torch_dtype="auto" absolutely kills performance. I know it would have run for many, many hours.

There's definitely more going on here. Maybe it's a difference in instructions available on the different CPUs in these environments?

I don't understand what's going on here well enough to explain this. At least in my two test environments the current code seems to be a nice improvement.

It would definitely be better if we could have a clearer explanation. I imagine there's a bit of luck involved right now, and instead of checking something else, we need to set the ideal configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on @tiran 's comment, it's possible that my laptop has this support while my server does not:

  • bf16 when the CPU or hardware accelerator supports it

which could explain why "auto" gets the best performance on my laptop.

If "auto" isn't doing an adequate job of checking if that is supported and is choosing it even if it's not actually supported on my server, maybe that's killing the performance. I'll have to keep digging here ...

Copy link
Contributor Author
@tiran tiran May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the output of torch.backends.cpu.get_cpu_capability() on your server and your laptop?

I get AVX512 on a server with Intel Xeon Platinum with avx512 instruction set and AVX2 on an Intel Core i7-8650U with avx2 but without AVX512.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my laptop -- 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz

>>> torch.backends.cpu.get_cpu_capability()
'AVX512'

On the server -- AMD EPYC 7R32

>>> torch.backends.cpu.get_cpu_capability()
'AVX2'

if device.type == "cpu":
total_memory = psutil.virtual_memory().total / (1024**3)
if total_memory < 60:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if total_memory < 60:
if total_memory < 62:

A system with 64GB of RAM, will report:

>>> import psutil
>>> mem = psutil.virtual_memory()
>>> mem
svmem(total=67228049408, available=31099351040, percent=53.7, used=35383861248, free=468701184, active=27983499264, inactive=37159084032, buffers=1079336960, cached=30296150016, shared=2109440, slab=1340628992)

And we have. 67228049408 Bytes converted to GiB gives us 67228049408 / 1024 ** 3 gives us 62.6 GiB

# Using our default model, a system with 32 GB of RAM
# will get OOM killed using torch_dtype=None, though we
# seem to get much better performance with this setting
# where there's enough memory. Using `None` makes it
# use float32 as opposed to float16 or bf16.
#
# Anecdotally, 64 GB seems to be enough, but this calculation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A system with 64GB of RAM will report ~62.6 GiB so we base our calculation on 62.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's such a rough guess, 60 still seems fine? We need to actually do some math at some point ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll share my math in a few :) stay tuned!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2024-06-07 at 14 23 04

Some more numbers:

  • The training part take ~30GB of RAM to process, there is a very small chance that this could work on very minimal Linux installation, by minimal I mean, only system critical services run and nothing else.
  • The inference part takes ~35GB of RAM

Essentially a system with 48GB of RAM should be able to run both training and inferencing. Although 48 GB of RAM is not very common.

# may come out to be slightly less than 64 GB, so we just check
# for 60 GB. It would be better to do a smarter calculation on
# the actual memory requirement here.
torch_dtype = "auto"

# torch compile fails to build, see PyTorch #124707
# scaled_dot_product_attention(): argument 'is_causal' must be bool, not SymBool
use_torch_compile = False
# if device.type == "cuda" and torch.version.cuda is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover?

# # check for NVIDIA V100, A100, or H100
# cap = torch.cuda.get_device_capability(device)
# use_torch_compile = cap in {(7, 0), (8, 0), (9, 0)}

print(
f"LINUX_TRAIN.PY: {use_bf16=}, {use_fp16=}, {torch_dtype=}, {use_torch_compile=}"
)

print("LINUX_TRAIN.PY: LOADING DATASETS")
# Get the file name
train_dataset = load_dataset("json", data_files=train_file, split="train")
Expand All @@ -194,6 +240,7 @@ def linux_train(

if four_bit_quant:
print("LINUX_TRAIN.PY: USING 4-bit quantization with BitsAndBytes")
use_bf16 = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was thinking this should go here. I noticed we were doing this already by setting it to !fp16 below (I think?)

use_fp16 = True
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
Expand All @@ -203,7 +250,6 @@ def linux_train(
)
else:
print("LINUX_TRAIN.PY: NOT USING 4-bit quantization")
use_fp16 = False
bnb_config = None

# Loading the model
Expand All @@ -214,7 +260,7 @@ def linux_train(

model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
torch_dtype=torch_dtype,
quantization_config=bnb_config,
config=config,
trust_remote_code=True,
Expand Down Expand Up @@ -340,7 +386,7 @@ def model_generate(user, **kwargs):
num_train_epochs=num_epochs,
per_device_train_batch_size=per_device_train_batch_size,
fp16=use_fp16,
bf16=not use_fp16,
bf16=use_bf16,
# use_ipex=True, # TODO CPU test this possible optimization
use_cpu=model.device.type == "cpu",
save_strategy="epoch",
Expand Down
Loading
0