8000 Errors when training qith mixed precision · Issue #16 · fal-ai/f-lite · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Errors when training qith mixed precision #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
megatomik opened this issue May 18, 2025 · 1 comment
Open

Errors when training qith mixed precision #16

megatomik opened this issue May 18, 2025 · 1 comment

Comments

@megatomik
Copy link

If I train with fp16 I get this:

(main) root@C.20177420:/workspace$ python -m f_lite.train  --pretrained_model_path /model --train_data_path /train.csv --base_image_dir /images --output_dir  ./flite7b_lora_ckpts  --resolution  256  --use_8bit_adam  --seed 1 --gradient_checkpointing  --mixed_precision fp16 --train_batch_size 4  --gradient_accumulation_steps 1  --sample_prompts_file /captions.txt  --learning_rate 1e-5 --num_epochs    1 --lr_scheduler  linear --sample_every  100  --use_resolution_buckets  
/venv/main/lib/python3.12/site-packages/accelerate/accelerator.py:498: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
05/18/2025 13:37:11 - INFO - __main__ - Distributed environment: DistributedType.NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

05/18/2025 13:37:11 - INFO - __main__ - Using random seed: 1
05/18/2025 13:37:11 - INFO - __main__ - Loading model from /model
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.05s/it]
Loading pipeline components...:  50%|█████████████████████████████████████████████████████████████████████████████▌                                                                             | 2/4 [00:02<00:02,  1.00s/it]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.63s/it]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
05/18/2025 13:37:28 - INFO - __main__ - Number of parameters: 6878.23 million
Loaded dataset with 1000 entries
Created 865 resolution buckets
05/18/2025 13:37:28 - INFO - __main__ - Using 8-bit AdamW optimizer from bitsandbytes
05/18/2025 13:37:28 - INFO - __main__ - No checkpoint specified, starting from scratch
Training:   0%|                                                                                                                                                                                        | 0/21 [00:00<?, ?it/s]Dataset size: 1000 images
Dataloader batches: 21
Calculated max steps: 21
05/18/2025 13:37:28 - INFO - __main__ - Starting epoch 1/1
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 1254, in <module>
    train(args)
  File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 1072, in train
    accelerator.clip_grad_norm_(dit_model.parameters(), args.max_grad_norm)
  File "/venv/main/lib/python3.12/site-packages/accelerate/accelerator.py", line 2628, in clip_grad_norm_
    self.unscale_gradients()
  File "/venv/main/lib/python3.12/site-packages/accelerate/accelerator.py", line 2567, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/venv/main/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 342, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
                                              ^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 283, in _unscale_grads_
    torch._amp_foreach_non_finite_check_and_unscale_(
RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'

And if I train with bf16 I get this:

Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: bf16

05/18/2025 13:38:15 - INFO - __main__ - Using random seed: 1
05/18/2025 13:38:15 - INFO - __main__ - Loading model from /model
Loading pipeline components...:   0%|                                                                                                                                                                   | 0/4 [00:00<?, ?it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 78.77it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 127.74it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.50it/s]
05/18/2025 13:38:18 - INFO - __main__ - Number of parameters: 6878.23 million
Loaded dataset with 1000 entries
Created 865 resolution buckets
05/18/2025 13:38:19 - INFO - __main__ - Using 8-bit AdamW optimizer from bitsandbytes
05/18/2025 13:38:19 - INFO - __main__ - No checkpoint specified, starting from scratch
Training:   0%|                                                                                                                                                                                        | 0/21 [00:00<?, ?it/s]Dataset size: 1000 images
Dataloader batches: 21
Calculated max steps: 21
05/18/2025 13:38:19 - INFO - __main__ - Starting epoch 1/1
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 1254, in <module>
    train(args)
  File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 1051, in train
    total_loss, diffusion_loss = forward(
                                 ^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 466, in forward
    vae_latent = vae_model.encode(images_vae).latent_dist.sample()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 278, in encode
    h = self._encode(x)
        ^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 252, in _encode
    enc = self.encoder(x)
          ^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/diffusers/models/autoencoders/vae.py", line 156, in forward
    sample = self.conv_in(sample)
             ^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 554, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 549, in _conv_forward
    return F.conv2d(
           ^^^^^^^^^
RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same
@megatomik
Copy link
Author

No MP works for a while, but eventually I get this:


05/18/2025 13:39:02 - INFO - __main__ - Using random seed: 1
05/18/2025 13:39:02 - INFO - __main__ - Loading model from /model
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.04s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.71s/it]
Loading pipeline components...:  75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                      | 3/4 [00:06<00:02,  2.47s/it]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:06<00:00,  1.62s/it]
05/18/2025 13:39:21 - INFO - __main__ - Number of parameters: 6878.23 million
Loaded dataset with 1000 entries
Created 865 resolution buckets
05/18/2025 13:39:21 - INFO - __main__ - Using 8-bit AdamW optimizer from bitsandbytes
05/18/2025 13:39:21 - INFO - __main__ - No checkpoint specified, starting from scratch
Training:   0%|                                                                                                                                                                                        | 0/21 [00:00<?, ?it/s]Dataset size: 1000 images
Dataloader batches: 21
Calculated max steps: 21
05/18/2025 13:39:21 - INFO - __main__ - Starting epoch 1/1
Training:  95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊      | 20/21 [00:11<00:00,  2.27it/s, loss=0.5116, diff_loss=0.5116, lr=0.0000005]Traceback (most recent call last):
  File "/venv/main/lib/python3.12/site-packages/einops/einops.py", line 532, in reduce
    return _apply_recipe(
           ^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/einops/einops.py", line 235, in _apply_recipe
    init_shapes, axes_reordering, reduced_axes, added_axes, final_shapes, n_axes_w_added = _reconstruct_from_shape(
                                                                                           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/einops/einops.py", line 188, in _reconstruct_from_shape_uncached
    raise EinopsError(f"Shape mismatch, can't divide axis of length {length} in chunks of {known_product}")
einops.EinopsError: Shape mismatch, can't divide axis of length 41 in chunks of 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 1254, in <module>
    train(args)
  File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 1051, in train
    total_loss, diffusion_loss = forward(
                                 ^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 539, in forward
    targ = rearrange(v_objective, "b c (h p1) (w p2) -> b (h w) (p1 p2 c)", p1=2, p2=2)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/einops/einops.py", line 600, in rearrange
    return reduce(tensor, pattern, reduction="rearrange", **axes_lengths)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/einops/einops.py", line 542, in reduce
    raise EinopsError(message + "\n {}".format(e))
einops.EinopsError:  Error while processing rearrange-reduction pattern "b c (h p1) (w p2) -> b (h w) (p1 p2 c)".
 Input tensor shape: torch.Size([4, 16, 41, 32]). Additional info: {'p1': 2, 'p2': 2}.
 Shape mismatch, can't divide axis of length 41 in chunks of 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant
0