Skip to content

Errors when training qith mixed precision #16

@megatomik

Description

@megatomik

If I train with fp16 I get this:

(main) [email protected]:/workspace$ python -m f_lite.train  --pretrained_model_path /model --train_data_path /train.csv --base_image_dir /images --output_dir  ./flite7b_lora_ckpts  --resolution  256  --use_8bit_adam  --seed 1 --gradient_checkpointing  --mixed_precision fp16 --train_batch_size 4  --gradient_accumulation_steps 1  --sample_prompts_file /captions.txt  --learning_rate 1e-5 --num_epochs    1 --lr_scheduler  linear --sample_every  100  --use_resolution_buckets  
/venv/main/lib/python3.12/site-packages/accelerate/accelerator.py:498: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
05/18/2025 13:37:11 - INFO - __main__ - Distributed environment: DistributedType.NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

05/18/2025 13:37:11 - INFO - __main__ - Using random seed: 1
05/18/2025 13:37:11 - INFO - __main__ - Loading model from /model
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.05s/it]
Loading pipeline components...:  50%|█████████████████████████████████████████████████████████████████████████████▌                                                                             | 2/4 [00:02<00:02,  1.00s/it]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.63s/it]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
05/18/2025 13:37:28 - INFO - __main__ - Number of parameters: 6878.23 million
Loaded dataset with 1000 entries
Created 865 resolution buckets
05/18/2025 13:37:28 - INFO - __main__ - Using 8-bit AdamW optimizer from bitsandbytes
05/18/2025 13:37:28 - INFO - __main__ - No checkpoint specified, starting from scratch
Training:   0%|                                                                                                                                                                                        | 0/21 [00:00<?, ?it/s]Dataset size: 1000 images
Dataloader batches: 21
Calculated max steps: 21
05/18/2025 13:37:28 - INFO - __main__ - Starting epoch 1/1
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 1254, in <module>
    train(args)
  File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 1072, in train
    accelerator.clip_grad_norm_(dit_model.parameters(), args.max_grad_norm)
  File "/venv/main/lib/python3.12/site-packages/accelerate/accelerator.py", line 2628, in clip_grad_norm_
    self.unscale_gradients()
  File "/venv/main/lib/python3.12/site-packages/accelerate/accelerator.py", line 2567, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/venv/main/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 342, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
                                              ^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 283, in _unscale_grads_
    torch._amp_foreach_non_finite_check_and_unscale_(
RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'

And if I train with bf16 I get this:

Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: bf16

05/18/2025 13:38:15 - INFO - __main__ - Using random seed: 1
05/18/2025 13:38:15 - INFO - __main__ - Loading model from /model
Loading pipeline components...:   0%|                                                                                                                                                                   | 0/4 [00:00<?, ?it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 78.77it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 127.74it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.50it/s]
05/18/2025 13:38:18 - INFO - __main__ - Number of parameters: 6878.23 million
Loaded dataset with 1000 entries
Created 865 resolution buckets
05/18/2025 13:38:19 - INFO - __main__ - Using 8-bit AdamW optimizer from bitsandbytes
05/18/2025 13:38:19 - INFO - __main__ - No checkpoint specified, starting from scratch
Training:   0%|                                                                                                                                                                                        | 0/21 [00:00<?, ?it/s]Dataset size: 1000 images
Dataloader batches: 21
Calculated max steps: 21
05/18/2025 13:38:19 - INFO - __main__ - Starting epoch 1/1
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 1254, in <module>
    train(args)
  File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 1051, in train
    total_loss, diffusion_loss = forward(
                                 ^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 466, in forward
    vae_latent = vae_model.encode(images_vae).latent_dist.sample()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 278, in encode
    h = self._encode(x)
        ^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 252, in _encode
    enc = self.encoder(x)
          ^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/diffusers/models/autoencoders/vae.py", line 156, in forward
    sample = self.conv_in(sample)
             ^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 554, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 549, in _conv_forward
    return F.conv2d(
           ^^^^^^^^^
RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions