-
Notifications
You must be signed in to change notification settings - Fork 36
Open
Description
If I train with fp16 I get this:
(main) [email protected]:/workspace$ python -m f_lite.train --pretrained_model_path /model --train_data_path /train.csv --base_image_dir /images --output_dir ./flite7b_lora_ckpts --resolution 256 --use_8bit_adam --seed 1 --gradient_checkpointing --mixed_precision fp16 --train_batch_size 4 --gradient_accumulation_steps 1 --sample_prompts_file /captions.txt --learning_rate 1e-5 --num_epochs 1 --lr_scheduler linear --sample_every 100 --use_resolution_buckets
/venv/main/lib/python3.12/site-packages/accelerate/accelerator.py:498: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
05/18/2025 13:37:11 - INFO - __main__ - Distributed environment: DistributedType.NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
05/18/2025 13:37:11 - INFO - __main__ - Using random seed: 1
05/18/2025 13:37:11 - INFO - __main__ - Loading model from /model
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.05s/it]
Loading pipeline components...: 50%|█████████████████████████████████████████████████████████████████████████████▌ | 2/4 [00:02<00:02, 1.00s/it]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00, 1.63s/it]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00, 1.48s/it]
05/18/2025 13:37:28 - INFO - __main__ - Number of parameters: 6878.23 million
Loaded dataset with 1000 entries
Created 865 resolution buckets
05/18/2025 13:37:28 - INFO - __main__ - Using 8-bit AdamW optimizer from bitsandbytes
05/18/2025 13:37:28 - INFO - __main__ - No checkpoint specified, starting from scratch
Training: 0%| | 0/21 [00:00<?, ?it/s]Dataset size: 1000 images
Dataloader batches: 21
Calculated max steps: 21
05/18/2025 13:37:28 - INFO - __main__ - Starting epoch 1/1
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 1254, in <module>
train(args)
File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 1072, in train
accelerator.clip_grad_norm_(dit_model.parameters(), args.max_grad_norm)
File "/venv/main/lib/python3.12/site-packages/accelerate/accelerator.py", line 2628, in clip_grad_norm_
self.unscale_gradients()
File "/venv/main/lib/python3.12/site-packages/accelerate/accelerator.py", line 2567, in unscale_gradients
self.scaler.unscale_(opt)
File "/venv/main/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 342, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
^^^^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 283, in _unscale_grads_
torch._amp_foreach_non_finite_check_and_unscale_(
RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'
And if I train with bf16 I get this:
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: bf16
05/18/2025 13:38:15 - INFO - __main__ - Using random seed: 1
05/18/2025 13:38:15 - INFO - __main__ - Loading model from /model
Loading pipeline components...: 0%| | 0/4 [00:00<?, ?it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 78.77it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 127.74it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 8.50it/s]
05/18/2025 13:38:18 - INFO - __main__ - Number of parameters: 6878.23 million
Loaded dataset with 1000 entries
Created 865 resolution buckets
05/18/2025 13:38:19 - INFO - __main__ - Using 8-bit AdamW optimizer from bitsandbytes
05/18/2025 13:38:19 - INFO - __main__ - No checkpoint specified, starting from scratch
Training: 0%| | 0/21 [00:00<?, ?it/s]Dataset size: 1000 images
Dataloader batches: 21
Calculated max steps: 21
05/18/2025 13:38:19 - INFO - __main__ - Starting epoch 1/1
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 1254, in <module>
train(args)
File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 1051, in train
total_loss, diffusion_loss = forward(
^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/f_lite/train.py", line 466, in forward
vae_latent = vae_model.encode(images_vae).latent_dist.sample()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 278, in encode
h = self._encode(x)
^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 252, in _encode
enc = self.encoder(x)
^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/diffusers/models/autoencoders/vae.py", line 156, in forward
sample = self.conv_in(sample)
^^^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 554, in forward
return self._conv_forward(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 549, in _conv_forward
return F.conv2d(
^^^^^^^^^
RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same
Metadata
Metadata
Assignees
Labels
No labels