Qwen Image Models Training - 0 to Hero Level Tutorial - LoRA & Fine Tuning - Base & Edit Model #344
FurkanGozukara
announced in
Tutorials
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Qwen Image Models Training - 0 to Hero Level Tutorial - LoRA & Fine Tuning - Base & Edit Model
Full tutorial: https://www.youtube.com/watch?v=DPX3eBTuO_Y
This is a full comprehensive step-by-step tutorial for how to train Qwen Image models. This tutorial covers how to do LoRA training and full Fine-Tuning / DreamBooth training on Qwen Image models. It covers both the Qwen Image base model and the Qwen Image Edit Plus 2509 model. This tutorial is the product of 21 days of full R&D, costing over $800 in cloud services to find the best configurations for training. Furthermore, we have developed an amazing, ultra-easy-to-use Gradio app to use the legendary Kohya Musubi Tuner trainer with ease. You will be able to train locally on your Windows computer with GPUs with as little as 6 GB of VRAM for both LoRA and Fine-Tuning.
The post used in tutorial to download zip file : https://www.patreon.com/posts/qwen-trainer-app-137551634
Requirements tutorial : https://youtu.be/DrhUHnYfwC0
SwarmUI tutorial : https://youtu.be/c3gEoAyL2IE
Video Chapters
00:00:00 Introduction & Tutorial Goals
00:00:59 Showcase: Realistic vs. Style Training (GTA 5 Example)
00:01:26 Showcase: High-Quality Product Training
00:01:40 Showcase: Qwen Image Edit Model Capabilities
00:01:57 Effort & Cost Behind The Tutorial
00:02:19 Introducing The Custom Training Application & Presets
00:03:09 Power of Qwen Models: High-Quality Results from a Small Dataset
00:03:58 Detailed Tutorial Outline & Chapter Flow
00:04:36 Part 4: Dataset Preparation (Critical Section)
00:05:05 Part 5: Monitoring Training & Performance
00:05:23 Part 6: Generating High-Quality Images with Presets
00:05:44 Part 7: Specialized Training Scenarios
00:06:07 Why You Should Watch The Entire Tutorial
00:07:15 Part 1 Begins: Finding Resources & Downloading The Zip File
00:07:50 Mandatory Prerequisites (Python, CUDA, FFmpeg)
00:08:30 Core Application Installation on Windows
00:09:47 Part 2: Downloading The Qwen Training Models
00:10:28 Features of The Custom Downloader (Fast & Resumable)
00:11:24 Verifying Model Downloads & Hash Check
00:12:41 Part 3 Begins: Starting The Application & UI Overview
00:13:16 Crucial First Step: Selecting & Loading a Training Preset
00:13:43 Understanding The Preset Structure (LoRA/Fine-Tune, Epochs, Tiers)
00:15:01 System & VRAM Preparation: Checking Your Free VRAM
00:16:07 How to Minimize VRAM Usage Before Training
00:17:06 Setting Checkpoint Save Path & Frequency
00:19:05 Saving Your Custom Configuration File
00:19:52 Part 4 Begins: Dataset Preparation Introduction
00:20:10 Using The Ultimate Batch Image Processing Tool
00:20:53 Stage 1: Auto-Cropping & Subject Focusing
00:23:37 Stage 2: Resizing Images to Final Training Resolution
00:25:49 Critical: Dataset Quality Guidelines & Best Practices
00:27:19 The Importance of Variety (Clothing, Backgrounds, Angles)
00:29:10 New Tool: Internal Image Pre-Processing Preview
00:31:21 Using The Debug Mode to See Each Processed Image
00:32:21 How to Structure The Dataset Folder For Training
00:34:31 Pointing The Trainer to Your Dataset Folder
00:35:19 Captioning Strategy: Why a Single Trigger Word is Best
00:36:30 Optional: Using The Built-in Detailed Image Captioner
00:39:56 Finalizing Model Paths & Settings
00:40:34 Setting The Base Model, VAE, and Text Encoder Paths
00:41:59 Training Settings: How Many Epochs Should You Use?
00:43:45 Part 5 Begins: Starting & Monitoring The Training
00:46:41 Performance Optimization: How to Improve Training Speed
00:48:35 Tip: Overclocking with MSI Afterburner
00:49:25 Part 6 Begins: Testing & Finding The Best Checkpoint
00:51:35 Using The Grid Generator to Compare Checkpoints
00:55:33 Analyzing The Comparison Grid to Find The Best Checkpoint
00:57:21 How to Resume an Incomplete LoRA Training
00:59:02 Generating Images with Your Best LoRA
01:00:21 Workflow: Generate Low-Res Previews First, Then Upscale
01:01:26 The Power of Upscaling: Before and After
01:02:08 Fixing Faces with Automatic Segmentation Inpainting
01:04:28 Manual Inpainting for Maximum Control
01:06:31 Batch Generating Images with Wildcards
01:08:49 How to Write Excellent Prompts with Google AI Studio (Gemini)
01:10:04 Quality Comparison: Tier 1 (BF16) vs Tier 2 (FP8 Scaled)
01:12:10 Part 7 Begins: Fine-Tuning (DreamBooth) Explained
01:13:36 Converting 40GB Fine-Tuned Models to FP8 Scaled
01:15:15 Testing Fine-Tuned Checkpoints
01:16:27 Training on The Qwen Image Edit Model
01:17:39 Using The Trained Edit Model for Prompt-Based Editing
01:24:22 Advanced: Teaching The Edit Model New Commands (Control Images)
01:27:01 Performance Impact of Training with Control Images
01:31:41 How to Resume an Incomplete Fine-Tuning Training
01:33:08 Recap: How to Use Your Trained Models
01:35:36 Using Fine-Tuned Models in SwarmUI
01:37:16 Specialized Scenario: Style Training
01:38:20 Style Dataset Guidelines: Consistency & No Repeating Elements
01:40:25 Generating Prompts for Your Trained Style with Gemini
01:44:45 Generating Images with Your Trained Style Model
01:46:41 Specialized Scenario: Product Training
01:47:34 Product Dataset Guidelines: Proportions & Detail Shots
01:48:56 Generating Prompts for Your Trained Product with Gemini
01:50:52 Conclusion & Community Links (Discord, GitHub, Reddit)
Video Transcription
00:00:00 Greetings everyone, welcome to the most comprehensive yet easy-to-follow Qwen
00:00:06 models training tutorial. In this tutorial, I am going to show you from scratch to the grandmaster
00:00:14 level how to train Qwen models on your local Windows computer. After watching this tutorial,
00:00:22 you will be able to train your models locally on your Windows computer and generate amazing
00:00:30 images. I am going to show both LoRA training and also fine-tuning training. Furthermore,
00:00:36 I will show Qwen base model training and Qwen Edit Plus model training. This tutorial is
00:00:45 extremely comprehensive, so therefore, check out the tutorial description to see the chapters.
00:00:52 Moreover, in a moment, I will show you the layout of the tutorial, so keep watching.
00:00:59 In this tutorial, I am not going to show only realistic images,
00:01:03 but I am going to show you style training as well. For example, I have trained GTA 5 style,
00:01:11 shared it on CivitAI and also the style dataset, so I will explain how to train your style and
00:01:20 generate excellent images with your trained style. Furthermore, I will show you how to
00:01:26 train a product like this one and generate amazing product images with highest quality,
00:01:33 with small text or the logos, and keep consistency and accuracy of the products.
00:01:40 Moreover, after you trained the Qwen image edit model, even without control images,
00:01:46 you will be able to make prompt-based editing. For example, I say that replace
00:01:52 head of this man and it generates this image. I will show all of that. You will
00:01:57 see it. For preparing this tutorial, I have worked over 20 days and spent over
00:02:04 $600 for research and development. You see on a single day, $110 I have spent
00:02:13 on RunPod. When we also include MassedCompute, I have spent over $700 or $800 for research.
00:02:19 Moreover, I have prepared an application so easy to use with pre-made configurations. LoRA training
00:02:27 configurations already, as you can see they are all split into each GPU tier. Fine-tuning
00:02:35 configurations already, they are also split into GPU tiers. This application fully developed by me,
00:02:42 it is using the famous Kohya SS GUI tuner, so easy to use. You just load the configuration
00:02:49 and set up a few things and you are ready to go. I will explain everything. Furthermore, we
00:02:54 have one-click installers for this application for Windows, RunPod, and MassedCompute, including the
00:03:02 base models download. This application supports 1.2.1 and Wan 2.2 models training as well. Also,
00:03:09 this model is extremely powerful. If you paid attention to the images that I have shown,
00:03:14 you will see that it is able to do a lot of emotions very accurately. It is able to do
00:03:20 very hard prompts, very hard complex prompts very accurately. And I didn't even use a very
00:03:27 powerful training images dataset. I just used 28 medium quality images. However, with only small
00:03:36 and medium quality dataset, I am able to get amazing, mind-blowing quality images like these
00:03:43 ones. You see all of them are highest quality, really good, both realistic, and it can do style
00:03:50 images already very well. So this Qwen model is extremely powerful and my new favorite model.
00:03:58 So let me also show you the flow of the tutorial as well. So the rest of the tutorial
00:04:04 flow will be like this. Part 1, initial setup and installation, introduction and finding resources,
00:04:11 mandatory prerequisites, the requirement tutorial, core application installation. Then Part 2 will
00:04:17 be the downloading training models. Part 3 will be starting and navigating the user interface,
00:04:23 the Gradio application that I have developed, loading and training configuration presets,
00:04:28 system and VRAM preparation, detailed training parameters setup. Part 4 will be dataset
00:04:36 preparation. This is super critical if you are first-time training, this part will be super
00:04:42 useful for you. Using the ultimate batch image processing tool, this is another tool that I have
00:04:47 developed, dataset quality and guidelines, this is super important. New tool that I have added
00:04:53 using the internal image pre-processing, dataset structuring for the trainer, this is important,
00:04:59 captioning your dataset and the impact of it, finalizing model paths and settings.
00:05:05 In the Part 5, we are going to see monitoring training and performance optimizations,
00:05:11 testing and finding best checkpoint, resuming incomplete trainings, either it is a LoRA or
00:05:18 fine-tuning. Then in the Part 6, I will show generating high-quality images. I
00:05:23 have prepared amazing presets so that with one click you will be able to generate
00:05:28 highest quality images with your trained Qwen models, but we are supporting so many models,
00:05:33 not just Qwen. Image generation workflow in SwarmUI, fixing some of the images, inpainting,
00:05:39 this is also extremely useful, you will love it. Part 7, specialized training scenarios,
00:05:44 fine-tuning difference versus LoRA, training on Qwen image edit model. If you are interested in
00:05:50 Qwen image edit model training, by teaching model new commands like replace clothing,
00:05:56 replace hair color, or colorize this sketch or line art, style training,
00:06:02 what is the difference, product training, what is the difference, and the Part 8 is the conclusion.
00:06:07 So I really recommend you to watch this tutorial from beginning to the end without skipping any
00:06:12 part. This tutorial will also help you significantly in your future trainings,
00:06:19 whether it is Qwen or Wan 2.2. Hopefully after this tutorial, I will work on Wan 2.2
00:06:24 training. Therefore, this tutorial will help you significantly in the future as well. And
00:06:30 I am saying that this is a tutorial, however, this is literally a full course. So therefore,
00:06:36 try to learn everything I have explained in this tutorial and improve your skills,
00:06:41 your knowledge, and utilize this knowledge in your professional life. This tutorial,
00:06:47 I can say that it is a breaking deal, like a full course. I have spent huge time and you
00:06:54 will love this tutorial, you will enjoy from this tutorial, and you will learn so much
00:06:59 information from this tutorial. This tutorial is a product of experience of two years working on
00:07:07 these generative AI models, training them, doing research, doing experimentation. So let's begin.
00:07:15 So as usual, I have prepared an amazing post where you will find all of the necessary information,
00:07:22 the zip file, instructions. Slowly scroll down. Download the latest zip file from here. Also,
00:07:30 it is in the attachments section. Do not start installation right away. Keep scrolling down.
00:07:36 I recommend you to read everything. Find the Qwen image tutorial video instructions, this section,
00:07:43 and from here we will follow. The very first thing that you need to do is following the requirements
00:07:50 tutorial. This is mandatory and super important. When you open this tutorial, you will get to this
00:07:56 video. This video shows you everything about the requirements. What are they? Python, CUDA, FFmpeg,
00:08:04 and other stuff. So please follow this tutorial with its updated instructions. You see the link
00:08:10 is here. This is a fully public tutorial. All of the links are updated. You see it is
00:08:17 fully updated 3 September 2025. After you watch this tutorial, apply the steps here, you will
00:08:24 be ready to run all of the AI applications that I develop or other developers develop.
00:08:30 After you have followed the requirements tutorial, return back to our main post and
00:08:35 now we will start installation. So move the downloaded zip file into the disk where you
00:08:41 want to install. I am going to install into my Q drive. I will right-click and I will extract
00:08:47 here. You can use your Windows extractor. After extraction, enter inside the extracted folder,
00:08:53 do not forget that. Then all you need to do is just double-click
00:08:57 windows_install_and_update.bat file and run. Do not run anything as administrator,
00:09:04 it will break it. So run everything with double-click or select and hit enter. You
00:09:09 see that it will generate a virtual environment and install all the libraries inside it. So this
00:09:15 will not affect anything else on your computer. All of my applications installed into secured
00:09:23 and isolated virtual environment folders. Just wait for installation to be completed. Okay,
00:09:28 so the installation has been completed. You can scroll up and see if there are any errors. If
00:09:33 there are any errors, select everything like this, control C, save into a text file and message me
00:09:41 the text file. You can message me from email, from Patreon, from Discord, anywhere. Then close this.
00:09:47 Now we need to download Qwen training models. To download the models, double-click
00:09:52 windows_download_training_models, run. It will install necessary requirements, then it will ask
00:09:58 you which model you want to download. So you can download Qwen base model or you can download Qwen
00:10:04 image edit plus model. I will download both of them because I will show you both of them.
00:10:09 So let's download the option one. The option one will download these following models. The option
00:10:14 two will download the newer model. It will not download twice. They will be downloaded into
00:10:20 training/models/Qwen folder. You will see here. As you have noticed, there are 16 parts because
00:10:28 this downloader is extremely robust. It downloads with 16 different simultaneous connections, so it
00:10:37 utilizes your entire internet speed. Moreover, it is fully resumable and it is fully robust.
00:10:44 For example, I can close this, run the downloader again. Okay, let's run it. Then I will select the
00:10:49 option one again, and it will fully resume wherever it is left. You see it is resuming
00:10:55 back from wherever it is left. As you can see, it is downloading with 1 gigabit per second on my
00:11:01 personal computer. This is an amazing speed. This is maximum speed that my internet connection has.
00:11:08 Once the model fully downloaded, it will merge the split parts into a single part, then it will
00:11:16 verify its hash value to ensure that it has been downloaded accurately. We will see in a moment.
00:11:24 Yes, it is merged, then it is verifying the hash value so that your downloaded models will never
00:11:31 be corrupted or have any issues. Then it will move to the next download like this. And it is moving
00:11:38 to the next download. When you next time resume or start the downloader, it will just skip the
00:11:43 already downloaded files and start with the next file. This is a downloader that I have developed,
00:11:49 and I am using this downloader in my all applications. So it is always very fast,
00:11:55 robust, and accurate. This downloader works with slow internet connections and also with
00:12:00 very high internet connections. This is the best downloader you will ever find.
00:12:05 Once the first downloads have been completed, start the windows downloader again and download
00:12:10 the Qwen image edit model as well if you want it. And at the end of the downloads, you will
00:12:16 see that all the files have been downloaded like this. If you already have the files, you can also
00:12:22 move them or you can also use them. However, I recommend you to use the windows downloader to
00:12:28 download accurate version of the models. You see these are the models that we are going to
00:12:33 use. BF16 version of the models are mandatory. FP8 version or GGUF versions will not work.
00:12:41 Then we are ready to start the application. Moreover, if you want to update the application
00:12:46 before starting, let's say you are going to use it afterwards, just double-click and start the
00:12:52 windows_install_and_update file again and it will update it to latest version. So let's start the
00:12:57 application, windows_start_app.bat file, run. It will automatically open the interface like
00:13:02 this. Always follow CMD windows as well to see if there are any errors or not or what is happening.
00:13:09 So this is our application interface. I will explain everything, don't worry. First of all,
00:13:16 begin with selecting your preset. This is super important. So make sure that you are in the Qwen
00:13:21 image training tab. We also support one model training, and hopefully it will be the next
00:13:26 tutorial after this. I am going to work on that as well. So make sure you are at this tab. Also,
00:13:32 whenever you are going to load a new config, refresh page and then load. So for loading
00:13:37 the config, click this folder icon, go back to your installation folder. This is where I have
00:13:43 installed. Enter inside Qwen-Training-Configs and from here you are going to choose whatever
00:13:49 you want to train. I am going to show LoRA training first, then I will show DreamBooth,
00:13:54 but both of them are exactly same. So let's enter inside LoRA training, and based on your
00:14:00 GPU or how much you can wait, select the epochs. So 200 epochs is the best quality, 100 epochs is
00:14:10 a little bit lesser quality, and 50 epochs is lesser quality. Why? Because with more epochs,
00:14:16 we are actually using a lower learning rate and we are doing more steps. Therefore, we are able
00:14:22 to train more details. So higher epochs, lower learning rate is better. And now you will see
00:14:28 the tier 1 and tier 2 and tier 3, 4, 5, 6 configs. You may be wondering what are the differences. To
00:14:37 learn the differences, enter inside the folder and you will see LoRA_Configs_Explanation.jpg files,
00:14:43 and when you open it, it will tell you what are the each configurations and what are
00:14:48 their difference. So based on your GPU, you are going to select the configuration. Therefore,
00:14:54 I am going to use 200 epoch, and I'm going to use tier 2 30,000 megabyte toml file. Double-click on
00:15:01 the toml file, it will open the file from here, then click this icon to load it, and you see it is
00:15:08 saying configuration loaded successfully. Why did I pick this configuration file? Type CMD and open
00:15:16 a CMD window, then type nvidia-smi. This will show you your GPU list like this. So I have RTX 5090,
00:15:25 it has 32 GB of VRAM, but how much free VRAM I have matters. So to learn that, open a CMD window,
00:15:34 type pip install nvitop like this, it will install the nvitop very quickly, then type nvitop. And it
00:15:43 will show your GPUs' VRAM usages. Currently, I am using 3.5 GB of VRAM on my GPU. But I need
00:15:52 30 GB of free VRAM for this configuration. Don't worry, I will show what you can do. Therefore,
00:15:59 I should restart my PC and minimize my VRAM usage. Moreover, you can open Task Manager,
00:16:07 go to Startup apps, and in here you can disable all the starting apps except the necessary ones,
00:16:14 and after that restart, it will minimize your VRAM usage as well. So I should get
00:16:19 this VRAM usage under 2 GB before I start training. Okay, let's continue.
00:16:25 You can click this open all panels and it will open all of the panels or you can just
00:16:30 hide all the panels. So let's begin with first option, accelerate launch settings. This option
00:16:35 is extremely useful when you do multiple GPU training, but if you don't have multiple GPUs,
00:16:41 you don't need to set anything here. Multiple GPU training on Windows not working very well.
00:16:47 Hopefully, I will show that on cloud tutorial on MassedCompute and RunPod. But if you have
00:16:52 multiple GPUs like me, you see I have two GPUs, I can set my GPU ID to 1 and the training will run
00:16:59 on my second GPU. However, I'm not going to use my second GPU, I'm going to use my first GPU. Okay,
00:17:06 the second tab, click it. Now, this is super important. Where you are going to save your
00:17:12 checkpoints. So click this folder icon or you can directly copy paste the folder path here.
00:17:17 I will show directly copy paste. I am going to save my models inside my SwarmUI installation,
00:17:24 inside models, inside diffusion_models, inside lora, because this is going to be a LoRA
00:17:31 training. So copy this path and paste it. Or now I will show with select folder. Click this icon,
00:17:38 find wherever you want to save. Okay, let's go to SwarmUI installation, inside models,
00:17:44 inside lora folder, then click select folder. And it will select the folder. Both works. Then how
00:17:50 frequently you want to save? Each saved checkpoint of LoRA will be 2.3 GB. Currently, this setup is
00:17:58 saving eight different checkpoints. How? Because you see it is going to save every N epochs. So
00:18:05 after every 25 epochs, it will save a checkpoint. And you may be wondering what is epoch? One epoch
00:18:12 means that all of your images are trained one time. I will explain that as we progress. So you
00:18:19 can keep this as a 25 epoch or you can reduce this number to get more frequent checkpoints,
00:18:25 or you can make it higher to get lesser frequent checkpoints. 25 is decent because after the
00:18:31 training, we will compare checkpoints and see which one of the checkpoint is the best one.
00:18:37 Checkpoint means that the snapshot of the model during that moment. The output name. Output name
00:18:45 means that with which name you are going to save your LoRA files. So I am going to name my LoRAs
00:18:53 like this: Qwen-Image-Lora-Tutorial. Okay, you don't need to change anything else in here. These
00:18:59 are all set. Then you can move to the next part, but before moving that, I recommend you to save
00:19:05 your configuration to be able to load it later. Where should we save? You can save this right
00:19:12 away from here. It will overwrite the base config, or the better way is, for example, let's save it
00:19:18 inside here, like this. So I am going to save it into this folder, and click save. Actually,
00:19:25 let's save it into our new folder to not have any issues. Inside here, and save. Okay, then
00:19:33 click save. Yes, it's saved. Don't forget to click save to save. You see it shows that configuration
00:19:39 saved. It is inside my new installation folder, and I can see that tier 2 30,000 megabyte toml.
00:19:45 Let's move with Qwen image training data set. Now, the dataset part is extremely
00:19:52 important. Pay attention to this part. If you are first time going to make a training,
00:19:57 preparation of the dataset matters hugely. You need to have your images accurately prepared. To
00:20:04 automatically prepare your images, I recommend to use Ultimate Batch Image Processing app. You see
00:20:10 it is under accelerate tools section. So let's go to this link. I recommend you to check out these
00:20:17 screenshots, read this post. Let's scroll down and let's download the latest version. Then let's
00:20:23 move it into our Q drive, right-click, extract here, enter inside it. First of all, we need to
00:20:29 install. This is a pretty fast installation. This application is very lightweight, but it
00:20:35 has so many features. Okay, the installation has been completed. Scroll up to see if there are any
00:20:41 errors or not, then close this. Then let's start the application. windows_start_application,
00:20:46 run. Why this application important? Because this will allow you to batch preprocess your training
00:20:53 images. You can of course manually preprocess your images, but this makes it much easier and
00:21:00 accurate. So I have some sample images to demonstrate you the power of this tool. I
00:21:06 will copy this path and enter as an input folder. Then as an output folder, let's output them into
00:21:14 my other folder as Pre-process Stage 1. Then the aspect ratio. If you are going to generate images
00:21:23 with 16x9 always, you can make your aspect ratio accordingly. However, if you are not sure which
00:21:31 aspect ratio you are going to use, I recommend you to use square aspect ratio with 1328 to 1328
00:21:39 pixels. This is the base resolution of the Qwen image model or Qwen image edit model. This works
00:21:45 best and with this aspect ratio and resolution, you can still generate any aspect ratio. All the
00:21:51 images I have shown you in the beginning of the tutorial were trained with 1328 to 1328.
00:21:58 Then there are several options. You can select the classes from here to zoom them in. This is
00:22:04 extremely useful when you are training a person because you want to zoom in the person. What
00:22:10 I mean by that? You see in these images, there are a lot of extra spaces that can be zoomed in.
00:22:18 For example, in this image, I can zoom in myself a lot. So you can choose this or there is a better
00:22:25 one which is based on SAM2. This takes anything as a prompt. Let's say person. You can set your
00:22:32 batch size, GPU IDs, these are all advanced stuff if you are going to process a lot of images. So
00:22:39 default is good. Let's start processing. What this is going to do is it is going to zoom in the class
00:22:46 I have given without cropping any part of the class. So this will not make these images exactly
00:22:52 as this resolution or this aspect ratio. It will try to match this aspect ratio without cropping
00:22:59 any part of the subject. So let's see what kind of images we are getting. We are saving them inside
00:23:04 here. You see it has generated this subfolder. This is important because in the second stage,
00:23:11 we are going to use this to make them exactly same resolution. When I enter inside this folder,
00:23:19 you can see that it has zoomed in the person. So this is how it works. And when it is zooming in,
00:23:25 it will not crop any parts of the image. And also when zooming in, it will try to match the aspect
00:23:32 ratio that you have given like this. Okay, the first stage has been completed. Now the second
00:23:37 stage is resizing them into the exact resolution. This will crop the subject if it is necessary,
00:23:44 like cropping the body parts to match the exact resolution. So this takes the parent folder,
00:23:50 not this folder. This is not the folder, but this is the folder that I need to give. And I need to
00:23:56 change the resolution that I want. So this will look a subfolder named it as exactly like this.
00:24:02 You can have multiple resolutions actually. For example, in the image cropper, I can add here
00:24:07 another resolution. Let's say 16:9. So this is the resolution of 16:9 for Qwen image model. Let's add
00:24:14 it like 1744 to 992. Let's start processing. It will process this new resolution as well.
00:24:23 And I am going to see a folder generated here in a minute when it is processed. Okay, it is started
00:24:30 processing. Now it will try to match this aspect ratio. It may not match it exactly. Why? Because
00:24:36 it is not going to crop any body parts. So you see this image cannot match that aspect ratio. This is
00:24:43 not a suitable image for that. This is almost still square. However, in the second tab, when
00:24:48 I go to image resizer, when I type it, you see I have given the parent folder. Let's wait for this
00:24:55 one to finish. Okay, it is almost finished. By the way, if you use this YOLO, it is faster than SAM2.
00:25:02 So just delete this and select your class from here. It supports so many classes to focus on
00:25:08 them. Okay, it is done. Now, I am going to make the output folder as final images, like this,
00:25:15 and I will click resize images. You can also make resize without cropping, so it will make padding
00:25:21 expansion. So let's resize images. I recommend cropping, it is better. Then let's go back to
00:25:28 our folder, final images. Okay. In here, you will see that it has cropped the body parts, resized
00:25:35 it into the exact resolution like this. And these are the square images. They are much more accurate
00:25:42 than the other ones. Now I have my images ready. However, this is not a very good collection of
00:25:49 images. It is another thing that you need to be careful of. I have used these images to train
00:25:55 the models that I have shown you in the beginning of the tutorial. So when we analyze these images,
00:26:01 what do you see? I have full body pose like this. I have half body pose. I have very close shot.
00:26:08 And when you have images, what matters is that it should have good lightning, good focus. These two
00:26:16 are extremely important. It should be very clear. All of these images are captured with my cheap
00:26:22 phone, so they are not taken with a professional camera. For example, when we look at this image,
00:26:28 you see it is not even a very good quality. Also, these are some old images. I didn't update my
00:26:33 dataset yet, but using medium quality images, and I am showing you how much you can obtain with a
00:26:40 medium quality. If you use a higher quality, then you will get even better results than I did get.
00:26:46 Why these images are medium quality? I mean, let me show you this image. You see this image is not
00:26:52 even a very high quality. This is how it looks. And this is a real image. This is a raw image.
00:26:58 And when we look at the AI generated image, as you can see, it is even higher quality than my
00:27:04 raw image. And therefore, you should add highest possible quality images into your training dataset
00:27:12 to get the maximum quality images. What else is important? You should try to have different
00:27:19 clothings, so it will not memorize your clothing. This is super important. Try to have different
00:27:24 clothings, different times, different backgrounds, all of these will help. Whatever you repeat in
00:27:30 your training dataset, the model will memorize them. You don't want that. You want only yourself
00:27:37 or the subject if you are training a style, the style or an object, the object to be repeated,
00:27:43 nothing else. I will explain them in the style and the item training, the product training part.
00:27:49 And one another thing is that you should add the emotions that you want. If you want smiling, you
00:27:55 should add it. If you want laughing, you should add it. So whatever the emotion you have will make
00:28:02 100% quality difference in your outputs. Try to have all the emotions you want. But this is not
00:28:10 all. Also, try to have all the angles you want. If you want to generate images that looks down,
00:28:17 you should have an image that has a look down like this, or from this angle, this angle,
00:28:23 whatever angle. So do not add the angles and poses that you don't want to see after training, and
00:28:30 add the poses and the angles you want to generate after training. So if we summarize again, have the
00:28:38 emotions, have the poses, have the angles, have different backgrounds, have different clothings,
00:28:45 have highest possible quality, lightning, and focus. Do not have blurry backgrounds,
00:28:52 do not have fuzzy backgrounds, they will impact your output quality. So in the AI world, whatever
00:28:58 you give, you get it. And with this medium quality dataset, I am able to generate amazing images.
00:29:04 If I increase the number of images, the variety in these images, I can get even better quality.
00:29:10 Another extremely useful tab we have is Image Pre-processing. The aim of this tab to make
00:29:17 you see exact version of your training images dataset during the training. This tab is extremely
00:29:25 useful especially if you want to do training with bucketing, with multiple aspect ratio resolutions.
00:29:31 So let's say I have a dataset like this and I want to do training with multiple aspect
00:29:36 ratios. Remember, for multiple aspect ratios in the Qwen image training dataset, you have
00:29:41 to enable bucketing. If you want to find the parameter fast, open all panels, control F, type
00:29:48 the name like bucket, and you can find it very easily. So let's say you have enabled bucketing,
00:29:54 and you are going to process your images to see their final version which the Kohya SS
00:30:00 GUI tuner processes them. So put your input images folder here, define an output like this one, sub,
00:30:07 and enable bucketing, then from the architecture, select the architecture. This matters because
00:30:13 based on this, the Kohya does bucketing. So I'm going to select Qwen image. You can
00:30:19 also make fix exif orientation. Currently, it is broken. If your image has an orientation problem,
00:30:24 the Kohya won't fix it. So let's process images, and it is processed, it shows how many processed,
00:30:30 the resolutions, the buckets. Now when I open this subfolder where I have processed them,
00:30:36 this is how Kohya is going to use my images. You see these images have inaccurate orientation. So
00:30:43 it won't be proper training. And furthermore, some of the images have padding. Let me show
00:30:49 you one of them. Okay, I couldn't find any example, but in some images, you may see them,
00:30:55 they have pad like this to fit into the accurate bucket. This is how you can preprocess your images
00:31:03 and see the bucket distribution. This is using the Kohya implementation itself, so this is 100%
00:31:10 accurate. This is extremely useful. You can also change your target resolution to see how they are
00:31:15 processed actually during the training and you can see the actual images. One another feature we have
00:31:21 is in the caching. In the caching section, you can enable debug mode. If you enable debug mode,
00:31:28 it will show you each image. However, it won't work. This is just for debugging to see. So
00:31:34 you can also enable debug mode image, and when you run the training this way, it will show you
00:31:40 every image one by one. Let me demonstrate you like this one. So it will pop up the image and
00:31:45 you will see each processed image in your training dataset. We had only one, so we have seen only one
00:31:52 image from here. So you can also use this debug mode. It has console, video, image to see how
00:31:58 they are actually used during the training. This can be extremely useful to understand how they
00:32:03 were actually trained. I really recommend you to use this image pre-processing. You can also
00:32:09 fix exif orientation and use the pre-processed dataset as your final dataset. So this screen
00:32:16 is extremely important to understand your images dataset, how it is composed.
00:32:21 Okay, now we have our images ready. How we are going to structure them? I am going to
00:32:26 generate a folder here and I will call it as training_images_dataset. And I am not going to
00:32:33 put all the images inside here. I am going to make a subfolder, this is mandatory, 1,
00:32:39 and I am going to use ohwx. Then I will paste all the images inside it. This 1 means that it
00:32:47 is repeating. Repeating means that how many times these images will be repeated in every
00:32:52 epoch. You don't need to try to understand this. The repeating is important when you
00:32:58 have different subsets of images, and when you are training a single concept, single subject,
00:33:04 you don't need different subsets of images. It is used to balance unbalanced datasets. And with
00:33:12 Qwen or with Flux or Wan, we are only able to train a single subject at a time at the moment.
00:33:19 So currently, we make all repeating 1. However, in future if we be able to train multiple concepts,
00:33:26 multiple persons, subjects, styles at the same time, to balance between different datasets,
00:33:33 we can have different repeating. What I mean by that, let me show you. For example,
00:33:38 the other folder is BBK. And this folder has only half amount of images. So let's delete this,
00:33:46 delete this. Yes. So you see this folder has 14 images, the other folder has 28 images. So in
00:33:55 every epoch, these folder images will be repeated two times. So each image will be trained twice,
00:34:02 and each image in this folder will be trained once. This is the logic of training to balance
00:34:08 unbalanced datasets during training, but we don't need it right now. Just make it as 1. And you see
00:34:16 this is ohwx. Why? Because I am going to generate captions with just ohwx. I'm not going to write
00:34:24 detailed captions, and I will explain why. So copy this path or from here, click this icon and select
00:34:31 the training_images_dataset folder and select folder. So make sure to select the parent folder,
00:34:39 not the subfolder, because it will look for the subfolder like this. Then set your resolution
00:34:45 and height. It is trained with best this one, but if you want to train with a different resolution,
00:34:51 with a different aspect ratio, you can set it. The batch size is 1, this is the best quality. I don't
00:34:57 recommend higher batch sizes. It is only necessary when you need speed or when you are going to do a
00:35:03 massive training, but when you are training a person or a subject, go with batch size 1,
00:35:09 it is the best quality. Also, learning rates are set for batch size 1. When you increase the
00:35:14 batch size, you need to set a new learning rate. Create missing captions. Currently,
00:35:19 I don't have any captions in my folder, so they will be created. It is going to use the folder
00:35:24 name as a captioning strategy. Then there is control directory, I will explain that in the
00:35:30 Qwen image edit model training part. You don't need to set anything else in here. All you need to
00:35:36 do is generate dataset configuration, and it will generate the dataset configuration automatically.
00:35:43 This is formatted for the Kohya. You can open this file and see what kind of dataset it has
00:35:51 generated. This is the config we are going to give to Kohya automatically. And when I return
00:35:56 back inside my folder, you will see that it has generated caption files with the same name as my
00:36:04 images. I recommend to train with only ohwx as a trigger word and do not have detailed captions
00:36:11 because it reduces the accuracy of the training. You need detailed captions when you are doing a
00:36:16 very big training like thousands of images or when you are training multiple concepts which doesn't
00:36:23 work right now. They bleed each other. But if you insist on using captions, we have image captioning
00:36:30 here. This is using the Qwen 2.5 VL, which is the text encoder used by the model itself. So how does
00:36:39 it work? First, you need to select the model path. Click this icon, go back to downloaded models,
00:36:45 which is here, select this one, okay. You can use FP8 precision if you have a GPU lower than 24 GB,
00:36:55 but I have it. Then you can drag and drop any image file to here. For example, let's see what
00:37:02 kind of captions it generates for this. By the way, don't forget to close your Ultimate Image
00:37:07 Processing CMD window after it is done. Okay, you see it has generated this caption. So I can use
00:37:13 this, I can modify this. Let's try another one with our training images. For example, let's use
00:37:19 this image and generate caption. Okay, so this is another caption. You can give custom command to
00:37:26 it. For example, this is a default prompt it takes, you can modify this. Or you can batch
00:37:31 process with caption prefix or caption suffix. It supports everything. You can also replace words
00:37:38 like it generates with a individual. You can make this as a cheerful ohwx, or it may generate with
00:37:45 a man word. So you can replace man with ohwx man, person with ohwx person. This supports everything
00:37:53 as a captioning. This is a really powerful captioner. Alternatively, you can use Joy Caption
00:37:59 application we have as well. It is here, you see this link. So you can install Joy Caption and use
00:38:04 it to generate captions as well. This is also one of the most famous captioning model, image
00:38:09 captioning model. It is also amazing. So this is captioning. Let me also demonstrate you batch
00:38:15 captioning. So let's delete the existing captions, like this. Select this folder. I'm not going to
00:38:22 give output folder so they will be automatically saved there. We can also replace words like man,
00:38:27 ohwx man, it will replace the man word with it. You can also add caption prefix like ohwx,
00:38:34 it supports everything. You can also auto-unload, this is important, so it won't take your VRAM
00:38:40 space. And then we just need to click start batch captioning. It supports copy images, scan folders,
00:38:47 overwrite existing captions, or output format as a JSON. Also, there are some other parameters you
00:38:53 can play here to see which one is working best for your captioning. It supports everything.
00:38:58 You can follow the start from the CMD window. So it is currently generating captions, 10 to 28. It
00:39:06 is pretty fast. And we can see the captions are getting generated here. When we open the caption,
00:39:11 you see it added this, also replaced man with ohwx man. So it supports everything. However,
00:39:17 I recommend to have only ohwx as a caption. I compared it with different captioning strategies,
00:39:26 detailed caption or ultra detailed caption, and just the trigger word, ohwx is working best. You
00:39:33 can use any trigger word. And the logic of the trigger word is a very random keyword. So it
00:39:39 should be random. It should be a rare word, and it should be a single word. Use something like
00:39:44 that as a trigger word, and that's it. Okay, so the captioning has been completed, but I
00:39:49 will return back to my dataset preparation and I will delete all these generated captions, and I
00:39:56 will click the generate dataset configuration and I will save my config and I will proceed.
00:40:03 And the next section is Qwen Image Model Settings. Do not change LoRA to DreamBooth or DreamBooth to
00:40:09 LoRA because the configurations are automatically set properly. Always use the base configuration
00:40:15 from the configs folder. So here, I'm not going to make any changes. However, if you want to use Qwen
00:40:23 image edit model, which I will show after training started as a next step, you can enable this, but
00:40:28 currently we don't need it. You can train on Qwen image base model. Okay, the next thing that you
00:40:34 need to set is the base model checkpoint. So click this, go back to your training models downloaded
00:40:40 folder, select the model. So this is the base model, you see. Then you need to set the VAE.
00:40:46 Click this, select the VAE, this one. Then select the text encoder, and it is this one. So we did
00:40:54 set the folders accurately. Don't change anything else. Don't change any of these unless you get out
00:41:03 of VRAM, which can happen if you are using too much VRAM. So since I am already using like 6 GB
00:41:10 of VRAM, I can make this like 25. I recommend you to try to reduce this maybe like 1 or maybe like
00:41:18 2 and see your speed. If you are getting very slow speeds, try to increase it slowly. So this depends
00:41:26 on your computer. I am trying to set them as much as accurately. Probably you shouldn't change this
00:41:32 at all, but if you get extremely slow speeds, that means that it is using shared VRAM. Therefore,
00:41:39 increase the block swap. Block swap means that it is going to use your RAM memory for swapping and
00:41:46 try to fit the trained part of the model into your GPU. Since I'm using more VRAM than recommended,
00:41:54 let's make this like 30. My training speed will get slower, or maybe like 25,
00:41:59 we can see. Don't change any other settings. And the next thing that you need to change is inside
00:42:06 training settings. What you can change here? You can change the maximum number of epochs. People
00:42:12 are asking me how many epochs they should do. If you have below 50 training images or even 100,
00:42:19 but it depends how much you can wait for training to be finished, use 200 epochs. Then compare each
00:42:27 checkpoint and see which one is generating the best. But let's say you have 100 images,
00:42:32 then you can reduce this to like 150. Let's say you have 200 images, then you can reduce
00:42:38 this to like 100. However, 200 epochs is really good below 50 training images. And as you have
00:42:46 more training images with highest quality, with variety like different backgrounds, clothings,
00:42:52 angles, poses, it is better quality. So try to increase the number of images that you have,
00:42:58 the training images with keeping the quality, then you can reduce these training epochs.
00:43:04 As I said, it depends on your GPU, how much you can wait, what is your computer, your GPU speed,
00:43:10 but 200 epoch is recommended if you have below 50 images. So I will leave it as a 200 epoch. Don't
00:43:18 change anything else in here. You can generate samples during the training, but I don't recommend
00:43:24 it. It will slow down your training significantly. Generate samples, the comparison after training,
00:43:30 which I will show. And in the advanced settings, you can provide the extra parameters that you
00:43:37 might have. Currently, we don't need any extra parameters, and we are all set. Now I will save my
00:43:45 configuration and I will click start training. First, it will generate cache files for my
00:43:53 training images, so it will first load the Qwen VL model, the text encoder, it will generate encoded
00:44:00 caches, you can see the progress here, then it will deload model and start the training. Okay,
00:44:05 it is going to load the model. I'm using a lot of VRAM right now. You should restart your PC,
00:44:12 minimize your VRAM usage. And this loading speed totally depends on your hard drive speed and also
00:44:19 your CPU speed because currently we are on the fly when loading the model converting model into FP8
00:44:26 scaled. Why we are doing that? Because currently on Windows, as you use more block swapping,
00:44:33 it is way slower than compared to Linux. The Kohya is aware of this and he's working on that. Let me
00:44:41 show you. So you see he's trying to eliminate the speed difference between Linux and the Windows
00:44:49 based on this issue. Let me also show you the issue that I have generated after doing a lot
00:44:54 of test and experimentation. Currently, because of the Windows system, it takes three times more
00:45:03 duration to swap between RAM and GPU. And as we use more block swapping, it becomes slower
00:45:11 than Linux. And if we don't use FP8 scaled, it becomes even slower because it takes twice
00:45:18 amount of RAM memory or VRAM memory. So the model takes twice space on our system. And you will see
00:45:26 that the training has started. You should try to get maximum amount of watt usage. Currently,
00:45:32 it is lower than what I expect, so I might be using some shared VRAM. So I may reduce block
00:45:40 swap and compare again. Furthermore, you should wait more steps because as you do more steps,
00:45:45 it will get faster. So wait until like 100 steps to see the duration that is going to take. If you
00:45:53 say that it is too long for you, what you need to do is selecting faster configuration from
00:46:00 the configs. What I mean by that? Select the 100 epoch or 50 epoch. So these uses higher learning
00:46:08 rates and doing lesser steps. Therefore, for example, if I use 50 epoch, it will take 1 over
00:46:16 4 times. So it will be four times faster, and the quality is very similar to the 200 epoch,
00:46:22 but 200 epoch is the best quality. But it is up to you whether you want faster training or not,
00:46:28 choose your configuration accordingly. Make sure that you are using minimal amount of VRAM and do
00:46:35 not do different stuffs while training and wait for training to be finished.
00:46:41 So can we improve the speed? Yes, as you can see, I am able to push speed further. How? First of
00:46:50 all, if you have dual GPUs, connect your monitors to your weaker GPU. This will make a huge impact
00:46:57 of the idle GPU usage and with that way you can push your block swapping lower. For example,
00:47:04 currently I am just doing seven block swaps on RTX 5090 and I am training highest quality FP8
00:47:12 scaled LoRA model. Furthermore, there is a newer feature we have added. This has been added while
00:47:19 I was editing the tutorial. You will find it as use pinned memory for block swapping. This
00:47:25 is a new feature. It is not merged into the main repository yet. However, when you are watching,
00:47:31 hopefully it will be already merged. You can see the pull request here. I am back and forth
00:47:37 communicating with Kohya to improve the speed on Windows devices. We are figuring out new stuff,
00:47:44 we are trying to make it perfect. Hopefully when you are watching this tutorial,
00:47:48 when you are following this tutorial, it will be implemented and it will be working better
00:47:52 than right now. You should enable this. This will increase the RAM usage, so if you get out of RAM,
00:47:58 out of VRAM errors, then you can disable it. This is using more system RAM, not the GPU RAM,
00:48:06 not the GPU memory. So when I say RAM, it is the system RAM. When I say VRAM,
00:48:11 it is the GPU memory. For this feature to fully work, open graphics, you see graphics settings,
00:48:17 then in here, go to advanced graphics settings, and in here, uncheck this hardware-accelerated
00:48:24 GPU scheduling and restart your PC. This should help you to improve your training speed even
00:48:29 further. And there is one another thing that you can even push your speed further. You can
00:48:35 use MSI Afterburner to increase your GPU clock speed. This should work fairly well because we
00:48:42 are still not using the GPU fully because we are spending a lot of time with the block swapping. So
00:48:47 how can I make the increase? It depends on your GPU, but on RTX 5090, I can so I can increase
00:48:53 the core speed by 320 and I can increase the memory speed with like 1000 and it should work
00:49:00 fairly well. I can just apply. You can see the actual speeds of the core and the memory here
00:49:07 and this should increase your training speed even further. So these are the tricks that we have
00:49:12 right now to improve. And hopefully when this new feature becomes more mature and fully implemented,
00:49:19 it will work way faster on Windows and it will get close speed to the Linux.
00:49:25 So I have trained previously exactly with these settings. So let's see them how to test them and
00:49:32 then we will proceed. So once the training has been finished, you will get exactly like this if
00:49:38 you did setup like me, the checkpoints, the LoRA checkpoints. Now we are ready to use them. So I
00:49:44 am going to use SwarmUI with the ComfyUI backend. If you don't know how to install and use SwarmUI
00:49:51 with the ComfyUI backend, we have an excellent tutorial. You see it is right under the Qwen image
00:49:56 tutorial video instructions. The link is here. You need to watch this to learn how to use it.
00:50:02 Let's open it. So this is a very recent tutorial that I have made like a few days ago. It is like
00:50:08 26 minutes, not much long. Watch this to learn how to install ComfyUI and SwarmUI. You need to
00:50:17 set it up to be able to use like me. So this is a fresh install SwarmUI. First of all, I'm going to
00:50:23 update my SwarmUI. I recommend that and start the SwarmUI after it. Okay, it is going to start. Yes,
00:50:29 it has started. I recommend to get the latest zip file and set the presets. So let's install
00:50:35 the presets. These are all shown in the tutorial. Then let's refresh the presets. Okay, our presets
00:50:41 arrived. The presets are extremely important because I did update presets and I have made
00:50:48 them with the best quality for either stylized generation or realistic generation. So let's sort
00:50:54 by name. Then for realistic generation, I am going to use Qwen-Image-Realism-Tier-2. This is a very
00:51:03 fast one. Direct apply. When you direct apply, you should see that it has selected this LoRA,
00:51:09 this base model. When you watch the tutorial, you will learn how all of these are downloaded,
00:51:16 installed, and set up. I recommend to follow that first. Okay. So then let's actually reset
00:51:22 params to default and then direct apply. Okay, we are all set. The first thing that
00:51:27 you need to do is compare your checkpoints to find out which checkpoint is performing best.
00:51:35 And how did I do that? Go to tools, select grid generator, select prompt. Then in this prompt,
00:51:43 you need to use some prompts. I have pre-made prompts, but you can write your own prompts as
00:51:48 well for comparing. So the prompts are inside Qwen-Training-Tutorial-Prompts,
00:51:54 and you will see all the prompts that I used. I'm going to use the prompts for grid find best
00:52:01 checkpoint prompts myself. Copy it entirely, paste it into here. Now with these prompts,
00:52:08 there is one significant difference. You see that I have written the LoRA name,
00:52:14 the fast LoRA name at the end of each prompt. And each prompt is separated with this character. This
00:52:21 is the format of the SwarmUI. Why do I need to define it here? Because I'm going to compare LoRA
00:52:28 checkpoints and I need this fast LoRA, you see it is also set here, to be able to accurately get my
00:52:36 images with low number of steps. Otherwise, you won't get quality outputs. The next step is I
00:52:43 am going to select LoRA from here. LoRAs. If your LoRAs doesn't appear here, go to LoRAs and refresh
00:52:50 for it to see or restart. Then, depending on how many epochs you did, you should start from the
00:52:57 half epoch, like 100, and it will be selected, like 125, click and select, like 150, like 175,
00:53:06 the final one is this one. So I'm going to compare these checkpoints and decide which checkpoint I'm
00:53:14 going to use. You see as a base model, I am using Qwen image FP8 scaled model because it uses half
00:53:21 VRAM. This model is huge. If you use BF16, it uses too much RAM memory and VRAM memory. Therefore,
00:53:29 I recommend to use this on your Windows computer. Then set a grid name to your testing, testing
00:53:36 grid, and click generate grid. Then the SwarmUI will use the ComfyUI backend and start generating.
00:53:44 Let's see the first generated image. First of all, it will load the model. You can see from the logs,
00:53:50 debug menu, what is happening. You can also follow the CMD window. This web API is not important or
00:53:57 this error is also not important. You can ignore both of them. Okay, I can see the logs. Yes, it is
00:54:04 starting. We should see the preview around here. You see it says that there are 61 generations,
00:54:11 they are queued. Okay, it is loading. You can watch the nvitop window as well what is
00:54:17 happening. It is loading the model, it will move the model into VRAM. Okay. So you see the first
00:54:23 thumbnail started to appear. This will also upscale images to 2x. This brings huge amount
00:54:32 of quality. However, it will take much more time. If you don't want to wait that much, you can just
00:54:37 disable this and generate your grid that way. So it will be way faster. However, if you want the
00:54:44 highest quality comparison, you shouldn't disable this. With this preset, it will do four steps for
00:54:51 base image generation, then it will do four steps of upscaling into which resolution, into 2536 to
00:55:00 2536 because we are doubling the resolution which we set here. We can see the speed here. These are
00:55:07 the speeds. The upscaling will take like 4x time. You can see it is like 8 seconds per it, but we
00:55:14 are doing only total eight steps. And this will bring highest quality. Currently, it is probably
00:55:21 testing the first LoRA, which is 100 epoch. This will be probably under-trained. Okay, let's see.
00:55:27 Yes, the first image has been generated. I can say that it is under-trained, not there yet.
00:55:33 Then to see the entire grid, I will click this and it will load the entire grid like this. So I
00:55:40 have done this previously. Let me show you that. I will close this running SwarmUI and go back to
00:55:46 my previous installation. Let's start the SwarmUI. Okay, let's go to tools and grid generator. Let's
00:55:53 load the grid config and I have the grid somewhere around here. Yes, LoRA checkpoint test, improved,
00:56:00 load grid config. Then let's open the grid. Okay. So this shows all the tests. I am going to change
00:56:06 how I view it from LoRAs to prompt. So now, you see the first tested LoRA is here, 75 epoch,
00:56:15 and the quality is not great. As I scroll to the right, you see this is 125 epoch. As I scroll to
00:56:23 right, this is 175 epoch. It is much better. This is a really good quality. This is exactly the
00:56:32 config I used just a moment ago. And this is the final epoch. This is the best one in my opinion.
00:56:38 As I scroll down, I can see the other images. So scroll between each image and decide which
00:56:47 checkpoint is working best for your case. So this is totally subjective. You need to decide which
00:56:53 checkpoint is looking best. However, I can see that 75, 100, 125, even 150 is not very good. They
00:57:02 are under-trained. And I can see that now it gets better as I do more training. If you decide to do
00:57:10 more training, let's say the final epoch is still not very trained. It is still under-trained. It
00:57:15 is not your character or style or whatever you are training. How you can resume training? How you can
00:57:21 continue training? With LoRA training, to resume your training, go to LoRA settings and you see
00:57:27 there is network weights LoRA weight. So you need to give the path of your final LoRA checkpoint
00:57:34 here. What I mean by that? Currently my LoRA is here. So this is the folder of my LoRA. Let's say
00:57:40 I will continue from this LoRA, while then copy this path and paste, then put a backslash and copy
00:57:48 the entire file name. So this is a full path to my LoRA. Now when I start training, it will start
00:57:57 from this LoRA and it will continue training from this checkpoint. However, there is one thing that
00:58:04 you need to fix. It will still see as starting from the first epoch. Therefore, let's say I want
00:58:11 to do total 250 epochs, and my last checkpoint is 200 epochs, then I type here 50. So it will
00:58:20 do 50 more epochs, and new saved files will be actually 250 epochs. I recommend you to change
00:58:29 the output folder, otherwise it will overwrite your older LoRAs because it will save them with
00:58:37 the same way as before. So it really doesn't see that it is starting from 200 epochs. It sees as it
00:58:45 thinks as it is starting from the first epoch. So make sure to change your output directory if you
00:58:51 are going to resume training, if you are going to do more epochs with your training. And after
00:58:56 analyzing this grid, you pick your best checkpoint and generate images with it. How you can do it?
00:59:02 Let's refresh. Okay, then let's reset params to default, let's go to presets, select our preset,
00:59:10 direct apply. Then select your checkpoint. The checkpoint that you decided as best. Let's say
00:59:17 I decided last checkpoint as best, so I click it. You see now lightning LoRA and my trained
00:59:23 LoRA are selected. You can change the impact, the weight of your LoRA from here. Let's say if it is
00:59:31 too much overfit, you can reduce your LoRA weight or if it is underfit, you can also increase your
00:59:37 weight from here. I don't recommend change the other LoRA weight, it is set accordingly. Then
00:59:43 type your prompt and generate. So I have some demo prompts for example here. I can use any of them
00:59:51 or I can use all of them. So let's make several examples. For example, let's use this one. Paste
00:59:58 it here. If you paid attention to my prompts, you will see that they are constructed for realism.
01:00:06 They include prompts that would make model to behave more realistic like Canon 15-35 mm, the
01:00:15 lens and such. And I will show how I made it. So then I will click generate, but I want to show you
01:00:21 one thing. I will first disable the upscale and I will generate four random images. Okay, let's
01:00:29 generate. This should be fairly fast when there is no upscale, it is really fast. And I'm also
01:00:34 going to change the resolution. So let's cancel it. Let's make the aspect ratio as 16:9. Okay,
01:00:42 let's generate. Okay, for example, this image, it takes only like 14-15 seconds. Why? Because
01:00:49 I'm recording a tutorial right now. Also, I made it to reserve VRAM, so it is not the best speed,
01:00:56 but it is decent. Okay, then let's say I like this image. I will click reuse parameters. Then I will
01:01:04 apply the upscale. So this is a specific upscale. Direct apply. Then the upscale is applied. So pay
01:01:13 attention to these values. And if your base model gets changed, if you do fine-tuning, it
01:01:19 will get changed, repick your base model. However, currently it is same. Then I will click generate.
01:01:26 Actually, I need to make this one. Yes. So we will see the difference between the base generation
01:01:32 and the upscaled generation. And I am not doing any face inpainting. If necessary, you can do
01:01:39 face inpainting as well. I will show an example of that. You can always from server logs, debug, and
01:01:46 watch where it is, where is the SwarmUI currently, what it is doing. Okay, now let's compare the
01:01:53 difference. This is the base image and this is the upscaled. You can see how much details and
01:01:59 realism it adds. This image may not be perfect so that we may need face inpainting, I will show, but
01:02:08 this is it. You see, like this to this. Let's also apply a face inpainting. To apply automatic face
01:02:14 inpainting, at the end of the prompt, I will type segment:face and I will type my face prompt, which
01:02:22 is photograph of ohwx man. Then go to segment refiner and you see there is segment steps. This
01:02:30 is important. I am going to make this seven. Why? Because when I make this seven with 60% of image
01:02:40 inpainting, I think it is default 60%, let's see. Yeah, as far as I know it is 60%. It will do four
01:02:47 steps. And this is necessary because we are using lightning LoRA. So I have made this segment step
01:02:53 seven and the rest is default. Let's generate. This is one option of doing that or you can edit
01:03:00 the image and inpaint face. I will show that too after this. Okay, you see first it is inpainting
01:03:06 the face. I think after that it will upscale. Oh wait, it used the last generated image then it
01:03:13 just did the face inpainting. Nice. Okay, I can see that this is a perfect face. I can play with
01:03:20 it with the parameters. So the default parameters are 0.6 to 0.5. I don't remember exactly what were
01:03:29 they. So to remember it, let's go to SwarmUI GitHub. In here there is documents. Then in
01:03:34 the documents, let's search for segment. Okay, you see there is documents, features, prompt,
01:03:41 syntax. I go into features and I go to prompt syntax MD file, then search here segment and
01:03:51 let's see if it does tell us the variables. Okay, it explains the variables here. It says that the
01:03:59 first parameter is the creativity, the other one is the threshold. So I'm going to increase
01:04:03 the creativity to like 70% like this. Let's see. And you can also increase the number of steps it
01:04:12 does. It can also increase your quality. Okay, let's see what happens. And yes, this is it. So
01:04:19 you can inpaint face to make it perfect. How about inpainting this first image? So to do it, select
01:04:28 that image, click edit image, and in here, you see it did set the resolution like this, init image
01:04:35 and the upscale, yeah, it is not enabled. Okay, I need to turn off the refine upscale, then I need
01:04:43 to mask the face. Okay, here. Let's change the mask radius. Yeah, this needs a total remaking,
01:04:51 but let's mask face. Okay, like this. Yes. Then I am going to use the this prompt and still it
01:05:00 will use the steps from here. I'm not sure. Let's generate and see what happens. This should only
01:05:08 inpaint the face. We can see how many steps it is making. Okay. By the way, the resolution is
01:05:14 massive, so I don't know how it will do. Okay, it did only two steps. This is wrong. We need to make
01:05:20 at least four steps. Yes. Therefore, I'm going to increase my steps count to like seven. Okay, let's
01:05:26 try again. Let's see how many steps it is making. Why it did two steps? The reason is that we have
01:05:33 init image creativity 60%. So 60% multiplied with four steps, it does two steps. 60% multiplied with
01:05:42 seven steps, it is going to do four steps. Yes, I can see it is doing four steps. Okay, it is using
01:05:48 the same amount of time as the upscaling. The advantage of this way is that I can change the
01:05:54 seed now and I can generate multiple times until I get the very best one. Yes. Now it is like this.
01:06:02 If you are not satisfied with it, what you can do is you can play with the parameters here. You
01:06:08 can make this 65%, you can make the mask blur like eight, generate, and decide which one is
01:06:16 best. This is the way of doing that. You can change the seed, make it random. So this way,
01:06:22 you can mask face or fix any part of the image many times until you get the satisfied results.
01:06:31 But usually, the generated images are highest quality, you don't need it. You just need to
01:06:37 write good prompts, which I am going to show in a minute. It's upcoming. You can increase the number
01:06:43 of generations so it will do multiple times image generation, the face inpainting, and you can pick
01:06:50 the best one. For example, let's generate four times with random seed and pick the best one.
01:06:55 Okay, now it is going to queue. Yes, four images queued. So I can see which will be the best one.
01:07:02 Okay, so with different seeds, we have different results and you can pick the best one with this
01:07:09 strategy. So to continue, I will reset params to default, then I will refresh, then from the
01:07:15 preset, let's reselect our preset, select back our best LoRA checkpoint like this. And let's say you
01:07:24 want to generate hundreds images with different prompts. Select your resolution, decide whether
01:07:29 you want to upscale or not. You can upscale later. So let's turn it off. Go to wildcards. In here,
01:07:36 create a wildcard, name it like whatever you want, and type each prompt here as a new line,
01:07:42 with a new line. So I have got some demo prompts I have generated here. So let's copy all of them,
01:07:50 paste and save. Then click it, it will use each generation this one of the prompt randomly,
01:07:58 it will insert it here, and let's generate 10 images. Okay, and generate. This way,
01:08:03 you can generate hundreds of images with different prompts, then pick the best one and upscale it,
01:08:11 inpaint it, work on it. This is a really good way of batch generating images and
01:08:17 picking the best image. As you are seeing live right now, it is really fast to generate
01:08:23 if you don't upscale because these presets uses only four steps for base generation.
01:08:29 I did huge research to find out these presets, and you can see that even without upscale,
01:08:36 the quality is decent. But when we latent upscale it, it becomes the next level. So this is the
01:08:42 way of finding good images. And how to write these prompts? So for writing these prompts,
01:08:49 I am using Google AI Studio. Let's go to Google AI Studio, Google AI Studio from here. Then in this
01:08:55 screen, select the Gemini 2.5. Hopefully Gemini 3 is coming. Then in our example prompts, you
01:09:03 will see that there is Gemini generate realistic character. Open it, modify this with your needs,
01:09:11 then copy and paste it into Gemini. Then I make the temperature lower so it will obey my command
01:09:20 prompt more and generate. This way, I have generated the realistic prompts. So read this,
01:09:27 modify it as you wish, and you can generate random prompts with this preset way. You can
01:09:35 test them and pick the good prompts. Then you can pick the good image and upscale it. This is the
01:09:41 way of generating amazing quality images. I have prepared Gemini prompts for stylized character or
01:09:48 for trained product item or for trained style. All of them exist to generate random prompts.
01:09:54 Okay, as a next question, you may be asking what is the difference between tier 1 LoRA and tier 2
01:10:04 LoRA? So you may be wondering what is the actual difference between tier 1 and tier 2. As I have
01:10:10 explained, tier 2 uses FP8 scaled, tier 1 uses BF16, not FP8 scaled. And these other tiers uses
01:10:20 lower network rank or lower resolution to reduce the VRAM usage. So in my test, let me open it,
01:10:28 I also have tested the quality difference. You see there is FP8 scaled version BF16 quality
01:10:34 difference. Let's open the grid and let's make it as prompt. Okay, here. So the first one is FP8
01:10:44 scaled, the second one is the BF16. And the third one is a LoRA trained on the Qwen image edit plus
01:10:53 model. I need to apply it to the Qwen image edit as a base model. So this is its actual output. You
01:11:00 can use the Qwen image trained model on Qwen image edit plus model or vice versa. However,
01:11:07 the max quality obtained when you use it on the same trained model. And the quality difference
01:11:13 is minimal. I think these are just the seed differences. However, the actual change appears
01:11:21 when you apply the Qwen image trained LoRA on Qwen image edit model like this, but all of
01:11:27 them is working. So we lose or we don't even lose much quality between the FP8 and the BF16, between
01:11:36 the tier 2 and tier 1. You see this is tier 2, this is tier 1, or this is tier 2, this is tier
01:11:42 1. Almost same quality. These are just the random noise differences, tier 2, tier 1. So you can use
01:11:50 either of them, tier 2, tier 1. Almost same, you see. There is no big quality difference. Moreover,
01:11:57 you can train on Qwen image edit model as well. It works as you can see, and there is an advantage of
01:12:04 Qwen image edit model which I will show you in a moment as we progress in the tutorial.
01:12:10 So now, as a next step, how you do fine tuning? Are there any difference? The only difference of
01:12:18 fine tuning is that you select fine tuning configuration. The rest is exactly same as
01:12:26 the LoRA training. So from the training configs, select the fine tuning. Again,
01:12:32 select the number of epochs. By the way, the fine tuning is slower than LoRA right now on Windows
01:12:38 especially. On Linux, they are almost same speed. And select the tier. If you have paid attention,
01:12:44 all are tier 1 in fine tuning. Because fine tuning is more optimized, therefore we don't
01:12:51 sacrifice any quality. But the speed gets slower, especially on Windows, it is really slow compared
01:12:58 to the Linux. So select the VRAM according to your GPU and load with this icon, and that's
01:13:05 it. The rest is exactly same, absolutely nothing different. It just sets accurate
01:13:12 training parameters according to the DreamBooth. However, there is one important thing that these
01:13:20 model checkpoints will be 40 GB. Therefore, by default, I am only generating once every
01:13:29 40 epochs. Therefore, it will get five checkpoints, 200 GB. And after training,
01:13:36 what you need to do is you should convert them into FP8 scaled. How it works? Let me demonstrate
01:13:43 you. So let's say I have a full checkpoint in this folder. Copy this folder path, enter as an input
01:13:51 folder. You can set output folder, not mandatory. We are going to use tensor-wise. This is scaled.
01:13:58 This is not default FP8 generation. This is tensor-wise made by the ComfyUI and the Musubi.
01:14:07 The Musubi has also block-wise, this is higher quality, but ComfyUI is not supporting it yet.
01:14:13 I made an issue thread and the ComfyUI developer replied me with Torch version 2.10, he said that
01:14:21 it is coming hopefully. Currently, we are going to use tensor-wise. You can also delete the original
01:14:27 files after conversation, but don't do it at the first time. So click start conversation. It will
01:14:34 convert it into FP8 scaled with tensor-wise. This is really high quality and it is almost
01:14:42 same quality. After you did this, you will see that. So you see it is saving the converted
01:14:49 model. Yes. And it is going to take half space, 20 GB, and it will work on your GPU much easier.
01:14:58 This is almost same quality as BF16. I have tested it because this is scaled conversation.
01:15:06 So what is different when you are testing the grid of the fine-tuned models? This time, we don't need
01:15:15 to select LoRA. So reset params to default and let's refresh the models here, and let's go to
01:15:22 preset, apply our preset, direct apply, go to tools, grid, let's select the prompt. This is
01:15:30 for finding the best checkpoint. Tutorial prompts are here. So the grid test prompt is here. So copy
01:15:39 them, paste them into prompt. As a next parameter, we select model and same strategy. Let's refresh
01:15:45 models, go back to tools and type your epoch like 100, 125, 150, 175. Okay, it is not the accurate
01:15:56 one. 175 and the last checkpoint. So that's it. So it will generate the grid and exactly same
01:16:04 as LoRA, you will compare it and then all you need to do is select your best checkpoint. For example,
01:16:12 it is 200, but make sure that you have converted them into FP8 scaled. Otherwise, it will use
01:16:20 a lot of RAM memory, it will do a lot of block swapping, so it will be slower on consumer GPUs.
01:16:27 Okay, as a next step, Qwen image edit model. This is also exactly same as
01:16:34 LoRA and fine-tuning. First of all, decide whether you want to do LoRA or fine-tuning,
01:16:40 doesn't matter. Let's give an example with the LoRA since it is lighter weight. So let's load our
01:16:46 config. Then what is different? The difference comes from the training dataset. Currently,
01:16:53 we can generate images with Qwen image edit model with just text. Therefore, you don't need to use
01:17:00 edit images. You can use just your base images to train and it will train and it will use same
01:17:07 amount of VRAM, same amount of RAM memory, it will be same speed. So what is different? This time,
01:17:14 you enable this Qwen image edit model checkbox and you select the different checkpoint. Which
01:17:23 checkpoint? You select the Qwen image edit plus checkpoint and that's it. Now you will be training
01:17:31 on the Qwen image edit plus model. What advantage it has? It supports command-based actions. For
01:17:39 example, let me demonstrate you with this one. So I can upload an image here. Let's upload a
01:17:47 prompt image. I am going to use this image as an upload. Then to get accurate size, I have shown
01:17:54 all of these in the other tutorials. Let's upload it here and let's say use closest aspect ratio. So
01:18:01 it will set accordingly to your input image, then uncheck this. I also recommend to still upscale,
01:18:09 and type your command prompt. This is what the Qwen image edit plus model for. So you see this
01:18:16 command is replace his face with ohwx man, and hit generate. By the way, you see that this base model
01:18:25 is BF16, not FP8 scaled. Therefore, it will be slower than FP8 scaled. However, it will
01:18:33 still work. Why? Because since this is using the ComfyUI as a backend, it will do automatic block
01:18:40 swapping and it will work, but it will just work slower. The model loading, the inference because
01:18:46 of the block swapping. And one more thing is that, okay, I just noticed that I don't have
01:18:52 the accurate model right now. Okay, the model is here. Qwen image edit model trained without
01:18:58 control images. So same as training the Qwen base model. I will first convert it into FP8 so it will
01:19:07 be faster. Copy the folder and batch process. This convert tool also skips already FP8 models.
01:19:16 So it is converting the new model. It is also properly applying metadata as well. Currently,
01:19:24 it supports Qwen base and Qwen image edit models. Okay, you see it is converted. Let's put it into
01:19:31 diffusion models. This is a full fine-tune. Then let's go back to our model list. Okay, here. Now
01:19:36 the accurate is selected and hit generate. So now we are going to, by the way, ignore this
01:19:42 image. This is from the previous generation. It is going to apply this input image and convert
01:19:49 it into new image with this prompt. Actually, let me make another one so you will see. For example,
01:19:57 this one, and this has a different aspect ratio. So to get the accurate aspect ratio, I will use
01:20:04 the same strategy. Closest aspect ratio. Okay. So let's cancel the current one. Let's generate a few
01:20:10 images and pick the best one. Then we can upscale. Okay, image prompting is automatically selected.
01:20:16 Let's generate four images. Okay. The upscale helps here as well. And you can of course do the
01:20:25 face inpainting as well. This is a Qwen image edit model trained without control images. Don't worry,
01:20:32 I am also going to show you how to train Qwen image edit model with control images
01:20:40 and prompts like this, like replace his face. So you will be able to teach the model new prompts,
01:20:48 new instructions. It is actually so easy. Okay, we are getting some results. For example, this one,
01:20:55 this one, this one. Based on whichever the one you like, then we are going to upscale it. The
01:21:02 upscaling will improve the quality significantly. And remember, this model was trained without the
01:21:10 control images. Okay, for example, let's say this one. So I will say reuse parameters,
01:21:17 so it will set the seed accurately. Then I will enable the upscale. So I will do 60%,
01:21:25 2x. We are using the 4x Real Web Photo, and I will make the step count 7. Okay, and generate.
01:21:34 Let's see after upscaling what we will get. By the way, some of the images are horrible,
01:21:39 but after upscale, we should, but after upscaling, we should get a pretty good quality. And remember,
01:21:48 this is a prompt that it knows. Furthermore, you may need to generate more seed to get a
01:21:54 more accurate one. For example, in the history, I can show you that this was another generation
01:22:01 that I have made, and you see it worked perfect. Moreover, since we upscale, we add more details
01:22:08 compared to the original image. Let me show you the original image. So this is the original image.
01:22:13 You can see the original image details, and this is the regenerated image. We added more details
01:22:21 to the original image as well. When we compare it, you can see that our generated image has
01:22:27 some more details. And yes, this is the result. I mean, not the every upscale will be perfect or
01:22:34 the seed will be perfect. You just need to, oh, oh, I just noticed something. Currently, we are
01:22:40 not using the accurate LoRA. That is why we got these results. So, always, always apply the preset
01:22:48 to not make mistake like me. So I will just say direct apply, and let's turn off the refine, and
01:22:54 let's generate five images. Okay. Now I will pick a better one. So lightning LoRA is super important
01:23:02 because we are doing just four steps and without lightning LoRA, it will not work. Oh, by the way,
01:23:08 base model changed when I applied the preset. So you can also edit the preset and set your
01:23:14 base image. You can duplicate it. I will also show you the duplicate. So I will say duplicate preset.
01:23:21 I will edit the preset. Then in the bottom, display advanced and display normally hidden,
01:23:28 and I will change the base model into my model, my trained model, which is here. Then save. Then
01:23:36 when I apply the preset, it will accurately select my model. This is the way of duplicating presets,
01:23:42 editing them. Then let's generate five images. And let's remove this from batch view. Okay,
01:23:48 let's delete. You will see that how better it works now. I'm not going to delete this part of
01:23:54 the video so that you can learn why it happened. These are some errors I had. Yes, you see much
01:24:02 better. Now that we apply the accurate LoRA, it works much better. And this is the logic. Now when
01:24:09 I upscale it, it will become perfect. Okay, every image is accurate. So without LoRA, you get noise,
01:24:15 you get you get horrible images, but with accurate preset, you get the accurate images.
01:24:22 So how you can train a real control having training, like teaching a new command action
01:24:33 result to the Qwen image edit model. It is so so easy. Let's open our last configuration, this one.
01:24:42 Let's open all panels. Then let's go to Qwen image training dataset section. So this was my dataset.
01:24:51 Now I am going to also auto-generate black control images, set your control image and height like
01:24:59 this with your resolution, and generate dataset. Then what you need to do is properly replace the
01:25:09 control images. So let's go back to our training images dataset folder. This was our folder. Okay.
01:25:16 So these images wouldn't work for this task. What kind of images you need? I will show you. When you
01:25:23 extracted the zip file, when you enter inside the Qwen training configs, you will see that we
01:25:30 have Qwen image edit model example dataset. And this is the example dataset. Let's copy paste
01:25:38 it into here and analyze it. So now edit images are provided like this. You see dataset_image_0,
01:25:46 dataset_image_1. Why I have named them like this? Because my input image, actually the final image
01:25:54 that I expect is named as dataset_image. And this is the caption. So in this caption, you
01:26:02 give the command, make him wear the headphones. So this way, you have to prepare your final image,
01:26:09 input images, and the prompt. Let's say this is final_image_A. Okay. Then you need to make
01:26:16 the prompt final_image_A. Then you need to rename them like final_image_A0, final_image_A1. You can
01:26:25 provide up to three images as a control image. So you can have another image named it as like
01:26:31 two. So you can provide up to three images. Then you can train it. When you train this way, it will
01:26:38 learn this command to generate this final image when you provide these input images. However,
01:26:47 there is one tricky issue. When you train Qwen image edit model with control images like this,
01:26:54 what happens is that it will become slower and it will use more VRAM. Therefore, this is
01:27:01 super important to keep in your mind. You need to increase the block swap count. For example, let's
01:27:10 make a demonstration. I will close my SwarmUI and let's save. Then what I need to do is I need
01:27:18 to enable Qwen image edit model. Then I need to increase the block swap. Let's make it like 35.
01:27:27 I'm not sure how much will be sufficient because I have two control images and they are not even
01:27:33 the accurate sizes. They are not all 1328. We can see the generated dataset toml file which it is
01:27:41 going to use. So you see it says that it is going to use Qwen image edit control resolution 1328,
01:27:48 1328, and the general resolution, the directory of the edit images. So it is all automatically set
01:27:56 for you. What I need to do is I need to make these images all 1328, 1328. Actually, let's make it as
01:28:02 a demo. So I will resize these to 1328. Okay. Then I will resize this to 1328 as well. How am I going
01:28:11 to do that? So first resize this to 1328, then 1328 and we can add a padding like this. And then,
01:28:21 yes, that's it. So all my control images and my output image is now accurate resolution.
01:28:27 Then when I click start training, let's watch what happens. Okay, it says that you don't have
01:28:35 the okay, I got an error. Why? Because I didn't click load. So I need to click load. Then okay, I
01:28:43 have overwritten the previous files because I had forgotten to click load and I hit save. Therefore,
01:28:51 I need to reset the parameters. Okay, this one is true. This one is also true. Okay, now I need to
01:28:59 select the model file from here. Okay, edit plus, select it. I will enable this. Okay, these are all
01:29:08 true. Let's also verify this toml is valid one. Yes. Okay, now I need to click save. I also need
01:29:16 to set the swap count to like 35. I'm not sure which one is best because depending on your number
01:29:24 of control images, this changes. Now it will recache because I changed the dataset. Therefore,
01:29:31 I need a recache. So it is doing the recaching like this. When it caches, it combines all these
01:29:39 two and one image into single cache safe tensor file. So it still generates one image, but this
01:29:48 one contains all those three images. And you see it is doing the text encoder caching as well. Now
01:29:55 we will start the training. However, how much VRAM it will use, I'm not sure. Okay, you see it has
01:30:01 filled my VRAM. So let's stop. Let's go to swap and let's make this 40 and click start training
01:30:09 again. You should also save your configuration like this to be sure. Okay, let's see what
01:30:16 happens now. You can also read the logs on the CMD. It shows found one matching control images
01:30:23 for arbitrary images, one images have two control images. You should verify your logs from here too.
01:30:30 Okay, this time it is not using the full VRAM. Therefore, these many block swap was sufficient.
01:30:37 Now I can reduce block swap count, see the speed. However, as you use more control images, it will
01:30:45 become slower. But this is a professional thing mostly. So you can rent a cloud machine and do the
01:30:51 training there with a more powerful GPU like RTX 6000 Pro. Hopefully, I will make a cloud tutorial
01:30:59 as well after this, so you will see how easy it is to train there. Still, this tutorial is mandatory.
01:31:06 Okay, you see the first step has been passed. It is really, really slow. And I need to wait more to
01:31:14 see its actual speed, but currently I'm not at my max performance. I am recording video. I need to
01:31:20 restart, close all the running applications and such. But this is the way of training an actual
01:31:28 Qwen image edit model with a specific task, with a specific command you want, like replace clothings,
01:31:35 change hair, or whatever you want to do as a command, you can teach it to the model.
01:31:41 So how you can resume your fine-tuning tutorial? Let's refresh our configuration. Normally,
01:31:47 we give the base model, either it is Qwen image base model, Qwen image edit plus base model. So
01:31:54 to continue your fine-tuning training, we are going to give our checkpoint. For example,
01:32:00 you see my checkpoints. This is 125 epoch, this is 175 epoch. Let's say my
01:32:08 last checkpoint was 100 epoch. So I select that model, and when I start training now,
01:32:15 it will be continuing from this checkpoint. My configuration, my workflow is made in a such way
01:32:23 that this is equal to training from start to 200 or doing 100 more steps to reach the 200 epochs.
01:32:32 So it will be totally same whether you continue from your last checkpoint or you do from 0 to 200
01:32:39 epochs at once. This is the logic of continuing the fine-tuning. Now I need to reduce my training
01:32:47 epoch count from 200 to 100 because when you use either it's a LoRA or fine-tune checkpoint,
01:32:55 it will not know where the training was left off. So you need to calculate the difference
01:33:02 and do more epochs like this way. This is the way of continuing your fine-tuning training.
01:33:08 Before I show you the style training and also the product training, let's make a recap of how to
01:33:15 use our trained LoRAs and fine-tuned models. So for LoRAs, you put your LoRAs into SwarmUI into
01:33:23 models into LoRA folder like this. For fine-tuned models, first convert them into FP8 scaled. I
01:33:31 recommend that. It is not mandatory, but make sure to convert so they will work faster. Then put them
01:33:38 into SwarmUI/models/diffusion_models folder like this. You see my files are here. Then let's start
01:33:45 our SwarmUI as usual, windows_start_swarmui. Then Quick Tools, reset params to default,
01:33:52 presets, apply our preset. This is the preset that we use. You see, Qwen-Image-UHD-Tier-2,
01:34:00 direct apply. If you are going to use a LoRA, you just need to go to your LoRA tab, select
01:34:07 your LoRA, whichever the one you want to use. For example, this LoRA, make sure that no unnecessary
01:34:13 LoRAs are selected, and this Lightning 4-step LoRA is selected. The preset may get updated,
01:34:19 so this selected LoRA may get changed when you are watching this tutorial because there are
01:34:24 always some newer LoRAs, some newer ways that gets faster. So just additionally select your
01:34:31 LoRA. Then type your prompt. For example, let's use this prompt and hit generate. You see that the
01:34:38 preset selecting the Qwen image FP8 scaled model as a base model because when you are using LoRA,
01:34:45 you need to use the base model that you trained it on. You can use with other base models as well,
01:34:50 as long as they are Qwen models. However, it will work best with the base model that it was
01:34:56 trained on. This is the logic of LoRAs. And we are getting our image generated. To test faster,
01:35:02 I recommend to turn off upscale, generate images, then on the ones that you like, you can apply the
01:35:10 upscale as well so that you won't be waiting unnecessarily for upscale part to be finished.
01:35:17 If you don't like the preview image, you can always cancel and try with a new different
01:35:23 seed. As long as the seed is -1, it will generate a different image. And we got our image generated.
01:35:31 So how do I use my fine-tuned model? And you may be wondering why you should train
01:35:36 fine-tune because fine-tuned models are higher quality than LoRAs. That is the reason. They
01:35:42 are able to generalize better, they can do more poses, more emotions better, not much different,
01:35:49 very close to the LoRA, but still better. So let's refresh this page, reset params to default,
01:35:56 presets, let's apply our preset, direct apply, type our prompt. And now you need to select
01:36:02 your fine-tuned model instead of the base selected model. So I'm going to select my fine-tuned model,
01:36:09 which is here. You see my Qwen fine-tuned model FP8 converted by me. And that's it. Then you need
01:36:16 to select your aspect ratio, the resolution whichever you want. For example, this one and
01:36:21 generate. We also already have seen how to do face inpainting, how to fix face. The logic is
01:36:27 same. You can also fix other parts, either with inpainting or with segmentation. It should work,
01:36:34 the logic never changes, but how you apply it changes, and it comes with experience and using
01:36:40 the program, doing more generations. And this is the generation of fine-tuned model. If you ask my
01:36:46 opinion, of course fine-tuned model is better, but with LoRA you can generate more images and
01:36:51 get the perfect image or you can do inpainting, face inpainting, and fix manually. It depends your
01:36:59 case. If you are using this professionally, then I recommend to either wait for fine-tuning to be
01:37:05 finished or use cloud services like MassedCompute or RunPod. We already have the installer scripts,
01:37:11 and hopefully I will make another tutorial to show that, but you can already train on them as well.
01:37:16 So now let's talk about style training. What changes? With style training, everything is
01:37:23 exactly same. So what is changing? What changes is the dataset. So I already have attached the GTA 5
01:37:31 style dataset in our post. You see, remember, Qwen image tutorial video instructions. Let's download
01:37:37 the style dataset and I also shared the result model in this CivitAI link, so you can download
01:37:44 and use it already. The FP8 scaled version is shared. You see it is 19 GB of file. So far,
01:37:53 the comments are good, and you can use this model and generate yourself. Okay, let's look at the
01:37:59 used style dataset. So let's move this into our folder. You can move anywhere. Let's extract it,
01:38:06 and let's make the analysis of it. So the style dataset, again, only trained with a trigger word,
01:38:13 not a detailed captions, just ohwx. I didn't use anything else. And this was the dataset.
01:38:20 When you analyze this dataset, you will see some of the key things. The first thing is that it is
01:38:26 extremely consistent. This is mandatory for training a style. Consistency of the style.
01:38:32 The second thing is that no character repeats or no scene repeats or no object item repeats. This
01:38:41 is super important. So you should try to avoid repeating. For example, repeating a person that
01:38:48 will cause model to memorize, or repeating an item like this helicopter, you shouldn't repeat items,
01:38:55 you shouldn't repeat objects, you shouldn't repeat persons, places, buildings, nothing should repeat.
01:39:01 But you may be saying that, okay, these two scenes are very similar. It is true because there weren't
01:39:08 enough, there weren't sufficient amount of dataset to train. Therefore, I cropped some
01:39:15 of the images and made them multiple images. So this image is actually, let me open it,
01:39:21 so this image is actually cropped from this big image. But you see, I tried to not repeat the same
01:39:29 objects as much as possible. I tried to avoid it. So this is the way of preparing a style dataset,
01:39:36 consistency, not repeating objects, items, persons, characters, whatever you can think
01:39:43 of. Only style should repeat. Only style should consistent. Everything else should be different
01:39:51 in every image. With style training, as more as images you have, you will get better results. This
01:39:58 is really, really important. Try to collect more images for style training. And when you train,
01:40:05 you will see how high quality you get. I don't recommend to have detailed captions. Just use
01:40:11 ohwx. This is working best for the Qwen and also recently for Flux, I am using the same strategy,
01:40:18 and also for one which is coming, probably will be same. I haven't tested yet, but probably. So
01:40:25 how I am able to generate amazing quality images with just using ohwx during the training? I mean,
01:40:34 let's look at some of the images again, like this one or like this one. The logic is the detailed
01:40:41 prompting. So for very detailed prompting, I am using this strategy. Let's open the Google AI
01:40:48 Studio. As usual, Google AI Studio from here, and then upload your style images. This is the
01:40:57 lazy way of doing that. You can of course manually also test it, but I prefer this lazy way because
01:41:03 it makes it easier. So the dataset images are here. So just select all of them or like 20 of
01:41:10 them. It is up to you. And drag and drop into the this section. Then, this is super important, make
01:41:18 the media resolution highest possible. Currently medium is highest possible. This will make the
01:41:24 model process these images with higher quality and higher accuracy. Then set the temperature like
01:41:31 50%, and what prompt, what command you need to use to get proper captions, proper prompts? It is all
01:41:40 shared inside the Qwen training tutorial prompts. So to generate example prompts, I'm going to use
01:41:48 Gemini generate trained style prompts. You can read this and change it according to your needs,
01:41:55 then copy paste it here. So with this prompt, it is going to give me 100 unique prompts to generate
01:42:03 in SwarmUI or in ComfyUI, whichever the one you are using. This will ensure that the generated
01:42:10 prompts includes elements that will make the model generate images according to my trained
01:42:17 style. It will improve its consistency, its accuracy. Even though I trained with just ohwx,
01:42:24 this will work. Why? Because these models, Flux or the Qwen, encode your training images. So whether
01:42:31 you caption them or doesn't caption them, they are still internally captioned during the training. It
01:42:38 is a very technical thing, but you can still say that the model knows your image content.
01:42:44 So it still flows information into those captions, whether you use detailed captions or not. Then hit
01:42:52 generate icon. So now it will generate me example prompts. Analyze the generated prompts and you
01:42:58 will understand the logic. It will give you idea how you should prompt your style after training.
01:43:05 This will significantly improve the accuracy of your generated images with your style. And this
01:43:13 applies to all style trainings. Believe me, you will be able to generate amazing stylized images,
01:43:20 amazing images in your style after you do this. Another use case of style training could be that
01:43:26 you might have line art image, then you can say turn it into my style and final image. So you can
01:43:33 train Qwen edit model with this strategy and you can have a model that can convert your line art
01:43:40 images into your style painted, into your style colored images. We have already seen the logic
01:43:48 of the Qwen image edit model training, so check that part again if you don't know, but this is
01:43:54 the way of training a style, the logic of training a style. You can see that these are all amazing,
01:44:00 these are all extremely consistent with the dataset, and it is extremely versatile model,
01:44:06 not overfit. It can still generate pretty much everything or anything, and this is the exactly
01:44:13 way that I have trained. I am still using the same configuration. The configuration doesn't change
01:44:18 for style or for product or for person, it doesn't change. What changes is the dataset, how dataset
01:44:25 is prepared and the how many epochs you do. If you have more images, you can do lesser epochs,
01:44:32 but with style, I recommend to do more epochs because it learns it matter. And you can just
01:44:38 download this model from CivitAI and generate images right away yourself if you wish as well.
01:44:45 So how are you going to generate images with your trained style? Let's refresh.
01:44:49 Let's make reset params to default. Go to presets and for style generation,
01:44:55 we have two presets. Qwen-Images-Stylized-UHD or Qwen-Images-Stylized-UHD-Tier-1. The tier
01:45:02 1 is better, but it takes more time, it takes more steps. So let's make an example with the
01:45:09 tier 2. This will be a quick example. I have selected it. Then I need to select my trained
01:45:14 model. Currently I have full trained model, not a LoRA for style. It is here. I have selected
01:45:20 it. Let's change the aspect ratio. Then let's use one of the generated prompt. For example,
01:45:27 let's use this one and turn off refine upscale. Let's generate eight images. Then we can pick
01:45:33 the best one and upscale it. Okay, I have got two images generated. For example, let's upscale this
01:45:39 particular one. The seed is here. I will set the seed and I will just enable refine upscale
01:45:46 and generate. So this was the base generated image without any upscale and let's see the result after
01:45:52 upscaling. So it is upscaling right now. If you instead use tier 1, it will do more steps during
01:46:00 the upscale and it improves the quality. So if you are looking for maximum quality, you can use
01:46:06 Qwen-Image-Stylized-UHD-Tier-1 configuration. These configs may get updated over time,
01:46:13 so make sure to read Patreon post changes and the newest presets descriptions. Okay,
01:46:18 the upscale completed. I have forgotten images eight, so it was generating another one. So yes,
01:46:24 this is the upscaled version. Let's compare it with the base version. So this was the
01:46:28 base version and this is the upscaled version. And this upscale was very, very fast because
01:46:34 it was only four steps. However, you can do more steps to get even better, higher quality details.
01:46:41 Okay, what about product training? The product training dataset preparation is different than
01:46:48 both style and character. And let me explain you the logic of product training. So I have
01:46:55 prepared a product dataset like this one. Probably I have used this one, not the very accurate one.
01:47:02 And because I used this dataset, what happened is that its sizes in some cases, the perfume
01:47:10 size is not very accurate because you see all of these images are extremely close shot. So the
01:47:17 AI didn't learn its proportions properly. I also had another dataset which I was planning to use,
01:47:26 this one, were including shots that a person was holding it like this. So you should have mixed of
01:47:34 product images. Some of them should be very close, so it learns details. Some of them should be far
01:47:41 distant, so it will learn its proportions. This is important. Imagine with the which way you want
01:47:47 to generate the product images after training, so that you should have such images so that it can
01:47:54 learn its proportions. You see there is a glass behind of the perfume bottle. So the model will
01:48:01 understand that this is the proportion of the product image according to a glass. Moreover,
01:48:08 you can see how powerful this training is. You see this icon was perfectly learned by the AI,
01:48:17 like this one. So Qwen is extremely powerful when it comes to learning details or learning the
01:48:24 detailed small text, unlike Flux, this model is much more powerful for text learning, for learning
01:48:32 the text on small products, and it can generate amazing quality images like this one. It is up
01:48:40 to your imagination after training. And again, I just used the ohwx as a caption. I didn't use
01:48:48 detailed captions, and there is another strategy to generate the product prompts for inference.
01:48:56 So again, we upload our product images into the Gemini. So select a few of them, like these ones.
01:49:03 You can select more, of course. Selecting more will help the Gemini to understand better. Then
01:49:09 in the Qwen training tutorial prompts, you will see Gemini generate trained product item prompts.
01:49:15 So you can modify this as the way you want and then paste it here and hit enter. So this way,
01:49:23 it will generate me example prompts and it will also describe the text on the product. So you see,
01:49:31 during the inference, we describe whatever we want with details to improve its accuracy, to improve
01:49:38 its consistency. During training, we just used a single activation token, a rare word, a rare token
01:49:46 like ohwx, but during the inference, we give a very detailed description, a very detailed prompt
01:49:54 to match perfectly with whatever we have trained, especially if the product is a very rare product,
01:50:01 this will help more significant. When you train a character, this is not that mandatory because
01:50:07 character knowledge, the person knowledge of the models are massive compared to the your
01:50:13 specific products or your specific styles. And you will see that it has generated some example
01:50:20 prompts. You see it defines double C logo and the text on the product. These two will
01:50:26 help significantly to generate product images accurately after training. And then you will
01:50:32 be able to generate amazing quality images like these ones that you can use for advertisement,
01:50:38 for demo. I mean, you can even see that it has this pipe accurately as well. This is a very,
01:50:46 very small detail. However, it is able to do that. So this is the way of training a product.
01:50:52 Thank you so much for watching. I recommend you to join our Discord channel. You can always message
01:50:58 me from there. You will see the Discord channel link at the top. I recommend you to go to our
01:51:03 GitHub. You will see a lot of information here, fork it, star it, watch it. You can
01:51:09 also sponsor me from here. When you go to our wiki, you will see all of our tutorials. You
01:51:15 see we have hundreds of tutorials. You can search the tutorials from here with control F. Also on
01:51:20 the main page, you will see some sorted way of tutorials. Let me show you. As you scroll down,
01:51:27 you will see starting from one to the latest ones going this way. Moreover,
01:51:32 we have Reddit. I recommend to join our Reddit. We are getting bigger and bigger, more visitors,
01:51:39 more people. Let's see some of the stats. You see 300k visits we have. We have members,
01:51:45 they are increasing. And you can follow me on my LinkedIn. This is my real LinkedIn profile. You
01:51:52 can follow me here. Furthermore, don't forget to subscribe our channel, also open bell,
01:51:57 the notifications. You can see our videos from here. You can search our videos. We are getting
01:52:03 hopefully bigger and bigger. I am also giving private lectures. Let's say you want to learn
01:52:09 one to one, you can message me. I am giving private lectures to both individuals or the
01:52:14 companies. Moreover, I am giving consultation to companies as well. So you can always message
01:52:20 me with replying to the video or from Discord or from LinkedIn, all of them should work. So
01:52:26 thank you so much for watching. Hopefully, see you in another amazing tutorial video.
Beta Was this translation helpful? Give feedback.
All reactions