First of all, I spent hours drafting this guide and when I clicked save draft, CIVITAI took me to a 404 page and all my work was lost. I had to write this guide from scratch again. Recent changes to the site seem to only be enhancing the censorship at the cost of degraded notification system, worse site reliability, and increasingly confusing user experience (One of my following lora trainers complained that now he can only see his own work when he logs out of his account). So this will be the first and also my last guide published on this site. I am not mad at anyone as I work in the tech industry myself and I understand how challenging it is to maintain a large site like this, I just don’t like my time being wasted.
On 4/14, bilibili hosted a Alibaba public live stream(https://www.bilibili.com/video/BV17VojYtE47/) focusing on various video technologies made possible by WAN2.1, and there was 20 min spent on lora training by @shichen which I found quite helpful. As the stream is in Chinese only, here I am to translate it for you guys. In the end of this doc I’ll share my own training settings.
Apart from the official contents, I’ll also add my own commentaries because my interest is NSFW action/concept training like my best work “Rip her clothes”(which unfortunately was taken down), and the official stream focuses on art style training. Take my recommendations with a grain of salt.
DATASET
Official guideline
- The more data you have, the better generalization you get
- 30~50 static pictures
- 10~20 high quality video clips, 16fps 81 frames in total
- Change res and frame based on your VRAM
- High res training data can be used after initial lora training with lower res
- Style/Theme/Composition need to be consistent
- Include different scenes for generalization
Comments
Speaker mentioned lora can be trained with as few as 33 frames or 2 seconds of video. I have the same experience, but for nsfw concepts/actions, the longer the clip the better and the trade-off should be made biased towards the length of the clip as opposed to resolution, because WAN is extremely censored you need a lot of action data.
Speaker also mentioned that for I2V, high quality video clips are extremely important. I agree 100%, most of my loras are trained without using any static pictures but simply more video clips. But for T2V art style lora, I can imagine that static pictures are important for generalization.
On the last two points, Style/Theme/Composition can roughly be understood as: all the training data is oil painting(style) by Vincent van gogh(theme), and presented as if it’s a perfect scanned copy (composition), not a photography of the art or something else in a different format. The diversity in scenes means that you are gonna include people, street view, night sky, sunflower, etc. so the lora doesn’t always produce one scene.
I am lucky enough to afford a 4090 running all the time, so I always use 400*300 53 frames training clip and never tried low res training.
DATA CLEANING
Official guideline
- Remove blurry and irrelevant samples
- Have diversity in lighting, angles, concepts. Ensure detailed description
- Caption key feature (like word for a certain artstyle, or using a special token)
- Manually chunk your video clips
- Use Gemini2 for caption
- Use a consistent caption style (like special token + character + action + other elements)
Comments
Speaker mentioned that because WAN2.1 uses Google’s UMT model, having a trigger word can be helpful. In my recent training I have incorporated this idea.
Speaker uses Gemini to caption images. I use Gemma 27b uncensored. Gemini is probably better as it tops at visual tasks on llmarena, but for nsfw work I have to use a local llm and Gemma/Qwen will be the top choices. Joy & Florence may work but they are not real llms that can follow instructions.
Training Script
Official Guideline
- Use Diffusion-pipe
- You need wsl on windows for this tool
- 24GB or more VRAM is best
- 4090/L20/L40s (Just a few options with >= 24GB VRAM)
- Low VRAM can use blocks_to_swap option at the cost of training speed
Speaker recommends an alternative GUI trainer based on Kohya’s Musubi Tuner, but I’m not going to recommend it as the developer releases this tool in Chinese and it’s going to be very hard to get support. I recommend using Musubi Tuner directly. I bet there’s other English GUI solutions as well so if you know one comment below to help fellow lora trainers.
Training Tips
Official Guideline
- Training clip’s frame rate will affect the speed of the video the lora produces
- Static pics only and static pic + video clip training may affect the speed of the video
- Too much training on static pic will make your video static
Comment
I think the take away here is
- Use 16fps/1s videos ONLY for training, or change it up if you are training a “speed up”/”slowmo” concept, because the AI will learn the speed of your video.
- Static pics will give you more coverage and generalization if used properly, but at the end of the day, you are training a video lora, so they are supplementary to the videos. So never skip videos clips
Verification
Official Guideline
- Use completely unseen prompts to see if generalization is achieved (for art style lora)
- Check if WAN2.1’s intrinsic motion and concept capabilities are adversely affected
- Check if you output 1 frame, you can output an image similar to training data
- Check at 33 frame/77 frame, the motion is still good with a training prompt.
Comment
I think this is most applicable to T2V. I would only worry about point #2 and point #4 for I2V: Don’t mess up WAN’s intrinsic capabilities, as this is also an indicator if your lora can be stacked with other loras.
Common problems
- Shaking and flashing edges of the video – enabled SLG? Disable it and try again
- (T2V) Generated contents concentrate on the bottom right side of the video – check if the aspect ratio of the output video is the same as training clips, and lower Lora weight
- Bad performance on unseen prompts – train on 14b model, add more training clips with the same desired resolution
Comment
Having high quality training clip and diversity in a few aspect ratio is pretty important. This has been discussed multiple times even for different AI models, and I’m not going to repeat it.
Interesting Findings
- WAN2.1 can be used as an image generation model as well, you only need as little training time as 20 minutes to make it work
- WAN can generalize physics rules and forms of expression(Speaker uses the lamp’s light as an example – WAN learns that circular strips of brush revolving around a light source can be used to depict lighting). So when building training dataset you can consider the impact of time and cause-effect (Because WAN will learn it)
- 1.3b will run and train faster so use it for validation of ideas first.
- You can use English to caption the training dataset and Chinese to prompt. Just make sure you include the same trigger word.
My training setup for WAN I2V action lora on Kohya Musubi Tuner
This is neither my final setup, nor the best setup in the community. But I’m just sharing it here in case people are curious
- Use Gemma 27b uncensored for captioning. Use system prompt to force structured description of the scene (Scene + actors + foreground + background + action + lighting). Manually fix errors and detail the actions.
- 15~25 video clips of 16 fps, 3 seconds. 400*300 and 300*400 resolutions.
- No static pictures for I2V
- Network Dim 32, network alpha 1 (you need a high initial learning rate to support this config)
- adamw8bit optimizer, default options
- 20 epoch, 4~6 repeat per video clip
- 1e-4 or 2e-4 initial learning rate. If it’s a hard concept I use 2e-4.
- 14b 720p fp8 as base model
- time_step_sampling set to shift, discrete_flow_shift set to 5
- Save every epoch in initial 20, pick 3~5 epoch right around when lost value plateau and manually generate 3 videos for each epoch and pick the best one. Lost so far for my models has been around 0.04 ~ 0.025
- based on if the lora is already good enough or not, I’ll either publish or do further training at a much lower learning rate, e.g. 10 epoch 5 repeat 4e-5, and rinse and repeat
- based on if the generalization is good enough, I’ll add more training data and restart training on a lower epoch. For example, if epoch 18 is the best, but out of the three test prompts there’s one angle that doesn’t work very well, I’ll add maybe 5 more well-captioned training clips at this angle and restart training at epoch 13.
Leave a reply