First of all, I spent hours drafting this guide and when I clicked save draft, CIVITAI took me to a 404 page and all my work was lost. I had to write this guide from scratch again. Recent changes to the site seem to only be enhancing the censorship at the cost of degraded notification system, worse site reliability, and increasingly confusing user experience (One of my following lora trainers complained that now he can only see his own work when he logs out of his account). So this will be the first and also my last guide published on this site. I am not mad at anyone as I work in the tech industry myself and I understand how challenging it is to maintain a large site like this, I just don’t like my time being wasted.

On 4/14, bilibili hosted a Alibaba public live stream(https://www.bilibili.com/video/BV17VojYtE47/) focusing on various video technologies made possible by WAN2.1, and there was 20 min spent on lora training by @shichen which I found quite helpful. As the stream is in Chinese only, here I am to translate it for you guys. In the end of this doc I’ll share my own training settings.

Apart from the official contents, I’ll also add my own commentaries because my interest is NSFW action/concept training like my best work “Rip her clothes”(which unfortunately was taken down), and the official stream focuses on art style training. Take my recommendations with a grain of salt.

DATASET

Official guideline

The more data you have, the better generalization you get
30~50 static pictures
10~20 high quality video clips, 16fps 81 frames in total
Change res and frame based on your VRAM
High res training data can be used after initial lora training with lower res
Style/Theme/Composition need to be consistent
Include different scenes for generalization

Comments

Speaker mentioned lora can be trained with as few as 33 frames or 2 seconds of video. I have the same experience, but for nsfw concepts/actions, the longer the clip the better and the trade-off should be made biased towards the length of the clip as opposed to resolution, because WAN is extremely censored you need a lot of action data.

Speaker also mentioned that for I2V, high quality video clips are extremely important. I agree 100%, most of my loras are trained without using any static pictures but simply more video clips. But for T2V art style lora, I can imagine that static pictures are important for generalization.

On the last two points, Style/Theme/Composition can roughly be understood as: all the training data is oil painting(style) by Vincent van gogh(theme), and presented as if it’s a perfect scanned copy (composition), not a photography of the art or something else in a different format. The diversity in scenes means that you are gonna include people, street view, night sky, sunflower, etc. so the lora doesn’t always produce one scene.

I am lucky enough to afford a 4090 running all the time, so I always use 400*300 53 frames training clip and never tried low res training.

DATA CLEANING

Official guideline

Remove blurry and irrelevant samples
Have diversity in lighting, angles, concepts. Ensure detailed description
Caption key feature (like word for a certain artstyle, or using a special token)
Manually chunk your video clips
Use Gemini2 for caption
Use a consistent caption style (like special token + character + action + other elements)

Comments

Speaker mentioned that because WAN2.1 uses Google’s UMT model, having a trigger word can be helpful. In my recent training I have incorporated this idea.

Speaker uses Gemini to caption images. I use Gemma 27b uncensored. Gemini is probably better as it tops at visual tasks on llmarena, but for nsfw work I have to use a local llm and Gemma/Qwen will be the top choices. Joy & Florence may work but they are not real llms that can follow instructions.

Training Script

Official Guideline

Use Diffusion-pipe
You need wsl on windows for this tool
24GB or more VRAM is best
4090/L20/L40s (Just a few options with >= 24GB VRAM)
Low VRAM can use blocks_to_swap option at the cost of training speed

Speaker recommends an alternative GUI trainer based on Kohya’s Musubi Tuner, but I’m not going to recommend it as the developer releases this tool in Chinese and it’s going to be very hard to get support. I recommend using Musubi Tuner directly. I bet there’s other English GUI solutions as well so if you know one comment below to help fellow lora trainers.

Training Tips

Official Guideline

Training clip’s frame rate will affect the speed of the video the lora produces
Static pics only and static pic + video clip training may affect the speed of the video
Too much training on static pic will make your video static

Comment

I think the take away here is

Use 16fps/1s videos ONLY for training, or change it up if you are training a “speed up”/”slowmo” concept, because the AI will learn the speed of your video.
Static pics will give you more coverage and generalization if used properly, but at the end of the day, you are training a video lora, so they are supplementary to the videos. So never skip videos clips

Verification

Official Guideline

Use completely unseen prompts to see if generalization is achieved (for art style lora)
Check if WAN2.1’s intrinsic motion and concept capabilities are adversely affected
Check if you output 1 frame, you can output an image similar to training data
Check at 33 frame/77 frame, the motion is still good with a training prompt.

Comment

I think this is most applicable to T2V. I would only worry about point #2 and point #4 for I2V: Don’t mess up WAN’s intrinsic capabilities, as this is also an indicator if your lora can be stacked with other loras.

Common problems

Shaking and flashing edges of the video – enabled SLG? Disable it and try again
(T2V) Generated contents concentrate on the bottom right side of the video – check if the aspect ratio of the output video is the same as training clips, and lower Lora weight
Bad performance on unseen prompts – train on 14b model, add more training clips with the same desired resolution

Comment

Having high quality training clip and diversity in a few aspect ratio is pretty important. This has been discussed multiple times even for different AI models, and I’m not going to repeat it.

Interesting Findings

WAN2.1 can be used as an image generation model as well, you only need as little training time as 20 minutes to make it work
WAN can generalize physics rules and forms of expression(Speaker uses the lamp’s light as an example – WAN learns that circular strips of brush revolving around a light source can be used to depict lighting). So when building training dataset you can consider the impact of time and cause-effect (Because WAN will learn it)
1.3b will run and train faster so use it for validation of ideas first.
You can use English to caption the training dataset and Chinese to prompt. Just make sure you include the same trigger word.

My training setup for WAN I2V action lora on Kohya Musubi Tuner

This is neither my final setup, nor the best setup in the community. But I’m just sharing it here in case people are curious

Use Gemma 27b uncensored for captioning. Use system prompt to force structured description of the scene (Scene + actors + foreground + background + action + lighting). Manually fix errors and detail the actions.
15~25 video clips of 16 fps, 3 seconds. 400*300 and 300*400 resolutions.
No static pictures for I2V
Network Dim 32, network alpha 1 (you need a high initial learning rate to support this config)
adamw8bit optimizer, default options
20 epoch, 4~6 repeat per video clip
1e-4 or 2e-4 initial learning rate. If it’s a hard concept I use 2e-4.
14b 720p fp8 as base model
time_step_sampling set to shift, discrete_flow_shift set to 5
Save every epoch in initial 20, pick 3~5 epoch right around when lost value plateau and manually generate 3 videos for each epoch and pick the best one. Lost so far for my models has been around 0.04 ~ 0.025
based on if the lora is already good enough or not, I’ll either publish or do further training at a much lower learning rate, e.g. 10 epoch 5 repeat 4e-5, and rinse and repeat
based on if the generalization is good enough, I’ll add more training data and restart training on a lower epoch. For example, if epoch 18 is the best, but out of the three test prompts there’s one angle that doesn’t work very well, I’ll add maybe 5 more well-captioned training clips at this angle and restart training at epoch 13.

DATASET

Official guideline

Comments

DATA CLEANING

Official guideline

Comments

Training Script

Official Guideline

Training Tips

Official Guideline

Comment

Verification

Official Guideline

Comment

Common problems

Comment

Interesting Findings

My training setup for WAN I2V action lora on Kohya Musubi Tuner

Leave a reply

Leave a Reply Cancel reply

Popular Posts

WAN 2.1 Video Lora Training Guide based on 4/14 official stream and my own experience

WAN 2.1 Video Lora Training Guide based on 4/14 official stream and my own experience

DATASET

Official guideline

Comments

DATA CLEANING

Official guideline

Comments

Training Script

Official Guideline

Training Tips

Official Guideline

Comment

Verification

Official Guideline

Comment

Common problems

Comment

Interesting Findings

My training setup for WAN I2V action lora on Kohya Musubi Tuner

Leave a reply

Leave a Reply Cancel reply

Popular Posts