Introduction

Welcome to my article series on setting up and understanding an efficient workflow for generating videos from still images using Wan 2.2. In the last article, we created a minimally viable workflow that takes way too long to generate videos and learned the basics of ComfyUI usage. In this article, we will discuss ways to optimize generation time, which is the most important improvement you can make. Whether you're looking to churn out AI slop by the truckload, or you're obsessed with making a single video project look good, you'll need to generate a ton of videos.

The most important advice I can give is that you shouldn't take this article as some kind of gospel or stone tablet of truth. This field moves very fast and this information, aside from the conceptual stuff from my last article, will become outdated quickly. Newer accelerators, base models, and finetunes will appear. Keep up to date, check out what's trending in the Models section of this website, search Google or even use Huggingface directly.

Disclaimer

I express my opinions on some of the resources in this article. My perspective is that of someone who makes mostly anime style content. I do like to make realistic stuff sometimes, but my main issue with it is I'm much more familiar with anime art and Danbooru tags than I am with realistic stuff, photography, and understanding how to generate good looking realistic content, so I'll get frustrated while trying to make good realistic images sometimes. Some of these opinions might be completely invalid when applied to realistic style video generation.

Community Workflows

At this point you should hopefully understand the basics and have enough knowledge to experiment with new workflows. There are many different options for optimizing your Wan workflow, and many of the developers of these tools will provide their own workflow for you to try. I will provide links to some of their pages in this article. You should definitely try them or at least use them as a reference; it will help you learn and develop your own workflow that you feel comfortable with in the long term. Always listen to the developers of a model or tool, especially over some random guy like me. This article isn't going to change to keep up with their development.

Your most important tools to learn improve your workflow and learn better techniques will be:

referencing developer READMEs and workflows
downloading other users' videos from sites like CivitAI and then dragging the file into ComfyUI's workspace to see their workflow.

By default, ComfyUI includes the workflow used to generate it with every artifact generated, whether it's a video, image or something else. There are people who either get their images/videos from a web API service that strips the metadata, or they might have some irrational idea that they shouldn't share their "trade secrets" (as if anything they're doing is more intelligent or innovative than the research teams). But most of them should have a workflow embedded.

One more thing to keep in mind is that you can Google the filenames of any resources used to find them unless the uploader changed them for some reason, but generally that doesn't happen. That way, if someone used a LoRA or a base model that you liked you can still find it with no other information than the workflow.

Kijai's Nodes

Many videos and recommended workflows will use "Kijai's nodes," aka ComfyUI-WanVideoWrapper. Note that you shouldn't need to follow the instructions to install manually; you can do it through the ComfyUI Manager.

Kijai is a very well known developer in the ComfyUI community. He created his own set of nodes for Wan video generation that are self-contained, and you can't really mix them with native nodes (the ones we used in the last article). Everything, including the sampler node, is different. They offer much more control, which means they're much more likely to cause errors for an inexperienced person. I encourage you to research workflow settings based on your hardware, and experiment. There's countless Kijai workflows out there for people with different levels of hardware, so you should be able to find one that works. I will present some videos generated with a Kijai workflow, but keep in mind that unless you have a 5090 and the SageAttention dependency mentioned below, you will likely have problems.

Base Models

If you want to make decent looking videos without reading the rest of the article, this is the section for you. There are indeed Wan 2.2 finetunes available here on CivitAI, right now. The two most popular at the time of writing this article are SmoothMix and DaSiWa. To simplify it a lot, consider these checkpoints to be several Wan 2.2 LoRAs, including accelerators, baked in to the base model. Although, more LoRAs can still be used with them to emphasize the movement further. They will handle most prompts that you can give them, including ones of the NSFW variety. However, you will still need concept LoRAs to "refine" certain movements and certain body parts if they were not in the initial frame. And you will have to look elsewhere for quantized versions, if you do not have enough VRAM to use the model as-is. These checkpoints can be quite biased towards certain types of movement. SmoothMix in particular seems to be quite biased towards fast, bouncy movement, but that can also be useful depending on the video.

There are other, lesser-known base models created by research teams. Some of them have their own UIs, or encourage you to use their own UIs/run scripts, which is frankly a hassle. I haven't bothered with them yet. LightX2V, for example, has their own base model and an execution script that they claim is 2x faster than ComfyUI, but you have to run everything via command line, which even for me (a programmer with 10+ years experience) sounds like a nightmare for this particular use case. If this were my full-time job, sure I could see the value in investing the effort to setting up a command line generation workflow, but it isn't. I just make cartoon porn.

Accelerators

The second most straightforward and effective way to speed up your generation, accelerators are LoRAs that allow Wan to do what took 28 steps in the last video in about 4 or 5 steps. It sounds too good to be true, but it really does work. There is a small cost; I've found that unaccelerated videos can portray more fine detail in specific cases, so you may wish to keep that in mind in the future. But those cases are rare and usually involve a lot of subtle, slow movement (like fluid slowly pouring out of an orifice, for example).

There are actually a couple of different accelerators available for Wan 2.2. You might have heard of lightx2v. This is a "distilled" LoRA that has been around since Wan 2.1 at least. More recently, there is Wan 2.2 Lightning, developed by the same team. There are many different ways to use accelerators and I am continuously surprised by creators who have permutations I've never seen before. A few examples:

Using lightx2v on the high noise model, but not the low noise, and using Lightning on both models
Using just Lightning or just lightx2v on both models
Virtually every permutation of LoRA weight values you can imagine, including seemingly insane values like 5.00 for lightx2v.
Personally, I've found use in reducing the strength below 1 slightly.

There's a lot to experiment with, and personally I feel like I've only scratched the surface. This, along with shift values, is another good place to fiddle with numbers if you want subtle variations in your video because you feel like it's close to what you want.

SageAttention

SageAttention is an alternative attention model to Wan's integrated attention model that performs significantly better, with nodes to support it. The problem is, it can be a pain to install because of the very strict dependency chain you have to follow. It's also completely unavailable for anyone not using an NVIDIA GPU. Your GPU model, CUDA toolkit version, Pytorch version, and Python version all need to be in agreement. I think it might be the case that, as far as consumer grade GPUs go, you need at least a 4000 series NVIDIA GPU to get SageAttention to work due to Pytorch's dependencies. Could be wrong, but looking at the CUDA compatibility chart seems to indicate that might be the case.

Needless to say, this is a complicated process. The gains are quite significant, but accounts vary on how much SageAttention actually does speed up generation. In this article, it speeds up generation by about 3x. I won't post a full guide on this because it would deserve its own article. There are quite a few guides out there nowadays if you Google a phrase like "sageattention install guide comfyui" that can help you with this.

Example Generations

So there are as few variables as possible, I'll be using the exact same prompt and image from the last article to generate videos using tools mentioned from each section of this article. That way there can be some kind of meaningful comparison. The quality of the video is not the top concern for this comparison; generation time is, and these videos are here to inform you, not to be "the best video." So long as the quality of the video is at least as good as the final video of the first article, then it is acceptable as a point of reference.

Most notably, however, every video I will generate here is in 720p, not 480p. Specifically, 720 x 960 dimensions. This will demonstrate how much faster using these accelerators is, even with higher resolution. Remember, you can download and drag any of these videos into your ComfyUI workspace to see the workflow used to generate it.

For reference, the video I generated in my first article took about 7 minutes to generate, or around 400 seconds.

lightx2v, native nodes

Time taken: 156s

Steps: 4, 2 on high noise

Sampler: euler

I had to fiddle with the shift values on this one to not look terrible. Set it to shift 4, whereas in some of the below videos, namely LightX2V's official workflow, the shift value is 1.

lightx2v, Kijai's nodes

Time taken: 179s

Steps: 4, 2 on high noise

Sampler: uni_pc

This one uses LightX2V's officially recommended ComfyUI workflow, virtually unchanged except I bypassed the Torch Compile node and changed the attention model in the Model Loader node to sdpa attention, which I believe is the default. To use it, click the download button and save it to a folder and then open the .json file in ComfyUI. Or, of course, you can just download this video and drag it into ComfyUI. Notably, it includes a custom sigma schedule. These essentially tell Wan whether to focus on composition and dynamic movement (high values), or fine details (low values) for each given step. I have not experimented with sigmas at all yet, really, but it looks like yet another good way to create variation in your video output. As you might imagine, I'm learning as much from this tutorial as you are from reading it.

lightx2v, Kijai's nodes, SageAttention

Time taken: 59s

Steps: 4, 2 on high noise

Sampler: uni_pc

As you can see from the time taken, SageAttention does make a huge difference. It's a shame that it can be such a pain to get running if you don't have a 5000 series card.

Lightning, native nodes

Time taken: 173s

Steps: 4, 2 on high noise

Sampler: euler

One problem I often find with the Lightning LoRA is that it tends to change the style and often adds a little "too much" movement. You can see that the original style from the initial frame isn't respected as well here.

Lightning, native nodes, SageAttention

Time taken: 57s

Steps: 4, 2 on high noise

Sampler: euler

Not much to comment here. It's similar to the above video, except it took a third of the time to generate. She does slide along the wall slowly, but that's something that can easily be tweaked out through either prompt changes or messing with other variables.

lightx2v + Lightning, Kijai's nodes, SageAttention

Time taken: 144s

Steps: 4, 2 on high noise

Sampler: euler

For this and the below videos, I'll be using the custom workflow that I normally use for my own projects. This one surprised me because of how long it took to generate compared to lightx2v's official workflow that uses the same LoRAs. It tells me I still have a lot to learn, but we know for a fact that SageAttention still made a big difference thanks to the other examples. I used shift 8 for this one.

SmoothMix + Lightning, Kijai's nodes, SageAttention

Time taken: 67s

Steps: 5, 2 on high noise

Sampler: euler

Here, I'm able to increase the steps above 4, even though LightX2V is specifically made for 4 steps and it is baked into this checkpoint (from what I understand, anyway). Apparently, the process of training these checkpoints makes them more flexible and better at prompt adherence with greater numbers of steps. In the above workflows, the video would have degraded above 4 steps; there would be too many brass casings or something else weird going on.

DaSiWa + Lightning, Kijai's nodes, SageAttention

Time taken: 67s

Steps: 5, 2 on high noise

Sampler: euler

I hate to show bias in this article, but I think it's pretty obvious which of these trials turned out the best. DaSiWa has become my go-to. SmoothMix and even the basic Wan model still have their uses, but DaSiWa just produces the content I expect based on my prompt more reliably.

Conclusion

From the above generation trials, we can take away a few key points:

SageAttention should be used if possible because it is massively more efficient while sacrificing little, if any, quality.
Unless my configuration is bad, it's hard to say that Kijai's nodes are faster. However they do offer more control for advanced users, such as the ability to supply your own sigma schedule to the sampler, among many other things.
Lightning can be useful but it does noticeably change/degrade the art style of the original image. This probably matters less if you use a realistic image as the start frame.
Checkpoints with acceleration baked in offer more versatility and better prompt adherence than using basic Wan and accelerator LoRAs. You can increase the steps more than you normally could while still maintaining coherence.
lightx2v + Lightning together can be useful for scenes with a lot of camera changes, pose changes or dynamic movement. Otherwise, it will probably generate too much motion. Going over 4 steps with this configuration also tends to be a bad idea.

Now that we understand how to speed up our video generation workflow while not degrading quality, we can focus on making the video as good and as consistent as possible. I don't claim to be a master, but my videos have only gotten better over time, and frankly keeping the output consistent over a minute or more of stitched together clips can be pretty difficult. There are a few major techniques I'll go over in the next article.

Wan 2.2 I2V (with or without Acceleration) Basics: Accelerators & Optimizing Generation

Introduction

Disclaimer

Community Workflows

Kijai's Nodes

Base Models

Accelerators

SageAttention

Example Generations

lightx2v, native nodes

lightx2v, Kijai's nodes

lightx2v, Kijai's nodes, SageAttention

Lightning, native nodes

Lightning, native nodes, SageAttention

lightx2v + Lightning, Kijai's nodes, SageAttention

SmoothMix + Lightning, Kijai's nodes, SageAttention

DaSiWa + Lightning, Kijai's nodes, SageAttention

Conclusion