I finally got around to testing Wan 2.1, the new video model from Alibaba, and it’s way more accessible than I expected. The best part? It runs locally in ComfyUI without needing crazy hardware. Here’s how I set it up and what worked (and didn’t) on my mid-range GPU.
Picking the Right Model
The naming scheme threw me off at first—turns out it’s simpler than it looks. For example, wan2.1-i2v-14b-480p-q2_k.gguf
breaks down like this:
- i2v: Image-to-video (there’s also a text-to-video version).
- 14b: 14B parameters—bigger than most open models but still manageable.
- You can select 480p or 720p: Output resolution.
- You can select q2_k or q5_1: Quantization level. Lower numbers mean smaller files but slightly worse quality.
- Upscale Model: Try
4x_foolhardy_Remacri.pth
.
I tried a few quantized versions on my RTX 3060 (12GB VRAM). The q4_0
variant at 480p worked without offloading, but 720p needed q5_1
and some patience. If you’re on a 16GB+ card, skip the quantized versions and grab the full fp16 model from Hugging Face.
Setting It Up
No fancy install steps—just download the GGUF file and dump it in ComfyUI/models/diffusion_models
. The text encoder (t5xxl_um
) and VAE go in their usual folders. I missed the CLIP Vision file at first, which caused a weird error until I dropped it into clip_vision
.
The workflow was straightforward. I dragged in one of the example graphs from the ComfyUI examples page, and it auto-prompted me to download missing nodes. Pro tip: If you’re low on VRAM, use the fp8
text encoder instead of fp16
. It’s slightly slower but saves a few gigs.
First Results
I fed it a 512×768 input image and set the resolution to 848×480 (the 480p model’s native size). The output was… decent? Motion was smooth, but fine details got muddy. Upping the CFG to 10 helped, though it introduced some flickering. For shorter clips (under 5 seconds), the 1.3B model actually held up surprisingly well—way faster and barely any quality drop.
Biggest surprise: The VAE handles 1080p without breaking a sweat, even on my card. I’m still tweaking sampler settings (UniPC for speed, Euler A for balance), but so far, it’s a solid alternative to Hunyuan for local video gen.