ComfyUI Text-to-Video Workflow: Create Videos With Low VRAM

Let’s be real—CogVideoX-5B doesn’t play nice with 12GB VRAM. I hit the same wall, so I tweaked a workflow that actually works for lower VRAM setups, even down to 8GB. Here’s how it went.

The Model

CogVideoX is an open-source text-to-video model from Tsinghua University and Zhipu AI. There are two versions floating around:

CogVideoX-2B: Lighter, Apache 2.0 licensed, and runs on tighter hardware.

CogVideoX-5B: Better quality, but heavier. Sometimes throws errors mid-run—just rerun it, and it usually pushes through.

The Workaround

I skipped the 5B for a 2B quantized version and paired it with a few optimizations.hat worked:

Grab the model: Download the fp8 quantized checkpoint from ComfyUI-CogVideoXWrapper (it’s in the repo’s docs).

Drop it in place: Save it to ComfyUI/models/diffusion_models/.

Use the wrapper node: The custom node handles memory better by slicing the workload.

If you hit errors, try reducing the resolution or frames. It’s not perfect, but it’s the only way I got it stable on 12GB.

Why This Works

The 2B model trades some quality for speed, but the fp8 quantization keeps it usable. For context, the 5B version needs at least 16GB VRAM to run smoothly—otherwise, it’ll crash halfway.

For more details, check the original CogVideoX paper. And if you’re stuck, the ComfyUI Discord usually has fixes floating around.

Download Workflows

🤖 Hey, welcome! Thanks for visiting comfyuiblog.com