NVIDIA just dropped their Cosmos series, and if you’re into AI video generation, this one’s worth checking out. I spent the last few days testing it in ComfyUI, and here’s how it went.
First things first—you’ll need a few files. The text encoder and VAE are over on Hugging Face (grab them here). Save oldt5_xxl_fp8_e4m3fn_scaled.safetensors
in ComfyUI/models/text_encoders
and cosmos_cv8x8x8_1.0.safetensors
in ComfyUI/models/vae
. Fair warning: the text encoder is v1.0, not the newer 1.1 version you might’ve seen in models like Flux.
For the diffusion models, you’ve got options. The repackaged safetensors are easier to work with—just drop them into ComfyUI/models/diffusion_models
. If you want the original .pt files, NVIDIA’s got the 7B and 14B versions for both text-to-video and image-to-video (Model Link).
Want the original .pt
files? Official links:
Setting It Up
I just updated ComfyUI to the latest version (always do this first—skipping it causes half the issues people complain about). Then I dragged in the workflow JSON files. No fancy steps, just load and go.
For text-to-video, the key node is the diffusion model loader. Pick either the 7B or 14B safetensor, depending on your VRAM. The 7B runs fine on my 24GB GPU, but the 14B needs a bit more breathing room.
Here’s the thing: Cosmo works best with the new res_multistep
sampler. It’s the one NVIDIA used in their paper, and yeah, it makes a difference. The default Karras scheduler works, but don’t be afraid to tweak it.
Image-to-video is where things get interesting. Load your image, resize it to match Cosmo’s defaults (no guesswork—just use the Resize Image node), and pipe it through the VAE encoder. The output’s a 5-second clip by default, but you can adjust the frame count if you’re feeling experimental.
Quick Notes
- Negative prompts actually work here, unlike some other models. Use them.
- The 7B model’s a safer bet if your GPU’s not top-tier.
- Torch 2.5.1 helps if you’re using
torch.compile()
.