Welcome to 2025! NVIDIA kicked off the year with a big announcement: the Cosmo series of diffusion models. If you’re into AI, this is an exciting moment! Today, I’ll guide you through testing NVIDIA Cosmo models on ComfyUI in the simplest way possible. Let’s jump in.
What You’ll Need
1. Text Encoder and VAE
Download the files here:
Files and where they go:
oldt5_xxl_fp8_e4m3fn_scaled.safetensors
->ComfyUI/models/text_encoders
cosmos_cv8x8x8_1.0.safetensors
->ComfyUI/models/vae
Note: The oldt5_xxl
encoder is version 1.0, different from the version 1.1 used in other models like Flux.
2. Diffusion Models
Get them here:
Place them in: ComfyUI/models/diffusion_models
Want the original .pt
files? Official links:
Key Terms:
- Text to World = Text to Video
- Video to World = Image/Video to Video
How to Set It Up
1. Download Files
Make sure you download all the required files:
- Text encoder and VAE files for Cosmo.
- Diffusion model safetensors (7B and/or 14B versions).
2. Save Files in the Right Folders
Put each file in its specific folder as outlined above. This step is crucial for ComfyUI to recognize the models.
3. Update ComfyUI
Before running the workflows, update ComfyUI to the latest version. Many issues happen because people forget this step!
4. Load the Workflows
You’ll need two workflows:
- Text-to-Video
- Image-to-Video
Download the JSON workflow files and save them locally. Open ComfyUI and load the workflows by dragging the JSON files into the interface.
Running Text-to-Video
- Load the Model
Start by loading the Cosmo 7B or 14B Text-to-World safetensor model in the diffusion model loader node. - Input Text
Enter your prompt in the text encoder. Cosmo supports both positive and negative prompts for more precise control. - Sampler Settings
Use the new “ResMultiStep” sampler for the best results. The default scheduler is set to Karras but feel free to experiment with others. - Output Settings
Set the output to “Video Combine” and save as an MP4. You can also adjust the frame rate and resolution as needed.
Running Image-to-Video
- Load Your Image
Use the “Load Image” node to input an image. Resize or crop it using the “Resize Image” node to match Cosmo’s default dimensions. - Configure Nodes
Pass the image through the VAE encoder and connect it to the “Image to Video Latent” node. Adjust parameters like frame count and batch size as needed. - Generate Video
Run the workflow and save the output as an MP4. You’ll get a short video (default: 5 seconds) based on your input image.
Quick Tips
- Hardware Requirements: The 7B model is easier on your PC’s VRAM. The 14B model requires more resources.
- Negative Prompts: Unlike some other models, Cosmo supports negative prompts. Use them for finer control over outputs.
- Tiling: Cosmo uses a default tile size of 240 for VAE decoding. No need to configure this manually.
- Torch Compatibility: Install Torch 2.5.1 for better performance when using “Torch Compile” settings.