Hunyuan Video: First Try with ComfyUI’s Native Workflow
I finally got around to testing Hunyuan’s text-to-video model in ComfyUI, and honestly, it’s way simpler than I expected. No complicated API keys or external tools—just drag, load, and run. Here’s how it went.
I started with the pre-built workflow from the ComfyUI examples page. No setup needed—just downloaded the image and dragged it into my canvas. The workflow auto-prompted me to download the required models, which was a nice touch. If you’re doing this manually, you’ll need four files:
hunyuan_video_vae_bf16.safetensors (for the VAE loader)
hunyuan_video_t2v_720p_bf16.safetensors (diffusion model)
clip_l.safetensors and llava_llama3_fp8_scaled.safetensors (text encoders)
Loading the Models
The workflow uses a DualCLIPLoader node for the text encoders. I dropped clip_l.safetensors and llava_llama3_fp8_scaled.safetensors into the ComfyUI/models/text_encoders folder, and ComfyUI detected them right away. For the VAE and diffusion model, I just selected the files in their respective loader nodes.
Running the Workflow
I kept most settings default but tweaked two things:
Resolution: The EmptyHunyuanLatentVideo node defaults to 720p. You can lower it to 480p if you’re tight on VRAM.
Sampling Steps: I tried 6 steps for a quick test (took ~4 minutes on my 4090), but bumping it to 20 gave noticeably smoother motion.
Unexpected Wins
The model handles bilingual prompts (Chinese/English) without extra config.
Setting the length to “1” in EmptyHunyuanLatentVideo generates a static image—handy for testing compositions.
Annoyances
The VAE is heavy. Even at 720p, I hit VRAM limits until I switched the weight dtype to FP8.
Outputs sometimes ignore the prompt for the first few frames. Adding “–v 2” to the prompt helped, but it’s hit-or-miss.
For more details, the Hunyuan Video GitHub repo breaks down the architecture. Or just grab the workflow and tweak it—it’s surprisingly flexible.
Download Files
- VAE: pytorch_model.pt
- Diffusion Model: mp_rank_00_model_states.pt
- Text Encoder: llava_llama3_fp8_scaled.safetensors
- Clip I: clip_l.safetensors