Hi there! Imagine having a video in English and effortlessly syncing its lip movements to audio in another language. With Latent Sync,
In this blog post, we’ll explore how to use the Latent Sync workflow to achieve seamless lip-syncing for your videos.
What is Latent Sync?
Latent Sync is an advanced AI-based framework developed by researchers at ByteDance and Beijing Jiaotong University. It’s designed to map phonemes (the smallest units of sound in speech) to accurate lip movements, ensuring flawless synchronization.
Key Features:
- Unmatched Accuracy: Incorporates TREPA for superior temporal consistency.
- Virtual Avatars: Create human-like speech patterns for digital avatars.
- Flexibility: Works with various video lengths and audio files.
Setting Up the Workflow
Requirements:
- Python versions 3.8 to 3.11 (avoid 3.12 as Mediapipe isn’t compatible yet). but here the Solution Download from here and add to system PATH
Location to Save that File
C:\ffmpeg
Install Dependencies
- Save the necessary files in your custom node directory.
- Clone the Latent Sync Wrapper repository via the command line:
cd ComfyUI/custom_nodes
git clone https://github.com/ShmuelRonen/ComfyUI-LatentSyncWrapper.git
cd ComfyUI-LatentSyncWrapper
pip install -r requirements.txt
Add the Model
- Download
latentsync_unet.pt
and place it in theCheckpoint
folder of the Latent Sync Wrapper. - Create a
whisper
folder withinCheckpoint
and save thetiny.py
file there.
Place them in the following structure:
ComfyUI/custom_nodes/ComfyUI-LatentSyncWrapper/checkpoints/
├── latentsync_unet.pt
└── whisper/
└── tiny.pt
Run ComfyUI as Administrator
If you encounter PYTHONPATH
errors, running ComfyUI with admin privileges should resolve the issue.
Known Limitations
- Works best with clear, frontal face videos.
- Doesn’t support anime/cartoon faces yet.
- Input video must be 25 FPS (automatically converted if needed).
- Ensure the face is visible throughout the video.
Results That Speak for Themselves
Once the process is complete, you’ll notice how naturally the video’s lip movements align with the new audio. The tool analyzes the audio, breaks it into phonemes, and ensures each phoneme matches the correct lip shape.
For example:
- The “p” in “perfect” has a precise visual representation.
- Complex sounds and subtle expressions are seamlessly matched.
Thanks for sharing 🙏
I followed every step but once running in ComfyUI it gets an error:
D_LatentSyncNode
Failed to execute module: No module named ‘decord’
I guess installing requirement didn’t work as it should, I did it twice (just in case) but I get the same error.
I’ve tried to run as ADMIN as well, it didn’t help.
I’ve tried to install the missing module manually via CMD:
“pip install decord”
But there is another error:
WARNING: Error parsing dependencies of torchsde: .* suffix can only be used with `==` or `!=` operators
numpy (>=1.19.*) ; python_version >= “3.7”
~~~~~~~^
I’m not a programmer, I just followed your instructions.
Can you please explain how to fix it?
Thanks ahead! 💙