Perfect Lip Syncs Movements with Latent Sync
Let’s be real—getting lip movements to match new audio in a different language is usually a pain. But Latent Sync? It’s one of those tools that just works better than I expected.
Here’s the thing: Latent Sync isn’t just another lip-syncing tool. It’s an AI framework from ByteDance and Beijing Jiaotong University that maps phonemes (those tiny sound units in speech) to exact lip movements. The result? Scary-good sync, even across languages.
Setting It Up
I’ll be honest, the setup isn’t plug-and-play, but it’s not too bad either. First, make sure you’re running Python 3.8 to 3.11—skip 3.12 since Mediapipe doesn’t play nice with it yet. If you run into PATH issues, grab FFmpeg from here and drop it into
C:\ffmpeg
Next, clone the Latent Sync Wrapper into your ComfyUI custom nodes:
cd ComfyUI/custom_nodes
git clone https://github.com/ShmuelRonen/ComfyUI-LatentSyncWrapper.git
cd ComfyUI-LatentSyncWrapper
pip install -r requirements.txt
You’ll need two model files:
latentsync_unet.pt
(goes inComfyUI-LatentSyncWrapper/checkpoints/
)tiny.pt
(save it incheckpoints/whisper/
)
If you hit PYTHONPATH errors, try running ComfyUI as admin. Annoying, but it usually fixes things.
What Works (And What Doesn’t)
Latent Sync nails it with clear, frontal-face videos. The phoneme-to-lip mapping is eerily accurate—like the “p” in “perfect” actually looks like a “p.” But there are limits:
- No anime or cartoon faces (yet).
- Videos get auto-converted to 25 FPS.
- The face needs to stay visible the whole time.
ComfyUI/custom_nodes/ComfyUI-LatentSyncWrapper/checkpoints/
├── latentsync_unet.pt
└── whisper/
└── tiny.pt
I didn’t expect the temporal consistency to be this solid. Even with longer audio, the lip movements stay locked in. For more details, check out the original paper or the ComfyUI examples page.
Anyway, if you’ve been wrestling with lip sync, this is worth a shot. Just don’t expect miracles with non-human faces. Yet.