How to achieve great lipsync videos.
How to achieve great lipsync - What is CRITICAL for Lipsync
Your lipsync quality depends more on the input than the model! Check out the following power user pro tips for making lip-perfect videos with our custom word-by-word workflow.
Here is our review in order of appearance in our workflow.
1. Project & Video Settings
Aspect Ratio
Talking avatars work best with standard aspect ratios such as 16:9, 1:1, or 4:5. Portrait ratios like 4:5 often produce better lip sync results because the mouth occupies more pixels in the frame. Even if your final video target is 9:16 you will be able to crop the video into it as the lipsync will be centered.
Shot Type : affects prompts and reference images
Medium shots (head and shoulders visible) provide the most reliable lip synchronization and avoid model guessing.
- Use head-and-full-shoulders framing whenever possible.
- Avoid extreme close-ups where the face fills the entire frame.
- Avoid distant shots where the mouth becomes too small.
Camera Style
Stable, fixed camera angles work best for talking avatars.
- Avoid prompts suggesting strong camera motion.
- Keep the character facing the viewer.
Background
- Simple backgrounds or slightly blurred environments work best.
- Studio backgrounds or office settings are ideal.
Avoid:
- Busy backgrounds
- Complex patterns
- Highly detailed scenes
2. Character Setup (Avatar Image)
Lip sync quality depends heavily on the avatar reference image. Much more heavily than people think.
Best Face Pose
- Face angle: 0–10° off center
- Head tilt: neutral
- Eyes looking directly at the camera
- Mouth closed and relaxed
- Expression neutral or slight smile
Avoid
- Open mouth reference images
- Large smiles with visible teeth
- Strong side profile (greater than 25°)
Framing
The best composition is a bust shot (head and full shoulders visible). This gives the model stable jaw geometry, neck motion and breathing motion.
- Top of hair visible
- Eyes around the upper third of the frame
- Mouth near the center
- Shoulders visible
Avoid:
- Extreme close-ups
- Full-body images where the face is very small
3. Image Generation or Reference Matching
Users may upload their own image or generate a new avatar image. If you upload an image it can be used as-is for the scene (copied to your first scene) or use match-face-only mode in which only face features are used for the video reference image or not use match-face-only but generate the image. In that latter case the image generated will follow the best practices and try to produce a good reference image for video generation.
If you are generating a new avatar image using another AI model with the intention to upload:
Upscale your image to to 2048px (you can do that with AI upscalers). If you hit size limits for your uploads let us know or upgrade your plan. This improves:
- lip edge detail
- jaw movement
- teeth rendering
The best current avatar model is FLUX.1 Dev. Reasons:
- much stronger facial structure
- better lips
- higher realism
Use prompts that encourage symmetrical faces and clear mouth geometry. A lipsync model tracks shadow deformation around the lips so lighting and shadows are important (shadows cause mouth deformation flicker).
Helpful prompt additions
- soft studio lighting
- soft front light
- even skin illumination
- no harsh shadows
- high facial detail
- sharp focus
- natural skin texture
- centered face
- clear lips
- symmetrical face
- high resolution
- high realism
- high facial geometry detail
Photography style additions
- professional studio portrait
- looking directly at camera
- neutral expression
- mouth closed, natural mouth shape
- upper body portrait
- shoulders visible
- 85mm portrait lens style
- f/2.8 depth of field
Recommended negative prompts
- open mouth
- sunglasses
- beard covering lips
- microphone in front of mouth
- face partially cropped
- teeth visible
- extreme smile
- side profile
- strong side lighting
- dramatic film lighting
- harsh shadows
- face half in shadow
- low resolution
If using reference matching
- Choose images where the face is clearly visible.
- The mouth should not be obstructed.
Avoid images where:
- Sunglasses cover the eyes
- Hair covers the mouth
- Microphones or objects block the lips
- The face is partially cropped
Ideal Composition
- Face angle 0–10° off center
- Head tilt 0° (neutral)
- Eye direction Looking straight at camera
- Mouth Relaxed closed mouth
- Expression Neutral / slight smile
Avoid:
- Open mouth in reference
- Big smile (teeth showing)
- Side profile > 25°
These distort lip shapes when animated.
4. Lighting
Lighting strongly affects mouth tracking and facial motion quality.
Best lighting setup
- Soft front lighting
- Slight side fill
- Even facial illumination
Recommended prompt hints
- soft studio lighting
- even skin illumination
- cinematic portrait lighting
- sharp facial detail
Avoid
- Strong side lighting
- Dramatic film lighting
- Faces partially in shadow
These conditions can cause lip deformation flicker.
5. Voice and Speech Generation
Speech characteristics strongly influence lip animation quality. Our word-by-word workflow gives you the opportunity to adjust speed control, pitch, emotion, and pauses so your result is ideal. You can also hear the audio to ensure it has good characteristics before proceeding to the video generation
Recommended voice styles
- Clear narrator voice
- Confident presenter tone
- Stable speaking cadence
Best picks
- Deep_Voice_Man
- Imposing_Manner
- Casual_Guy
- Lively_Girl
Avoid
- Whisper style voices
- Cartoon or exaggerated voices
- Extremely fast speaking voices
Choosing a voice that goes well with your image will also produce good human perception performance. Avoid changing voice or parameters for the same character. Higher pitch on the same voice is a different voice.
The ideal speaking speed is:
0.9 – 1.05
Faster speech can reduce lip accuracy.
Short sentences usually produce better mouth animation than long complex sentences.
6. Script Formatting for Better Lip Sync
Adding pauses in the script improves realism and prevents the avatar from appearing to continue speaking after the line ends. A common issue with lipsync models, pretty much all of them, is they try to connect next scene so person at the end appears as if they continue speaking instead of stopping when text is over). The solution is to use pause markers supported by the model.
Example
Hello everyone.<#0.4#>
Today I want to show you something interesting.<#0.4#>
Let’s begin.<#0.8#>
If possible, add 1–1.2s silence at the last scene end.
Example:
Thank you for watching.<#1.0#>
This prevents the continuous talking effect.
Guidelines
- Add short pauses between sentences
- Add a longer pause at the end of the final sentence
The duration of each line/scene is also critical.
Normally the larger the duration the larger the model ‘dift’ (decreased quality). Leverage our workflow to make larger videos by breaking into more lines. No more than 15 seconds.
If you use premium lipsync models the duration of each line should be either close to 5, 10 or 15 seconds, certainly not more and close to these numbers e.g. not 7.5 seconds. Use pauses to achieve that and some trial-and-error leveraging our platform’s capability to produce the audio before the video.
Within each scene also use smaller sentences if possible. The model performs better with short phrases. Instead of:
Today I will explain something very interesting about artificial intelligence systems and how they are changing the world.
Use:
Today I will explain something interesting.<#0.3#> Artificial intelligence is changing the world.<#0.4#> Let me show you how.
8. Scene Actions
Scene actions influence how the avatar appears in the generated video.
Best actions
- Directly talking to camera
- Standing and addressing the viewer
- Subtle head movement
Avoid
- Actions where the character is far from the camera
- Scenes where the mouth becomes too small
- Scenes where the face turns away from the viewer
In our workflow we allow motion guidance prompt through Scene Action
Good Example:
person speaking naturally, subtle head movement, realistic lip motion
Avoid prompts like:
talking excitedly laughing shoutingThey break lip sync.
Use premium models only if your purpose is to achieve more elaborate scenes. The lip sync itself will be good enough with the basic model also presuming you follow the best practices outlined here.