How to achieve great lipsync videos.

How to achieve great lipsync - What is CRITICAL for Lipsync

Your lipsync quality depends more on the input than the model! Check out the following power user pro tips for making lip-perfect videos with our custom word-by-word workflow.

Here is our review in order of appearance in our workflow.

1. Project & Video Settings

Aspect Ratio

Talking avatars work best with standard aspect ratios such as 16:9, 1:1, or 4:5. Portrait ratios like 4:5 often produce better lip sync results because the mouth occupies more pixels in the frame. Even if your final video target is 9:16 you will be able to crop the video into it as the lipsync will be centered.

Shot Type : affects prompts and reference images

Medium shots (head and shoulders visible) provide the most reliable lip synchronization and avoid model guessing.

Use head-and-full-shoulders framing whenever possible.
Avoid extreme close-ups where the face fills the entire frame.
Avoid distant shots where the mouth becomes too small.

Camera Style

Stable, fixed camera angles work best for talking avatars.

Avoid prompts suggesting strong camera motion.
Keep the character facing the viewer.

Background

Simple backgrounds or slightly blurred environments work best.
Studio backgrounds or office settings are ideal.

Avoid:

Busy backgrounds
Complex patterns
Highly detailed scenes

2. Character Setup (Avatar Image)

Lip sync quality depends heavily on the avatar reference image. Much more heavily than people think.

Best Face Pose

Face angle: 0–10° off center
Head tilt: neutral
Eyes looking directly at the camera
Mouth closed and relaxed
Expression neutral or slight smile

Avoid

Open mouth reference images
Large smiles with visible teeth
Strong side profile (greater than 25°)

Framing

The best composition is a bust shot (head and full shoulders visible). This gives the model stable jaw geometry, neck motion and breathing motion.

Top of hair visible
Eyes around the upper third of the frame
Mouth near the center
Shoulders visible

Avoid:

Extreme close-ups
Full-body images where the face is very small

3. Image Generation or Reference Matching

Users may upload their own image or generate a new avatar image. If you upload an image it can be used as-is for the scene (copied to your first scene) or use match-face-only mode in which only face features are used for the video reference image or not use match-face-only but generate the image. In that latter case the image generated will follow the best practices and try to produce a good reference image for video generation.

If you are generating a new avatar image using another AI model with the intention to upload:

Upscale your image to to 2048px (you can do that with AI upscalers). If you hit size limits for your uploads let us know or upgrade your plan. This improves:

lip edge detail
jaw movement
teeth rendering

This improves lipsync a lot.

The best current avatar model is FLUX.1 Dev. Reasons:

much stronger facial structure
better lips
higher realism

Use prompts that encourage symmetrical faces and clear mouth geometry. A lipsync model tracks shadow deformation around the lips so lighting and shadows are important (shadows cause mouth deformation flicker).

Helpful prompt additions

soft studio lighting
soft front light
even skin illumination
no harsh shadows
high facial detail
sharp focus
natural skin texture
centered face
clear lips
symmetrical face
high resolution
high realism
high facial geometry detail

Photography style additions

professional studio portrait
looking directly at camera
neutral expression
mouth closed, natural mouth shape
upper body portrait
shoulders visible
85mm portrait lens style
f/2.8 depth of field

Recommended negative prompts

open mouth
sunglasses
beard covering lips
microphone in front of mouth
face partially cropped
teeth visible
extreme smile
side profile
strong side lighting
dramatic film lighting
harsh shadows
face half in shadow
low resolution

If using reference matching

Choose images where the face is clearly visible.
The mouth should not be obstructed.

Avoid images where:

Sunglasses cover the eyes
Hair covers the mouth
Microphones or objects block the lips
The face is partially cropped

Ideal Composition

Face angle 0–10° off center
Head tilt 0° (neutral)
Eye direction Looking straight at camera
Mouth Relaxed closed mouth
Expression Neutral / slight smile

Avoid:

Open mouth in reference
Big smile (teeth showing)
Side profile > 25°

These distort lip shapes when animated.

4. Lighting

Lighting strongly affects mouth tracking and facial motion quality.

Best lighting setup

Soft front lighting
Slight side fill
Even facial illumination

Recommended prompt hints

soft studio lighting
even skin illumination
cinematic portrait lighting
sharp facial detail

Avoid

Strong side lighting
Dramatic film lighting
Faces partially in shadow

These conditions can cause lip deformation flicker.

5. Voice and Speech Generation

Speech characteristics strongly influence lip animation quality. Our word-by-word workflow gives you the opportunity to adjust speed control, pitch, emotion, and pauses so your result is ideal. You can also hear the audio to ensure it has good characteristics before proceeding to the video generation

Recommended voice styles

Clear narrator voice
Confident presenter tone
Stable speaking cadence

Best picks

Deep_Voice_Man
Imposing_Manner
Casual_Guy
Lively_Girl

Avoid

Whisper style voices
Cartoon or exaggerated voices
Extremely fast speaking voices

Choosing a voice that goes well with your image will also produce good human perception performance. Avoid changing voice or parameters for the same character. Higher pitch on the same voice is a different voice.

The ideal speaking speed is:

0.9 – 1.05

Faster speech can reduce lip accuracy.

Short sentences usually produce better mouth animation than long complex sentences.

6. Script Formatting for Better Lip Sync

Adding pauses in the script improves realism and prevents the avatar from appearing to continue speaking after the line ends. A common issue with lipsync models, pretty much all of them, is they try to connect next scene so person at the end appears as if they continue speaking instead of stopping when text is over). The solution is to use pause markers supported by the model.

Example

Hello everyone.<#0.4#>
Today I want to show you something interesting.<#0.4#>
Let’s begin.<#0.8#>

If possible, add 1–1.2s silence at the last scene end.

Example:

Thank you for watching.<#1.0#>

This prevents the continuous talking effect.

Guidelines

Add short pauses between sentences
Add a longer pause at the end of the final sentence

The duration of each line/scene is also critical.

Normally the larger the duration the larger the model ‘dift’ (decreased quality). Leverage our workflow to make larger videos by breaking into more lines. No more than 15 seconds.

If you use premium lipsync models the duration of each line should be either close to 5, 10 or 15 seconds, certainly not more and close to these numbers e.g. not 7.5 seconds. Use pauses to achieve that and some trial-and-error leveraging our platform’s capability to produce the audio before the video.

Within each scene also use smaller sentences if possible. The model performs better with short phrases. Instead of:

Today I will explain something very interesting about artificial intelligence systems and how they are changing the world.

Use:

Today I will explain something interesting.<#0.3#>
Artificial intelligence is changing the world.<#0.4#>
Let me show you how.

8. Scene Actions

Scene actions influence how the avatar appears in the generated video.

Best actions

Directly talking to camera
Standing and addressing the viewer
Subtle head movement

Avoid

Actions where the character is far from the camera
Scenes where the mouth becomes too small
Scenes where the face turns away from the viewer

In our workflow we allow motion guidance prompt through Scene Action

Good Example:

person speaking naturally, subtle head movement, realistic lip motion

Avoid prompts like:

talking excitedly
laughing
shouting

They break lip sync.

Use premium models only if your purpose is to achieve more elaborate scenes. The lip sync itself will be good enough with the basic model also presuming you follow the best practices outlined here.