Lip Sync
ONNX neural inference maps speech phonemes to 52 ARKit blendshapes at 30fps. Crisp mouth movements with natural co-articulation.
AnimaSync extracts emotion from speech and generates lip sync, facial expressions, and body motion in real time — no server required.
One engine handles the full animation pipeline — from raw audio to animated avatar.
ONNX neural inference maps speech phonemes to 52 ARKit blendshapes at 30fps. Crisp mouth movements with natural co-articulation.
Voice energy and pitch automatically drive brows, cheeks, eyes, and smile. Emotion follows the speaker naturally.
Stochastic blink injection at 2.5–4.5s intervals with 15% double-blink probability. No dead-eyed avatars.
Embedded VRMA bone animation clips with smooth idle-to-speaking crossfade. Breathing, gestures, and posture shifts.
AudioWorklet captures microphone at 16kHz. Process chunks as they arrive — no need to wait for complete audio.
Rust/WASM + ONNX Runtime Web. No server, no API calls, no data leaves the browser. Works offline after first load.
Install from npm, initialize the engine, and start generating animation frames. Works with any Three.js + VRM setup.
Two engines, one API surface. Pick the engine that fits your project.
Phoneme classification engine — 111-dim output with full expression control. Built-in IdleExpressionGenerator, VoiceActivityDetector, and VRM 18-dim mode.
Emotion model — 52-dim ARKit blendshape prediction with 5-dim FiLM conditioning (neutral, joy, anger, sadness, surprise).
Interactive demos you can try right now — no install needed.
6-step interactive tutorial. Choose V1 or V2 engine, adjust emotion in real time (V2), load a VRM avatar, apply lip sync — with live demos at each step.
V1 phoneme engine — 111-dim output mapped to 52 ARKit blendshapes. ONNX inference with real-time visualization.
Try it →V2 emotion model — 52 ARKit blendshapes with 5-dim FiLM conditioning. Emotion-aware lip sync, real-time rendering.
Try it →Same voice input, two animation engines, two avatars. See the difference live in a dual-panel view.
Try it →Two engines for different needs. Both produce ARKit-compatible output at 30fps.
| Feature | V1 Recommended | V2 |
|---|---|---|
| Output | 111-dim ARKit blendshapes | 52-dim ARKit blendshapes |
| Architecture | Phoneme classification + viseme mapping | Emotion model + FiLM conditioning |
| Post-processing | OneEuroFilter + anatomical constraints | crisp_mouth + fade + auto-blink |
| Idle expressions | Built-in IdleExpressionGenerator | Blink injection in post-process |
| Voice activity | Built-in VoiceActivityDetector | — |
| Emotion control | — | 5-dim FiLM conditioning (neutral, joy, anger, sadness, surprise) |
| Best for | Full expression control, custom avatars | Emotion-aware lip sync, quick integration |