AnimaSync is a voice-driven 3D avatar animation engine that runs entirely in the browser. It extracts emotion from speech and generates lip sync (52 ARKit blendshapes at 30fps), facial expressions, eye blinks, and body motion using Rust/WASM and ONNX inference.

How do I install AnimaSync?

Install via npm: 'npm install @goodganglabs/lipsync-wasm-v1'. Then import, create a LipSyncWasmWrapper instance, call init(), and use processFile() or processAudioChunk() to generate animation frames.

Does AnimaSync require a server?

No. AnimaSync runs entirely client-side in the browser via WebAssembly. Audio processing, ONNX inference, and animation generation all happen on the user's device.

Open Source · Rust/WASM · 30-day Free Trial

Voice to avatar,
entirely in the browser

AnimaSync extracts emotion from speech and generates lip sync, facial expressions, and body motion in real time — no server required.

Start Building View on GitHub

Full Animation Pipeline

🎤

Audio Input

File, Mic, or TTS
16kHz PCM

🧠

ONNX Inference

Rust/WASM engine
Phoneme → Viseme

👄

Lip Sync

52 ARKit blendshapes
jaw, mouth, tongue

😊

Expressions

Brow, cheek, eyes
Emotion from voice

👁

Eye Blink

Stochastic injection
2.5–4.5s interval

🎭

VRM Body

VRMA bone animation
Idle ↔ Speaking

ARKit Blendshapes

30 fps

Animation Output

<300ms

Mic-to-Render

Server Required

Capabilities

Everything from audio alone

One engine handles the full animation pipeline — from raw audio to animated avatar.

👄

Lip Sync

ONNX neural inference maps speech phonemes to 52 ARKit blendshapes at 30fps. Crisp mouth movements with natural co-articulation.

😊

Facial Expressions

Voice energy and pitch automatically drive brows, cheeks, eyes, and smile. Emotion follows the speaker naturally.

👁

Eye Animation

Stochastic blink injection at 2.5–4.5s intervals with 15% double-blink probability. No dead-eyed avatars.

🎭

Body Motion

Embedded VRMA bone animation clips with smooth idle-to-speaking crossfade. Breathing, gestures, and posture shifts.

🎙

Real-time Streaming

AudioWorklet captures microphone at 16kHz. Process chunks as they arrive — no need to wait for complete audio.

⚡

Client-side Only

Rust/WASM + ONNX Runtime Web. No server, no API calls, no data leaves the browser. Works offline after first load.

Animate an avatar in 4 lines

Install from npm, initialize the engine, and start generating animation frames. Works with any Three.js + VRM setup.

$ npm install @goodganglabs/lipsync-wasm-v1

import { LipSyncWasmWrapper } from '@goodganglabs/lipsync-wasm-v1';

const lipsync = new LipSyncWasmWrapper();
await lipsync.init();  // 30-day free trial

const result = await lipsync.processFile(audioFile);
const frame  = lipsync.getFrame(result, 0); // number[111]
    

Packages

Install from npm

Two engines, one API surface. Pick the engine that fits your project.

@goodganglabs/lipsync-wasm-v1

Recommended

Phoneme classification engine — 111-dim output with full expression control. Built-in IdleExpressionGenerator, VoiceActivityDetector, and VRM 18-dim mode.

Output: 111-dim ARKit

Post: OneEuroFilter + constraints

VRM: 18-dim preset mode

$ npm install @goodganglabs/lipsync-wasm-v1

@goodganglabs/lipsync-wasm-v2

Lightweight

Emotion model — 52-dim ARKit blendshape prediction with 5-dim FiLM conditioning (neutral, joy, anger, sadness, surprise).

Output: 52-dim ARKit

Emotion: 5-dim FiLM conditioning

Peer: onnxruntime-web

$ npm install @goodganglabs/lipsync-wasm-v2

Examples

See it in action

Interactive demos you can try right now — no install needed.

Interactive Guide

Build Your Own AI Talking Avatar

6-step interactive tutorial. Choose V1 or V2 engine, adjust emotion in real time (V2), load a VRM avatar, apply lip sync — with live demos at each step.

Start Guide →

V1 Engine

Phoneme Visualization

V1 phoneme engine — 111-dim output mapped to 52 ARKit blendshapes. ONNX inference with real-time visualization.

Try it →

V2 Engine

Emotion Model Demo

V2 emotion model — 52 ARKit blendshapes with 5-dim FiLM conditioning. Emotion-aware lip sync, real-time rendering.

Try it →

Comparison

V1 vs V2 Side-by-Side

Same voice input, two animation engines, two avatars. See the difference live in a dual-panel view.

Try it →

Engine Versions

Choose your engine

Two engines for different needs. Both produce ARKit-compatible output at 30fps.

Feature	V1 Recommended	V2
Output	111-dim ARKit blendshapes	52-dim ARKit blendshapes
Architecture	Phoneme classification + viseme mapping	Emotion model + FiLM conditioning
Post-processing	OneEuroFilter + anatomical constraints	crisp_mouth + fade + auto-blink
Idle expressions	Built-in IdleExpressionGenerator	Blink injection in post-process
Voice activity	Built-in VoiceActivityDetector	—
Emotion control	—	5-dim FiLM conditioning (neutral, joy, anger, sadness, surprise)
Best for	Full expression control, custom avatars	Emotion-aware lip sync, quick integration