Overview
A prototype public-service kiosk: walk up, ask a question in Urdu, and a photoreal on-screen person answers — with a cloned voice and lip-synced video.
What it does
- Photoreal talking avatar — lip-sync generated with Wav2Lip over studio footage of a real presenter.
- Cloned natural voice — neural TTS shaped through an RVC voice model so the avatar sounds like the actual person.
- Local speech pipeline — speech recognition and synthesis run on local GPU hardware, with an LLM handling the intent layer.
- Intent-driven answers — a curated set of public-service intents, written for the local audience rather than translated.
Status
Working prototype, demoed to stakeholders. Originally scoped for Pashto, pivoted to Urdu-primary after stakeholder feedback — the architecture supports both.
My role
Everything: the speech pipeline, avatar rendering, intent design, and the GPU server it runs on. A fun reminder that the hard part of "AI kiosks" isn't the model — it's latency, audio handling, and making a pipeline of five models fail gracefully.