Persona-1-Live
Today, we’re excited to share Persona-1-Live, a real-time streaming variant of Persona-1 — capable of holding dynamic, human-feeling conversations with low latency.
Architecture Overview
Persona-1-Live enhances Persona-1’s flow-based image transformer with a next-token prediction scheme optimized for interactive video generation. Crucially, our implementation enables conversational interactivity without sacrificing Persona-1’s speed, expressive fidelity, or hallmark low cost.
New prediction scheme
Empirically, we’ve also found the new next-token prediction scheme exploits our relatively small training sets more effectively, resulting in broader and more nuanced expression diversity.
Limitations & Future Work
We’ve discovered that our current audio encoder performs best with longer windows — much longer than the new next-token prediction scheme likely needs. As a result, the typical delay between the end of a user’s turn and the persona’s response is roughly six seconds. Experiments with alternative audio encoders are showing promising results, and we expect to bring this delay down by roughly 50% in the near future.
Pricing & Availability
Persona-1-Live is currently in early access, with pricing comparable to Persona-1: just a few cents per minute, ~1/2 the cost of the cheapest competing models.
We’re excited to see what sorts of novel applications today’s live streaming upgrade to Persona-1 enables. If you’d like to experiment with the model or collaborate on new use cases, apply here.
