Overview
Integrated Kyutai Labs' 100M-parameter Pocket TTS model as a lightweight, CPU-only text-to-speech option for AssetFlow. Unlike the GPU-dependent F5-TTS and Qwen3-TTS engines, Pocket TTS runs on consumer hardware—using only 2 CPU cores—while achieving 6x faster than real-time inference.
The model achieves approximately 200ms latency to first audio chunk, making it viable for near-real-time applications. It supports voice cloning from reference audio and handles unlimited-length text input through automatic chunking. This gives AssetFlow a cost-effective TTS option that doesn't require GPU infrastructure.
Key Features
- 100M-parameter model running on CPU only (2 cores)
- 6x faster than real-time inference speed
- ~200ms latency to first audio chunk
- Voice cloning from reference audio samples
- Automatic text chunking for unlimited-length input
- No GPU infrastructure required—runs on consumer hardware
Architecture
Text Input → Auto-Chunking → Pocket TTS (100M) → Audio Concat → Output
↓
Reference Audio → Voice Clone
CPU-only (2 cores)
~200ms first chunk
Integration:
AssetFlow → TTS Engine Selector → Pocket TTS (CPU) / F5-TTS (GPU) / Qwen3-TTS (GPU)
Text is automatically chunked into segments sized for the model's context window. Each chunk runs through the Pocket TTS model on CPU with voice cloning from a reference audio embedding. Chunks are concatenated into the final audio output. Within AssetFlow, the TTS engine selector lets users choose between Pocket TTS (free, CPU-based) and the GPU-based alternatives depending on quality and speed requirements.
Sample Output
Voice Clone Sample
Tech Stack
I build and deploy production AI systems.
Let's talk about your next project.