Project

Pocket TTS Integration

CPU-based TTS at 6x real-time

Production · Integrated in AssetFlow

Overview

Integrated Kyutai Labs' 100M-parameter Pocket TTS model as a lightweight, CPU-only text-to-speech option for AssetFlow. Unlike the GPU-dependent F5-TTS and Qwen3-TTS engines, Pocket TTS runs on consumer hardware—using only 2 CPU cores—while achieving 6x faster than real-time inference.

The model achieves approximately 200ms latency to first audio chunk, making it viable for near-real-time applications. It supports voice cloning from reference audio and handles unlimited-length text input through automatic chunking. This gives AssetFlow a cost-effective TTS option that doesn't require GPU infrastructure.

Key Features

100M-parameter model running on CPU only (2 cores)
6x faster than real-time inference speed
~200ms latency to first audio chunk
Voice cloning from reference audio samples
Automatic text chunking for unlimited-length input
No GPU infrastructure required—runs on consumer hardware

Architecture

Text Input → Auto-Chunking → Pocket TTS (100M) → Audio Concat → Output
                                       ↓
                             Reference Audio → Voice Clone
                             CPU-only (2 cores)
                             ~200ms first chunk

Integration:
AssetFlow → TTS Engine Selector → Pocket TTS (CPU) / F5-TTS (GPU) / Qwen3-TTS (GPU)

Text is automatically chunked into segments sized for the model's context window. Each chunk runs through the Pocket TTS model on CPU with voice cloning from a reference audio embedding. Chunks are concatenated into the final audio output. Within AssetFlow, the TTS engine selector lets users choose between Pocket TTS (free, CPU-based) and the GPU-based alternatives depending on quality and speed requirements.

Sample Output

Voice Clone Sample

Tech Stack

PyTorch Pocket TTS CPU Inference Voice Cloning Python Django

I build and deploy production AI systems.

Let's talk about your next project.

Get in touch See more projects