Project

Qwen3-TTS Voice Cloning

1.7B-parameter model on serverless GPU

Production · Deployed on RunPod

Overview

Deployed Qwen3-TTS, a 1.7 billion parameter text-to-speech model, on serverless GPU as an additional voice cloning endpoint alongside F5-TTS. The model uses flash attention for efficient inference, producing high-quality voice synthesis with natural prosody and intonation.

Generated audio is stored on S3 for reliable delivery and integration with the AssetFlow pipeline. The deployment provides a second TTS engine option, giving the platform flexibility to choose the best voice quality for different content types.

Key Features

1.7 billion parameter model with natural prosody
Flash attention for efficient GPU memory usage and faster inference
Voice cloning from reference audio samples
S3 storage integration for generated audio
RunPod Serverless deployment with auto-scaling
Custom Docker container with optimized dependencies

Architecture

Text Input → Tokenization → Qwen3-TTS (1.7B) → Audio Generation → S3 Upload
                                     ↓
                           Reference Audio → Speaker Embedding
                           Flash Attention   (Voice Cloning)

Deployment:
Docker Container → RunPod Serverless GPU → API Endpoint → AssetFlow

Text is tokenized and processed by the Qwen3-TTS model with flash attention enabled for memory-efficient inference. A speaker embedding derived from reference audio drives voice cloning. The generated audio uploads to S3 and is accessible via the AssetFlow pipeline for narration workflows.

Sample Output

Voice Clone Sample

Tech Stack

Qwen3-TTS PyTorch Flash Attention Docker RunPod Serverless S3 Python

I build and deploy production AI systems.

Let's talk about your next project.

Get in touch See more projects