Project

F5-TTS Voice Cloning

Custom voice synthesis on serverless GPU

Production · Deployed on RunPod

Overview

Fine-tuned the F5-TTS model on approximately 2 hours of custom voice recordings for production-quality voice cloning. The model runs on serverless GPU infrastructure with a custom Docker container built on CUDA 12.1, exposing an API that supports multiple output formats.

During development, I discovered and documented critical, previously undocumented reference audio requirements that eliminate common synthesis artifacts. These findings were essential to achieving natural-sounding output consistently. The deployment uses RunPod Serverless for auto-scaling GPU inference with cold-start optimization.

Key Features

Fine-tuned on ~2 hours of custom voice recordings
Custom Docker container with CUDA 12.1 and optimized dependencies
API supporting WAV, MP3, and streaming output formats
Serverless GPU deployment with auto-scaling on RunPod
Documented reference audio requirements for artifact-free synthesis
Integrated as a TTS engine option in AssetFlow

Architecture

Text Input → Chunking → F5-TTS Model → Audio Post-processing → Output (WAV/MP3)
                              ↓
                    Reference Audio → Voice Embedding
                    (Custom recordings)

Deployment:
Docker Container (CUDA 12.1) → RunPod Serverless → API Endpoint → AssetFlow

Text is chunked into optimal segments, then processed by the fine-tuned F5-TTS model using a voice embedding derived from reference audio recordings. The output goes through post-processing (normalization, format conversion via FFmpeg) before delivery. The entire pipeline runs in a custom Docker container on RunPod's serverless GPU infrastructure.

Sample Output

Voice Clone Sample

Technical Report

Download fine-tuning report (.pdf)

Tech Stack

PyTorch F5-TTS CUDA 12.1 Docker RunPod Serverless FFmpeg Python

I build and deploy production AI systems.

Let's talk about your next project.

Get in touch See more projects