Overview
Fine-tuned the F5-TTS model on approximately 2 hours of custom voice recordings for production-quality voice cloning. The model runs on serverless GPU infrastructure with a custom Docker container built on CUDA 12.1, exposing an API that supports multiple output formats.
During development, I discovered and documented critical, previously undocumented reference audio requirements that eliminate common synthesis artifacts. These findings were essential to achieving natural-sounding output consistently. The deployment uses RunPod Serverless for auto-scaling GPU inference with cold-start optimization.
Key Features
- Fine-tuned on ~2 hours of custom voice recordings
- Custom Docker container with CUDA 12.1 and optimized dependencies
- API supporting WAV, MP3, and streaming output formats
- Serverless GPU deployment with auto-scaling on RunPod
- Documented reference audio requirements for artifact-free synthesis
- Integrated as a TTS engine option in AssetFlow
Architecture
Text Input → Chunking → F5-TTS Model → Audio Post-processing → Output (WAV/MP3)
↓
Reference Audio → Voice Embedding
(Custom recordings)
Deployment:
Docker Container (CUDA 12.1) → RunPod Serverless → API Endpoint → AssetFlow
Text is chunked into optimal segments, then processed by the fine-tuned F5-TTS model using a voice embedding derived from reference audio recordings. The output goes through post-processing (normalization, format conversion via FFmpeg) before delivery. The entire pipeline runs in a custom Docker container on RunPod's serverless GPU infrastructure.
Tech Stack
I build and deploy production AI systems.
Let's talk about your next project.