Project
voice-training — A Local TTS Fine-Tuning Pipeline
A fully on-device, FOSS pipeline that fine-tunes a custom text-to-speech voice from a handful of audio samples — decode, loudness-normalize, segment, transcribe, fine-tune, synthesize, evaluate. The model factory that produces Jarvis's voice. Built AI-assisted.
Jarvis needed a voice that wasn’t a cloud API call. voice-training is the model
factory that makes one — and it runs entirely on-device, with no audio uploaded
to a cloud voice vendor.
The problem
Turn a small set of sampled audio into a usable, custom text-to-speech voice — locally, reproducibly, and at package quality — then hand that voice to a separate assistant as a black box.
The pipeline
A narrow scope, done end to end:
decode → loudness-normalize → segment → transcribe → fine-tune / clone → synthesize → evaluate
Clean data beats a fancy model
Most of the work is the data contract, and the pipeline enforces it: every training clip is mono, single-rate (24 kHz), loudness-normalized to ≈ −23 LUFS, two to twelve seconds, single-speaker, no music bed. Voice-activity detection segments the source; faster-whisper transcribes it; anything that fails the contract is dropped rather than poisoning the model.
The model
The pipeline fine-tunes (and can zero-shot clone) Coqui XTTS-v2, and I tuned
the decoding configuration — sampling temperature, top-k/top-p — to lock a
consistent delivery rather than a different read every run. The result is a
reproducible text → speech path with a pinned config.
A note on scope
This is a personal, non-commercial project: a study in voice cloning and audio-data engineering. The synthesized output is a private, fan-made voice — never presented as a real person or performer, and never used for impersonation.
Why it’s here
Even set apart from the voice it produces, the engineering stands on its own: real model fine-tuning (not an API call), audio-signal-processing and data-curation discipline (LUFS normalization, VAD segmentation, a hard data contract), reproducible packaging, and a clean train-here / serve-there seam.