voice-training — A Local TTS Fine-Tuning Pipeline

Jarvis needed a voice that wasn’t a cloud API call. voice-training is the model factory that makes one — and it runs entirely on-device, with no audio uploaded to a cloud voice vendor.

The problem

Turn a small set of sampled audio into a usable, custom text-to-speech voice — locally, reproducibly, and at package quality — then hand that voice to a separate assistant as a black box.

The pipeline

A narrow scope, done end to end:

decode → loudness-normalize → segment → transcribe → fine-tune / clone → synthesize → evaluate

Clean data beats a fancy model

Most of the work is the data contract, and the pipeline enforces it: every training clip is mono, single-rate (24 kHz), loudness-normalized to ≈ −23 LUFS, two to twelve seconds, single-speaker, no music bed. Voice-activity detection segments the source; faster-whisper transcribes it; anything that fails the contract is dropped rather than poisoning the model.

The model

The pipeline fine-tunes (and can zero-shot clone) Coqui XTTS-v2, and I tuned the decoding configuration — sampling temperature, top-k/top-p — to lock a consistent delivery rather than a different read every run. The result is a reproducible text → speech path with a pinned config.

A note on scope

This is a personal, non-commercial project: a study in voice cloning and audio-data engineering. The synthesized output is a private, fan-made voice — never presented as a real person or performer, and never used for impersonation.

Why it’s here

Even set apart from the voice it produces, the engineering stands on its own: real model fine-tuning (not an API call), audio-signal-processing and data-curation discipline (LUFS normalization, VAD segmentation, a hard data contract), reproducible packaging, and a clean train-here / serve-there seam.