Project
Jarvis — A Local-First Voice Assistant
The LLM-era successor to Walter: a local-first voice assistant — wake word, voice-activity detection, streaming speech-to-text, a pluggable LLM brain that calls tools over MCP, and a fine-tuned voice — all wrapped around a single security gate. Built AI-assisted, in active development.
Walter was the pre-LLM version of this question: how far would the speech stack of its day carry a homemade assistant? Jarvis is the answer once foundation models arrived — and a deliberate bet on keeping the whole thing local.
The problem
I wanted an assistant that could listen, reason, and act — control a PC, a home-automation setup, a server — without handing a live microphone feed and my home to a vendor’s cloud. The reasoning could come from a frontier model, but the pipeline around it, and the option to run entirely offline, should be mine.
Approach — a streaming pipeline around a reasoning brain
A voice assistant is really a streaming pipeline wrapped around a brain that can call tools:
mic → wake → VAD → STT → BRAIN ⇄ tools (MCP) → TTS → speaker
- Wake / endpoint: a registry of openWakeWord models runs continuously; Silero VAD closes the utterance.
- STT: faster-whisper, streaming.
- Brain: a pluggable LLM — a cloud model or a fully-local one — that reasons and calls tools.
- TTS: a fine-tuned voice (see the sibling fine-tuning pipeline), with a small post-processing chain.
The lever that makes it feel instant is sentence streaming — the brain emits a sentence, it goes straight to speech, and audio starts playing while the next sentence is still being generated. The pipeline never waits for a full response before it starts talking.
Pluggable by design
wake, stt, llm, and tts are each an interface with swappable backends —
no single backend is load-bearing in the core. A cloud brain and a fully-local
one are both first-class and selectable. Capabilities are data, not code:
skills are tool servers spoken to over the Model Context Protocol, and adding one
is a config entry, not a core change.
One gate for anything that matters
Every outward or destructive action passes a single action-tier security gate — not scattered per-skill checks. Arbitrary shell and SSH are blocked by default, and secrets never reach the brain. Security lives in one place by design, so it’s auditable.
Train here, serve there
The voice itself is fine-tuned in a separate pipeline and handed over as a versioned voice bundle — never a code dependency on training internals. That seam keeps the model factory and the serving loop independent.
Status
In active, rapid development, and built AI-assisted — I architect and direct; the model writes much of the implementation under that direction. It’s a working local voice loop and a study in how far a small, security-conscious, local-first design can go.