Project

Jarvis — A Local-First Voice Assistant

The LLM-era successor to Walter: a local-first voice assistant — wake word, voice-activity detection, streaming speech-to-text, a pluggable LLM brain that calls tools over MCP, and a fine-tuned voice — all wrapped around a single security gate. Built AI-assisted, in active development.

  • Python
  • openWakeWord
  • Silero VAD
  • faster-whisper
  • Ollama / Claude
  • MCP
  • XTTS

Walter was the pre-LLM version of this question: how far would the speech stack of its day carry a homemade assistant? Jarvis is the answer once foundation models arrived — and a deliberate bet on keeping the whole thing local.

The problem

I wanted an assistant that could listen, reason, and act — control a PC, a home-automation setup, a server — without handing a live microphone feed and my home to a vendor’s cloud. The reasoning could come from a frontier model, but the pipeline around it, and the option to run entirely offline, should be mine.

Approach — a streaming pipeline around a reasoning brain

A voice assistant is really a streaming pipeline wrapped around a brain that can call tools:

mic → wake → VAD → STT → BRAIN ⇄ tools (MCP) → TTS → speaker
  • Wake / endpoint: a registry of openWakeWord models runs continuously; Silero VAD closes the utterance.
  • STT: faster-whisper, streaming.
  • Brain: a pluggable LLM — a cloud model or a fully-local one — that reasons and calls tools.
  • TTS: a fine-tuned voice (see the sibling fine-tuning pipeline), with a small post-processing chain.

The lever that makes it feel instant is sentence streaming — the brain emits a sentence, it goes straight to speech, and audio starts playing while the next sentence is still being generated. The pipeline never waits for a full response before it starts talking.

Pluggable by design

wake, stt, llm, and tts are each an interface with swappable backends — no single backend is load-bearing in the core. A cloud brain and a fully-local one are both first-class and selectable. Capabilities are data, not code: skills are tool servers spoken to over the Model Context Protocol, and adding one is a config entry, not a core change.

One gate for anything that matters

Every outward or destructive action passes a single action-tier security gate — not scattered per-skill checks. Arbitrary shell and SSH are blocked by default, and secrets never reach the brain. Security lives in one place by design, so it’s auditable.

Train here, serve there

The voice itself is fine-tuned in a separate pipeline and handed over as a versioned voice bundle — never a code dependency on training internals. That seam keeps the model factory and the serving loop independent.

Status

In active, rapid development, and built AI-assisted — I architect and direct; the model writes much of the implementation under that direction. It’s a working local voice loop and a study in how far a small, security-conscious, local-first design can go.