Project

Walter — A Pre-LLM Voice Assistant

A pre-LLM voice-assistant spike assembling the speech-and-vision stack of its day — Google STT, neural punctuation restoration, nltk parsing, pyttsx3 speech, and an OpenCV face pipeline — built before any of it was a single API call.

  • Python
  • speech_recognition
  • pyttsx3
  • nltk
  • OpenCV
  • Punctuator (BRNN)

Before large language models made conversational AI a commodity, building an assistant that could listen, understand, and respond meant assembling it from parts. Walter was that exercise — a spike to see how far the speech-and-vision stack of the day would carry me.

The problem

I wanted a hands-free assistant that could take a spoken command, work out what was asked, answer aloud, and recognize who it was talking to — one API call today, a systems-integration project then. The question was never whether the pieces existed; it was how much glue stood between them and something that felt like a conversation.

Constraints

  • No foundation models. Nothing to offload understanding to — every stage was its own library with its own failure modes.
  • It had to run on ordinary hardware with a microphone and a webcam.

Approach

Walter wired together the stack a pre-LLM assistant needed:

  • Speech-to-text via speech_recognition, calling Google’s Web Speech API.
  • Punctuation restoration — the part I’m proudest of. Cloud STT returned a naked, unpunctuated string, useless to any parser, so the transcript first ran through a bidirectional-RNN model (the punctuator library) to put sentence boundaries back. In 2020, getting from raw speech to a parseable sentence already took a neural net of its own.
  • NLP with nltk — tokenizing and part-of-speech tagging the cleaned text, with Stanford NER on hand for pulling names out of an utterance.
  • Text-to-speech via pyttsx3, plus synthesized cues for events like “I’m listening.”
  • Facial recognition — a separate OpenCV pipeline: Haar-cascade detection, a labeled training set, an LBPH recognizer.

Outcome

Walter was a spike, not a shipped product — and that gap is the lesson. The wired path ran end to end as far as it went: greet, listen, capture, transcribe through Google, restore punctuation. The other pieces — face recognition, POS and NER — worked on their own, but I never closed the loop from a parsed sentence to a dispatched action. No command table, no skill handlers; the hard mile from “I understand the words” to “I know what you want, and I’ll go do it” is the one I didn’t finish.

Building Walter pre-LLM is why the current wave reads to me as a change in degree, not kind. The hard problems — intent, context, graceful failure, latency — were always the real ones. A single model now swallows the brittle middle I was wiring by hand. The pieces didn’t get easier; they got absorbed.