Project
Walter — A Pre-LLM Voice Assistant
A pre-LLM voice-assistant spike assembling the speech-and-vision stack of its day — Google STT, neural punctuation restoration, nltk parsing, pyttsx3 speech, and an OpenCV face pipeline — built before any of it was a single API call.
Before large language models made conversational AI a commodity, building an assistant that could listen, understand, and respond meant assembling it from parts. Walter was that exercise — a spike to see how far the speech-and-vision stack of the day would carry me.
The problem
I wanted a hands-free assistant that could take a spoken command, work out what was asked, answer aloud, and recognize who it was talking to — one API call today, a systems-integration project then. The question was never whether the pieces existed; it was how much glue stood between them and something that felt like a conversation.
Constraints
- No foundation models. Nothing to offload understanding to — every stage was its own library with its own failure modes.
- It had to run on ordinary hardware with a microphone and a webcam.
Approach
Walter wired together the stack a pre-LLM assistant needed:
- Speech-to-text via
speech_recognition, calling Google’s Web Speech API. - Punctuation restoration — the part I’m proudest of. Cloud STT returned a
naked, unpunctuated string, useless to any parser, so the transcript first ran
through a bidirectional-RNN model (the
punctuatorlibrary) to put sentence boundaries back. In 2020, getting from raw speech to a parseable sentence already took a neural net of its own. - NLP with
nltk— tokenizing and part-of-speech tagging the cleaned text, with Stanford NER on hand for pulling names out of an utterance. - Text-to-speech via
pyttsx3, plus synthesized cues for events like “I’m listening.” - Facial recognition — a separate OpenCV pipeline: Haar-cascade detection, a labeled training set, an LBPH recognizer.
Outcome
Walter was a spike, not a shipped product — and that gap is the lesson. The wired path ran end to end as far as it went: greet, listen, capture, transcribe through Google, restore punctuation. The other pieces — face recognition, POS and NER — worked on their own, but I never closed the loop from a parsed sentence to a dispatched action. No command table, no skill handlers; the hard mile from “I understand the words” to “I know what you want, and I’ll go do it” is the one I didn’t finish.
Building Walter pre-LLM is why the current wave reads to me as a change in degree, not kind. The hard problems — intent, context, graceful failure, latency — were always the real ones. A single model now swallows the brittle middle I was wiring by hand. The pieces didn’t get easier; they got absorbed.