Live, private, in-browser captioning with simulated speaker grouping.
Try it now: mgifford.github.io/EchoLocate
EchoLocate is designed as an accessibility-first captioning tool, especially for deaf and hard-of-hearing users who need live, glanceable transcripts in meetings and conversations.
The app runs fully client-side. Audio stays on-device. There is no backend speech pipeline.
EchoLocate combines two browser pipelines in parallel:
- Speech-to-text pipeline
- Input: browser microphone stream
- Engine: Web Speech API (
SpeechRecognition/webkitSpeechRecognition) - Output: transcript chunks with confidence values
- Voice differentiation pipeline
- Input: same microphone stream via Web Audio API
- Engine: Meyda feature extraction
- Output: per-utterance voice fingerprint used to choose a speaker lane
Rendering and persistence stack:
- HTMX posts caption payloads to local routes
- Service Worker intercepts
/api/add-cardand/api/add-chat-msg - Service Worker returns HTML fragments (cards/chat messages)
- Frontend inserts fragments without server round-trips
- Session data is stored in
localStorage
EchoLocate uses vector comparison instead of a single scalar pitch comparison.
Per-frame feature vector includes:
- 13 MFCC coefficients
- Spectral flatness
- Spectral slope
For lane assignment, the current vector is compared with each existing profile using cosine similarity:
Behavior:
- If best similarity is high enough, append to that lane
- Otherwise, create a new guest lane (up to configured maximum)
- Profiles are updated incrementally over time to adapt to natural voice variation
Why this matters: when a person raises or lowers pitch, timbre features (MFCC texture + slope/flatness) are often more stable than pitch alone.
To reduce lane hopping during continuous speech:
- Hysteresis lock: once a lane is selected, it is temporarily favored for 400ms unless another lane is significantly stronger
- Temporal smoothing: recent match results are buffered and smoothed over the last 3 decisions
This keeps one sentence from bouncing between two lanes.
Web Speech can silently stall in real browsers. EchoLocate adds a watchdog to recover automatically.
- If the app is running and no result is received for 10 seconds, recognition is restarted
- If
onendfires while app state is still running, recognition warm-restarts automatically - If user intentionally stops, watchdog is cleared and no restart occurs
This is critical for accessibility reliability: silent failure is a communication failure.
- Per-card confidence meter (0-100%) so users can quickly gauge transcript trust
- Active lane energy ring so users can see which speaker lane is currently focused
- Merge lanes controls to combine mistaken duplicate lanes in long sessions
- Language selector with
None (Auto)mode and mismatch hints during low-recognition scenarios - Chat or lane layout toggle for small screens and varied reading preferences
Export uses WebVTT and includes speaker metadata tags:
00:00:01.000 --> 00:00:04.000
<v Speaker 1>Hello world</v>This makes the transcript more useful in subtitle-capable tools that understand speaker cues.
- Audio processing happens in-browser
- Transcript data is stored locally in browser storage
- No transcript/audio is sent to external cloud services by default
- Offline operation is supported because vendor assets are committed in-repo
git clone https://github.com/mgifford/EchoLocate.git
cd EchoLocate
python3 server.pyThen open http://localhost:8080/ in Chrome or Edge.
Optional model/dependency refresh scripts:
./download-deps.sh./download-models.sh
See INSTALL.txt for installation and troubleshooting details.
- Chrome desktop: supported
- Edge desktop: supported
- Firefox/Safari: Web Speech API limitation
Contributions are welcome, especially feedback from deaf and hard-of-hearing users on real-world conversation quality.
Project repo: github.com/mgifford/EchoLocate
Before committing:
node --check app.js && node --check sw.jsCheck out Airtime2 to highlight who spoke and how much time that took. Note that this works much better working directly with a .vtt file from a tool like Zoom.