This document describes EchoLocate from an implementation perspective: what each feature does, how it works under the hood, and why it matters for real-time accessibility.
EchoLocate is a browser-only application deployed as static files.
Core runtime files:
index.html— semantic UI structure, controls, accessibility landmarksstyle.css— adaptive layout, dark/light theming, responsive behavior, confidence visualsapp.js— speech recognition, audio analysis, speaker lane logic, persistence, exportsw.js— local fragment rendering API for HTMX (/api/add-card,/api/add-chat-msg)
Design principle:
- No backend is required for standard operation.
- Transcript and analysis remain in-browser.
Speech recognition path:
- Browser mic capture via
getUserMedia SpeechRecognition/webkitSpeechRecognitionreceives audio stream- Interim and final transcript events are produced
- Final transcript is combined with speaker profile metadata
- Card payload is posted to local route intercepted by Service Worker
- Returned HTML fragment is inserted into lane/chat containers
Reliability behavior:
- A watchdog timer restarts recognition when no results are received for a configured interval.
- Warm restart logic handles unexpected
onendwhile the app is still running.
Audio analysis path uses Web Audio API + Meyda.
Per-frame extracted features include:
- MFCC (13 coefficients)
- Spectral flatness
- Spectral slope
- Spectral centroid
- Spectral rolloff
- Zero crossing rate (ZCR)
- RMS energy
Feature vectors are collected during utterance windows and aggregated to represent voice texture rather than a single pitch scalar.
For each utterance, EchoLocate constructs a voice fingerprint vector from timbral and spectral features.
Each active lane profile is compared with the incoming fingerprint using cosine similarity:
- If similarity clears threshold, assign utterance to the best matching lane
- If not, create a new guest lane (up to max speaker limit)
- Lane profiles update incrementally with each new utterance
Why this matters:
- Timbral vectors are more robust than pitch-only matching when a speaker changes intonation.
To reduce lane hopping:
- Hysteresis lock: keep recent lane preference for a minimum duration unless a meaningfully better candidate appears
- Temporal smoothing: evaluate recent match history (median/majority over last N decisions)
Result:
- Better sentence continuity and fewer rapid lane switches.
Language features include:
- Selectable recognition language list, including
None (Auto)mode - Optional text-based language detection fallback using
franc-min - Visual feedback when detected text language and selected recognition language diverge
Purpose:
- Make multilingual conversation behavior visible and debuggable for users.
- Lanes view: parallel speaker columns for at-a-glance differentiation
- Chat view: single stream useful on mobile and constrained screens
- Current lane receives an energy ring / active state styling to show where focus is landing.
- Confidence meter (0-100%) is attached to transcript cards/messages
- Low-confidence text is visually marked
- Merge controls allow users to merge two speaker lanes when automatic grouping splits one person into multiple lanes.
sw.js behaves as a local fragment server:
/api/add-card(POST): returns lane card fragment/api/add-chat-msg(POST): returns chat message fragment/api/clear(POST): local clear acknowledgment
Security behavior:
- Inputs are escaped/sanitized before HTML output
- Attribute-safe escaping is applied for user-provided content
Operational advantages:
- No remote templating required
- HTMX interactions stay local
- Offline resilience with cache fallback for same-origin assets
Storage:
- Session cards are stored in
localStorage(echolocate_v1) - Startup restore rebuilds lanes and chat view from stored cards
- Clear operation wipes stored conversation state
Tradeoff:
- Fast and private, but scoped to browser/device profile.
Exported transcript format:
- WebVTT with speaker metadata tags, e.g.
<v Speaker 1>...</v> - Time windows are normalized relative to first utterance
Benefit:
- Better interoperability with subtitle tools and downstream review workflows.
Privacy posture:
- Audio remains local in browser
- Transcript content is not sent to external cloud APIs by default
- Processing and rendering happen on device
Offline posture:
- Key dependencies are vendored in
vendor/ - Local server (
server.py) supports localhost operation - Service Worker provides local route handling and cache fallback
Required browser capabilities:
- Web Speech API (best support: Chromium-based desktop browsers)
- Web Audio API
- Service Worker support
Known constraint:
- Browsers without Web Speech API support cannot provide live transcription in this architecture.
EchoLocate is optimized for practical meeting use where reliability and transparency matter:
- Voice texture matching improves speaker grouping stability
- Watchdog recovery reduces silent transcript dropouts
- Confidence and mismatch indicators expose uncertainty instead of hiding it
- Merge controls let users correct AI mistakes quickly
- Fully local execution supports privacy-sensitive environments
For implementation details, see: