EchoLocate

Live, private, in-browser captioning with simulated speaker grouping.

Try it now: mgifford.github.io/EchoLocate

Purpose

EchoLocate is designed as an accessibility-first captioning tool, especially for deaf and hard-of-hearing users who need live, glanceable transcripts in meetings and conversations.

The app runs fully client-side. Audio stays on-device. There is no backend speech pipeline.

High-level architecture

EchoLocate combines two browser pipelines in parallel:

Speech-to-text pipeline

Input: browser microphone stream
Engine: Web Speech API (SpeechRecognition / webkitSpeechRecognition)
Output: transcript chunks with confidence values

Voice differentiation pipeline

Input: same microphone stream via Web Audio API
Engine: Meyda feature extraction
Output: per-utterance voice fingerprint used to choose a speaker lane

Rendering and persistence stack:

HTMX posts caption payloads to local routes
Service Worker intercepts /api/add-card and /api/add-chat-msg
Service Worker returns HTML fragments (cards/chat messages)
Frontend inserts fragments without server round-trips
Session data is stored in localStorage

Voice fingerprinting model (Phase 1 reliability)

EchoLocate uses vector comparison instead of a single scalar pitch comparison.

Per-frame feature vector includes:

13 MFCC coefficients
Spectral flatness
Spectral slope

For lane assignment, the current vector is compared with each existing profile using cosine similarity:

$$ ext{similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}|,|\mathbf{B}|} $$

Behavior:

If best similarity is high enough, append to that lane
Otherwise, create a new guest lane (up to configured maximum)
Profiles are updated incrementally over time to adapt to natural voice variation

Why this matters: when a person raises or lowers pitch, timbre features (MFCC texture + slope/flatness) are often more stable than pitch alone.

Anti-flicker stability

To reduce lane hopping during continuous speech:

Hysteresis lock: once a lane is selected, it is temporarily favored for 400ms unless another lane is significantly stronger
Temporal smoothing: recent match results are buffered and smoothed over the last 3 decisions

This keeps one sentence from bouncing between two lanes.

Watchdog and warm restart

Web Speech can silently stall in real browsers. EchoLocate adds a watchdog to recover automatically.

If the app is running and no result is received for 10 seconds, recognition is restarted
If onend fires while app state is still running, recognition warm-restarts automatically
If user intentionally stops, watchdog is cleared and no restart occurs

This is critical for accessibility reliability: silent failure is a communication failure.

Accessibility-focused UI behaviors

Per-card confidence meter (0-100%) so users can quickly gauge transcript trust
Active lane energy ring so users can see which speaker lane is currently focused
Merge lanes controls to combine mistaken duplicate lanes in long sessions
Language selector with None (Auto) mode and mismatch hints during low-recognition scenarios
Chat or lane layout toggle for small screens and varied reading preferences

Export model

Export uses WebVTT and includes speaker metadata tags:

00:00:01.000 --> 00:00:04.000
<v Speaker 1>Hello world</v>

This makes the transcript more useful in subtitle-capable tools that understand speaker cues.

Privacy model

Audio processing happens in-browser
Transcript data is stored locally in browser storage
No transcript/audio is sent to external cloud services by default
Offline operation is supported because vendor assets are committed in-repo

Run locally (offline-friendly)

git clone https://github.com/mgifford/EchoLocate.git
cd EchoLocate
python3 server.py

Then open http://localhost:8080/ in Chrome or Edge.

Optional model/dependency refresh scripts:

./download-deps.sh
./download-models.sh

See INSTALL.txt for installation and troubleshooting details.

Browser support

Chrome desktop: supported
Edge desktop: supported
Firefox/Safari: Web Speech API limitation

Contributing

Contributions are welcome, especially feedback from deaf and hard-of-hearing users on real-world conversation quality.

Project repo: github.com/mgifford/EchoLocate

Before committing:

node --check app.js && node --check sw.js

Related projects

Check out Airtime2 to highlight who spoke and how much time that took. Note that this works much better working directly with a .vtt file from a tool like Zoom.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
models		models
tests		tests
vendor		vendor
.gitignore		.gitignore
ACCESSIBILITY.md		ACCESSIBILITY.md
AGENTS.md		AGENTS.md
FEATURES.md		FEATURES.md
INSTALL.txt		INSTALL.txt
LICENSE		LICENSE
README.md		README.md
STYLE.md		STYLE.md
app.js		app.js
download-deps.sh		download-deps.sh
download-models.sh		download-models.sh
index.html		index.html
package.json		package.json
server.py		server.py
style.css		style.css
sw.js		sw.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EchoLocate

Purpose

High-level architecture

Voice fingerprinting model (Phase 1 reliability)

Anti-flicker stability

Watchdog and warm restart

Accessibility-focused UI behaviors

Export model

Privacy model

Run locally (offline-friendly)

Browser support

Contributing

Related projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EchoLocate

Purpose

High-level architecture

Voice fingerprinting model (Phase 1 reliability)

Anti-flicker stability

Watchdog and warm restart

Accessibility-focused UI behaviors

Export model

Privacy model

Run locally (offline-friendly)

Browser support

Contributing

Related projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages