EchoLocate Technical Features

This document describes EchoLocate from an implementation perspective: what each feature does, how it works under the hood, and why it matters for real-time accessibility.

1. Runtime Architecture

EchoLocate is a browser-only application deployed as static files.

Core runtime files:

index.html — semantic UI structure, controls, accessibility landmarks
style.css — adaptive layout, dark/light theming, responsive behavior, confidence visuals
app.js — speech recognition, audio analysis, speaker lane logic, persistence, export
sw.js — local fragment rendering API for HTMX (/api/add-card, /api/add-chat-msg)

Design principle:

No backend is required for standard operation.
Transcript and analysis remain in-browser.

2. Caption Pipeline

Speech recognition path:

Browser mic capture via getUserMedia
SpeechRecognition / webkitSpeechRecognition receives audio stream
Interim and final transcript events are produced
Final transcript is combined with speaker profile metadata
Card payload is posted to local route intercepted by Service Worker
Returned HTML fragment is inserted into lane/chat containers

Reliability behavior:

A watchdog timer restarts recognition when no results are received for a configured interval.
Warm restart logic handles unexpected onend while the app is still running.

3. Audio Feature Extraction

Audio analysis path uses Web Audio API + Meyda.

Per-frame extracted features include:

MFCC (13 coefficients)
Spectral flatness
Spectral slope
Spectral centroid
Spectral rolloff
Zero crossing rate (ZCR)
RMS energy

Feature vectors are collected during utterance windows and aggregated to represent voice texture rather than a single pitch scalar.

4. Voice Fingerprint + Lane Matching

4.1 Vector fingerprint

For each utterance, EchoLocate constructs a voice fingerprint vector from timbral and spectral features.

4.2 Similarity scoring

Each active lane profile is compared with the incoming fingerprint using cosine similarity:

$$ \text{similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}||\mathbf{B}|} $$

4.3 Assignment behavior

If similarity clears threshold, assign utterance to the best matching lane
If not, create a new guest lane (up to max speaker limit)
Lane profiles update incrementally with each new utterance

Why this matters:

Timbral vectors are more robust than pitch-only matching when a speaker changes intonation.

5. Anti-Flicker Stability

To reduce lane hopping:

Hysteresis lock: keep recent lane preference for a minimum duration unless a meaningfully better candidate appears
Temporal smoothing: evaluate recent match history (median/majority over last N decisions)

Result:

Better sentence continuity and fewer rapid lane switches.

6. Language Handling

Language features include:

Selectable recognition language list, including None (Auto) mode
Optional text-based language detection fallback using franc-min
Visual feedback when detected text language and selected recognition language diverge

Purpose:

Make multilingual conversation behavior visible and debuggable for users.

7. Accessibility-Centered UI Features

7.1 Lane and chat views

Lanes view: parallel speaker columns for at-a-glance differentiation
Chat view: single stream useful on mobile and constrained screens

7.2 Active lane indicator

Current lane receives an energy ring / active state styling to show where focus is landing.

7.3 Confidence visibility

Confidence meter (0-100%) is attached to transcript cards/messages
Low-confidence text is visually marked

7.4 Human-in-the-loop correction

Merge controls allow users to merge two speaker lanes when automatic grouping splits one person into multiple lanes.

8. Service Worker Local Rendering API

sw.js behaves as a local fragment server:

/api/add-card (POST): returns lane card fragment
/api/add-chat-msg (POST): returns chat message fragment
/api/clear (POST): local clear acknowledgment

Security behavior:

Inputs are escaped/sanitized before HTML output
Attribute-safe escaping is applied for user-provided content

Operational advantages:

No remote templating required
HTMX interactions stay local
Offline resilience with cache fallback for same-origin assets

9. Persistence Model

Storage:

Session cards are stored in localStorage (echolocate_v1)
Startup restore rebuilds lanes and chat view from stored cards
Clear operation wipes stored conversation state

Tradeoff:

Fast and private, but scoped to browser/device profile.

10. Export Model (VTT)

Exported transcript format:

WebVTT with speaker metadata tags, e.g. <v Speaker 1>...</v>
Time windows are normalized relative to first utterance

Benefit:

Better interoperability with subtitle tools and downstream review workflows.

11. Privacy and Offline Operation

Privacy posture:

Audio remains local in browser
Transcript content is not sent to external cloud APIs by default
Processing and rendering happen on device

Offline posture:

Key dependencies are vendored in vendor/
Local server (server.py) supports localhost operation
Service Worker provides local route handling and cache fallback

12. Dependency and Platform Constraints

Required browser capabilities:

Web Speech API (best support: Chromium-based desktop browsers)
Web Audio API
Service Worker support

Known constraint:

Browsers without Web Speech API support cannot provide live transcription in this architecture.

13. Why This Stack Is Useful for Accessibility

EchoLocate is optimized for practical meeting use where reliability and transparency matter:

Voice texture matching improves speaker grouping stability
Watchdog recovery reduces silent transcript dropouts
Confidence and mismatch indicators expose uncertainty instead of hiding it
Merge controls let users correct AI mistakes quickly
Fully local execution supports privacy-sensitive environments

For implementation details, see:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EchoLocate Technical Features

1. Runtime Architecture

2. Caption Pipeline

3. Audio Feature Extraction

4. Voice Fingerprint + Lane Matching

4.1 Vector fingerprint

4.2 Similarity scoring

4.3 Assignment behavior

5. Anti-Flicker Stability

6. Language Handling

7. Accessibility-Centered UI Features

7.1 Lane and chat views

7.2 Active lane indicator

7.3 Confidence visibility

7.4 Human-in-the-loop correction

8. Service Worker Local Rendering API

9. Persistence Model

10. Export Model (VTT)

11. Privacy and Offline Operation

12. Dependency and Platform Constraints

13. Why This Stack Is Useful for Accessibility

FilesExpand file tree

FEATURES.md

Latest commit

History

FEATURES.md

File metadata and controls

EchoLocate Technical Features

1. Runtime Architecture

2. Caption Pipeline

3. Audio Feature Extraction

4. Voice Fingerprint + Lane Matching

4.1 Vector fingerprint

4.2 Similarity scoring

4.3 Assignment behavior

5. Anti-Flicker Stability

6. Language Handling

7. Accessibility-Centered UI Features

7.1 Lane and chat views

7.2 Active lane indicator

7.3 Confidence visibility

7.4 Human-in-the-loop correction

8. Service Worker Local Rendering API

9. Persistence Model

10. Export Model (VTT)

11. Privacy and Offline Operation

12. Dependency and Platform Constraints

13. Why This Stack Is Useful for Accessibility