Deploy VibeVoice ASR model as a high-performance API service using vLLM. This plugin provides OpenAI-compatible API endpoints for speech-to-text transcription with streaming support.
- 🚀 High-Performance Serving: Optimized for high-throughput ASR inference with vLLM's continuous batching
- 📡 OpenAI-Compatible API: Standard
/v1/chat/completionsendpoint with streaming support - 🎵 Long Audio Support: Process up to 60+ minutes of audio in a single request
- 🔌 Plugin Architecture: No vLLM source code modification required - just install and run
- ⚡ Data Parallel (DP): Run independent model replicas across multiple GPUs with automatic load balancing behind a single port
Using Official vLLM Docker Image (Recommended)
- Clone the repository
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice- Launch the server (background mode)
docker run -d --gpus all --name vibevoice-vllm \
--ipc=host \
-p 8000:8000 \
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
-v $(pwd):/app \
-w /app \
--entrypoint bash \
vllm/vllm-openai:v0.14.1 \
-c "python3 /app/vllm_plugin/scripts/start_server.py"The launcher supports two types of GPU parallelism via --tp and --dp flags:
| Flag | Name | What it does |
|---|---|---|
--tp N |
Tensor Parallel | Splits one model across N GPUs (for models too large for a single GPU) |
--dp N |
Data Parallel | Runs N independent replicas, one per GPU, with automatic load balancing behind a single port |
Run N independent replicas on N GPUs with automatic load balancing behind a single port.
When --dp N is specified (N > 1), the launcher automatically starts N independent vLLM
processes behind an nginx reverse proxy (2×N workers) for optimal throughput:
docker run -d --gpus '"device=0,1,2,3"' --name vibevoice-vllm \
--ipc=host \
-p 8000:8000 \
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
-v $(pwd):/app \
-w /app \
--entrypoint bash \
vllm/vllm-openai:v0.14.1 \
-c "python3 /app/vllm_plugin/scripts/start_server.py --dp 4"Run on all 8 GPUs:
docker run -d --gpus all --name vibevoice-vllm \
--ipc=host \
-p 8000:8000 \
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
-v $(pwd):/app \
-w /app \
--entrypoint bash \
vllm/vllm-openai:v0.14.1 \
-c "python3 /app/vllm_plugin/scripts/start_server.py --dp 8"Split a single model across 2 GPUs (useful if GPU memory is limited):
docker run -d --gpus '"device=0,1"' --name vibevoice-vllm \
--ipc=host \
-p 8000:8000 \
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
-v $(pwd):/app \
-w /app \
--entrypoint bash \
vllm/vllm-openai:v0.14.1 \
-c "python3 /app/vllm_plugin/scripts/start_server.py --tp 2"Combine both — e.g., 2 replicas, each split across 2 GPUs (4 GPUs total):
docker run -d --gpus '"device=0,1,2,3"' --name vibevoice-vllm \
--ipc=host \
-p 8000:8000 \
-v $(pwd):/app \
-w /app \
--entrypoint bash \
vllm/vllm-openai:v0.14.1 \
-c "python3 /app/vllm_plugin/scripts/start_server.py --dp 2 --tp 2"Note: Total GPUs required =
dp × tp. Make sure to expose enough GPU devices in the Docker--gpusflag.
- View logs
docker logs -f vibevoice-vllmNote:
- The
-dflag runs the container in background (detached mode)- Use
docker stop vibevoice-vllmto stop the service- The model will be downloaded to HuggingFace cache (
~/.cache/huggingface) inside the container
Once the vLLM server is running, test it with the provided script:
# Basic transcription
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api.py /app/audio.wav
# With hotwords for better recognition of specific terms
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api.py /app/audio.wav --hotwords "Microsoft,VibeVoice"
# With auto-recovery from repetition loops (for long audio)
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api_auto_recover.py /app/audio.wav
# Auto-recover with hotwords
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api_auto_recover.py /app/audio.wav --hotwords "Microsoft,VibeVoice"Note:
- The audio/video file must be inside the mounted directory (
/appin the container). Copy your files to the VibeVoice folder before testing.- Hotwords help improve recognition of domain-specific terms like proper nouns, technical terms, and speaker names.
| Variable | Description | Default |
|---|---|---|
VIBEVOICE_FFMPEG_MAX_CONCURRENCY |
Maximum FFmpeg processes for audio decoding | 64 |
PYTORCH_ALLOC_CONF |
PyTorch memory allocator config | expandable_segments:True |
- GPU Memory: Use
--gpu-memory-utilization 0.9for maximum throughput if you have dedicated GPU - Batch Size: Increase
--max-num-seqsfor higher concurrency (requires more GPU memory) - FFmpeg Concurrency: Tune
VIBEVOICE_FFMPEG_MAX_CONCURRENCYbased on CPU cores
-
"CUDA out of memory"
- Reduce
--gpu-memory-utilization - Reduce
--max-num-seqs - Use smaller
--max-model-len
- Reduce
-
"Audio decoding failed"
- Ensure FFmpeg is installed:
ffmpeg -version - Check audio file format is supported
- Ensure FFmpeg is installed:
-
"Model not found"
- Ensure model path contains
config.jsonand model weights - Generate tokenizer files if missing
- Ensure model path contains
-
"Plugin not loaded"
- Verify installation:
pip show vibevoice - Check entry point:
pip show -f vibevoice | grep entry
- Verify installation: