VibeVoice vLLM ASR Deployment

Deploy VibeVoice ASR model as a high-performance API service using vLLM. This plugin provides OpenAI-compatible API endpoints for speech-to-text transcription with streaming support.

🔥 Key Features

🚀 High-Performance Serving: Optimized for high-throughput ASR inference with vLLM's continuous batching
📡 OpenAI-Compatible API: Standard /v1/chat/completions endpoint with streaming support
🎵 Long Audio Support: Process up to 60+ minutes of audio in a single request
🔌 Plugin Architecture: No vLLM source code modification required - just install and run
⚡ Data Parallel (DP): Run independent model replicas across multiple GPUs with automatic load balancing behind a single port

🛠️ Installation

Using Official vLLM Docker Image (Recommended)

Clone the repository

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice

Launch the server (background mode)

docker run -d --gpus all --name vibevoice-vllm \
  --ipc=host \
  -p 8000:8000 \
  -e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  -v $(pwd):/app \
  -w /app \
  --entrypoint bash \
  vllm/vllm-openai:v0.14.1 \
  -c "python3 /app/vllm_plugin/scripts/start_server.py"

⚡ Multi-GPU Deployment

The launcher supports two types of GPU parallelism via --tp and --dp flags:

Flag	Name	What it does
`--tp N`	Tensor Parallel	Splits one model across N GPUs (for models too large for a single GPU)
`--dp N`	Data Parallel	Runs N independent replicas, one per GPU, with automatic load balancing behind a single port

Data Parallel (Recommended for scaling throughput)

Run N independent replicas on N GPUs with automatic load balancing behind a single port. When --dp N is specified (N > 1), the launcher automatically starts N independent vLLM processes behind an nginx reverse proxy (2×N workers) for optimal throughput:

docker run -d --gpus '"device=0,1,2,3"' --name vibevoice-vllm \
  --ipc=host \
  -p 8000:8000 \
  -e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  -v $(pwd):/app \
  -w /app \
  --entrypoint bash \
  vllm/vllm-openai:v0.14.1 \
  -c "python3 /app/vllm_plugin/scripts/start_server.py --dp 4"

Run on all 8 GPUs:

docker run -d --gpus all --name vibevoice-vllm \
  --ipc=host \
  -p 8000:8000 \
  -e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  -v $(pwd):/app \
  -w /app \
  --entrypoint bash \
  vllm/vllm-openai:v0.14.1 \
  -c "python3 /app/vllm_plugin/scripts/start_server.py --dp 8"

Tensor Parallel

Split a single model across 2 GPUs (useful if GPU memory is limited):

docker run -d --gpus '"device=0,1"' --name vibevoice-vllm \
  --ipc=host \
  -p 8000:8000 \
  -e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  -v $(pwd):/app \
  -w /app \
  --entrypoint bash \
  vllm/vllm-openai:v0.14.1 \
  -c "python3 /app/vllm_plugin/scripts/start_server.py --tp 2"

Hybrid (DP × TP)

Combine both — e.g., 2 replicas, each split across 2 GPUs (4 GPUs total):

docker run -d --gpus '"device=0,1,2,3"' --name vibevoice-vllm \
  --ipc=host \
  -p 8000:8000 \
  -v $(pwd):/app \
  -w /app \
  --entrypoint bash \
  vllm/vllm-openai:v0.14.1 \
  -c "python3 /app/vllm_plugin/scripts/start_server.py --dp 2 --tp 2"

Note: Total GPUs required = dp × tp. Make sure to expose enough GPU devices in the Docker --gpus flag.

View logs

docker logs -f vibevoice-vllm

Note:

The -d flag runs the container in background (detached mode)

Use docker stop vibevoice-vllm to stop the service

The model will be downloaded to HuggingFace cache (~/.cache/huggingface) inside the container

🚀 Usages

Test the API

Once the vLLM server is running, test it with the provided script:

# Basic transcription
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api.py /app/audio.wav

# With hotwords for better recognition of specific terms
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api.py /app/audio.wav --hotwords "Microsoft,VibeVoice"

# With auto-recovery from repetition loops (for long audio)
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api_auto_recover.py /app/audio.wav

# Auto-recover with hotwords
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api_auto_recover.py /app/audio.wav --hotwords "Microsoft,VibeVoice"

Note:

The audio/video file must be inside the mounted directory (/app in the container). Copy your files to the VibeVoice folder before testing.

Hotwords help improve recognition of domain-specific terms like proper nouns, technical terms, and speaker names.

Environment Variables

Variable	Description	Default
`VIBEVOICE_FFMPEG_MAX_CONCURRENCY`	Maximum FFmpeg processes for audio decoding	`64`
`PYTORCH_ALLOC_CONF`	PyTorch memory allocator config	`expandable_segments:True`

📊 Performance Tips

GPU Memory: Use --gpu-memory-utilization 0.9 for maximum throughput if you have dedicated GPU
Batch Size: Increase --max-num-seqs for higher concurrency (requires more GPU memory)
FFmpeg Concurrency: Tune VIBEVOICE_FFMPEG_MAX_CONCURRENCY based on CPU cores

🚨 Troubleshooting

Common Issues

"CUDA out of memory"
- Reduce --gpu-memory-utilization
- Reduce --max-num-seqs
- Use smaller --max-model-len
"Audio decoding failed"
- Ensure FFmpeg is installed: ffmpeg -version
- Check audio file format is supported
"Model not found"
- Ensure model path contains config.json and model weights
- Generate tokenizer files if missing
"Plugin not loaded"
- Verify installation: pip show vibevoice
- Check entry point: pip show -f vibevoice | grep entry

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VibeVoice vLLM ASR Deployment

🔥 Key Features

🛠️ Installation

⚡ Multi-GPU Deployment

Data Parallel (Recommended for scaling throughput)

Tensor Parallel

Hybrid (DP × TP)

🚀 Usages

Test the API

Environment Variables

📊 Performance Tips

🚨 Troubleshooting

Common Issues

FilesExpand file tree

vibevoice-vllm-asr.md

Latest commit

History

vibevoice-vllm-asr.md

File metadata and controls

VibeVoice vLLM ASR Deployment

🔥 Key Features

🛠️ Installation

⚡ Multi-GPU Deployment

Data Parallel (Recommended for scaling throughput)

Tensor Parallel

Hybrid (DP × TP)

🚀 Usages

Test the API

Environment Variables

📊 Performance Tips

🚨 Troubleshooting

Common Issues