This directory contains scripts for LoRA (Low-Rank Adaptation) fine-tuning of the VibeVoice ASR model.
# Install vibevoice first
pip install -e .
pip install peftNote: The
toy_dataset/included in this directory contains synthetic audio generated by VibeVoice TTS for demonstration purposes only. It is NOT a full finetuning dataset.When using your own data, you should:
- Prepare real audio recordings with accurate transcriptions
- Adjust hyperparameters (learning rate, epochs, LoRA rank) based on your dataset size and domain
- Consider the audio quality and speaker diversity in your data
Training data should be organized as pairs of audio files and JSON labels in the same directory:
toy_dataset/
├── 0.mp3
├── 0.json
├── 1.mp3
├── 1.json
└── ...
Each JSON file should have the following structure:
{
"audio_duration": 351.73,
"audio_path": "0.mp3",
"segments": [
{
"speaker": 0,
"text": "Hey everyone, welcome back...",
"start": 0.0,
"end": 38.68
},
{
"speaker": 1,
"text": "Thanks for having me...",
"start": 38.75,
"end": 77.88
}
],
"customized_context": ["Tea Brew", "Aiden Host", "The property is near Meter Street."] // optional, domain-specific terms or context sentences
}# 1 GPU
torchrun --nproc_per_node=1 lora_finetune.py \
--model_path microsoft/VibeVoice-ASR \
--data_dir ./toy_dataset \
--output_dir ./output \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--learning_rate 1e-4 \
--bf16 \
--report_to none
# Specific GPUs (e.g., GPU 0,1,2,3)
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 lora_finetune.py \
--model_path microsoft/VibeVoice-ASR \
--data_dir ./toy_dataset \
--output_dir ./output \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--learning_rate 1e-4 \
--bf16 \
--report_to noneThe script uses HuggingFace's TrainingArguments, so all standard options are available:
torchrun --nproc_per_node=4 lora_finetune.py \
--model_path microsoft/VibeVoice-ASR \
--data_dir ./toy_dataset \
--output_dir ./output \
--lora_r 16 \
--lora_alpha 32 \
--lora_dropout 0.05 \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4 \
--learning_rate 1e-4 \
--warmup_ratio 0.1 \
--weight_decay 0.01 \
--max_grad_norm 1.0 \
--logging_steps 10 \
--save_steps 100 \
--gradient_checkpointing \
--bf16 \
--report_to none| Parameter | Default | Description |
|---|---|---|
--lora_r |
16 | LoRA rank (lower = fewer params, higher = more expressive) |
--lora_alpha |
32 | LoRA scaling factor (typically 2x rank) |
--lora_dropout |
0.05 | Dropout for LoRA layers |
--per_device_train_batch_size |
8 | Batch size per device |
--gradient_accumulation_steps |
1 | Effective batch size = batch_size × grad_accum |
--learning_rate |
5e-5 | Learning rate (1e-4 to 2e-4 typical for LoRA) |
--gradient_checkpointing |
False | Enable to reduce memory usage |
--use_customized_context |
True | Include customized_context from JSON as additional context |
--max_audio_length |
None | Skip audio longer than this (seconds) |
python inference_lora.py \
--base_model microsoft/VibeVoice-ASR \
--lora_path ./output \
--audio_file ./toy_dataset/0.mp3 \
--context_info "Tea Brew, Aiden Host"To merge LoRA weights into the base model for faster inference:
from peft import PeftModel
# Load base model + LoRA
model = VibeVoiceASRForConditionalGeneration.from_pretrained("microsoft/VibeVoice-ASR", ...)
model = PeftModel.from_pretrained(model, "./output")
# Merge and save
model = model.merge_and_unload()
model.save_pretrained("./merged_model")