VibeVoice ASR LoRA Fine-tuning

This directory contains scripts for LoRA (Low-Rank Adaptation) fine-tuning of the VibeVoice ASR model.

Requirements

# Install vibevoice first
pip install -e .

pip install peft

Toy Dataset

Note: The toy_dataset/ included in this directory contains synthetic audio generated by VibeVoice TTS for demonstration purposes only. It is NOT a full finetuning dataset.

When using your own data, you should:

Prepare real audio recordings with accurate transcriptions

Adjust hyperparameters (learning rate, epochs, LoRA rank) based on your dataset size and domain

Consider the audio quality and speaker diversity in your data

Data Format

Training data should be organized as pairs of audio files and JSON labels in the same directory:

toy_dataset/
├── 0.mp3
├── 0.json
├── 1.mp3
├── 1.json
└── ...

JSON Label Format

Each JSON file should have the following structure:

{
  "audio_duration": 351.73,
  "audio_path": "0.mp3",
  "segments": [
    {
      "speaker": 0,
      "text": "Hey everyone, welcome back...",
      "start": 0.0,
      "end": 38.68
    },
    {
      "speaker": 1,
      "text": "Thanks for having me...",
      "start": 38.75,
      "end": 77.88
    }
  ],
  "customized_context": ["Tea Brew", "Aiden Host", "The property is near Meter Street."]  // optional, domain-specific terms or context sentences
}

Training

Basic

# 1 GPU
torchrun --nproc_per_node=1 lora_finetune.py \
    --model_path microsoft/VibeVoice-ASR \
    --data_dir ./toy_dataset \
    --output_dir ./output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --learning_rate 1e-4 \
    --bf16 \
    --report_to none

# Specific GPUs (e.g., GPU 0,1,2,3)
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 lora_finetune.py \
    --model_path microsoft/VibeVoice-ASR \
    --data_dir ./toy_dataset \
    --output_dir ./output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --learning_rate 1e-4 \
    --bf16 \
    --report_to none

Full Options

The script uses HuggingFace's TrainingArguments, so all standard options are available:

torchrun --nproc_per_node=4 lora_finetune.py \
    --model_path microsoft/VibeVoice-ASR \
    --data_dir ./toy_dataset \
    --output_dir ./output \
    --lora_r 16 \
    --lora_alpha 32 \
    --lora_dropout 0.05 \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --learning_rate 1e-4 \
    --warmup_ratio 0.1 \
    --weight_decay 0.01 \
    --max_grad_norm 1.0 \
    --logging_steps 10 \
    --save_steps 100 \
    --gradient_checkpointing \
    --bf16 \
    --report_to none

Key Parameters

Parameter	Default	Description
`--lora_r`	16	LoRA rank (lower = fewer params, higher = more expressive)
`--lora_alpha`	32	LoRA scaling factor (typically 2x rank)
`--lora_dropout`	0.05	Dropout for LoRA layers
`--per_device_train_batch_size`	8	Batch size per device
`--gradient_accumulation_steps`	1	Effective batch size = batch_size × grad_accum
`--learning_rate`	5e-5	Learning rate (1e-4 to 2e-4 typical for LoRA)
`--gradient_checkpointing`	False	Enable to reduce memory usage
`--use_customized_context`	True	Include customized_context from JSON as additional context
`--max_audio_length`	None	Skip audio longer than this (seconds)

Inference with Fine-tuned Model

python inference_lora.py \
    --base_model microsoft/VibeVoice-ASR \
    --lora_path ./output \
    --audio_file ./toy_dataset/0.mp3 \
    --context_info "Tea Brew, Aiden Host"

Merging LoRA Weights (Optional)

To merge LoRA weights into the base model for faster inference:

from peft import PeftModel

# Load base model + LoRA
model = VibeVoiceASRForConditionalGeneration.from_pretrained("microsoft/VibeVoice-ASR", ...)
model = PeftModel.from_pretrained(model, "./output")

# Merge and save
model = model.merge_and_unload()
model.save_pretrained("./merged_model")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VibeVoice ASR LoRA Fine-tuning

Requirements

Toy Dataset

Data Format

JSON Label Format

Training

Basic

Full Options

Key Parameters

Inference with Fine-tuned Model

Merging LoRA Weights (Optional)

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

VibeVoice ASR LoRA Fine-tuning

Requirements

Toy Dataset

Data Format

JSON Label Format

Training

Basic

Full Options

Key Parameters

Inference with Fine-tuned Model

Merging LoRA Weights (Optional)