Skip to content

Latest commit

 

History

History
154 lines (126 loc) · 4.09 KB

File metadata and controls

154 lines (126 loc) · 4.09 KB

VibeVoice ASR LoRA Fine-tuning

This directory contains scripts for LoRA (Low-Rank Adaptation) fine-tuning of the VibeVoice ASR model.

Requirements

# Install vibevoice first
pip install -e .

pip install peft

Toy Dataset

Note: The toy_dataset/ included in this directory contains synthetic audio generated by VibeVoice TTS for demonstration purposes only. It is NOT a full finetuning dataset.

When using your own data, you should:

  • Prepare real audio recordings with accurate transcriptions
  • Adjust hyperparameters (learning rate, epochs, LoRA rank) based on your dataset size and domain
  • Consider the audio quality and speaker diversity in your data

Data Format

Training data should be organized as pairs of audio files and JSON labels in the same directory:

toy_dataset/
├── 0.mp3
├── 0.json
├── 1.mp3
├── 1.json
└── ...

JSON Label Format

Each JSON file should have the following structure:

{
  "audio_duration": 351.73,
  "audio_path": "0.mp3",
  "segments": [
    {
      "speaker": 0,
      "text": "Hey everyone, welcome back...",
      "start": 0.0,
      "end": 38.68
    },
    {
      "speaker": 1,
      "text": "Thanks for having me...",
      "start": 38.75,
      "end": 77.88
    }
  ],
  "customized_context": ["Tea Brew", "Aiden Host", "The property is near Meter Street."]  // optional, domain-specific terms or context sentences
}

Training

Basic

# 1 GPU
torchrun --nproc_per_node=1 lora_finetune.py \
    --model_path microsoft/VibeVoice-ASR \
    --data_dir ./toy_dataset \
    --output_dir ./output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --learning_rate 1e-4 \
    --bf16 \
    --report_to none

# Specific GPUs (e.g., GPU 0,1,2,3)
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 lora_finetune.py \
    --model_path microsoft/VibeVoice-ASR \
    --data_dir ./toy_dataset \
    --output_dir ./output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --learning_rate 1e-4 \
    --bf16 \
    --report_to none

Full Options

The script uses HuggingFace's TrainingArguments, so all standard options are available:

torchrun --nproc_per_node=4 lora_finetune.py \
    --model_path microsoft/VibeVoice-ASR \
    --data_dir ./toy_dataset \
    --output_dir ./output \
    --lora_r 16 \
    --lora_alpha 32 \
    --lora_dropout 0.05 \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --learning_rate 1e-4 \
    --warmup_ratio 0.1 \
    --weight_decay 0.01 \
    --max_grad_norm 1.0 \
    --logging_steps 10 \
    --save_steps 100 \
    --gradient_checkpointing \
    --bf16 \
    --report_to none

Key Parameters

Parameter Default Description
--lora_r 16 LoRA rank (lower = fewer params, higher = more expressive)
--lora_alpha 32 LoRA scaling factor (typically 2x rank)
--lora_dropout 0.05 Dropout for LoRA layers
--per_device_train_batch_size 8 Batch size per device
--gradient_accumulation_steps 1 Effective batch size = batch_size × grad_accum
--learning_rate 5e-5 Learning rate (1e-4 to 2e-4 typical for LoRA)
--gradient_checkpointing False Enable to reduce memory usage
--use_customized_context True Include customized_context from JSON as additional context
--max_audio_length None Skip audio longer than this (seconds)

Inference with Fine-tuned Model

python inference_lora.py \
    --base_model microsoft/VibeVoice-ASR \
    --lora_path ./output \
    --audio_file ./toy_dataset/0.mp3 \
    --context_info "Tea Brew, Aiden Host"

Merging LoRA Weights (Optional)

To merge LoRA weights into the base model for faster inference:

from peft import PeftModel

# Load base model + LoRA
model = VibeVoiceASRForConditionalGeneration.from_pretrained("microsoft/VibeVoice-ASR", ...)
model = PeftModel.from_pretrained(model, "./output")

# Merge and save
model = model.merge_and_unload()
model.save_pretrained("./merged_model")