ViGent2/models/Qwen3-TTS/finetuning/README.md at ec16e08bdb020cebf24cf2904bd980ba6d4c7e1c

Files

Kevin Wong 4a3dd2b225 更新

2026-01-28 17:22:31 +08:00

3.1 KiB

Raw Blame History

Fine Tuning Qwen3-TTS-12Hz-1.7B/0.6B-Base

The Qwen3-TTS-12Hz-1.7B/0.6B-Base model series currently supports single-speaker fine-tuning. Please run pip install qwen-tts first, then run the command below:

git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTS/finetuning

Then follow the steps below to complete the entire fine-tuning workflow. Multi-speaker fine-tuning and other advanced fine-tuning features will be supported in future releases.

1) Input JSONL format

Prepare your training file as a JSONL (one JSON object per line). Each line must contain:

audio: path to the target training audio (wav)
text: transcript corresponding to audio
ref_audio: path to the reference speaker audio (wav)

Example:

{"audio":"./data/utt0001.wav","text":"其实我真的有发现，我是一个特别善于观察别人情绪的人。","ref_audio":"./data/ref.wav"}
{"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","ref_audio":"./data/ref.wav"}

ref_audio recommendation:

Strongly recommended: use the same ref_audio for all samples.
Keeping ref_audio identical across the dataset usually improves speaker consistency and stability during generation.

2) Prepare data (extract `audio_codes`)

Convert train_raw.jsonl into a training JSONL that includes audio_codes:

python prepare_data.py \
  --device cuda:0 \
  --tokenizer_model_path Qwen/Qwen3-TTS-Tokenizer-12Hz \
  --input_jsonl train_raw.jsonl \
  --output_jsonl train_with_codes.jsonl

3) Fine-tune

Run SFT using the prepared JSONL:

python sft_12hz.py \
  --init_model_path Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --output_model_path output \
  --train_jsonl train_with_codes.jsonl \
  --batch_size 2 \
  --lr 2e-5 \
  --num_epochs 3 \
  --speaker_name speaker_test

Checkpoints will be written to:

output/checkpoint-epoch-0
output/checkpoint-epoch-1
output/checkpoint-epoch-2
...

4) Quick inference test

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

device = "cuda:0"
tts = Qwen3TTSModel.from_pretrained(
    "output/checkpoint-epoch-2",
    device_map=device,
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = tts.generate_custom_voice(
    text="She said she would be here by noon.",
    speaker="speaker_test",
)
sf.write("output.wav", wavs[0], sr)

One-click shell script example

#!/usr/bin/env bash
set -e

DEVICE="cuda:0"
TOKENIZER_MODEL_PATH="Qwen/Qwen3-TTS-Tokenizer-12Hz"
INIT_MODEL_PATH="Qwen/Qwen3-TTS-12Hz-1.7B-Base"

RAW_JSONL="train_raw.jsonl"
TRAIN_JSONL="train_with_codes.jsonl"
OUTPUT_DIR="output"

BATCH_SIZE=2
LR=2e-5
EPOCHS=3
SPEAKER_NAME="speaker_1"

python prepare_data.py \
  --device ${DEVICE} \
  --tokenizer_model_path ${TOKENIZER_MODEL_PATH} \
  --input_jsonl ${RAW_JSONL} \
  --output_jsonl ${TRAIN_JSONL}

python sft_12hz.py \
  --init_model_path ${INIT_MODEL_PATH} \
  --output_model_path ${OUTPUT_DIR} \
  --train_jsonl ${TRAIN_JSONL} \
  --batch_size ${BATCH_SIZE} \
  --lr ${LR} \
  --num_epochs ${EPOCHS} \
  --speaker_name ${SPEAKER_NAME}

3.1 KiB Raw Blame History