221 lines
7.3 KiB
Markdown
221 lines
7.3 KiB
Markdown
# Qwen3-TTS 声音克隆集成到 ViGent2
|
||
|
||
## 需求概述
|
||
1. 前端支持上传/在线录制参考音频(wav, mp3, m4a 等)
|
||
2. EdgeTTS 音色保留,增加 Qwen3-TTS 声音克隆界面
|
||
3. 两种 TTS 方式做成统一界面(Tab 切换)
|
||
4. 声音克隆使用相同的口播文案输入
|
||
|
||
## 架构设计
|
||
|
||
### GPU 分配
|
||
| GPU | 服务 | 模型 |
|
||
|-----|------|------|
|
||
| GPU0 | Qwen3-TTS | 0.6B-Base (声音克隆) |
|
||
| GPU1 | LatentSync | 1.6 (唇形同步) |
|
||
|
||
### 存储
|
||
- 新增 Supabase bucket: `ref_audios`
|
||
- 路径格式: `{user_id}/{timestamp}_{filename}.wav`
|
||
|
||
---
|
||
|
||
## 实现步骤
|
||
|
||
### 1. 后端:新建声音克隆服务
|
||
**文件**: `backend/app/services/voice_clone_service.py`
|
||
|
||
```python
|
||
class VoiceCloneService:
|
||
def __init__(self):
|
||
self.gpu_id = 0
|
||
self.model_path = "models/Qwen3-TTS/checkpoints/0.6B-Base"
|
||
self._model = None
|
||
self._lock = asyncio.Lock()
|
||
|
||
async def generate_audio(self, text, ref_audio_path, ref_text, output_path, language="Chinese"):
|
||
# 使用 Qwen3TTSModel.generate_voice_clone()
|
||
```
|
||
|
||
### 2. 后端:新建参考音频 API
|
||
**文件**: `backend/app/api/ref_audios.py`
|
||
|
||
| 接口 | 方法 | 功能 |
|
||
|------|------|------|
|
||
| `/api/ref-audios` | POST | 上传参考音频 + ref_text |
|
||
| `/api/ref-audios` | GET | 列出用户的参考音频 |
|
||
| `/api/ref-audios/{id}` | DELETE | 删除参考音频 |
|
||
|
||
上传时自动转换为 wav (16kHz mono),存储 ref_text 元数据。
|
||
|
||
### 3. 后端:修改视频生成 API
|
||
**文件**: `backend/app/api/videos.py`
|
||
|
||
扩展 GenerateRequest:
|
||
```python
|
||
class GenerateRequest(BaseModel):
|
||
text: str
|
||
voice: str = "zh-CN-YunxiNeural"
|
||
material_path: str
|
||
# 新增
|
||
tts_mode: str = "edgetts" # "edgetts" | "voiceclone"
|
||
ref_audio_id: Optional[str] = None
|
||
ref_text: Optional[str] = None
|
||
```
|
||
|
||
修改 `_process_video_generation()`:
|
||
```python
|
||
if req.tts_mode == "voiceclone":
|
||
await voice_clone_service.generate_audio(...)
|
||
else:
|
||
await tts_service.generate_audio(...)
|
||
```
|
||
|
||
### 4. 后端:注册路由
|
||
**文件**: `backend/app/main.py`
|
||
|
||
```python
|
||
from app.api import ref_audios
|
||
app.include_router(ref_audios.router, prefix="/api/ref-audios", tags=["ref-audios"])
|
||
```
|
||
|
||
### 5. 前端:改造音色选择区域
|
||
**文件**: `frontend/src/app/page.tsx`
|
||
|
||
**新增状态**:
|
||
```typescript
|
||
const [ttsMode, setTtsMode] = useState<'edgetts' | 'voiceclone'>('edgetts');
|
||
const [refAudios, setRefAudios] = useState<RefAudio[]>([]);
|
||
const [selectedRefAudio, setSelectedRefAudio] = useState<RefAudio | null>(null);
|
||
const [refText, setRefText] = useState('');
|
||
|
||
// 在线录音相关
|
||
const [isRecording, setIsRecording] = useState(false);
|
||
const [recordedBlob, setRecordedBlob] = useState<Blob | null>(null);
|
||
const [recordingTime, setRecordingTime] = useState(0);
|
||
const mediaRecorderRef = useRef<MediaRecorder | null>(null);
|
||
```
|
||
|
||
**UI 结构**:
|
||
```
|
||
┌─────────────────────────────────────┐
|
||
│ 🎙️ 选择配音方式 │
|
||
├─────────────────────────────────────┤
|
||
│ [EdgeTTS 音色] [声音克隆] ← Tab │
|
||
├─────────────────────────────────────┤
|
||
│ Tab 1: 现有音色 2x3 网格 │
|
||
│ │
|
||
│ Tab 2: 声音克隆 │
|
||
│ ┌───────────────────────────────┐ │
|
||
│ │ 📁 我的参考音频 │ │
|
||
│ │ [ref1] [ref2] [+上传] │ │
|
||
│ └───────────────────────────────┘ │
|
||
│ ┌───────────────────────────────┐ │
|
||
│ │ 🎤 或在线录音 │ │
|
||
│ │ [开始录音] [停止] 时长: 0:05 │ │
|
||
│ │ (录音完成后显示试听和使用按钮) │ │
|
||
│ └───────────────────────────────┘ │
|
||
│ ┌───────────────────────────────┐ │
|
||
│ │ 📝 参考音频文字 (必填) │ │
|
||
│ │ [textarea] │ │
|
||
│ └───────────────────────────────┘ │
|
||
└─────────────────────────────────────┘
|
||
```
|
||
|
||
**在线录音逻辑**:
|
||
```typescript
|
||
const startRecording = async () => {
|
||
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
|
||
const mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
|
||
const chunks: BlobPart[] = [];
|
||
|
||
mediaRecorder.ondataavailable = (e) => chunks.push(e.data);
|
||
mediaRecorder.onstop = () => {
|
||
const blob = new Blob(chunks, { type: 'audio/webm' });
|
||
setRecordedBlob(blob);
|
||
stream.getTracks().forEach(track => track.stop());
|
||
};
|
||
|
||
mediaRecorder.start();
|
||
setIsRecording(true);
|
||
mediaRecorderRef.current = mediaRecorder;
|
||
};
|
||
|
||
const stopRecording = () => {
|
||
mediaRecorderRef.current?.stop();
|
||
setIsRecording(false);
|
||
};
|
||
|
||
const useRecording = async () => {
|
||
// 将录音 Blob 上传到后端
|
||
const formData = new FormData();
|
||
formData.append('file', recordedBlob, 'recording.webm');
|
||
formData.append('ref_text', refText);
|
||
const { data } = await api.post('/api/ref-audios', formData);
|
||
// 上传成功后刷新列表并选中
|
||
fetchRefAudios();
|
||
setSelectedRefAudio(data);
|
||
};
|
||
```
|
||
|
||
### 6. 前端:修改生成请求
|
||
```typescript
|
||
const handleGenerate = async () => {
|
||
const payload = {
|
||
material_path: materialObj.path,
|
||
text: text,
|
||
tts_mode: ttsMode,
|
||
...(ttsMode === 'edgetts'
|
||
? { voice }
|
||
: { ref_audio_id: selectedRefAudio.id, ref_text: refText })
|
||
};
|
||
await api.post('/api/videos/generate', payload);
|
||
};
|
||
```
|
||
|
||
---
|
||
|
||
## 文件清单
|
||
|
||
### 新建
|
||
| 文件 | 描述 |
|
||
|------|------|
|
||
| `backend/app/services/voice_clone_service.py` | 声音克隆服务 |
|
||
| `backend/app/api/ref_audios.py` | 参考音频管理 API |
|
||
|
||
### 修改
|
||
| 文件 | 修改内容 |
|
||
|------|----------|
|
||
| `backend/app/api/videos.py` | 扩展 GenerateRequest,修改 TTS 调用逻辑 |
|
||
| `backend/app/main.py` | 注册 ref_audios 路由 |
|
||
| `backend/app/services/storage.py` | 添加 BUCKET_REF_AUDIOS |
|
||
| `frontend/src/app/page.tsx` | Tab 切换 UI、参考音频选择、refText 输入 |
|
||
|
||
---
|
||
|
||
## 验证方法
|
||
|
||
1. **后端测试**:
|
||
```bash
|
||
# 启动后端
|
||
cd backend && uvicorn app.main:app --port 8006
|
||
|
||
# 测试参考音频上传
|
||
curl -X POST http://localhost:8006/api/ref-audios \
|
||
-F "file=@test.wav" -F "ref_text=测试文字"
|
||
|
||
# 测试声音克隆生成
|
||
curl -X POST http://localhost:8006/api/videos/generate \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"text":"测试文案","tts_mode":"voiceclone","ref_audio_id":"xxx","ref_text":"参考文字","material_path":"..."}'
|
||
```
|
||
|
||
2. **前端测试**:
|
||
- 打开首页,确认 Tab 切换正常
|
||
- 上传参考音频,确认列表显示
|
||
- 选择声音克隆模式,填写参考文字,点击生成
|
||
- 确认生成的视频使用克隆的声音
|
||
|
||
3. **端到端测试**:
|
||
- 上传参考音频 → 选择声音克隆 → 输入口播文案 → 生成视频 → 播放验证声音
|