432 lines
14 KiB
Markdown
432 lines
14 KiB
Markdown
# Day 13 - 声音克隆功能集成 + 字幕功能
|
||
|
||
**日期**:2026-01-29
|
||
|
||
---
|
||
|
||
## 🎙️ Qwen3-TTS 服务集成
|
||
|
||
### 背景
|
||
在 Day 12 完成 Qwen3-TTS 模型部署后,今日重点是将其集成到 ViGent2 系统中,提供完整的声音克隆功能。
|
||
|
||
### 架构设计
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ 前端 (Next.js) │
|
||
│ 参考音频上传 → TTS 模式选择 → 视频生成请求 │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ 后端 (FastAPI :8006) │
|
||
│ ref-audios API → voice_clone_service → video_service │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ Qwen3-TTS 服务 (FastAPI :8009) │
|
||
│ HTTP /generate → 返回克隆音频 │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Qwen3-TTS HTTP 服务 (`qwen_tts_server.py`)
|
||
|
||
创建独立的 FastAPI 服务,运行在 8009 端口:
|
||
|
||
```python
|
||
from fastapi import FastAPI, UploadFile, Form, HTTPException
|
||
from fastapi.responses import Response
|
||
import torch
|
||
import soundfile as sf
|
||
from qwen_tts import Qwen3TTSModel
|
||
import io, os
|
||
|
||
app = FastAPI(title="Qwen3-TTS Voice Clone Service")
|
||
|
||
# GPU 配置
|
||
GPU_ID = os.getenv("QWEN_TTS_GPU_ID", "0")
|
||
model = None
|
||
|
||
@app.on_event("startup")
|
||
async def load_model():
|
||
global model
|
||
model = Qwen3TTSModel.from_pretrained(
|
||
"./checkpoints/0.6B-Base",
|
||
device_map=f"cuda:{GPU_ID}",
|
||
dtype=torch.bfloat16,
|
||
)
|
||
|
||
@app.get("/health")
|
||
async def health():
|
||
return {"service": "Qwen3-TTS", "ready": model is not None, "gpu_id": GPU_ID}
|
||
|
||
@app.post("/generate")
|
||
async def generate(
|
||
ref_audio: UploadFile,
|
||
text: str = Form(...),
|
||
ref_text: str = Form(""),
|
||
language: str = Form("Chinese"),
|
||
):
|
||
# 保存临时参考音频
|
||
ref_path = f"/tmp/ref_{ref_audio.filename}"
|
||
with open(ref_path, "wb") as f:
|
||
f.write(await ref_audio.read())
|
||
|
||
# 生成克隆音频
|
||
wavs, sr = model.generate_voice_clone(
|
||
text=text,
|
||
language=language,
|
||
ref_audio=ref_path,
|
||
ref_text=ref_text or "一段参考音频。",
|
||
)
|
||
|
||
# 返回 WAV 音频
|
||
buffer = io.BytesIO()
|
||
sf.write(buffer, wavs[0], sr, format="WAV")
|
||
buffer.seek(0)
|
||
return Response(content=buffer.read(), media_type="audio/wav")
|
||
```
|
||
|
||
### 后端声音克隆服务 (`voice_clone_service.py`)
|
||
|
||
通过 HTTP 调用 Qwen3-TTS 服务:
|
||
|
||
```python
|
||
import aiohttp
|
||
from loguru import logger
|
||
|
||
QWEN_TTS_URL = "http://localhost:8009"
|
||
|
||
async def generate_cloned_audio(
|
||
ref_audio_path: str,
|
||
text: str,
|
||
output_path: str,
|
||
ref_text: str = "",
|
||
) -> str:
|
||
"""调用 Qwen3-TTS 服务生成克隆音频"""
|
||
|
||
async with aiohttp.ClientSession() as session:
|
||
with open(ref_audio_path, "rb") as f:
|
||
data = aiohttp.FormData()
|
||
data.add_field("ref_audio", f, filename="ref.wav")
|
||
data.add_field("text", text)
|
||
data.add_field("ref_text", ref_text)
|
||
|
||
async with session.post(f"{QWEN_TTS_URL}/generate", data=data) as resp:
|
||
if resp.status != 200:
|
||
raise Exception(f"Qwen3-TTS error: {resp.status}")
|
||
|
||
audio_data = await resp.read()
|
||
with open(output_path, "wb") as out:
|
||
out.write(audio_data)
|
||
|
||
return output_path
|
||
```
|
||
|
||
---
|
||
|
||
## 📂 参考音频管理 API
|
||
|
||
### 新增 API 端点 (`ref_audios.py`)
|
||
|
||
| 端点 | 方法 | 功能 |
|
||
|------|------|------|
|
||
| `/api/ref-audios` | GET | 获取参考音频列表 |
|
||
| `/api/ref-audios` | POST | 上传参考音频 |
|
||
| `/api/ref-audios/{id}` | DELETE | 删除参考音频 |
|
||
|
||
### Supabase Bucket 配置
|
||
|
||
为参考音频创建独立存储桶:
|
||
|
||
```sql
|
||
-- 创建 ref-audios bucket
|
||
INSERT INTO storage.buckets (id, name, public)
|
||
VALUES ('ref-audios', 'ref-audios', true)
|
||
ON CONFLICT (id) DO NOTHING;
|
||
|
||
-- RLS 策略
|
||
CREATE POLICY "Allow public uploads" ON storage.objects
|
||
FOR INSERT TO anon WITH CHECK (bucket_id = 'ref-audios');
|
||
|
||
CREATE POLICY "Allow public read" ON storage.objects
|
||
FOR SELECT TO anon USING (bucket_id = 'ref-audios');
|
||
|
||
CREATE POLICY "Allow public delete" ON storage.objects
|
||
FOR DELETE TO anon USING (bucket_id = 'ref-audios');
|
||
```
|
||
|
||
---
|
||
|
||
## 🎨 前端声音克隆 UI
|
||
|
||
### TTS 模式选择
|
||
|
||
在视频生成页面新增声音克隆选项:
|
||
|
||
```tsx
|
||
{/* TTS 模式选择 */}
|
||
<div className="flex gap-2 mb-4">
|
||
<button
|
||
onClick={() => setTtsMode("edge")}
|
||
className={`px-4 py-2 rounded-lg ${ttsMode === "edge" ? "bg-purple-600" : "bg-white/10"}`}
|
||
>
|
||
🔊 EdgeTTS
|
||
</button>
|
||
<button
|
||
onClick={() => setTtsMode("clone")}
|
||
className={`px-4 py-2 rounded-lg ${ttsMode === "clone" ? "bg-purple-600" : "bg-white/10"}`}
|
||
>
|
||
🎙️ 声音克隆
|
||
</button>
|
||
</div>
|
||
```
|
||
|
||
### 参考音频管理
|
||
|
||
新增参考音频上传和列表展示功能:
|
||
|
||
| 功能 | 实现 |
|
||
|------|------|
|
||
| 音频上传 | 拖拽上传 WAV/MP3,直传 Supabase |
|
||
| 列表展示 | 显示文件名、时长、上传时间 |
|
||
| 快速选择 | 点击即选中作为参考音频 |
|
||
| 删除功能 | 删除不需要的参考音频 |
|
||
|
||
---
|
||
|
||
## ✅ 端到端测试验证
|
||
|
||
### 测试流程
|
||
1. **上传参考音频**: 3 秒参考音频 → Supabase ref-audios bucket
|
||
2. **选择声音克隆模式**: TTS 模式切换为 "声音克隆"
|
||
3. **输入文案**: 测试口播文案
|
||
4. **生成视频**:
|
||
- TTS 阶段调用 Qwen3-TTS (17.7s)
|
||
- LipSync 阶段调用 LatentSync (122.8s)
|
||
5. **播放验证**: 视频声音与参考音色一致
|
||
|
||
### 测试结果
|
||
- ✅ 参考音频上传成功
|
||
- ✅ Qwen3-TTS 生成克隆音频 (15s 推理,4.6s 音频)
|
||
- ✅ LatentSync 唇形同步正常
|
||
- ✅ 总生成时间 143.1s
|
||
- ✅ 前端视频播放正常
|
||
|
||
---
|
||
|
||
## 🔧 PM2 服务配置
|
||
|
||
### 新增 Qwen3-TTS 服务
|
||
|
||
**前置依赖安装**:
|
||
```bash
|
||
conda activate qwen-tts
|
||
pip install fastapi uvicorn python-multipart
|
||
```
|
||
|
||
启动脚本 `run_qwen_tts.sh` (位于项目**根目录**):
|
||
```bash
|
||
#!/bin/bash
|
||
cd /home/rongye/ProgramFiles/ViGent2/models/Qwen3-TTS
|
||
/home/rongye/ProgramFiles/miniconda3/envs/qwen-tts/bin/python qwen_tts_server.py
|
||
```
|
||
|
||
PM2 管理命令:
|
||
```bash
|
||
# 进入根目录启动
|
||
cd /home/rongye/ProgramFiles/ViGent2
|
||
pm2 start ./run_qwen_tts.sh --name vigent2-qwen-tts
|
||
pm2 save
|
||
|
||
# 查看状态
|
||
pm2 status
|
||
|
||
# 查看日志
|
||
pm2 logs vigent2-qwen-tts --lines 50
|
||
```
|
||
|
||
### 完整服务列表
|
||
|
||
| 服务名 | 端口 | 功能 |
|
||
|--------|------|------|
|
||
| vigent2-backend | 8006 | FastAPI 后端 |
|
||
| vigent2-frontend | 3002 | Next.js 前端 |
|
||
| vigent2-latentsync | 8007 | LatentSync 唇形同步 |
|
||
| vigent2-qwen-tts | 8009 | Qwen3-TTS 声音克隆 |
|
||
|
||
---
|
||
|
||
## 📁 今日修改文件清单
|
||
|
||
| 文件 | 变更类型 | 说明 |
|
||
|------|----------|------|
|
||
| `models/Qwen3-TTS/qwen_tts_server.py` | 新增 | Qwen3-TTS HTTP 推理服务 |
|
||
| `run_qwen_tts.sh` | 新增 | PM2 启动脚本 (根目录) |
|
||
| `backend/app/services/voice_clone_service.py` | 新增 | 声音克隆服务 (HTTP 调用) |
|
||
| `backend/app/api/ref_audios.py` | 新增 | 参考音频管理 API |
|
||
| `backend/app/main.py` | 修改 | 注册 ref-audios 路由 |
|
||
| `frontend/src/app/page.tsx` | 修改 | TTS 模式选择 + 参考音频 UI |
|
||
|
||
---
|
||
|
||
## 🔗 相关文档
|
||
|
||
- [TASK_COMPLETE.md](../TASK_COMPLETE.md) - 任务总览
|
||
- [Day12.md](./Day12.md) - iOS 兼容与 Qwen3-TTS 部署
|
||
- [QWEN3_TTS_DEPLOY.md](../QWEN3_TTS_DEPLOY.md) - Qwen3-TTS 部署指南
|
||
- [SUBTITLE_DEPLOY.md](../SUBTITLE_DEPLOY.md) - 字幕功能部署指南
|
||
- [DEPLOY_MANUAL.md](../DEPLOY_MANUAL.md) - 完整部署手册
|
||
|
||
---
|
||
|
||
## 🎬 逐字高亮字幕 + 片头标题功能
|
||
|
||
### 背景
|
||
|
||
为提升视频质量,新增逐字高亮字幕(卡拉OK效果)和片头标题功能。
|
||
|
||
### 技术方案
|
||
|
||
| 组件 | 技术 | 说明 |
|
||
|------|------|------|
|
||
| 字幕对齐 | **faster-whisper** | 生成字级别时间戳 |
|
||
| 视频渲染 | **Remotion** | React 视频合成框架 |
|
||
|
||
### 架构设计
|
||
|
||
```
|
||
原有流程:
|
||
文本 → EdgeTTS → 音频 → LatentSync → FFmpeg合成 → 最终视频
|
||
|
||
新流程:
|
||
文本 → EdgeTTS → 音频 ─┬→ LatentSync → 唇形视频 ─┐
|
||
└→ faster-whisper → 字幕JSON ─┴→ Remotion合成 → 最终视频
|
||
```
|
||
|
||
### 后端新增服务
|
||
|
||
#### 1. 字幕服务 (`whisper_service.py`)
|
||
|
||
基于 faster-whisper 生成字级别时间戳:
|
||
|
||
```python
|
||
from faster_whisper import WhisperModel
|
||
|
||
class WhisperService:
|
||
def __init__(self, model_size="large-v3", device="cuda"):
|
||
self.model = WhisperModel(model_size, device=device)
|
||
|
||
async def align(self, audio_path: str, text: str, output_path: str):
|
||
segments, info = self.model.transcribe(audio_path, word_timestamps=True)
|
||
# 将词拆分成单字,时间戳线性插值
|
||
result = {"segments": [...]}
|
||
# 保存到 JSON
|
||
```
|
||
|
||
**字幕拆字算法**:faster-whisper 对中文返回词级别,系统自动拆分成单字并线性插值:
|
||
|
||
```python
|
||
# 输入: {"word": "大家好", "start": 0.0, "end": 0.9}
|
||
# 输出:
|
||
[
|
||
{"word": "大", "start": 0.0, "end": 0.3},
|
||
{"word": "家", "start": 0.3, "end": 0.6},
|
||
{"word": "好", "start": 0.6, "end": 0.9}
|
||
]
|
||
```
|
||
|
||
#### 2. Remotion 渲染服务 (`remotion_service.py`)
|
||
|
||
调用 Remotion 渲染字幕和标题:
|
||
|
||
```python
|
||
class RemotionService:
|
||
async def render(self, video_path, output_path, captions_path, title, ...):
|
||
cmd = f"npx ts-node render.ts --video {video_path} --output {output_path} ..."
|
||
# 执行渲染
|
||
```
|
||
|
||
### Remotion 项目结构
|
||
|
||
```
|
||
remotion/
|
||
├── package.json # Node.js 依赖
|
||
├── render.ts # 服务端渲染脚本
|
||
└── src/
|
||
├── Video.tsx # 主视频组件
|
||
├── components/
|
||
│ ├── Title.tsx # 片头标题(淡入淡出)
|
||
│ ├── Subtitles.tsx # 逐字高亮字幕
|
||
│ └── VideoLayer.tsx # 视频图层
|
||
└── utils/
|
||
└── captions.ts # 字幕数据类型
|
||
```
|
||
|
||
### 前端 UI
|
||
|
||
新增标题和字幕设置区块:
|
||
|
||
| 功能 | 说明 |
|
||
|------|------|
|
||
| 片头标题输入 | 可选,在视频开头显示 3 秒 |
|
||
| 字幕开关 | 默认开启,可关闭 |
|
||
|
||
### 遇到的问题与修复
|
||
|
||
#### 问题 1: `fs` 模块错误
|
||
|
||
**现象**:Remotion 打包失败,提示 `fs.js doesn't exist`
|
||
|
||
**原因**:`captions.ts` 中有 `loadCaptions` 函数使用了 Node.js 的 `fs` 模块
|
||
|
||
**修复**:删除未使用的 `loadCaptions` 函数
|
||
|
||
#### 问题 2: 视频文件读取失败
|
||
|
||
**现象**:`file://` 协议无法读取本地视频
|
||
|
||
**修复**:
|
||
1. `render.ts` 使用 `publicDir` 指向视频目录
|
||
2. `VideoLayer.tsx` 使用 `staticFile()` 加载视频
|
||
|
||
```typescript
|
||
// render.ts
|
||
const publicDir = path.dirname(path.resolve(options.videoPath));
|
||
const bundleLocation = await bundle({
|
||
entryPoint: path.resolve(__dirname, './src/index.ts'),
|
||
publicDir, // 关键配置
|
||
});
|
||
|
||
// VideoLayer.tsx
|
||
const videoUrl = staticFile(videoSrc);
|
||
```
|
||
|
||
### 测试结果
|
||
|
||
- ✅ faster-whisper 字幕对齐成功(~1秒)
|
||
- ✅ Remotion 渲染成功(~10秒)
|
||
- ✅ 字幕逐字高亮效果正常
|
||
- ✅ 片头标题淡入淡出正常
|
||
- ✅ 降级机制正常(Remotion 失败时回退到 FFmpeg)
|
||
|
||
---
|
||
|
||
## 📁 今日修改文件清单(完整)
|
||
|
||
| 文件 | 变更类型 | 说明 |
|
||
|------|----------|------|
|
||
| `models/Qwen3-TTS/qwen_tts_server.py` | 新增 | Qwen3-TTS HTTP 推理服务 |
|
||
| `run_qwen_tts.sh` | 新增 | PM2 启动脚本 (根目录) |
|
||
| `backend/app/services/voice_clone_service.py` | 新增 | 声音克隆服务 (HTTP 调用) |
|
||
| `backend/app/services/whisper_service.py` | 新增 | 字幕对齐服务 (faster-whisper) |
|
||
| `backend/app/services/remotion_service.py` | 新增 | Remotion 渲染服务 |
|
||
| `backend/app/api/ref_audios.py` | 新增 | 参考音频管理 API |
|
||
| `backend/app/api/videos.py` | 修改 | 集成字幕和标题功能 |
|
||
| `backend/app/main.py` | 修改 | 注册 ref-audios 路由 |
|
||
| `backend/requirements.txt` | 修改 | 添加 faster-whisper 依赖 |
|
||
| `remotion/` | 新增 | Remotion 视频渲染项目 |
|
||
| `frontend/src/app/page.tsx` | 修改 | TTS 模式选择 + 标题字幕 UI |
|
||
| `Docs/SUBTITLE_DEPLOY.md` | 新增 | 字幕功能部署文档 |
|