Compare commits

...

2 Commits

Author SHA1 Message Date
Kevin Wong
48bc78fe38 更新 2026-03-02 16:35:16 +08:00
Kevin Wong
abf005f225 更新 2026-02-28 17:49:32 +08:00
28 changed files with 1350 additions and 335 deletions

View File

@@ -70,6 +70,18 @@ run_cosyvoice.sh # PM2 启动脚本
| ref_text | string | 是 | 参考音频的转写文字 |
| language | string | 否 | 语言 (默认 "Chinese"CosyVoice 自动检测) |
| speed | float | 否 | 语速 (默认 1.0,范围 0.5-2.0,建议 0.8-1.2) |
| instruct_text | string | 否 | 语气指令 (默认 "",非空时切换为 `inference_instruct2` 模式) |
**推理模式分支:**
- `instruct_text` 为空 → `inference_zero_shot(text, prompt_text, ref_audio)` — 纯声音克隆
- `instruct_text` 非空 → `inference_instruct2(text, instruct_text, ref_audio)` — 带语气/情绪控制的声音克隆
**支持的语气指令示例:**
```
"You are a helpful assistant. 请非常开心地说一句话。<|endofprompt|>"
"You are a helpful assistant. 请非常伤心地说一句话。<|endofprompt|>"
"You are a helpful assistant. 请非常生气地说一句话。<|endofprompt|>"
```
**返回:** WAV 音频文件

View File

@@ -1,8 +1,8 @@
## 字幕同步修复 + 嘴型参数调优 + 视频流水线全面优化 + 预览背景修复 (Day 29)
## 字幕同步修复 + 嘴型参数调优 + 视频流水线全面优化 + 预览背景修复 + CosyVoice 语气控制 (Day 29)
### 概述
本轮对视频生成流水线做全面审查优化修复字幕与语音不同步问题Whisper 时间戳平滑 + 原文节奏映射)、调优 LatentSync 嘴型参数、compose 流复制省去冗余重编码、FFmpeg 超时保护、全局并发限制、Redis 任务 TTL、临时文件清理、死代码移除。同时修复因前端域名迁移`vigent.hbyrkj.top``ipagent.ai-labz.cn`导致的样式预览背景 CORS 失效问题。
本轮对视频生成流水线做全面审查优化修复字幕与语音不同步问题Whisper 时间戳平滑 + 原文节奏映射)、调优 LatentSync 嘴型参数、compose 流复制省去冗余重编码、FFmpeg 超时保护、全局并发限制、Redis 任务 TTL、临时文件清理、死代码移除。修复因前端域名迁移导致的样式预览背景 CORS 失效问题。新增 CosyVoice 语气控制功能,声音克隆模式下支持开心/伤心/生气等情绪表达(基于 `inference_instruct2`
---
@@ -194,6 +194,79 @@ ALIPAY_RETURN_URL=https://ipagent.ai-labz.cn/pay
---
### 12. CosyVoice 语气控制功能
- **功能**: 声音克隆模式下新增"语气"下拉菜单(正常/欢快/低沉/严肃),利用 CosyVoice3 的 `inference_instruct2()` 方法通过自然语言指令控制语气情绪
- **默认行为不变**: 选择"正常"时仍走 `inference_zero_shot()`,与改动前完全一致
#### 数据流
```
用户选择语气 → setEmotion("happy") → localStorage 持久化
→ 生成配音 → emotion 映射为 instruct_text
→ POST /api/generated-audios/generate { instruct_text }
→ voice_clone_service → POST localhost:8010/generate { instruct_text }
→ instruct_text 非空 ? inference_instruct2() : inference_zero_shot()
```
#### CosyVoice 服务 — `cosyvoice_server.py`
- `/generate` 端点新增 `instruct_text: str = Form("")` 参数
- 推理分支:空 → `inference_zero_shot()`,非空 → `inference_instruct2(text, instruct_text, ref_audio_path, ...)`
- `inference_instruct2` 不需要 `prompt_text`,直接接受 `instruct_text` + `prompt_wav`
#### 后端透传
- `schemas.py`: `GenerateAudioRequest` 新增 `instruct_text: Optional[str] = None`
- `service.py`: `generate_audio_task()` voiceclone 分支传递 `instruct_text=req.instruct_text or ""`
- `voice_clone_service.py`: `_generate_once()``generate_audio()` 新增 `instruct_text` 参数
#### 前端
- `useHomeController.ts`: 新增 `emotion` state + `emotionToInstruct` 映射表
- `useHomePersistence.ts`: 语气选择持久化到 localStorage
- `useGeneratedAudios.ts`: `generateAudio` params 新增 `instruct_text`
- `GeneratedAudiosPanel.tsx`: 语气下拉菜单(语速按钮左侧),复用语速下拉样式,仅 voiceclone 模式可见
- `HomePage.tsx`: 透传 `emotion`/`onEmotionChange`
#### instruct_text 格式(来自 CosyVoice3 instruct_list
```
正常: ""(走 inference_zero_shot
欢快: "You are a helpful assistant. 请非常开心地说一句话。<|endofprompt|>"
低沉: "You are a helpful assistant. 请非常伤心地说一句话。<|endofprompt|>"
严肃: "You are a helpful assistant. 请非常生气地说一句话。<|endofprompt|>"
```
---
## 📁 修改文件清单
| 文件 | 改动 |
|------|------|
| `backend/app/services/whisper_service.py` | 时间戳平滑 + 原文节奏映射 + 单字时长钳位 |
| `remotion/src/utils/captions.ts` | 新增 `getCurrentSegment` / `getCurrentWordIndex` |
| `backend/app/services/video_service.py` | compose 流复制 + FFmpeg 超时保护 |
| `backend/app/modules/videos/workflow.py` | Semaphore(2) 并发限制 + 字体清理 + Whisper 逻辑去重 |
| `backend/app/modules/videos/task_store.py` | Redis TTL + 索引过期清理 |
| `backend/app/services/lipsync_service.py` | 删除 `_preprocess_video()` 死代码 |
| `backend/app/services/remotion_service.py` | concurrency 16 → 4 |
| `remotion/render.ts` | 新增 concurrency 参数支持 |
| `backend/app/modules/materials/router.py` | 新增 `/stream/{material_id}` 同源代理端点 |
| `frontend/.../useVideoFrameCapture.ts` | 移除 crossOrigin |
| `frontend/.../useHomeController.ts` | 帧截取 URL 改用同源代理 + emotion state + emotionToInstruct 映射 |
| `backend/.env` | 嘴型参数 + 支付宝域名更新 |
| `models/CosyVoice/cosyvoice_server.py` | `/generate` 新增 `instruct_text` 参数,分支 `inference_instruct2` / `inference_zero_shot` |
| `backend/app/services/voice_clone_service.py` | `_generate_once` / `generate_audio` 新增 `instruct_text` 透传 |
| `backend/app/modules/generated_audios/schemas.py` | `GenerateAudioRequest` 新增 `instruct_text` 字段 |
| `backend/app/modules/generated_audios/service.py` | voiceclone 分支传递 `instruct_text` |
| `frontend/.../useGeneratedAudios.ts` | `generateAudio` params 新增 `instruct_text` |
| `frontend/.../useHomePersistence.ts` | emotion 持久化 (localStorage) |
| `frontend/.../GeneratedAudiosPanel.tsx` | 语气下拉菜单 UI (embedded + standalone) |
| `frontend/.../HomePage.tsx` | 透传 emotion / onEmotionChange |
---
## 🔍 验证
1. **字幕同步**: 生成视频观察逐字高亮,不应出现超前/滞后/跳空
@@ -204,3 +277,7 @@ ALIPAY_RETURN_URL=https://ipagent.ai-labz.cn/pay
6. **字体清理**: 生成视频后 temp 目录不应残留字体文件
7. **预览背景**: 选择素材 → 点击"预览样式",应显示视频第一帧(非渐变)
8. **支付宝**: 发起支付后回调和跳转地址为新域名
9. **语气控制**: 声音克隆模式选择"开心"/"生气"生成配音CosyVoice 日志出现 `🎭 Instruct mode`,音频语气有明显变化
10. **语气默认**: 选择"正常"时行为与改动前完全相同(走 `inference_zero_shot`
11. **语气持久化**: 切换语气后刷新页面,下拉菜单恢复上次选择
12. **语气可见性**: 语气下拉仅在 voiceclone 模式显示edgetts 模式不显示

363
Docs/DevLogs/Day30.md Normal file
View File

@@ -0,0 +1,363 @@
## Remotion 缓存修复 + 编码流水线质量优化 + 唇形同步容错 + 模型选择 (Day 30)
### 概述
本轮解决四大方面:(1) Remotion bundle 缓存导致标题/字幕丢失的严重 Bug(2) 全面优化 LatentSync + MuseTalk 双引擎编码流水线,消除冗余有损编码;(3) 增强 LatentSync 的鲁棒性,允许素材中部分帧检测不到人脸时继续推理而非中断任务;(4) 前端唇形模型选择,用户可按需切换默认/快速/高级模型。
---
## ✅ 改动内容
### 1. Remotion Bundle 缓存 404 修复(严重 Bug
- **问题**: 生成的视频没有标题和字幕Remotion 渲染失败后静默回退到 FFmpeg无文字叠加能力
- **根因**: Remotion 的 bundle 缓存机制只在首次打包时复制 `publicDir`(视频/字体所在目录)。代码稳定后缓存持续命中,新生成的视频和字体文件不在旧缓存的 `public/` 目录 → Remotion HTTP server 返回 404 → 渲染失败
- **尝试**: 先用 `fs.symlinkSync` 符号链接,但 Remotion 内部 HTTP server 不支持跟随符号链接
- **最终方案**: 使用 `fs.linkSync` 硬链接(同文件系统零拷贝,对应用完全透明),跨文件系统时自动回退为 `fs.copyFileSync`
**文件**: `remotion/render.ts`
```typescript
function ensureInCachedPublic(cachedPublicDir, srcAbsPath, fileName) {
// 检查是否已存在且为同一 inode
// 优先硬链接(零拷贝),跨文件系统回退为复制
try {
fs.linkSync(srcAbsPath, cachedPath);
} catch {
fs.copyFileSync(srcAbsPath, cachedPath);
}
}
```
使用缓存 bundle 时,自动将当前渲染所需的文件(视频 + 字体)硬链接到缓存的 `public/` 目录:
- 视频文件(`videoFileName`
- 字体文件(从 `subtitleStyle` / `titleStyle` / `secondaryTitleStyle``font_file` 字段提取)
---
### 2. 视频编码流水线质量优化
对完整流水线做全面审查,发现从素材上传到最终输出,视频最多经历 **5-6 次有损重编码**,而官方 LatentSync demo 只有 1-2 次。
#### 优化前编码链路
| # | 阶段 | CRF | 问题 |
|---|------|-----|------|
| 1 | 方向归一化 | 23 | 条件触发 |
| 2 | `prepare_segment` 缩放+时长 | 23 | 必经,质量偏低 |
| 3 | LatentSync `read_video` FPS 转换 | 18 | **即使已是 25fps 也重编码** |
| 4 | LatentSync `imageio` 写帧 | 13 | 模型输出 |
| 5 | LatentSync final mux | 18 | **CRF13 刚写完立刻 CRF18 重编码** |
| 6 | compose | copy | Day29 已优化 |
| 7 | 多素材 concat | 23 | **段参数已统一,不需要重编码** |
| 8 | Remotion 渲染 | ~18 | 必经(叠加文字) |
#### 优化措施
##### 2a. LatentSync `read_video` 跳过冗余 FPS 重编码
**文件**: `models/LatentSync/latentsync/utils/util.py`
- 原代码无条件执行 `ffmpeg -r 25 -crf 18`,即使输入视频已是 25fps
- 新增 FPS 检测:`abs(current_fps - 25.0) < 0.5` 时直接使用原文件
- 我们的 `prepare_segment` 已统一输出 25fps此步完全多余
```python
cap = cv2.VideoCapture(video_path)
current_fps = cap.get(cv2.CAP_PROP_FPS)
cap.release()
if abs(current_fps - 25.0) < 0.5:
print(f"Video already at {current_fps:.1f}fps, skipping FPS conversion")
target_video_path = video_path
else:
# 仅非 25fps 时才重编码
command = f"ffmpeg ... -r 25 -crf 18 ..."
```
##### 2b. LatentSync final mux 流复制替代重编码
**文件**: `models/LatentSync/latentsync/pipelines/lipsync_pipeline.py`
- 原代码:`imageio` 以 CRF 13 高质量写完帧后final mux 又用 `libx264 -crf 18` 完整重编码
- 修复:改为 `-c:v copy` 流复制,仅 mux 音频轨,视频零损失
```diff
- ffmpeg ... -c:v libx264 -crf 18 -c:a aac -q:v 0 -q:a 0
+ ffmpeg ... -c:v copy -c:a aac -q:a 0
```
##### 2c. `prepare_segment` + `normalize_orientation` CRF 23 → 18
**文件**: `backend/app/services/video_service.py`
- `normalize_orientation`CRF 23 → 18
- `prepare_segment` trim 临时文件CRF 23 → 18
- `prepare_segment` 主命令CRF 23 → 18
- CRF 18 是"高质量"级别,与 LatentSync 内部标准一致
##### 2d. 多素材 concat 流复制
**文件**: `backend/app/services/video_service.py`
- 原代码用 `libx264 -crf 23` 重编码拼接
- 所有段已由 `prepare_segment` 统一为相同分辨率/帧率/编码参数
- 改为 `-c:v copy` 流复制,消除一次完整重编码
```diff
- -vsync cfr -r 25 -c:v libx264 -preset fast -crf 23 -pix_fmt yuv420p
+ -c:v copy
```
#### 优化后编码链路
| # | 阶段 | CRF | 状态 |
|---|------|-----|------|
| 1 | 方向归一化 | **18** | 提质(条件触发) |
| 2 | `prepare_segment` | **18** | 提质(必经) |
| 3 | ~~LatentSync FPS 转换~~ | - | **已消除** |
| 4 | LatentSync 模型输出 | 13 | 不变(不可避免) |
| 5 | ~~LatentSync final mux~~ | - | **已消除copy** |
| 6 | compose | copy | 不变 |
| 7 | ~~多素材 concat~~ | - | **已消除copy** |
| 8 | Remotion 渲染 | ~18 | 不变(不可避免) |
**总计5-6 次有损编码 → 3 次**prepare_segment → LatentSync 模型输出 → Remotion质量损失减少近一半。
---
## 📁 修改文件清单
| 文件 | 改动 |
|------|------|
| `remotion/render.ts` | bundle 缓存使用时硬链接视频+字体到 public 目录 |
| `models/LatentSync/latentsync/utils/util.py` | `read_video` 检测 FPS25fps 时跳过重编码 |
| `models/LatentSync/latentsync/pipelines/lipsync_pipeline.py` | final mux `-c:v copy`无脸帧容错affine_transform + restore_video |
| `backend/app/services/video_service.py` | `normalize_orientation` CRF 23→18`prepare_segment` CRF 23→18`concat_videos` `-c:v copy` |
| `backend/app/modules/videos/workflow.py` | 单素材 LatentSync 异常时回退原视频 |
---
### 3. LatentSync 无脸帧容错
- **问题**: 素材中如果有部分帧检测不到人脸(转头、遮挡、空镜头),`affine_transform` 会抛异常导致整个推理任务失败
- **改动**:
- `affine_transform_video`: 单帧异常时 catch 住,用最近有效帧的 face/box/affine_matrix 填充(保证 tensor batch 维度完整),全部帧无脸时仍 raise
- `restore_video`: 新增 `valid_face_flags` 参数,无脸帧直接保留原画面(不做嘴型替换)
- `loop_video`: `valid_face_flags` 跟随循环和翻转
- `workflow.py`: 单素材路径 `lipsync.generate()` 整体异常时 copy 原视频继续流程,任务不会失败
---
### 4. MuseTalk 编码链路优化
#### 4a. FFmpeg rawvideo 管道直编码(消除中间有损文件)
**文件**: `models/MuseTalk/scripts/server.py`
- **原流程**: UNet 推理帧 → `cv2.VideoWriter(mp4v)` 写中间文件(有损) → FFmpeg 重编码+音频 mux又一次有损
- **新流程**: UNet 推理帧 → FFmpeg rawvideo stdin 管道 → 一次 libx264 编码+音频 mux
```python
ffmpeg_cmd = [
"ffmpeg", "-y", "-v", "warning",
"-f", "rawvideo", "-pix_fmt", "bgr24",
"-s", f"{w}x{h}", "-r", str(fps),
"-i", "-", # stdin 管道输入
"-i", audio_path,
"-c:v", "libx264", "-preset", ENCODE_PRESET, "-crf", str(ENCODE_CRF),
"-pix_fmt", "yuv420p",
"-c:a", "copy", "-shortest",
output_vid_path,
]
ffmpeg_proc = subprocess.Popen(ffmpeg_cmd, stdin=subprocess.PIPE, ...)
# 每帧直接 pipe_in.write(frame.tobytes())
```
关键实现细节:
- `-pix_fmt bgr24` 匹配 OpenCV 原生帧格式,零转换开销
- `np.ascontiguousarray` 确保帧内存连续
- `BrokenPipeError` 捕获 + return code 检查覆盖异常路径
- `pipe_in.close()``ffmpeg_proc.wait()` 之前,正确发送 EOF
- 合成 fallbackresize 失败、mask 失败、blending 失败)均通过 `_write_pipe_frame` 输出原帧
#### 4b. MuseTalk 参数环境变量化 + 质量优先档
**文件**: `models/MuseTalk/scripts/server.py` + `backend/.env`
所有推理与编码参数从硬编码改为 `.env` 可配置,当前使用"质量优先"档:
| 参数 | 原默认值 | 质量优先值 | 作用 |
|------|----------|-----------|------|
| `MUSETALK_DETECT_EVERY` | 5 | **2** | 人脸检测频率 ↑2.5x,画面跟踪更稳 |
| `MUSETALK_BLEND_CACHE_EVERY` | 5 | **2** | mask 更新更频,面部边缘融合更干净 |
| `MUSETALK_EXTRA_MARGIN` | 15 | **14** | 下巴区域微调 |
| `MUSETALK_BLEND_MODE` | auto | **jaw** | v1.5 显式 jaw 模式 |
| `MUSETALK_ENCODE_CRF` | 18 | **14** | 接近视觉无损(输出还要进 Remotion 再编码) |
| `MUSETALK_ENCODE_PRESET` | medium | **slow** | 同 CRF 下压缩效率更高 |
| `MUSETALK_AUDIO_PADDING` | 2/2 | 2/2 | 不变 |
| `MUSETALK_FACEPARSING_CHEEK` | 90/90 | 90/90 | 不变 |
新增可配置参数完整列表:`DETECT_EVERY``BLEND_CACHE_EVERY``AUDIO_PADDING_LEFT/RIGHT``EXTRA_MARGIN``DELAY_FRAME``BLEND_MODE``FACEPARSING_LEFT/RIGHT_CHEEK_WIDTH``ENCODE_CRF``ENCODE_PRESET`
---
### 5. Workflow 异步防阻塞 + compose 跳过优化
#### 5a. 阻塞调用线程池化
**文件**: `backend/app/modules/videos/workflow.py`
workflow 中多处同步 FFmpeg 调用会阻塞 asyncio 事件循环,导致其他 API 请求(健康检查、任务状态查询)无法响应。新增通用辅助函数 `_run_blocking()`,将所有阻塞调用统一走线程池:
```python
async def _run_blocking(func, *args):
"""在线程池执行阻塞函数,避免卡住事件循环。"""
loop = asyncio.get_running_loop()
return await loop.run_in_executor(None, func, *args)
```
已改造的阻塞调用点:
| 调用 | 位置 | 说明 |
|------|------|------|
| `video.normalize_orientation()` | 单素材旋转归一化 | FFmpeg 旋转/转码 |
| `video.prepare_segment()` | 多素材片段准备 | FFmpeg 缩放+时长裁剪,配合 `asyncio.gather` 多段并行 |
| `video.concat_videos()` | 多素材拼接 | FFmpeg concat |
| `video.prepare_segment()` | 单素材 prepare | FFmpeg 缩放+时长裁剪 |
| `video.mix_audio()` | BGM 混音 | FFmpeg 音频混合 |
| `video._get_duration()` | 音频/视频时长探测 (3处) | ffprobe 子进程 |
#### 5b. `prepare_segment` 同分辨率跳过 scale
**文件**: `backend/app/modules/videos/workflow.py`
原来无论素材分辨率是否已匹配目标,都强制传 `target_resolution``prepare_segment`,触发 scale filter + libx264 重编码。优化后逐素材比对分辨率:
- **多素材**: 逐段判断,分辨率匹配的传 `None``prepare_target_res = None if res == base_res else base_res`),走 `-c:v copy` 分支
- **单素材**: 先 `get_resolution` 比对,匹配则传 `None`
当分辨率匹配且无截取、不需要循环、不需要变帧率时,`prepare_segment` 内部走 `-c:v copy`,完全零损编码。
#### 5c. `_get_duration()` 线程池化
**文件**: `backend/app/modules/videos/workflow.py`
3 处 `video._get_duration()` 同步 ffprobe 调用改为 `await _run_blocking(video._get_duration, ...)`,避免阻塞事件循环。
#### 5d. compose 循环场景 CRF 统一
**文件**: `backend/app/services/video_service.py`
`compose()` 在视频需要循环时的编码从 CRF 23 提升到 CRF 18与全流水线质量标准统一。
#### 5e. 多素材片段校验
**文件**: `backend/app/modules/videos/workflow.py`
多素材 `prepare_segment` 完成后新增片段数量一致性校验,避免空片段进入 concat 导致异常。
#### 5f. compose() 内部防阻塞
**文件**: `backend/app/services/video_service.py`
`compose()` 改为 `async def`,内部的 `_get_duration()``_run_ffmpeg()` 都通过 `loop.run_in_executor` 在线程池执行。
#### 5g. 无需二次 compose 直接透传
**文件**: `backend/app/modules/videos/workflow.py`
当没有 BGM 时(`final_audio_path == audio_path`LatentSync/MuseTalk 输出已包含正确音轨,跳过多余的 compose 步骤:
```python
needs_audio_compose = str(final_audio_path) != str(audio_path)
```
- **Remotion 路径**: 音频没变则跳过 pre-compose直接用 lipsync 输出进 Remotion
- **非 Remotion 路径**: 音频没变则 `shutil.copy` 直接透传 lipsync 输出,不再走 compose
---
### 6. 唇形模型前端选择
前端生成按钮右侧新增模型下拉,用户可按需选择唇形同步引擎,全链路透传到后端路由。
#### 模型选项
| 选项 | 值 | 路由逻辑 |
|------|------|------|
| 默认模型 | `default` | 保持现有阈值策略(`LIPSYNC_DURATION_THRESHOLD` 分水岭,短视频 LatentSync长视频 MuseTalk |
| 快速模型 | `fast` | 强制 MuseTalk不可用时回退 LatentSync |
| 高级模型 | `advanced` | 强制 LatentSync跳过 MuseTalk |
三种模式最终都有 LatentSync 兜底,不会出现无模型可用的情况。
#### 数据流
```
前端 select → setLipsyncModelMode("fast") → localStorage 持久化
用户点击"生成视频" → handleGenerate()
→ payload.lipsync_model = lipsyncModelMode
→ POST /api/videos/generate { ..., lipsync_model: "fast" }
→ workflow: req.lipsync_model 透传给 lipsync.generate(model_mode=...)
→ lipsync_service.generate(): 按 model_mode 路由
→ fast: 强制 MuseTalk → 回退 LatentSync
→ advanced: 强制 LatentSync
→ default: 阈值策略
```
#### 改动文件
| 文件 | 改动 |
|------|------|
| `frontend/src/features/home/ui/GenerateActionBar.tsx` | 生成按钮右侧新增模型 `<select>` 下拉 |
| `frontend/src/features/home/ui/HomePage.tsx` | 透传 `modelMode` / `onModelModeChange` |
| `frontend/src/features/home/model/useHomeController.ts` | `lipsyncModelMode` state + payload 透传 |
| `frontend/src/features/home/model/useHomePersistence.ts` | 读/校验/写三步持久化 |
| `backend/app/modules/videos/schemas.py` | `lipsync_model: Literal["default", "fast", "advanced"]` |
| `backend/app/modules/videos/workflow.py` | 多素材/单素材两处 `model_mode=req.lipsync_model` 透传 |
| `backend/app/services/lipsync_service.py` | `generate()` 新增 `model_mode` 参数,三路分支路由 |
---
## 📁 总修改文件清单
| 文件 | 改动 |
|------|------|
| `remotion/render.ts` | bundle 缓存使用时硬链接视频+字体到 public 目录 |
| `models/LatentSync/latentsync/utils/util.py` | `read_video` 检测 FPS25fps 时跳过重编码 |
| `models/LatentSync/latentsync/pipelines/lipsync_pipeline.py` | final mux `-c:v copy`;无脸帧容错 |
| `backend/app/services/video_service.py` | CRF 23→18`concat_videos` copy`compose()` 异步化 + 循环 CRF 18 |
| `backend/app/modules/videos/workflow.py` | 线程池化;同分辨率跳过 scalecompose 跳过;片段校验;模型选择透传 |
| `backend/app/modules/videos/schemas.py` | 新增 `lipsync_model` 字段 |
| `backend/app/services/lipsync_service.py` | `generate()` 新增 `model_mode` 三路分支路由 |
| `models/MuseTalk/scripts/server.py` | FFmpeg rawvideo 管道;参数环境变量化 |
| `backend/.env` | 新增 MuseTalk 质量优先参数 |
| `frontend/src/features/home/ui/GenerateActionBar.tsx` | 模型下拉 UI |
| `frontend/src/features/home/ui/HomePage.tsx` | 模型状态透传 |
| `frontend/src/features/home/model/useHomeController.ts` | `lipsyncModelMode` state + payload |
| `frontend/src/features/home/model/useHomePersistence.ts` | 模型选择持久化 |
---
## 🔍 验证
1. **标题字幕恢复**: 生成视频应有标题和逐字高亮字幕Remotion 渲染成功,非 FFmpeg 回退)
2. **Remotion 日志**: 应出现 `Hardlinked into cached bundle:``Copied into cached bundle:` 而非 404
3. **LatentSync FPS 跳过**: 日志应出现 `Video already at 25.0fps, skipping FPS conversion`
4. **LatentSync mux**: FFmpeg 日志中 final mux 应为 `-c:v copy`
5. **画质对比**: 同一素材+音频,优化后生成的视频嘴型区域(尤其牙齿)应比优化前更清晰
6. **多素材拼接**: concat 步骤应为流复制,耗时从秒级降到毫秒级
7. **无脸帧容错**: 包含转头/遮挡帧的素材不再导致任务失败,无脸帧保留原画面
8. **MuseTalk 管道编码**: 日志中不应出现中间 mp4v 文件,合成阶段直接管道写入
9. **MuseTalk 质量参数**: `curl localhost:8011/health` 确认服务在线,生成视频嘴型边缘更清晰
10. **事件循环不阻塞**: 生成视频期间,`/api/tasks/{id}` 等接口应正常响应,不出现超时
11. **compose 跳过**: 无 BGM 时日志应出现 `Audio unchanged, skip pre-Remotion compose`
12. **同分辨率跳过 scale**: 素材已是目标分辨率时,`prepare_segment` 应走 `-c:v copy`(日志中无 scale filter
13. **compose 循环 CRF**: 循环场景编码应为 CRF 18非 23
14. **模型选择 UI**: 生成按钮右侧应出现默认模型/快速模型/高级模型下拉
15. **模型选择持久化**: 切换模型后刷新页面,下拉应恢复上次选择
16. **快速模型路由**: 选择"快速模型"时,后端日志应出现 `强制快速模型MuseTalk`
17. **高级模型路由**: 选择"高级模型"时,后端日志应出现 `强制高级模型LatentSync`
18. **默认模型不变**: 选择"默认模型"时行为与改动前完全一致(阈值路由)

View File

@@ -37,6 +37,7 @@ ViGent2 的前端界面,采用 Next.js 16 + TailwindCSS 构建。
- **重新识别**: 旧参考音频可重新转写并截取 (RotateCw 按钮)。
- **一键克隆**: 选择参考音频后自动调用 CosyVoice 3.0 服务。
- **语速控制**: 声音克隆模式下支持 5 档语速 (0.8-1.2),选择持久化 (Day 23)。
- **语气控制**: 声音克隆模式下支持 4 种语气 (正常/欢快/低沉/严肃),基于 CosyVoice3 `inference_instruct2`,选择持久化 (Day 29)。
- **多语言支持**: EdgeTTS 10 语言声音列表,声音克隆 language 透传 (Day 22)。
### 4. 配音前置 + 时间轴编排 [Day 23 新增]

View File

@@ -201,6 +201,29 @@ LatentSync 1.6 需要 ~18GB VRAM。如果遇到 OOM 错误:
- `inference_steps`: 增加到 30-50 可提高质量
- `guidance_scale`: 增加可改善唇同步,但过高可能导致抖动
### 编码流水线优化Day 30
LatentSync 内部默认流程有两处冗余编码已优化:
1. **`read_video` FPS 转换**: 原代码无条件 `ffmpeg -r 25 -crf 18`,现已改为检测 FPS25fps 时跳过(我们的 `prepare_segment` 已输出 25fps
2. **final mux 双重编码**: 原代码 `imageio` CRF 13 写帧后又用 `libx264 -crf 18` 重编码做 mux现已改为 `-c:v copy` 流复制
这两项优化位于:
- `latentsync/utils/util.py``read_video()` 函数
- `latentsync/pipelines/lipsync_pipeline.py` — final mux 命令
---
### 无脸帧容错Day 30
素材中部分帧检测不到人脸(转头、遮挡、空镜头)时,不再中断整次推理:
- `affine_transform_video`: 单帧异常时用最近有效帧填充,全部帧无脸时仍报错
- `restore_video`: 无脸帧保留原画面,不做嘴型替换
- 后端 `workflow.py`: LatentSync 整体异常时自动回退原视频,任务不会失败
改动位于 `latentsync/pipelines/lipsync_pipeline.py`
---
## 参考链接

View File

@@ -1,6 +1,6 @@
# MuseTalk 部署指南
> **更新时间**2026-02-27
> **更新时间**2026-03-02
> **适用版本**MuseTalk v1.5 (常驻服务模式)
> **架构**FastAPI 常驻服务 + PM2 进程管理
@@ -173,17 +173,36 @@ curl http://localhost:8011/health
`backend/.env` 中的相关变量:
```ini
# MuseTalk 配置
# MuseTalk 基础配置
MUSETALK_GPU_ID=0 # GPU 编号 (与 CosyVoice 共存)
MUSETALK_API_URL=http://localhost:8011 # 常驻服务地址
MUSETALK_BATCH_SIZE=32 # 推理批大小
MUSETALK_VERSION=v15 # 模型版本
MUSETALK_USE_FLOAT16=true # 半精度加速
# 推理质量参数
MUSETALK_DETECT_EVERY=2 # 人脸检测降频间隔 (帧,越小越准但更慢)
MUSETALK_BLEND_CACHE_EVERY=2 # BiSeNet mask 缓存更新间隔 (帧)
MUSETALK_AUDIO_PADDING_LEFT=2 # Whisper 时序上下文 (左)
MUSETALK_AUDIO_PADDING_RIGHT=2 # Whisper 时序上下文 (右)
MUSETALK_EXTRA_MARGIN=14 # v1.5 下巴区域扩展像素
MUSETALK_DELAY_FRAME=0 # 音频-口型对齐偏移 (帧)
MUSETALK_BLEND_MODE=jaw # 融合模式: auto / jaw / raw
MUSETALK_FACEPARSING_LEFT_CHEEK_WIDTH=90 # 面颊宽度 (仅 v1.5)
MUSETALK_FACEPARSING_RIGHT_CHEEK_WIDTH=90
# 编码质量参数
MUSETALK_ENCODE_CRF=14 # CRF 越小越清晰 (14≈接近视觉无损)
MUSETALK_ENCODE_PRESET=slow # x264 preset (slow=高压缩效率)
# 混合唇形同步路由
LIPSYNC_DURATION_THRESHOLD=120 # 秒, >=此值用 MuseTalk
```
> **参数档位参考**
> - 速度优先:`DETECT_EVERY=5, BLEND_CACHE_EVERY=5, ENCODE_CRF=18, ENCODE_PRESET=medium`
> - 质量优先(当前):`DETECT_EVERY=2, BLEND_CACHE_EVERY=2, ENCODE_CRF=14, ENCODE_PRESET=slow`
---
## 相关文件
@@ -207,22 +226,36 @@ LIPSYNC_DURATION_THRESHOLD=120 # 秒, >=此值用 MuseTalk
|--------|------|
| `MUSETALK_BATCH_SIZE` 8→32 | RTX 3090 显存充裕UNet 推理加速 ~3x |
| cv2.VideoCapture 直读帧 | 跳过 ffmpeg→PNG→imread 链路 |
| 人脸检测降频 (每5帧) | DWPose + FaceAlignment 只在采样帧运行,中间帧线性插值 bbox |
| BiSeNet mask 缓存 (每5帧) | `get_image_prepare_material`5 帧运行,中间帧`get_image_blending` 复用 |
| cv2.VideoWriter 直写 | 跳过逐帧 PNG 写盘 + ffmpeg 重编码 |
| 人脸检测降频 (每N帧) | DWPose + FaceAlignment 只在采样帧运行,中间帧线性插值 bbox |
| BiSeNet mask 缓存 (每N帧) | `get_image_prepare_material`N 帧运行,中间帧复用 |
| FFmpeg rawvideo 管道直编码 | 原 `cv2.VideoWriter(mp4v)` 中间有损文件改为 stdin 管道直写,消除一次冗余有损编码 |
| 参数环境变量化 | 所有推理/编码参数从 `.env` 读取,支持速度优先/质量优先快速切换 |
| 每阶段计时 | 7 个阶段精确计时,方便后续调优 |
### 编码链路
```
UNet 推理帧 (raw BGR24)
→ FFmpeg rawvideo stdin 管道
→ 一次 libx264 编码 (CRF 14, preset slow) + 音频 mux
→ 最终输出 .mp4
```
与旧流程对比:消除了 `cv2.VideoWriter(mp4v)` 中间有损文件,编码次数从 2 次减至 1 次。
### 调优参数
`models/MuseTalk/scripts/server.py` 顶部可调
所有参数通过 `backend/.env` 配置(修改后需重启 MuseTalk 服务生效)
```python
DETECT_EVERY = 5 # 人脸检测降频间隔 (帧)
BLEND_CACHE_EVERY = 5 # BiSeNet mask 缓存间隔 (帧)
```ini
MUSETALK_DETECT_EVERY=2 # 人脸检测降频间隔 (帧),质量优先用 2速度优先用 5
MUSETALK_BLEND_CACHE_EVERY=2 # BiSeNet mask 缓存间隔 (帧)
MUSETALK_ENCODE_CRF=14 # 编码质量 (14≈接近视觉无损18=高质量)
MUSETALK_ENCODE_PRESET=slow # 编码速度 (slow=高压缩效率medium=平衡)
```
> 对于口播视频 (人脸几乎不动)5 帧间隔的插值误差可忽略。
> 如人脸运动剧烈的场景,可降低为 2-3
> 对于口播视频 (人脸几乎不动)detect_every=5 的插值误差可忽略。
> 如人脸运动剧烈或追求最佳质量,使用 detect_every=2
---

View File

@@ -241,6 +241,15 @@ const bundleLocation = await bundle({
const videoUrl = staticFile(videoSrc); // 使用 staticFile
```
**问题**: Remotion 渲染失败 - 404 视频文件找不到bundle 缓存问题)
Remotion 使用 bundle 缓存加速打包。缓存命中时,新生成的视频/字体文件需要硬链接到缓存的 `public/` 目录。如果出现 404 错误,清除缓存重试:
```bash
rm -rf /home/rongye/ProgramFiles/ViGent2/remotion/.remotion-bundle-cache
pm2 restart vigent2-backend
```
**问题**: Remotion 渲染失败
查看后端日志:
@@ -296,3 +305,4 @@ WhisperService(device="cuda:0") # 或 "cuda:1"
| 2026-02-27 | 1.3.0 | 架构图更新 MuseTalk 混合路由Remotion 并发渲染从 8 提升到 16GPU 分配说明更新 |
| 2026-02-28 | 1.3.1 | MuseTalk 合成阶段优化:纯 numpy blending + FFmpeg pipe NVENC GPU 硬编码替代双重编码 |
| 2026-02-28 | 1.4.0 | compose 流复制替代重编码FFmpeg 超时保护 (600s/30s)Remotion 并发 16→4Whisper 时间戳平滑 + 原文节奏映射;全局视频生成 Semaphore(2)Redis 任务 TTL |
| 2026-03-02 | 1.5.0 | Remotion bundle 缓存修复(硬链接视频/字体到 cached public 目录);编码流水线优化 prepare_segment/normalize CRF 23→18多素材 concat 改为流复制 |

View File

@@ -1,8 +1,8 @@
# ViGent2 开发任务清单 (Task Log)
**项目**: ViGent2 数字人口播视频生成系统
**进度**: 100% (Day 28 - CosyVoice FP16 加速 + 文档全面更新)
**更新时间**: 2026-02-27
**进度**: 100% (Day 30 - Remotion 缓存修复 + 编码流水线质量优化)
**更新时间**: 2026-03-02
---
@@ -10,7 +10,37 @@
> 这里记录了每一天的核心开发内容与 milestone。
### Day 28: CosyVoice FP16 加速 + 文档全面更新 (Current)
### Day 30: Remotion 缓存修复 + 编码流水线质量优化 + 唇形同步容错 (Current)
- [x] **Remotion 缓存 404 修复**: bundle 缓存命中时,新生成的视频/字体文件不在旧缓存 `public/` 目录 → 404 → 回退 FFmpeg无标题字幕。改为硬链接`fs.linkSync`)当前渲染所需文件到缓存目录。
- [x] **LatentSync `read_video` 跳过冗余 FPS 重编码**: 检测输入 FPS已是 25fps 时跳过 `ffmpeg -r 25 -crf 18` 重编码。
- [x] **LatentSync final mux 流复制**: `imageio` CRF 13 写帧后的 mux 步骤从 `libx264 -crf 18` 改为 `-c:v copy`,消除冗余双重编码。
- [x] **`prepare_segment` + `normalize_orientation` CRF 提质**: CRF 23 → 18与 LatentSync 内部质量标准统一。
- [x] **多素材 concat 流复制**: 各段参数已统一,`concat_videos``libx264 -crf 23` 改为 `-c:v copy`
- [x] **编码次数总计**: 从 5-6 次有损编码降至 3 次prepare_segment → LatentSync/MuseTalk 模型输出 → Remotion
- [x] **LatentSync 无脸帧容错**: 素材部分帧检测不到人脸时不再中断推理,无脸帧保留原画面,单素材异常时回退原视频。
- [x] **MuseTalk 管道直编码**: `cv2.VideoWriter(mp4v)` 中间有损文件改为 FFmpeg rawvideo stdin 管道,消除一次冗余有损编码。
- [x] **MuseTalk 参数环境变量化**: 推理与编码参数detect_every/blend_cache/CRF/preset 等)从硬编码迁移到 `backend/.env`当前使用质量优先档CRF 14, preset slow, detect_every 2, blend_cache_every 2
- [x] **Workflow 异步防阻塞**: 新增 `_run_blocking()` 线程池辅助5 处同步 FFmpeg 调用(旋转归一化/prepare_segment/concat/BGM 混音)改为 `await _run_blocking()`,事件循环不再被阻塞。
- [x] **compose 跳过优化**: 无 BGM 时 `final_audio_path == audio_path`,跳过多余的 compose 步骤Remotion 路径直接用 lipsync 输出,非 Remotion 路径 `shutil.copy` 透传。
- [x] **compose() 异步化**: `compose()` 改为 `async def`,内部 `_get_duration``_run_ffmpeg``run_in_executor`
- [x] **同分辨率跳过 scale**: 多素材逐段比对分辨率,匹配的传 `None` 走 copy 分支;单素材同理。避免已是目标分辨率时的无效重编码。
- [x] **`_get_duration()` 线程池化**: workflow 中 3 处同步 ffprobe 探测改为 `await _run_blocking()`
- [x] **compose 循环 CRF 统一**: 循环场景 CRF 23 → 18与全流水线质量标准一致。
- [x] **多素材片段校验**: prepare 完成后校验片段数量一致,防止空片段进入 concat。
- [x] **唇形模型前端选择**: 生成按钮右侧新增模型下拉(默认模型/快速模型/高级模型),全链路透传 `lipsync_model` 到后端路由。默认保持阈值策略,快速强制 MuseTalk高级强制 LatentSync三种模式均有 LatentSync 兜底。选择 localStorage 持久化。
### Day 29: 视频流水线优化 + CosyVoice 语气控制
- [x] **字幕同步修复**: Whisper 时间戳三步平滑(单调递增+重叠消除+间隙填补)+ 原文节奏映射(线性插值 + 单字时长钳位)。
- [x] **LatentSync 嘴型参数调优**: inference_steps 16→20, guidance_scale 2.0, DeepCache 启用, Remotion concurrency 16→4。
- [x] **compose 流复制**: 不循环时 `-c:v copy` 替代 libx264 重编码compose 耗时从分钟级降到秒级。
- [x] **FFmpeg 超时保护**: `_run_ffmpeg()` timeout=600, `_get_duration()` timeout=30。
- [x] **全局并发限制**: `asyncio.Semaphore(2)` 控制同时运行的生成任务数。
- [x] **Redis 任务 TTL**: create 24h, completed/failed 2h, list 自动清理过期索引。
- [x] **临时字体清理**: 字体文件加入 temp_files 清理列表。
- [x] **预览背景 CORS 修复**: 素材同源代理 `/api/materials/stream/{id}` 彻底绕开跨域。
- [x] **CosyVoice 语气控制**: 声音克隆模式新增语气下拉(正常/欢快/低沉/严肃),基于 `inference_instruct2()` 自然语言指令控制情绪,全链路透传 instruct_text默认"正常"行为不变。
### Day 28: CosyVoice FP16 加速 + 文档全面更新
- [x] **CosyVoice FP16 半精度加速**: `AutoModel()` 开启 `fp16=True`LLM 推理和 Flow Matching 自动混合精度运行,预估提速 30-40%、显存降低 ~30%。
- [x] **文档全面更新**: README.md / DEPLOY_MANUAL.md / SUBTITLE_DEPLOY.md / BACKEND_README.md 补充 MuseTalk 混合唇形同步方案、性能优化、Remotion 并发渲染等内容。
@@ -258,7 +288,7 @@
| **核心 API** | 100% | ✅ 稳定 |
| **Web UI** | 100% | ✅ 稳定 (移动端适配) |
| **唇形同步** | 100% | ✅ LatentSync 1.6 |
| **TTS 配音** | 100% | ✅ EdgeTTS + CosyVoice 3.0 + 配音前置 + 时间轴编排 + 自动转写 + 语速控制 |
| **TTS 配音** | 100% | ✅ EdgeTTS + CosyVoice 3.0 + 配音前置 + 时间轴编排 + 自动转写 + 语速控制 + 语气控制 |
| **自动发布** | 100% | ✅ 抖音/微信视频号/B站/小红书 |
| **用户认证** | 100% | ✅ 手机号 + JWT |
| **付费会员** | 100% | ✅ 支付宝电脑网站支付 + 自动激活 |

View File

@@ -16,8 +16,8 @@
## ✨ 功能特性
### 核心能力
- 🎬 **高清唇形同步** - 混合方案:短视频 (<120s) 用 LatentSync 1.6 (高质量 Latent Diffusion),长视频 (>=120s) 用 MuseTalk 1.5 (实时级单步推理),自动路由 + 回退。
- 🎙️ **多模态配音** - 支持 **EdgeTTS** (微软超自然语音, 10 语言) 和 **CosyVoice 3.0** (3秒极速声音克隆, 9语言+18方言, 语速可调)。上传参考音频自动 Whisper 转写 + 智能截取。配音前置工作流:先生成配音 → 选素材 → 生成视频。
- 🎬 **高清唇形同步** - 混合方案:短视频 (<120s) 用 LatentSync 1.6 (高质量 Latent Diffusion),长视频 (>=120s) 用 MuseTalk 1.5 (实时级单步推理),自动路由 + 回退。前端可选模型:默认模型(阈值自动路由)/ 快速模型(强制 MuseTalk/ 高级模型(强制 LatentSync
- 🎙️ **多模态配音** - 支持 **EdgeTTS** (微软超自然语音, 10 语言) 和 **CosyVoice 3.0** (3秒极速声音克隆, 9语言+18方言, 语速/语气可调)。上传参考音频自动 Whisper 转写 + 智能截取。配音前置工作流:先生成配音 → 选素材 → 生成视频。
- 📝 **智能字幕** - 集成 faster-whisper + Remotion自动生成逐字高亮 (卡拉OK效果) 字幕。
- 🎨 **样式预设** - 12 种标题 + 8 种字幕样式预设,支持预览 + 字号调节 + 自定义字体库。CSS 原生描边渲染,清晰无重影。
- 🏷️ **标题显示模式** - 片头标题支持 `短暂显示` / `常驻显示`默认短暂显示4秒用户偏好自动持久化。
@@ -37,7 +37,7 @@
- 💳 **付费会员** - 支付宝电脑网站支付自动开通会员,到期自动停用并引导续费,管理员手动激活并存。
- 🔐 **认证与隔离** - 基于 Supabase 的用户隔离,支持手机号注册/登录、密码管理。
- 🛡️ **服务守护** - 内置 Watchdog 看门狗机制,自动监控并重启僵死服务,确保 7x24h 稳定运行。
- 🚀 **性能优化** - compose 流复制免重编码、FFmpeg 超时保护、全局视频生成并发限制 (Semaphore(2))、Remotion 4 并发渲染、MuseTalk NVENC GPU 硬编码 + 纯 numpy blending、模型常驻服务、双 GPU 流水线并发、Redis 任务 TTL 自动清理。
- 🚀 **性能优化** - 编码流水线从 5-6 次有损编码精简至 3 次prepare_segment → 模型输出 → Remotioncompose 流复制免重编码、同分辨率跳过 scale、FFmpeg 超时保护、全局视频生成并发限制 (Semaphore(2))、Remotion 4 并发渲染、MuseTalk rawvideo 管道直编码(消除中间有损文件)、模型常驻服务、双 GPU 流水线并发、Redis 任务 TTL 自动清理、workflow 阻塞调用线程池化
---

View File

@@ -25,10 +25,10 @@ LATENTSYNC_USE_SERVER=true
# LATENTSYNC_API_URL=http://localhost:8007
# 推理步数 (20-50, 越高质量越好,速度越慢)
LATENTSYNC_INFERENCE_STEPS=20
LATENTSYNC_INFERENCE_STEPS=30
# 引导系数 (1.0-3.0, 越高唇同步越准,但可能抖动)
LATENTSYNC_GUIDANCE_SCALE=2.0
LATENTSYNC_GUIDANCE_SCALE=1.9
# 启用 DeepCache 加速 (推荐开启)
LATENTSYNC_ENABLE_DEEPCACHE=true
@@ -52,9 +52,36 @@ MUSETALK_VERSION=v15
# 半精度加速
MUSETALK_USE_FLOAT16=true
# 人脸检测降频间隔(帧,越小质量越稳但更慢)
MUSETALK_DETECT_EVERY=2
# BiSeNet mask 缓存更新间隔(帧,越小质量越稳但更慢)
MUSETALK_BLEND_CACHE_EVERY=2
# Whisper 时序上下文(越大越平滑,口型响应会更钝)
MUSETALK_AUDIO_PADDING_LEFT=2
MUSETALK_AUDIO_PADDING_RIGHT=2
# v1.5 下巴区域扩展像素(越大越容易看到下唇/牙齿,也更易边缘不稳)
MUSETALK_EXTRA_MARGIN=14
# 音频-口型对齐偏移(帧,正数=口型更晚,负数=口型更早)
MUSETALK_DELAY_FRAME=0
# 融合模式auto(按版本自动) / jaw / raw
MUSETALK_BLEND_MODE=jaw
# FaceParsing 面颊宽度(仅 v1.5 生效,影响融合掩膜范围)
MUSETALK_FACEPARSING_LEFT_CHEEK_WIDTH=90
MUSETALK_FACEPARSING_RIGHT_CHEEK_WIDTH=90
# 最终编码质量CRF 越小越清晰但体积更大)
MUSETALK_ENCODE_CRF=14
MUSETALK_ENCODE_PRESET=slow
# =============== 混合唇形同步路由 ===============
# 音频时长 >= 此阈值(秒)用 MuseTalk< 此阈值用 LatentSync
LIPSYNC_DURATION_THRESHOLD=120
LIPSYNC_DURATION_THRESHOLD=100
# =============== 上传配置 ===============
# 最大上传文件大小 (MB)

View File

@@ -10,6 +10,7 @@ class GenerateAudioRequest(BaseModel):
ref_text: Optional[str] = None
language: str = "zh-CN"
speed: float = 1.0
instruct_text: Optional[str] = None
class RenameAudioRequest(BaseModel):

View File

@@ -81,6 +81,7 @@ async def generate_audio_task(task_id: str, req: GenerateAudioRequest, user_id:
output_path=audio_path,
language=_locale_to_tts_lang(req.language),
speed=req.speed,
instruct_text=req.instruct_text or "",
)
finally:
if os.path.exists(ref_local):

View File

@@ -38,3 +38,4 @@ class GenerateRequest(BaseModel):
bgm_volume: Optional[float] = 0.2
custom_assignments: Optional[List[CustomAssignment]] = None
output_aspect_ratio: Literal["9:16", "16:9"] = "9:16"
lipsync_model: Literal["default", "fast", "advanced"] = "default"

View File

@@ -94,6 +94,12 @@ def _update_task(task_id: str, **updates: Any) -> None:
task_store.update(task_id, updates)
async def _run_blocking(func, *args):
"""在线程池执行阻塞函数,避免卡住事件循环。"""
loop = asyncio.get_running_loop()
return await loop.run_in_executor(None, func, *args)
# ── 多素材辅助函数 ──
@@ -214,7 +220,8 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
# 归一化旋转元数据(如 iPhone MOV 1920x1080 + rotation=-90
normalized_input_path = temp_dir / f"{task_id}_input_norm.mp4"
normalized_result = video.normalize_orientation(
normalized_result = await _run_blocking(
video.normalize_orientation,
str(input_material_path),
str(normalized_input_path),
)
@@ -317,7 +324,7 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
result = _split_equal(captions_data["segments"], material_paths)
else:
logger.warning("[MultiMat] Whisper 无数据,按时长均分")
audio_dur = video._get_duration(str(audio_path))
audio_dur = await _run_blocking(video._get_duration, str(audio_path))
if audio_dur <= 0:
audio_dur = 30.0
seg_dur = audio_dur / len(material_paths)
@@ -378,7 +385,7 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
assignments, captions_path = await _whisper_and_split()
# 扩展段覆盖完整音频范围首段从0开始末段到音频结尾
audio_duration = video._get_duration(str(audio_path))
audio_duration = await _run_blocking(video._get_duration, str(audio_path))
if assignments and audio_duration > 0:
assignments[0]["start"] = 0.0
assignments[-1]["end"] = audio_duration
@@ -402,9 +409,7 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
await _download_material(assignment["material_path"], material_local)
normalized_material = temp_dir / f"{task_id}_material_{i}_norm.mp4"
loop = asyncio.get_event_loop()
normalized_result = await loop.run_in_executor(
None,
normalized_result = await _run_blocking(
video.normalize_orientation,
str(material_local),
str(normalized_material),
@@ -432,22 +437,21 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
logger.info(f"[MultiMat] 素材分辨率不一致,统一到 {base_res[0]}x{base_res[1]}")
# ── 第二步:并行裁剪每段素材到对应时长 ──
prepared_segments: List[Path] = [None] * num_segments
prepared_segments: List[Optional[Path]] = [None] * num_segments
async def _prepare_one_segment(i: int, assignment: dict):
"""将单个素材裁剪/循环到对应时长"""
seg_dur = assignment["end"] - assignment["start"]
prepared_path = temp_dir / f"{task_id}_prepared_{i}.mp4"
temp_files.append(prepared_path)
prepare_target_res = None if resolutions[i] == base_res else base_res
loop = asyncio.get_event_loop()
await loop.run_in_executor(
None,
await _run_blocking(
video.prepare_segment,
str(material_locals[i]),
seg_dur,
str(prepared_path),
base_res,
prepare_target_res,
assignment.get("source_start", 0.0),
assignment.get("source_end"),
25,
@@ -472,10 +476,14 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
_update_task(task_id, progress=50, message="正在拼接素材片段...")
concat_path = temp_dir / f"{task_id}_concat.mp4"
temp_files.append(concat_path)
video.concat_videos(
[str(p) for p in prepared_segments],
prepared_segment_paths = [str(p) for p in prepared_segments if p is not None]
if len(prepared_segment_paths) != num_segments:
raise RuntimeError("Multi-material: prepared segments mismatch")
await _run_blocking(
video.concat_videos,
prepared_segment_paths,
str(concat_path),
target_fps=25,
25,
)
# ── 第三步:一次 LatentSync 推理 ──
@@ -485,7 +493,12 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
_update_task(task_id, progress=55, message="正在合成唇形 (LatentSync)...")
print(f"[LipSync] Multi-material: single LatentSync on concatenated video")
try:
await lipsync.generate(str(concat_path), str(audio_path), str(lipsync_video_path))
await lipsync.generate(
str(concat_path),
str(audio_path),
str(lipsync_video_path),
model_mode=req.lipsync_model,
)
except Exception as e:
logger.warning(f"[LipSync] Failed, fallback to concat without lipsync: {e}")
import shutil
@@ -519,18 +532,22 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
single_source_end = req.custom_assignments[0].source_end
_update_task(task_id, progress=20, message="正在准备素材片段...")
audio_dur = video._get_duration(str(audio_path))
audio_dur = await _run_blocking(video._get_duration, str(audio_path))
if audio_dur <= 0:
audio_dur = 30.0
single_res = await _run_blocking(video.get_resolution, str(input_material_path))
single_target_res = None if single_res == target_resolution else target_resolution
prepared_single_path = temp_dir / f"{task_id}_prepared_single.mp4"
temp_files.append(prepared_single_path)
video.prepare_segment(
await _run_blocking(
video.prepare_segment,
str(input_material_path),
audio_dur,
str(prepared_single_path),
target_resolution=target_resolution,
source_start=single_source_start,
source_end=single_source_end,
single_target_res,
single_source_start,
single_source_end,
None,
)
input_material_path = prepared_single_path
@@ -543,7 +560,18 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
if is_ready:
print(f"[LipSync] Starting LatentSync inference...")
_update_task(task_id, progress=35, message="正在运行 LatentSync 推理...")
await lipsync.generate(str(input_material_path), str(audio_path), str(lipsync_video_path))
try:
await lipsync.generate(
str(input_material_path),
str(audio_path),
str(lipsync_video_path),
model_mode=req.lipsync_model,
)
except Exception as e:
logger.warning(f"[LipSync] Failed on single-material, fallback to prepared video: {e}")
_update_task(task_id, message="唇形同步失败,使用原始视频...")
import shutil
shutil.copy(str(input_material_path), str(lipsync_video_path))
else:
print(f"[LipSync] LatentSync not ready, copying original video")
_update_task(task_id, message="唇形同步不可用,使用原始视频...")
@@ -564,6 +592,7 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
final_audio_path = audio_path
_whisper_task = None
_bgm_task = None
mix_output_path: Optional[Path] = None
# 单素材模式下 Whisper 尚未执行,这里与 BGM 并行启动
need_whisper = not is_multi and req.enable_subtitles and captions_path is None
@@ -604,10 +633,8 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
async def _run_bgm():
_update_task(task_id, message="正在合成背景音乐...", progress=86)
loop = asyncio.get_event_loop()
try:
await loop.run_in_executor(
None,
await _run_blocking(
video.mix_audio,
_voice_path,
_bgm_path,
@@ -633,7 +660,7 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
captions_path = None
result_idx += 1
if _bgm_task is not None:
if results[result_idx]:
if results[result_idx] and mix_output_path is not None:
final_audio_path = mix_output_path
@@ -705,14 +732,19 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
final_output_local_path = temp_dir / f"{task_id}_output.mp4"
temp_files.append(final_output_local_path)
needs_audio_compose = str(final_audio_path) != str(audio_path)
if use_remotion:
_update_task(task_id, message="正在合成视频 (Remotion)...", progress=87)
remotion_input_path = lipsync_video_path
composed_video_path = temp_dir / f"{task_id}_composed.mp4"
temp_files.append(composed_video_path)
await video.compose(str(lipsync_video_path), str(final_audio_path), str(composed_video_path))
if needs_audio_compose:
composed_video_path = temp_dir / f"{task_id}_composed.mp4"
temp_files.append(composed_video_path)
await video.compose(str(lipsync_video_path), str(final_audio_path), str(composed_video_path))
remotion_input_path = composed_video_path
else:
logger.info("[Pipeline] Audio unchanged, skip pre-Remotion compose")
remotion_health = await remotion_service.check_health()
if remotion_health.get("ready"):
@@ -729,7 +761,7 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
title_duration = max(0.5, min(float(req.title_duration or 4.0), 30.0))
await remotion_service.render(
video_path=str(composed_video_path),
video_path=str(remotion_input_path),
output_path=str(final_output_local_path),
captions_path=str(captions_path) if captions_path else None,
title=req.title,
@@ -747,15 +779,18 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
except Exception as e:
logger.warning(f"Remotion render failed, using FFmpeg fallback: {e}")
import shutil
shutil.copy(str(composed_video_path), final_output_local_path)
shutil.copy(str(remotion_input_path), str(final_output_local_path))
else:
logger.warning(f"Remotion not ready: {remotion_health.get('error')}, using FFmpeg")
import shutil
shutil.copy(str(composed_video_path), final_output_local_path)
shutil.copy(str(remotion_input_path), str(final_output_local_path))
else:
_update_task(task_id, message="正在合成最终视频...", progress=90)
await video.compose(str(lipsync_video_path), str(final_audio_path), str(final_output_local_path))
if needs_audio_compose:
await video.compose(str(lipsync_video_path), str(final_audio_path), str(final_output_local_path))
else:
import shutil
shutil.copy(str(lipsync_video_path), str(final_output_local_path))
total_time = time.time() - start_time

View File

@@ -11,12 +11,12 @@ import asyncio
import httpx
from pathlib import Path
from loguru import logger
from typing import Optional
from typing import Optional, Literal
from app.core.config import settings
class LipSyncService:
class LipSyncService:
"""唇形同步服务 - LatentSync 1.6 + MuseTalk 1.5 混合方案"""
def __init__(self):
@@ -121,47 +121,43 @@ class LipSyncService:
logger.warning(f"⚠️ 视频循环异常: {e}")
return video_path
async def generate(
self,
video_path: str,
audio_path: str,
output_path: str,
fps: int = 25
) -> str:
"""生成唇形同步视频"""
logger.info(f"🎬 唇形同步任务: {Path(video_path).name} + {Path(audio_path).name}")
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
if self.use_local:
return await self._local_generate(video_path, audio_path, output_path, fps)
else:
return await self._remote_generate(video_path, audio_path, output_path, fps)
async def generate(
self,
video_path: str,
audio_path: str,
output_path: str,
fps: int = 25,
model_mode: Literal["default", "fast", "advanced"] = "default",
) -> str:
"""生成唇形同步视频"""
logger.info(f"🎬 唇形同步任务: {Path(video_path).name} + {Path(audio_path).name}")
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
normalized_mode: Literal["default", "fast", "advanced"] = model_mode
if normalized_mode not in ("default", "fast", "advanced"):
normalized_mode = "default"
logger.info(f"🧠 Lipsync 模式: {normalized_mode}")
if self.use_local:
return await self._local_generate(video_path, audio_path, output_path, fps, normalized_mode)
else:
return await self._remote_generate(video_path, audio_path, output_path, fps, normalized_mode)
async def _local_generate(
self,
video_path: str,
audio_path: str,
output_path: str,
fps: int
) -> str:
"""使用 subprocess 调用 LatentSync conda 环境"""
# 检查前置条件
if not self._check_conda_env():
logger.warning("⚠️ Conda 环境不可用,使用 Fallback")
shutil.copy(video_path, output_path)
return output_path
if not self._check_weights():
logger.warning("⚠️ 模型权重不存在,使用 Fallback")
shutil.copy(video_path, output_path)
return output_path
logger.info("⏳ 等待 GPU 资源 (排队中)...")
async with self._lock:
# 使用临时目录存放中间文件
with tempfile.TemporaryDirectory() as tmpdir:
tmpdir = Path(tmpdir)
async def _local_generate(
self,
video_path: str,
audio_path: str,
output_path: str,
fps: int,
model_mode: Literal["default", "fast", "advanced"],
) -> str:
"""使用 subprocess 调用 LatentSync conda 环境"""
logger.info("⏳ 等待 GPU 资源 (排队中)...")
async with self._lock:
# 使用临时目录存放中间文件
with tempfile.TemporaryDirectory() as tmpdir:
tmpdir = Path(tmpdir)
# 获取音频和视频时长
audio_duration = self._get_media_duration(audio_path)
@@ -176,24 +172,53 @@ class LipSyncService:
str(looped_video),
audio_duration
)
else:
actual_video_path = video_path
# 混合路由: 长视频走 MuseTalk短视频走 LatentSync
if audio_duration and audio_duration >= settings.LIPSYNC_DURATION_THRESHOLD:
logger.info(
f"🔄 音频 {audio_duration:.1f}s >= {settings.LIPSYNC_DURATION_THRESHOLD}s路由到 MuseTalk"
)
musetalk_result = await self._call_musetalk_server(
actual_video_path, audio_path, output_path
)
if musetalk_result:
return musetalk_result
logger.warning("⚠️ MuseTalk 不可用,回退到 LatentSync长视频会较慢")
if self.use_server:
# 模式 A: 调用常驻服务 (加速模式)
return await self._call_persistent_server(actual_video_path, audio_path, output_path)
else:
actual_video_path = video_path
# 模型路由
force_musetalk = model_mode == "fast"
force_latentsync = model_mode == "advanced"
auto_to_musetalk = (
model_mode == "default"
and audio_duration is not None
and audio_duration >= settings.LIPSYNC_DURATION_THRESHOLD
)
if force_musetalk:
logger.info("⚡ 强制快速模型MuseTalk")
musetalk_result = await self._call_musetalk_server(
actual_video_path, audio_path, output_path
)
if musetalk_result:
return musetalk_result
logger.warning("⚠️ MuseTalk 不可用,快速模型回退到 LatentSync")
elif auto_to_musetalk:
logger.info(
f"🔄 音频 {audio_duration:.1f}s >= {settings.LIPSYNC_DURATION_THRESHOLD}s路由到 MuseTalk"
)
musetalk_result = await self._call_musetalk_server(
actual_video_path, audio_path, output_path
)
if musetalk_result:
return musetalk_result
logger.warning("⚠️ MuseTalk 不可用,回退到 LatentSync长视频会较慢")
elif force_latentsync:
logger.info("🎯 强制高级模型LatentSync")
# 检查 LatentSync 前置条件(仅在需要回退或使用 LatentSync 时)
if not self._check_conda_env():
logger.warning("⚠️ Conda 环境不可用,使用 Fallback")
shutil.copy(video_path, output_path)
return output_path
if not self._check_weights():
logger.warning("⚠️ 模型权重不存在,使用 Fallback")
shutil.copy(video_path, output_path)
return output_path
if self.use_server:
# 模式 A: 调用常驻服务 (加速模式)
return await self._call_persistent_server(actual_video_path, audio_path, output_path)
logger.info("🔄 调用 LatentSync 推理 (subprocess)...")
@@ -388,15 +413,18 @@ class LipSyncService:
"请确保 LatentSync 服务已启动 (cd models/LatentSync && python scripts/server.py)"
)
async def _remote_generate(
self,
video_path: str,
audio_path: str,
output_path: str,
fps: int
) -> str:
"""调用远程 LatentSync API 服务"""
logger.info(f"📡 调用远程 API: {self.api_url}")
async def _remote_generate(
self,
video_path: str,
audio_path: str,
output_path: str,
fps: int,
model_mode: Literal["default", "fast", "advanced"],
) -> str:
"""调用远程 LatentSync API 服务"""
if model_mode == "fast":
logger.warning("⚠️ 远程模式未接入 MuseTalk快速模型将使用远程 LatentSync")
logger.info(f"📡 调用远程 API: {self.api_url}")
try:
async with httpx.AsyncClient(timeout=600.0) as client:

View File

@@ -1,6 +1,7 @@
"""
视频合成服务
"""
import asyncio
import os
import subprocess
import json
@@ -96,7 +97,7 @@ class VideoService:
"-map", "0:a?",
"-c:v", "libx264",
"-preset", "fast",
"-crf", "23",
"-crf", "18",
"-c:a", "copy",
"-movflags", "+faststart",
output_path,
@@ -199,9 +200,10 @@ class VideoService:
"""合成视频"""
# Ensure output dir
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
video_duration = self._get_duration(video_path)
audio_duration = self._get_duration(audio_path)
loop = asyncio.get_running_loop()
video_duration = await loop.run_in_executor(None, self._get_duration, video_path)
audio_duration = await loop.run_in_executor(None, self._get_duration, audio_path)
# Audio loop if needed
loop_count = 1
@@ -228,7 +230,7 @@ class VideoService:
# 不需要循环时用流复制(几乎瞬间完成),需要循环时才重编码
if loop_count > 1:
cmd.extend([
"-c:v", "libx264", "-preset", "fast", "-crf", "23",
"-c:v", "libx264", "-preset", "fast", "-crf", "18",
])
else:
cmd.extend(["-c:v", "copy"])
@@ -242,7 +244,8 @@ class VideoService:
cmd.append(output_path)
if self._run_ffmpeg(cmd):
ok = await loop.run_in_executor(None, self._run_ffmpeg, cmd)
if ok:
return output_path
else:
raise RuntimeError("FFmpeg composition failed")
@@ -267,12 +270,7 @@ class VideoService:
"-fflags", "+genpts",
"-i", str(list_path),
"-an",
"-vsync", "cfr",
"-r", str(target_fps),
"-c:v", "libx264",
"-preset", "fast",
"-crf", "23",
"-pix_fmt", "yuv420p",
"-c:v", "copy",
"-movflags", "+faststart",
output_path,
]
@@ -346,6 +344,7 @@ class VideoService:
needs_loop = target_duration > available
needs_scale = target_resolution is not None
needs_fps = bool(target_fps and target_fps > 0)
target_fps_value = int(target_fps) if needs_fps and target_fps is not None else None
has_source_end = clip_end < video_dur
# 当需要循环且存在截取范围时,先裁剪出片段,再循环裁剪后的文件
@@ -360,7 +359,7 @@ class VideoService:
"-i", video_path,
"-t", str(available),
"-an",
"-c:v", "libx264", "-preset", "fast", "-crf", "23",
"-c:v", "libx264", "-preset", "fast", "-crf", "18",
trim_temp,
]
if not self._run_ffmpeg(trim_cmd):
@@ -380,20 +379,20 @@ class VideoService:
cmd.extend(["-i", actual_input, "-t", str(target_duration), "-an"])
filters = []
if needs_fps:
filters.append(f"fps={int(target_fps)}")
if target_fps_value is not None:
filters.append(f"fps={target_fps_value}")
if needs_scale:
w, h = target_resolution
filters.append(f"scale={w}:{h}:force_original_aspect_ratio=decrease,pad={w}:{h}:(ow-iw)/2:(oh-ih)/2")
if filters:
cmd.extend(["-vf", ",".join(filters)])
if needs_fps:
cmd.extend(["-vsync", "cfr", "-r", str(int(target_fps))])
if target_fps_value is not None:
cmd.extend(["-vsync", "cfr", "-r", str(target_fps_value)])
# 需要循环、缩放或指定起点时必须重编码,否则用 stream copy 保持原画质
if needs_loop or needs_scale or source_start > 0 or has_source_end or needs_fps:
cmd.extend(["-c:v", "libx264", "-preset", "fast", "-crf", "23"])
cmd.extend(["-c:v", "libx264", "-preset", "fast", "-crf", "18"])
else:
cmd.extend(["-c:v", "copy"])

View File

@@ -32,6 +32,7 @@ class VoiceCloneService:
ref_text: str,
language: str,
speed: float = 1.0,
instruct_text: str = "",
max_retries: int = 4,
) -> bytes:
timeout = httpx.Timeout(240.0)
@@ -39,15 +40,18 @@ class VoiceCloneService:
for attempt in range(max_retries):
try:
async with httpx.AsyncClient(timeout=timeout) as client:
data = {
"text": text,
"ref_text": ref_text,
"language": language,
"speed": str(speed),
}
if instruct_text:
data["instruct_text"] = instruct_text
response = await client.post(
f"{self.base_url}/generate",
files={"ref_audio": ("ref.wav", ref_audio_data, "audio/wav")},
data={
"text": text,
"ref_text": ref_text,
"language": language,
"speed": str(speed),
},
data=data,
)
retryable = False
@@ -99,6 +103,7 @@ class VoiceCloneService:
output_path: str,
language: str = "Chinese",
speed: float = 1.0,
instruct_text: str = "",
) -> str:
"""
使用声音克隆生成语音
@@ -132,6 +137,7 @@ class VoiceCloneService:
ref_text=ref_text,
language=language,
speed=speed,
instruct_text=instruct_text,
)
with open(output_path, "wb") as f:
f.write(audio_bytes)

View File

@@ -127,6 +127,7 @@ export const useGeneratedAudios = ({
ref_text?: string;
language: string;
speed?: number;
instruct_text?: string;
}) => {
setIsGeneratingAudio(true);
setAudioTask({ status: "pending", progress: 0, message: "正在提交..." });

View File

@@ -124,6 +124,8 @@ interface RefAudio {
created_at: number;
}
type LipsyncModelMode = "default" | "fast" | "advanced";
import type { Material } from "@/shared/types/material";
export const useHomeController = () => {
@@ -155,6 +157,7 @@ export const useHomeController = () => {
const [titleDisplayMode, setTitleDisplayMode] = useState<"short" | "persistent">("short");
const [subtitleBottomMargin, setSubtitleBottomMargin] = useState<number>(80);
const [outputAspectRatio, setOutputAspectRatio] = useState<"9:16" | "16:9">("9:16");
const [lipsyncModelMode, setLipsyncModelMode] = useState<LipsyncModelMode>("default");
const [showStylePreview, setShowStylePreview] = useState<boolean>(false);
const [materialDimensions, setMaterialDimensions] = useState<{ width: number; height: number } | null>(null);
@@ -182,6 +185,9 @@ export const useHomeController = () => {
// 语速控制
const [speed, setSpeed] = useState<number>(1.0);
// 语气控制(仅声音克隆模式)
const [emotion, setEmotion] = useState<string>("normal");
// ClipTrimmer 模态框状态
const [clipTrimmerOpen, setClipTrimmerOpen] = useState(false);
const [clipTrimmerSegmentId, setClipTrimmerSegmentId] = useState<string | null>(null);
@@ -489,6 +495,8 @@ export const useHomeController = () => {
setSubtitleBottomMargin,
outputAspectRatio,
setOutputAspectRatio,
lipsyncModelMode,
setLipsyncModelMode,
selectedBgmId,
setSelectedBgmId,
bgmVolume,
@@ -502,6 +510,8 @@ export const useHomeController = () => {
setSelectedAudioId,
speed,
setSpeed,
emotion,
setEmotion,
});
const { savedScripts, saveScript, deleteScript: deleteSavedScript } = useSavedScripts(storageKey);
@@ -876,6 +886,13 @@ export const useHomeController = () => {
return;
}
const emotionToInstruct: Record<string, string> = {
normal: "",
happy: "You are a helpful assistant. 请非常开心地说一句话。<|endofprompt|>",
sad: "You are a helpful assistant. 请非常伤心地说一句话。<|endofprompt|>",
angry: "You are a helpful assistant. 请非常生气地说一句话。<|endofprompt|>",
};
const params = {
text: text.trim(),
tts_mode: ttsMode,
@@ -884,6 +901,7 @@ export const useHomeController = () => {
ref_text: ttsMode === "voiceclone" ? refText : undefined,
language: textLang,
speed: ttsMode === "voiceclone" ? speed : undefined,
instruct_text: ttsMode === "voiceclone" ? emotionToInstruct[emotion] || "" : undefined,
};
await generateAudio(params);
};
@@ -921,6 +939,7 @@ export const useHomeController = () => {
text: selectedAudio.text || text,
generated_audio_id: selectedAudio.id,
language: selectedAudio.language || textLang,
lipsync_model: lipsyncModelMode,
title: videoTitle.trim() || undefined,
enable_subtitles: true,
output_aspect_ratio: outputAspectRatio,
@@ -1141,6 +1160,8 @@ export const useHomeController = () => {
setSubtitleBottomMargin,
outputAspectRatio,
setOutputAspectRatio,
lipsyncModelMode,
setLipsyncModelMode,
resolveAssetUrl,
getFontFormat,
buildTextShadow,
@@ -1215,6 +1236,8 @@ export const useHomeController = () => {
selectAudio,
speed,
setSpeed,
emotion,
setEmotion,
timelineSegments,
reorderSegments,
setSourceRange,

View File

@@ -52,6 +52,8 @@ interface UseHomePersistenceOptions {
setSubtitleBottomMargin: React.Dispatch<React.SetStateAction<number>>;
outputAspectRatio: '9:16' | '16:9';
setOutputAspectRatio: React.Dispatch<React.SetStateAction<'9:16' | '16:9'>>;
lipsyncModelMode: 'default' | 'fast' | 'advanced';
setLipsyncModelMode: React.Dispatch<React.SetStateAction<'default' | 'fast' | 'advanced'>>;
selectedBgmId: string;
setSelectedBgmId: React.Dispatch<React.SetStateAction<string>>;
bgmVolume: number;
@@ -65,6 +67,8 @@ interface UseHomePersistenceOptions {
setSelectedAudioId: React.Dispatch<React.SetStateAction<string | null>>;
speed: number;
setSpeed: React.Dispatch<React.SetStateAction<number>>;
emotion: string;
setEmotion: React.Dispatch<React.SetStateAction<string>>;
}
export const useHomePersistence = ({
@@ -109,6 +113,8 @@ export const useHomePersistence = ({
setSubtitleBottomMargin,
outputAspectRatio,
setOutputAspectRatio,
lipsyncModelMode,
setLipsyncModelMode,
selectedBgmId,
setSelectedBgmId,
bgmVolume,
@@ -122,6 +128,8 @@ export const useHomePersistence = ({
setSelectedAudioId,
speed,
setSpeed,
emotion,
setEmotion,
}: UseHomePersistenceOptions) => {
const [isRestored, setIsRestored] = useState(false);
@@ -152,7 +160,9 @@ export const useHomePersistence = ({
const savedTitleDisplayMode = localStorage.getItem(`vigent_${storageKey}_titleDisplayMode`);
const savedSubtitleBottomMargin = localStorage.getItem(`vigent_${storageKey}_subtitleBottomMargin`);
const savedOutputAspectRatio = localStorage.getItem(`vigent_${storageKey}_outputAspectRatio`);
const savedLipsyncModelMode = localStorage.getItem(`vigent_${storageKey}_lipsyncModelMode`);
const savedSpeed = localStorage.getItem(`vigent_${storageKey}_speed`);
const savedEmotion = localStorage.getItem(`vigent_${storageKey}_emotion`);
setText(savedText || "大家好,欢迎来到我的频道,今天给大家分享一些有趣的内容。");
setVideoTitle(savedTitle ? clampTitle(savedTitle) : "");
@@ -230,11 +240,21 @@ export const useHomePersistence = ({
setOutputAspectRatio(savedOutputAspectRatio);
}
if (
savedLipsyncModelMode === 'default'
|| savedLipsyncModelMode === 'fast'
|| savedLipsyncModelMode === 'advanced'
) {
setLipsyncModelMode(savedLipsyncModelMode);
}
if (savedSpeed) {
const parsed = parseFloat(savedSpeed);
if (!Number.isNaN(parsed)) setSpeed(parsed);
}
if (savedEmotion) setEmotion(savedEmotion);
// eslint-disable-next-line react-hooks/set-state-in-effect
setIsRestored(true);
}, [
@@ -249,6 +269,7 @@ export const useHomePersistence = ({
setSelectedVideoId,
setSelectedAudioId,
setSpeed,
setEmotion,
setSubtitleFontSize,
setSubtitleSizeLocked,
setText,
@@ -262,6 +283,7 @@ export const useHomePersistence = ({
setTitleDisplayMode,
setSubtitleBottomMargin,
setOutputAspectRatio,
setLipsyncModelMode,
setTtsMode,
setVideoTitle,
setVideoSecondaryTitle,
@@ -377,6 +399,12 @@ export const useHomePersistence = ({
}
}, [outputAspectRatio, storageKey, isRestored]);
useEffect(() => {
if (isRestored) {
localStorage.setItem(`vigent_${storageKey}_lipsyncModelMode`, lipsyncModelMode);
}
}, [lipsyncModelMode, storageKey, isRestored]);
useEffect(() => {
if (isRestored) {
localStorage.setItem(`vigent_${storageKey}_bgmId`, selectedBgmId);
@@ -427,5 +455,11 @@ export const useHomePersistence = ({
}
}, [speed, storageKey, isRestored]);
useEffect(() => {
if (isRestored) {
localStorage.setItem(`vigent_${storageKey}_emotion`, emotion);
}
}, [emotion, storageKey, isRestored]);
return { isRestored };
};

View File

@@ -1,10 +1,14 @@
import { Rocket } from "lucide-react";
type LipsyncModelMode = "default" | "fast" | "advanced";
interface GenerateActionBarProps {
isGenerating: boolean;
progress: number;
disabled: boolean;
materialCount?: number;
modelMode: LipsyncModelMode;
onModelModeChange: (value: LipsyncModelMode) => void;
onGenerate: () => void;
}
@@ -13,45 +17,61 @@ export function GenerateActionBar({
progress,
disabled,
materialCount = 1,
modelMode,
onModelModeChange,
onGenerate,
}: GenerateActionBarProps) {
return (
<div>
<button
onClick={onGenerate}
disabled={disabled}
className={`w-full py-4 rounded-xl font-bold text-lg transition-all ${disabled
? "bg-gray-600 cursor-not-allowed text-gray-400"
: "bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-700 hover:to-pink-700 text-white shadow-lg hover:shadow-purple-500/25"
}`}
>
{isGenerating ? (
<span className="flex items-center justify-center gap-3">
<svg className="animate-spin h-5 w-5" viewBox="0 0 24 24">
<circle
className="opacity-25"
cx="12"
cy="12"
r="10"
stroke="currentColor"
strokeWidth="4"
fill="none"
/>
<path
className="opacity-75"
fill="currentColor"
d="M4 12a8 8 0 018-8V0C5.373 0 0 5.373 0 12h4z"
/>
</svg>
... {progress}%
</span>
) : (
<span className="flex items-center justify-center gap-2">
<Rocket className="h-5 w-5" />
</span>
)}
</button>
<div className="flex items-center gap-2">
<button
onClick={onGenerate}
disabled={disabled}
className={`flex-1 py-4 rounded-xl font-bold text-lg transition-all ${disabled
? "bg-gray-600 cursor-not-allowed text-gray-400"
: "bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-700 hover:to-pink-700 text-white shadow-lg hover:shadow-purple-500/25"
}`}
>
{isGenerating ? (
<span className="flex items-center justify-center gap-3">
<svg className="animate-spin h-5 w-5" viewBox="0 0 24 24">
<circle
className="opacity-25"
cx="12"
cy="12"
r="10"
stroke="currentColor"
strokeWidth="4"
fill="none"
/>
<path
className="opacity-75"
fill="currentColor"
d="M4 12a8 8 0 018-8V0C5.373 0 0 5.373 0 12h4z"
/>
</svg>
... {progress}%
</span>
) : (
<span className="flex items-center justify-center gap-2">
<Rocket className="h-5 w-5" />
</span>
)}
</button>
<select
value={modelMode}
onChange={(e) => onModelModeChange(e.target.value as LipsyncModelMode)}
disabled={isGenerating}
className="h-[58px] rounded-xl border border-white/15 bg-black/30 px-3 text-sm text-gray-200 outline-none focus:border-purple-400"
title="选择唇形模型"
>
<option value="default"></option>
<option value="fast"></option>
<option value="advanced"></option>
</select>
</div>
{!isGenerating && materialCount >= 2 && (
<p className="text-xs text-gray-400 text-center mt-1.5">
({materialCount} )

View File

@@ -23,6 +23,8 @@ interface GeneratedAudiosPanelProps {
speed: number;
onSpeedChange: (speed: number) => void;
ttsMode: string;
emotion: string;
onEmotionChange: (e: string) => void;
embedded?: boolean;
}
@@ -41,14 +43,18 @@ export function GeneratedAudiosPanel({
speed,
onSpeedChange,
ttsMode,
emotion,
onEmotionChange,
embedded = false,
}: GeneratedAudiosPanelProps) {
const [editingId, setEditingId] = useState<string | null>(null);
const [editName, setEditName] = useState("");
const [playingId, setPlayingId] = useState<string | null>(null);
const [speedOpen, setSpeedOpen] = useState(false);
const [emotionOpen, setEmotionOpen] = useState(false);
const audioRef = useRef<HTMLAudioElement | null>(null);
const speedRef = useRef<HTMLDivElement>(null);
const emotionRef = useRef<HTMLDivElement>(null);
const stopPlaying = useCallback(() => {
if (audioRef.current) {
@@ -80,6 +86,17 @@ export function GeneratedAudiosPanel({
return () => document.removeEventListener("mousedown", handler);
}, [speedOpen]);
// Close emotion dropdown on click outside
useEffect(() => {
const handler = (e: MouseEvent) => {
if (emotionRef.current && !emotionRef.current.contains(e.target as Node)) {
setEmotionOpen(false);
}
};
if (emotionOpen) document.addEventListener("mousedown", handler);
return () => document.removeEventListener("mousedown", handler);
}, [emotionOpen]);
const togglePlay = (audio: GeneratedAudio, e: React.MouseEvent) => {
e.stopPropagation();
if (playingId === audio.id) {
@@ -125,12 +142,48 @@ export function GeneratedAudiosPanel({
] as const;
const currentSpeedLabel = speedOptions.find((o) => o.value === speed)?.label ?? "正常";
const emotionOptions = [
{ value: "normal", label: "正常" },
{ value: "happy", label: "欢快" },
{ value: "sad", label: "低沉" },
{ value: "angry", label: "严肃" },
] as const;
const currentEmotionLabel = emotionOptions.find((o) => o.value === emotion)?.label ?? "正常";
const content = (
<>
{embedded ? (
<>
{/* Row 1: 语速 + 生成配音 (right-aligned) */}
{/* Row 1: 语气 + 语速 + 生成配音 (right-aligned) */}
<div className="flex justify-end items-center gap-1.5 mb-3">
{ttsMode === "voiceclone" && (
<div ref={emotionRef} className="relative">
<button
onClick={() => setEmotionOpen((v) => !v)}
className="px-2 py-1 text-xs bg-white/10 hover:bg-white/20 rounded text-gray-300 whitespace-nowrap flex items-center gap-1 transition-all"
>
: {currentEmotionLabel}
<ChevronDown className={`h-3 w-3 transition-transform ${emotionOpen ? "rotate-180" : ""}`} />
</button>
{emotionOpen && (
<div className="absolute right-0 top-full mt-1 bg-gray-800 border border-white/20 rounded-lg shadow-xl py-1 z-50 min-w-[80px]">
{emotionOptions.map((opt) => (
<button
key={opt.value}
onClick={() => { onEmotionChange(opt.value); setEmotionOpen(false); }}
className={`w-full text-left px-3 py-1.5 text-xs transition-colors ${
emotion === opt.value
? "bg-purple-600/40 text-purple-200"
: "text-gray-300 hover:bg-white/10"
}`}
>
{opt.label}
</button>
))}
</div>
)}
</div>
)}
{ttsMode === "voiceclone" && (
<div ref={speedRef} className="relative">
<button
@@ -192,6 +245,34 @@ export function GeneratedAudiosPanel({
</h2>
<div className="flex gap-1.5">
{ttsMode === "voiceclone" && (
<div ref={emotionRef} className="relative">
<button
onClick={() => setEmotionOpen((v) => !v)}
className="px-2 py-1 text-xs bg-white/10 hover:bg-white/20 rounded text-gray-300 whitespace-nowrap flex items-center gap-1 transition-all"
>
: {currentEmotionLabel}
<ChevronDown className={`h-3 w-3 transition-transform ${emotionOpen ? "rotate-180" : ""}`} />
</button>
{emotionOpen && (
<div className="absolute right-0 top-full mt-1 bg-gray-800 border border-white/20 rounded-lg shadow-xl py-1 z-50 min-w-[80px]">
{emotionOptions.map((opt) => (
<button
key={opt.value}
onClick={() => { onEmotionChange(opt.value); setEmotionOpen(false); }}
className={`w-full text-left px-3 py-1.5 text-xs transition-colors ${
emotion === opt.value
? "bg-purple-600/40 text-purple-200"
: "text-gray-300 hover:bg-white/10"
}`}
>
{opt.label}
</button>
))}
</div>
)}
</div>
)}
{ttsMode === "voiceclone" && (
<div ref={speedRef} className="relative">
<button

View File

@@ -97,6 +97,8 @@ export function HomePage() {
setTitleDisplayMode,
outputAspectRatio,
setOutputAspectRatio,
lipsyncModelMode,
setLipsyncModelMode,
resolveAssetUrl,
getFontFormat,
buildTextShadow,
@@ -168,6 +170,8 @@ export function HomePage() {
selectAudio,
speed,
setSpeed,
emotion,
setEmotion,
timelineSegments,
reorderSegments,
setSourceRange,
@@ -293,6 +297,8 @@ export function HomePage() {
speed={speed}
onSpeedChange={setSpeed}
ttsMode={ttsMode}
emotion={emotion}
onEmotionChange={setEmotion}
/>
</div>
@@ -427,6 +433,8 @@ export function HomePage() {
progress={currentTask?.progress || 0}
materialCount={selectedMaterials.length}
disabled={isGenerating || selectedMaterials.length === 0 || !selectedAudio}
modelMode={lipsyncModelMode}
onModelModeChange={setLipsyncModelMode}
onGenerate={handleGenerate}
/>
</div>

View File

@@ -174,6 +174,7 @@ async def generate(
ref_text: str = Form(...),
language: str = Form("Chinese"),
speed: float = Form(1.0),
instruct_text: str = Form(""),
):
"""
声音克隆生成
@@ -236,16 +237,30 @@ async def generate(
# CosyVoice3 的 prompt_text 格式
prompt_text = f"You are a helpful assistant.<|endofprompt|>{ref_text}"
use_instruct = bool(instruct_text.strip())
if use_instruct:
print(f"🎭 Instruct mode: {instruct_text[:60]}...")
def _do_inference():
"""在线程池中执行推理"""
results = list(_model.inference_zero_shot(
text,
prompt_text,
ref_audio_path,
stream=False,
speed=speed,
text_frontend=True,
))
if use_instruct:
results = list(_model.inference_instruct2(
text,
instruct_text,
ref_audio_path,
stream=False,
speed=speed,
text_frontend=True,
))
else:
results = list(_model.inference_zero_shot(
text,
prompt_text,
ref_audio_path,
stream=False,
speed=speed,
text_frontend=True,
))
if not results:
raise RuntimeError("CosyVoice returned empty results")

View File

@@ -253,21 +253,58 @@ class LipsyncPipeline(DiffusionPipeline):
faces = []
boxes = []
affine_matrices = []
valid_face_flags = []
print(f"Affine transforming {len(video_frames)} faces...")
for frame in tqdm.tqdm(video_frames):
face, box, affine_matrix = self.image_processor.affine_transform(frame)
faces.append(face)
boxes.append(box)
affine_matrices.append(affine_matrix)
try:
face, box, affine_matrix = self.image_processor.affine_transform(frame)
faces.append(face)
boxes.append(box)
affine_matrices.append(affine_matrix)
valid_face_flags.append(True)
except Exception:
faces.append(None)
boxes.append(None)
affine_matrices.append(None)
valid_face_flags.append(False)
valid_indices = [i for i, flag in enumerate(valid_face_flags) if flag]
if not valid_indices:
raise RuntimeError("Face not detected in any frame")
for i in range(len(faces)):
if faces[i] is not None:
continue
nearest_idx = min(valid_indices, key=lambda idx: abs(idx - i))
faces[i] = faces[nearest_idx].clone()
boxes[i] = boxes[nearest_idx]
affine_matrices[i] = affine_matrices[nearest_idx]
missing_count = len(valid_face_flags) - len(valid_indices)
if missing_count > 0:
print(
f"Warning: face not detected in {missing_count}/{len(valid_face_flags)} frames. "
"Those frames will keep original content."
)
faces = torch.stack(faces)
return faces, boxes, affine_matrices
return faces, boxes, affine_matrices, valid_face_flags
def restore_video(self, faces: torch.Tensor, video_frames: np.ndarray, boxes: list, affine_matrices: list):
def restore_video(
self,
faces: torch.Tensor,
video_frames: np.ndarray,
boxes: list,
affine_matrices: list,
valid_face_flags: Optional[list] = None,
):
video_frames = video_frames[: len(faces)]
out_frames = []
print(f"Restoring {len(faces)} faces...")
for index, face in enumerate(tqdm.tqdm(faces)):
if valid_face_flags is not None and not valid_face_flags[index]:
out_frames.append(video_frames[index])
continue
x1, y1, x2, y2 = boxes[index]
height = int(y2 - y1)
width = int(x2 - x1)
@@ -281,33 +318,37 @@ class LipsyncPipeline(DiffusionPipeline):
def loop_video(self, whisper_chunks: list, video_frames: np.ndarray):
# If the audio is longer than the video, we need to loop the video
if len(whisper_chunks) > len(video_frames):
faces, boxes, affine_matrices = self.affine_transform_video(video_frames)
faces, boxes, affine_matrices, valid_face_flags = self.affine_transform_video(video_frames)
num_loops = math.ceil(len(whisper_chunks) / len(video_frames))
loop_video_frames = []
loop_faces = []
loop_boxes = []
loop_affine_matrices = []
loop_valid_face_flags = []
for i in range(num_loops):
if i % 2 == 0:
loop_video_frames.append(video_frames)
loop_faces.append(faces)
loop_boxes += boxes
loop_affine_matrices += affine_matrices
loop_valid_face_flags += valid_face_flags
else:
loop_video_frames.append(video_frames[::-1])
loop_faces.append(faces.flip(0))
loop_boxes += boxes[::-1]
loop_affine_matrices += affine_matrices[::-1]
loop_valid_face_flags += valid_face_flags[::-1]
video_frames = np.concatenate(loop_video_frames, axis=0)[: len(whisper_chunks)]
faces = torch.cat(loop_faces, dim=0)[: len(whisper_chunks)]
boxes = loop_boxes[: len(whisper_chunks)]
affine_matrices = loop_affine_matrices[: len(whisper_chunks)]
valid_face_flags = loop_valid_face_flags[: len(whisper_chunks)]
else:
video_frames = video_frames[: len(whisper_chunks)]
faces, boxes, affine_matrices = self.affine_transform_video(video_frames)
faces, boxes, affine_matrices, valid_face_flags = self.affine_transform_video(video_frames)
return video_frames, faces, boxes, affine_matrices
return video_frames, faces, boxes, affine_matrices, valid_face_flags
@torch.no_grad()
def __call__(
@@ -367,7 +408,7 @@ class LipsyncPipeline(DiffusionPipeline):
audio_samples = read_audio(audio_path)
video_frames = read_video(video_path, use_decord=False)
video_frames, faces, boxes, affine_matrices = self.loop_video(whisper_chunks, video_frames)
video_frames, faces, boxes, affine_matrices, valid_face_flags = self.loop_video(whisper_chunks, video_frames)
synced_video_frames = []
@@ -457,7 +498,13 @@ class LipsyncPipeline(DiffusionPipeline):
)
synced_video_frames.append(decoded_latents)
synced_video_frames = self.restore_video(torch.cat(synced_video_frames), video_frames, boxes, affine_matrices)
synced_video_frames = self.restore_video(
torch.cat(synced_video_frames),
video_frames,
boxes,
affine_matrices,
valid_face_flags=valid_face_flags,
)
audio_samples_remain_length = int(synced_video_frames.shape[0] / video_fps * audio_sample_rate)
audio_samples = audio_samples[:audio_samples_remain_length].cpu().numpy()
@@ -473,5 +520,5 @@ class LipsyncPipeline(DiffusionPipeline):
sf.write(os.path.join(temp_dir, "audio.wav"), audio_samples, audio_sample_rate)
command = f"ffmpeg -y -loglevel error -nostdin -i {os.path.join(temp_dir, 'video.mp4')} -i {os.path.join(temp_dir, 'audio.wav')} -c:v libx264 -crf 18 -c:a aac -q:v 0 -q:a 0 {video_out_path}"
command = f"ffmpeg -y -loglevel error -nostdin -i {os.path.join(temp_dir, 'video.mp4')} -i {os.path.join(temp_dir, 'audio.wav')} -c:v copy -c:a aac -q:a 0 {video_out_path}"
subprocess.run(command, shell=True)

View File

@@ -49,11 +49,22 @@ def read_video(video_path: str, change_fps=True, use_decord=True):
if os.path.exists(temp_dir):
shutil.rmtree(temp_dir)
os.makedirs(temp_dir, exist_ok=True)
command = (
f"ffmpeg -loglevel error -y -nostdin -i {video_path} -r 25 -crf 18 {os.path.join(temp_dir, 'video.mp4')}"
)
subprocess.run(command, shell=True)
target_video_path = os.path.join(temp_dir, "video.mp4")
# 检测输入视频 FPS已是 25fps 时跳过重编码
cap = cv2.VideoCapture(video_path)
current_fps = cap.get(cv2.CAP_PROP_FPS)
cap.release()
if abs(current_fps - 25.0) < 0.5:
# 已是 25fps直接使用原文件避免一次有损重编码
print(f"Video already at {current_fps:.1f}fps, skipping FPS conversion")
target_video_path = video_path
else:
command = (
f"ffmpeg -loglevel error -y -nostdin -i {video_path} -r 25 -crf 18 {os.path.join(temp_dir, 'video.mp4')}"
)
subprocess.run(command, shell=True)
target_video_path = os.path.join(temp_dir, "video.mp4")
else:
target_video_path = video_path

View File

@@ -4,14 +4,14 @@ MuseTalk v1.5 常驻推理服务 (优化版 v2)
- GPU: 从 backend/.env 读取 MUSETALK_GPU_ID (默认 0)
- 架构: FastAPI + lifespan (与 LatentSync server.py 同模式)
优化项 (vs v1):
1. cv2.VideoCapture 直读帧 (跳过 ffmpeg→PNG→imread)
2. 人脸检测降频 (每 N 帧检测, 中间插值 bbox)
3. BiSeNet mask 缓存 (每 N 帧更新, 中间复用)
4. cv2.VideoWriter 直写视频 (跳过逐帧 PNG 写盘)
5. batch_size 8→32
6. 每阶段计时
"""
优化项 (vs v1):
1. cv2.VideoCapture 直读帧 (跳过 ffmpeg→PNG→imread)
2. 人脸检测降频 (每 N 帧检测, 中间插值 bbox)
3. BiSeNet mask 缓存 (每 N 帧更新, 中间复用)
4. FFmpeg rawvideo 管道直编码 (去掉中间有损 mp4v)
5. batch_size 8→32
6. 每阶段计时
"""
import os
import sys
@@ -84,17 +84,28 @@ from musetalk.utils.utils import get_file_type, get_video_fps, datagen, load_all
from musetalk.utils.preprocessing import get_landmark_and_bbox, read_imgs, coord_placeholder
# --- 从 .env 读取额外配置 ---
def load_env_config():
"""读取 MuseTalk 相关环境变量"""
config = {
"batch_size": 32,
"version": "v15",
"use_float16": True,
}
try:
env_path = musetalk_root.parent.parent / "backend" / ".env"
if env_path.exists():
with open(env_path, "r", encoding="utf-8") as f:
def load_env_config():
"""读取 MuseTalk 相关环境变量"""
config = {
"batch_size": 32,
"version": "v15",
"use_float16": True,
"detect_every": 5,
"blend_cache_every": 5,
"audio_padding_left": 2,
"audio_padding_right": 2,
"extra_margin": 15,
"delay_frame": 0,
"blend_mode": "auto",
"faceparsing_left_cheek_width": 90,
"faceparsing_right_cheek_width": 90,
"encode_crf": 18,
"encode_preset": "medium",
}
try:
env_path = musetalk_root.parent.parent / "backend" / ".env"
if env_path.exists():
with open(env_path, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if line.startswith("MUSETALK_BATCH_SIZE="):
@@ -105,22 +116,78 @@ def load_env_config():
val = line.split("=")[1].strip().split("#")[0].strip()
if val:
config["version"] = val
elif line.startswith("MUSETALK_USE_FLOAT16="):
val = line.split("=")[1].strip().split("#")[0].strip().lower()
config["use_float16"] = val in ("true", "1", "yes")
except Exception as e:
print(f"⚠️ 读取额外配置失败: {e}")
return config
env_config = load_env_config()
elif line.startswith("MUSETALK_USE_FLOAT16="):
val = line.split("=")[1].strip().split("#")[0].strip().lower()
config["use_float16"] = val in ("true", "1", "yes")
elif line.startswith("MUSETALK_DETECT_EVERY="):
val = line.split("=")[1].strip().split("#")[0].strip()
if val:
config["detect_every"] = max(1, int(val))
elif line.startswith("MUSETALK_BLEND_CACHE_EVERY="):
val = line.split("=")[1].strip().split("#")[0].strip()
if val:
config["blend_cache_every"] = max(1, int(val))
elif line.startswith("MUSETALK_AUDIO_PADDING_LEFT="):
val = line.split("=")[1].strip().split("#")[0].strip()
if val:
config["audio_padding_left"] = max(0, int(val))
elif line.startswith("MUSETALK_AUDIO_PADDING_RIGHT="):
val = line.split("=")[1].strip().split("#")[0].strip()
if val:
config["audio_padding_right"] = max(0, int(val))
elif line.startswith("MUSETALK_EXTRA_MARGIN="):
val = line.split("=")[1].strip().split("#")[0].strip()
if val:
config["extra_margin"] = max(0, int(val))
elif line.startswith("MUSETALK_DELAY_FRAME="):
val = line.split("=")[1].strip().split("#")[0].strip()
if val:
config["delay_frame"] = int(val)
elif line.startswith("MUSETALK_BLEND_MODE="):
val = line.split("=")[1].strip().split("#")[0].strip().lower()
if val in ("auto", "jaw", "raw"):
config["blend_mode"] = val
elif line.startswith("MUSETALK_FACEPARSING_LEFT_CHEEK_WIDTH="):
val = line.split("=")[1].strip().split("#")[0].strip()
if val:
config["faceparsing_left_cheek_width"] = max(0, int(val))
elif line.startswith("MUSETALK_FACEPARSING_RIGHT_CHEEK_WIDTH="):
val = line.split("=")[1].strip().split("#")[0].strip()
if val:
config["faceparsing_right_cheek_width"] = max(0, int(val))
elif line.startswith("MUSETALK_ENCODE_CRF="):
val = line.split("=")[1].strip().split("#")[0].strip()
if val:
config["encode_crf"] = min(51, max(0, int(val)))
elif line.startswith("MUSETALK_ENCODE_PRESET="):
val = line.split("=")[1].strip().split("#")[0].strip().lower()
if val in (
"ultrafast", "superfast", "veryfast", "faster", "fast",
"medium", "slow", "slower", "veryslow"
):
config["encode_preset"] = val
except Exception as e:
print(f"⚠️ 读取额外配置失败: {e}")
return config
env_config = load_env_config()
# 全局模型缓存
models = {}
# ===================== 优化参数 =====================
DETECT_EVERY = 5 # 人脸检测降频: 每 N 帧检测一次
BLEND_CACHE_EVERY = 5 # BiSeNet mask 缓存: 每 N 帧更新一次
# ====================================================
# ===================== 优化参数 =====================
DETECT_EVERY = int(env_config["detect_every"]) # 人脸检测降频: 每 N 帧检测一次
BLEND_CACHE_EVERY = int(env_config["blend_cache_every"]) # BiSeNet mask 缓存: 每 N 帧更新一次
AUDIO_PADDING_LEFT = int(env_config["audio_padding_left"])
AUDIO_PADDING_RIGHT = int(env_config["audio_padding_right"])
EXTRA_MARGIN = int(env_config["extra_margin"])
DELAY_FRAME = int(env_config["delay_frame"])
BLEND_MODE = str(env_config["blend_mode"])
FACEPARSING_LEFT_CHEEK_WIDTH = int(env_config["faceparsing_left_cheek_width"])
FACEPARSING_RIGHT_CHEEK_WIDTH = int(env_config["faceparsing_right_cheek_width"])
ENCODE_CRF = int(env_config["encode_crf"])
ENCODE_PRESET = str(env_config["encode_preset"])
# ====================================================
def run_ffmpeg(cmd):
@@ -191,11 +258,14 @@ async def lifespan(app: FastAPI):
whisper = whisper.to(device=device, dtype=weight_dtype).eval()
whisper.requires_grad_(False)
# FaceParsing
if version == "v15":
fp = FaceParsing(left_cheek_width=90, right_cheek_width=90)
else:
fp = FaceParsing()
# FaceParsing
if version == "v15":
fp = FaceParsing(
left_cheek_width=FACEPARSING_LEFT_CHEEK_WIDTH,
right_cheek_width=FACEPARSING_RIGHT_CHEEK_WIDTH,
)
else:
fp = FaceParsing()
# 恢复工作目录
os.chdir(original_cwd)
@@ -211,9 +281,13 @@ async def lifespan(app: FastAPI):
models["version"] = version
models["timesteps"] = torch.tensor([0], device=device)
print("✅ MuseTalk v1.5 模型加载完成,服务就绪!")
print(f"⚙️ 优化参数: batch_size={env_config['batch_size']}, "
f"detect_every={DETECT_EVERY}, blend_cache_every={BLEND_CACHE_EVERY}")
print("✅ MuseTalk v1.5 模型加载完成,服务就绪!")
print(f"⚙️ 优化参数: batch_size={env_config['batch_size']}, "
f"detect_every={DETECT_EVERY}, blend_cache_every={BLEND_CACHE_EVERY}, "
f"audio_padding=({AUDIO_PADDING_LEFT},{AUDIO_PADDING_RIGHT}), extra_margin={EXTRA_MARGIN}, "
f"delay_frame={DELAY_FRAME}, blend_mode={BLEND_MODE}, "
f"faceparsing_cheek=({FACEPARSING_LEFT_CHEEK_WIDTH},{FACEPARSING_RIGHT_CHEEK_WIDTH}), "
f"encode=libx264/{ENCODE_PRESET}/crf{ENCODE_CRF}")
yield
models.clear()
torch.cuda.empty_cache()
@@ -354,15 +428,15 @@ def _detect_faces_subsampled(frames, detect_every=5):
# 核心推理 (优化版)
# =====================================================================
@torch.no_grad()
def _run_inference(req: LipSyncRequest) -> dict:
"""
优化版推理逻辑:
1. cv2.VideoCapture 直读帧 (跳过 ffmpeg→PNG→imread)
2. 人脸检测降频 (每 N 帧, 中间插值)
3. BiSeNet mask 缓存 (每 N 帧更新)
4. cv2.VideoWriter 直写 (跳过逐帧 PNG)
5. 每阶段计时
def _run_inference(req: LipSyncRequest) -> dict:
"""
优化版推理逻辑:
1. cv2.VideoCapture 直读帧 (跳过 ffmpeg→PNG→imread)
2. 人脸检测降频 (每 N 帧, 中间插值)
3. BiSeNet mask 缓存 (每 N 帧更新)
4. FFmpeg rawvideo 管道直编码 (无中间有损文件)
5. 每阶段计时
"""
vae = models["vae"]
unet = models["unet"]
pe = models["pe"]
@@ -411,12 +485,12 @@ def _run_inference(req: LipSyncRequest) -> dict:
# ===== Phase 2: Whisper 音频特征 =====
t0 = time.time()
whisper_input_features, librosa_length = audio_processor.get_audio_feature(audio_path)
whisper_chunks = audio_processor.get_whisper_chunk(
whisper_input_features, device, weight_dtype, whisper, librosa_length,
fps=fps,
audio_padding_length_left=2,
audio_padding_length_right=2,
)
whisper_chunks = audio_processor.get_whisper_chunk(
whisper_input_features, device, weight_dtype, whisper, librosa_length,
fps=fps,
audio_padding_length_left=AUDIO_PADDING_LEFT,
audio_padding_length_right=AUDIO_PADDING_RIGHT,
)
timings["2_whisper"] = time.time() - t0
print(f"🎵 Whisper 特征 [{timings['2_whisper']:.1f}s]")
@@ -427,12 +501,12 @@ def _run_inference(req: LipSyncRequest) -> dict:
print(f"🔍 人脸检测 [{timings['3_face']:.1f}s]")
# ===== Phase 4: VAE 潜空间编码 =====
t0 = time.time()
input_latent_list = []
extra_margin = 15
for bbox, frame in zip(coord_list, frames):
if bbox == coord_placeholder:
continue
t0 = time.time()
input_latent_list = []
extra_margin = EXTRA_MARGIN
for bbox, frame in zip(coord_list, frames):
if bbox == coord_placeholder:
continue
x1, y1, x2, y2 = bbox
if version == "v15":
y2 = min(y2 + extra_margin, frame.shape[0])
@@ -453,13 +527,13 @@ def _run_inference(req: LipSyncRequest) -> dict:
input_latent_list_cycle = input_latent_list + input_latent_list[::-1]
video_num = len(whisper_chunks)
gen = datagen(
whisper_chunks=whisper_chunks,
vae_encode_latents=input_latent_list_cycle,
batch_size=batch_size,
delay_frame=0,
device=device,
)
gen = datagen(
whisper_chunks=whisper_chunks,
vae_encode_latents=input_latent_list_cycle,
batch_size=batch_size,
delay_frame=DELAY_FRAME,
device=device,
)
res_frame_list = []
total_batches = int(np.ceil(float(video_num) / batch_size))
@@ -479,21 +553,44 @@ def _run_inference(req: LipSyncRequest) -> dict:
timings["5_unet"] = time.time() - t0
print(f"✅ UNet 推理: {len(res_frame_list)} 帧 [{timings['5_unet']:.1f}s]")
# ===== Phase 6: 合成 (cv2.VideoWriter + 纯 numpy blending) =====
t0 = time.time()
h, w = frames[0].shape[:2]
temp_raw_path = output_vid_path + ".raw.mp4"
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
writer = cv2.VideoWriter(temp_raw_path, fourcc, fps, (w, h))
if not writer.isOpened():
raise RuntimeError(f"cv2.VideoWriter 打开失败: {temp_raw_path}")
cached_mask = None
cached_crop_box = None
blend_mode = "jaw" if version == "v15" else "raw"
# ===== Phase 6: 合成并写入 FFmpeg rawvideo 管道 =====
t0 = time.time()
h, w = frames[0].shape[:2]
ffmpeg_cmd = [
"ffmpeg", "-y", "-v", "warning",
"-f", "rawvideo",
"-pix_fmt", "bgr24",
"-s", f"{w}x{h}",
"-r", str(fps),
"-i", "-",
"-i", audio_path,
"-c:v", "libx264", "-preset", ENCODE_PRESET, "-crf", str(ENCODE_CRF), "-pix_fmt", "yuv420p",
"-c:a", "copy", "-shortest",
output_vid_path,
]
ffmpeg_proc = subprocess.Popen(
ffmpeg_cmd,
stdin=subprocess.PIPE,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
pipe_in = ffmpeg_proc.stdin
if pipe_in is None:
raise RuntimeError("FFmpeg 管道初始化失败")
def _write_pipe_frame(frame: np.ndarray):
try:
pipe_in.write(np.ascontiguousarray(frame, dtype=np.uint8).tobytes())
except BrokenPipeError as exc:
raise RuntimeError("FFmpeg 管道写入失败") from exc
cached_mask = None
cached_crop_box = None
if BLEND_MODE == "auto":
blend_mode = "jaw" if version == "v15" else "raw"
else:
blend_mode = BLEND_MODE
for i in tqdm(range(len(res_frame_list)), desc="合成"):
res_frame = res_frame_list[i]
@@ -503,26 +600,26 @@ def _run_inference(req: LipSyncRequest) -> dict:
x1, y1, x2, y2 = bbox
if version == "v15":
y2 = min(y2 + extra_margin, ori_frame.shape[0])
adjusted_bbox = (x1, y1, x2, y2)
try:
res_frame = cv2.resize(res_frame.astype(np.uint8), (x2 - x1, y2 - y1))
except Exception:
writer.write(ori_frame)
continue
adjusted_bbox = (x1, y1, x2, y2)
try:
res_frame = cv2.resize(res_frame.astype(np.uint8), (x2 - x1, y2 - y1))
except Exception:
_write_pipe_frame(ori_frame)
continue
# 每 N 帧更新 BiSeNet 人脸解析 mask, 其余帧复用缓存
if i % BLEND_CACHE_EVERY == 0 or cached_mask is None:
try:
cached_mask, cached_crop_box = get_image_prepare_material(
ori_frame, adjusted_bbox, mode=blend_mode, fp=fp)
except Exception:
# 如果 prepare 失败, 用完整方式
combine_frame = get_image(
ori_frame, res_frame, list(adjusted_bbox),
mode=blend_mode, fp=fp)
writer.write(combine_frame)
continue
except Exception:
# 如果 prepare 失败, 用完整方式
combine_frame = get_image(
ori_frame, res_frame, list(adjusted_bbox),
mode=blend_mode, fp=fp)
_write_pipe_frame(combine_frame)
continue
try:
combine_frame = get_image_blending_fast(
@@ -532,35 +629,25 @@ def _run_inference(req: LipSyncRequest) -> dict:
try:
combine_frame = get_image_blending(
ori_frame, res_frame, adjusted_bbox, cached_mask, cached_crop_box)
except Exception:
combine_frame = get_image(
ori_frame, res_frame, list(adjusted_bbox),
mode=blend_mode, fp=fp)
writer.write(combine_frame)
writer.release()
timings["6_blend"] = time.time() - t0
print(f"🎨 合成 [{timings['6_blend']:.1f}s]")
# ===== Phase 7: FFmpeg H.264 编码 + 合并音频 =====
t0 = time.time()
cmd = [
"ffmpeg", "-y", "-v", "warning",
"-i", temp_raw_path, "-i", audio_path,
"-c:v", "libx264", "-crf", "18", "-pix_fmt", "yuv420p",
"-c:a", "copy", "-shortest",
output_vid_path
]
if not run_ffmpeg(cmd):
raise RuntimeError("FFmpeg 重编码+音频合并失败")
# 清理临时文件
if os.path.exists(temp_raw_path):
os.unlink(temp_raw_path)
timings["7_encode"] = time.time() - t0
print(f"🔊 编码+音频 [{timings['7_encode']:.1f}s]")
except Exception:
combine_frame = get_image(
ori_frame, res_frame, list(adjusted_bbox),
mode=blend_mode, fp=fp)
_write_pipe_frame(combine_frame)
pipe_in.close()
timings["6_blend"] = time.time() - t0
print(f"🎨 合成 [{timings['6_blend']:.1f}s]")
# ===== Phase 7: 等待 FFmpeg 编码完成 =====
t0 = time.time()
return_code = ffmpeg_proc.wait()
if return_code != 0:
raise RuntimeError("FFmpeg 编码+音频合并失败")
timings["7_encode"] = time.time() - t0
print(f"🔊 编码+音频 [{timings['7_encode']:.1f}s]")
# ===== 汇总 =====
total_time = time.time() - t_total

View File

@@ -185,9 +185,50 @@ async function main() {
const currentHash = getSourceHash();
let bundleLocation: string;
// 辅助函数: 确保文件在缓存 public 目录中可访问 (硬链接 > 复制)
function ensureInCachedPublic(cachedPublicDir: string, srcAbsPath: string, fileName: string) {
const cachedPath = path.join(cachedPublicDir, fileName);
// 已存在且大小一致,跳过
try {
if (fs.existsSync(cachedPath)) {
const srcStat = fs.statSync(srcAbsPath);
const cachedStat = fs.statSync(cachedPath);
if (srcStat.size === cachedStat.size && srcStat.ino === cachedStat.ino) return;
}
} catch { /* file doesn't exist or broken, will recreate */ }
// 移除旧的文件/链接
try { fs.unlinkSync(cachedPath); } catch { /* doesn't exist, fine */ }
// 优先硬链接(零拷贝,对应用透明),跨文件系统时回退为复制
try {
fs.linkSync(srcAbsPath, cachedPath);
console.log(`Hardlinked into cached bundle: ${fileName}`);
} catch {
fs.copyFileSync(srcAbsPath, cachedPath);
console.log(`Copied into cached bundle: ${fileName}`);
}
}
if (fs.existsSync(hashFile) && fs.readFileSync(hashFile, 'utf-8') === currentHash) {
bundleLocation = BUNDLE_CACHE_DIR;
console.log('Using cached bundle');
// 确保当前渲染所需的文件在缓存 bundle 的 public 目录中可访问
const cachedPublicDir = path.join(BUNDLE_CACHE_DIR, 'public');
if (!fs.existsSync(cachedPublicDir)) {
fs.mkdirSync(cachedPublicDir, { recursive: true });
}
// 1) 视频文件
ensureInCachedPublic(cachedPublicDir, path.resolve(options.videoPath), videoFileName);
// 2) 字体文件 (从 subtitleStyle / titleStyle / secondaryTitleStyle 中提取)
const styleSources = [options.subtitleStyle, options.titleStyle, options.secondaryTitleStyle];
for (const style of styleSources) {
const fontFile = (style as Record<string, unknown>)?.font_file as string | undefined;
if (fontFile) {
const fontSrcPath = path.join(publicDir, fontFile);
if (fs.existsSync(fontSrcPath)) {
ensureInCachedPublic(cachedPublicDir, path.resolve(fontSrcPath), fontFile);
}
}
}
} else {
console.log('Bundling Remotion project...');
console.log(`Entry point: ${entryPoint}`);