Compare commits
1 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
48bc78fe38 |
363
Docs/DevLogs/Day30.md
Normal file
363
Docs/DevLogs/Day30.md
Normal file
@@ -0,0 +1,363 @@
|
||||
## Remotion 缓存修复 + 编码流水线质量优化 + 唇形同步容错 + 模型选择 (Day 30)
|
||||
|
||||
### 概述
|
||||
|
||||
本轮解决四大方面:(1) Remotion bundle 缓存导致标题/字幕丢失的严重 Bug;(2) 全面优化 LatentSync + MuseTalk 双引擎编码流水线,消除冗余有损编码;(3) 增强 LatentSync 的鲁棒性,允许素材中部分帧检测不到人脸时继续推理而非中断任务;(4) 前端唇形模型选择,用户可按需切换默认/快速/高级模型。
|
||||
|
||||
---
|
||||
|
||||
## ✅ 改动内容
|
||||
|
||||
### 1. Remotion Bundle 缓存 404 修复(严重 Bug)
|
||||
|
||||
- **问题**: 生成的视频没有标题和字幕,Remotion 渲染失败后静默回退到 FFmpeg(无文字叠加能力)
|
||||
- **根因**: Remotion 的 bundle 缓存机制只在首次打包时复制 `publicDir`(视频/字体所在目录)。代码稳定后缓存持续命中,新生成的视频和字体文件不在旧缓存的 `public/` 目录 → Remotion HTTP server 返回 404 → 渲染失败
|
||||
- **尝试**: 先用 `fs.symlinkSync` 符号链接,但 Remotion 内部 HTTP server 不支持跟随符号链接
|
||||
- **最终方案**: 使用 `fs.linkSync` 硬链接(同文件系统零拷贝,对应用完全透明),跨文件系统时自动回退为 `fs.copyFileSync`
|
||||
|
||||
**文件**: `remotion/render.ts`
|
||||
|
||||
```typescript
|
||||
function ensureInCachedPublic(cachedPublicDir, srcAbsPath, fileName) {
|
||||
// 检查是否已存在且为同一 inode
|
||||
// 优先硬链接(零拷贝),跨文件系统回退为复制
|
||||
try {
|
||||
fs.linkSync(srcAbsPath, cachedPath);
|
||||
} catch {
|
||||
fs.copyFileSync(srcAbsPath, cachedPath);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
使用缓存 bundle 时,自动将当前渲染所需的文件(视频 + 字体)硬链接到缓存的 `public/` 目录:
|
||||
- 视频文件(`videoFileName`)
|
||||
- 字体文件(从 `subtitleStyle` / `titleStyle` / `secondaryTitleStyle` 的 `font_file` 字段提取)
|
||||
|
||||
---
|
||||
|
||||
### 2. 视频编码流水线质量优化
|
||||
|
||||
对完整流水线做全面审查,发现从素材上传到最终输出,视频最多经历 **5-6 次有损重编码**,而官方 LatentSync demo 只有 1-2 次。
|
||||
|
||||
#### 优化前编码链路
|
||||
|
||||
| # | 阶段 | CRF | 问题 |
|
||||
|---|------|-----|------|
|
||||
| 1 | 方向归一化 | 23 | 条件触发 |
|
||||
| 2 | `prepare_segment` 缩放+时长 | 23 | 必经,质量偏低 |
|
||||
| 3 | LatentSync `read_video` FPS 转换 | 18 | **即使已是 25fps 也重编码** |
|
||||
| 4 | LatentSync `imageio` 写帧 | 13 | 模型输出 |
|
||||
| 5 | LatentSync final mux | 18 | **CRF13 刚写完立刻 CRF18 重编码** |
|
||||
| 6 | compose | copy | Day29 已优化 |
|
||||
| 7 | 多素材 concat | 23 | **段参数已统一,不需要重编码** |
|
||||
| 8 | Remotion 渲染 | ~18 | 必经(叠加文字) |
|
||||
|
||||
#### 优化措施
|
||||
|
||||
##### 2a. LatentSync `read_video` 跳过冗余 FPS 重编码
|
||||
|
||||
**文件**: `models/LatentSync/latentsync/utils/util.py`
|
||||
|
||||
- 原代码无条件执行 `ffmpeg -r 25 -crf 18`,即使输入视频已是 25fps
|
||||
- 新增 FPS 检测:`abs(current_fps - 25.0) < 0.5` 时直接使用原文件
|
||||
- 我们的 `prepare_segment` 已统一输出 25fps,此步完全多余
|
||||
|
||||
```python
|
||||
cap = cv2.VideoCapture(video_path)
|
||||
current_fps = cap.get(cv2.CAP_PROP_FPS)
|
||||
cap.release()
|
||||
|
||||
if abs(current_fps - 25.0) < 0.5:
|
||||
print(f"Video already at {current_fps:.1f}fps, skipping FPS conversion")
|
||||
target_video_path = video_path
|
||||
else:
|
||||
# 仅非 25fps 时才重编码
|
||||
command = f"ffmpeg ... -r 25 -crf 18 ..."
|
||||
```
|
||||
|
||||
##### 2b. LatentSync final mux 流复制替代重编码
|
||||
|
||||
**文件**: `models/LatentSync/latentsync/pipelines/lipsync_pipeline.py`
|
||||
|
||||
- 原代码:`imageio` 以 CRF 13 高质量写完帧后,final mux 又用 `libx264 -crf 18` 完整重编码
|
||||
- 修复:改为 `-c:v copy` 流复制,仅 mux 音频轨,视频零损失
|
||||
|
||||
```diff
|
||||
- ffmpeg ... -c:v libx264 -crf 18 -c:a aac -q:v 0 -q:a 0
|
||||
+ ffmpeg ... -c:v copy -c:a aac -q:a 0
|
||||
```
|
||||
|
||||
##### 2c. `prepare_segment` + `normalize_orientation` CRF 23 → 18
|
||||
|
||||
**文件**: `backend/app/services/video_service.py`
|
||||
|
||||
- `normalize_orientation`:CRF 23 → 18
|
||||
- `prepare_segment` trim 临时文件:CRF 23 → 18
|
||||
- `prepare_segment` 主命令:CRF 23 → 18
|
||||
- CRF 18 是"高质量"级别,与 LatentSync 内部标准一致
|
||||
|
||||
##### 2d. 多素材 concat 流复制
|
||||
|
||||
**文件**: `backend/app/services/video_service.py`
|
||||
|
||||
- 原代码用 `libx264 -crf 23` 重编码拼接
|
||||
- 所有段已由 `prepare_segment` 统一为相同分辨率/帧率/编码参数
|
||||
- 改为 `-c:v copy` 流复制,消除一次完整重编码
|
||||
|
||||
```diff
|
||||
- -vsync cfr -r 25 -c:v libx264 -preset fast -crf 23 -pix_fmt yuv420p
|
||||
+ -c:v copy
|
||||
```
|
||||
|
||||
#### 优化后编码链路
|
||||
|
||||
| # | 阶段 | CRF | 状态 |
|
||||
|---|------|-----|------|
|
||||
| 1 | 方向归一化 | **18** | 提质(条件触发) |
|
||||
| 2 | `prepare_segment` | **18** | 提质(必经) |
|
||||
| 3 | ~~LatentSync FPS 转换~~ | - | **已消除** |
|
||||
| 4 | LatentSync 模型输出 | 13 | 不变(不可避免) |
|
||||
| 5 | ~~LatentSync final mux~~ | - | **已消除(copy)** |
|
||||
| 6 | compose | copy | 不变 |
|
||||
| 7 | ~~多素材 concat~~ | - | **已消除(copy)** |
|
||||
| 8 | Remotion 渲染 | ~18 | 不变(不可避免) |
|
||||
|
||||
**总计:5-6 次有损编码 → 3 次**(prepare_segment → LatentSync 模型输出 → Remotion),质量损失减少近一半。
|
||||
|
||||
---
|
||||
|
||||
## 📁 修改文件清单
|
||||
|
||||
| 文件 | 改动 |
|
||||
|------|------|
|
||||
| `remotion/render.ts` | bundle 缓存使用时硬链接视频+字体到 public 目录 |
|
||||
| `models/LatentSync/latentsync/utils/util.py` | `read_video` 检测 FPS,25fps 时跳过重编码 |
|
||||
| `models/LatentSync/latentsync/pipelines/lipsync_pipeline.py` | final mux `-c:v copy`;无脸帧容错(affine_transform + restore_video) |
|
||||
| `backend/app/services/video_service.py` | `normalize_orientation` CRF 23→18;`prepare_segment` CRF 23→18;`concat_videos` `-c:v copy` |
|
||||
| `backend/app/modules/videos/workflow.py` | 单素材 LatentSync 异常时回退原视频 |
|
||||
|
||||
---
|
||||
|
||||
### 3. LatentSync 无脸帧容错
|
||||
|
||||
- **问题**: 素材中如果有部分帧检测不到人脸(转头、遮挡、空镜头),`affine_transform` 会抛异常导致整个推理任务失败
|
||||
- **改动**:
|
||||
- `affine_transform_video`: 单帧异常时 catch 住,用最近有效帧的 face/box/affine_matrix 填充(保证 tensor batch 维度完整),全部帧无脸时仍 raise
|
||||
- `restore_video`: 新增 `valid_face_flags` 参数,无脸帧直接保留原画面(不做嘴型替换)
|
||||
- `loop_video`: `valid_face_flags` 跟随循环和翻转
|
||||
- `workflow.py`: 单素材路径 `lipsync.generate()` 整体异常时 copy 原视频继续流程,任务不会失败
|
||||
|
||||
---
|
||||
|
||||
### 4. MuseTalk 编码链路优化
|
||||
|
||||
#### 4a. FFmpeg rawvideo 管道直编码(消除中间有损文件)
|
||||
|
||||
**文件**: `models/MuseTalk/scripts/server.py`
|
||||
|
||||
- **原流程**: UNet 推理帧 → `cv2.VideoWriter(mp4v)` 写中间文件(有损) → FFmpeg 重编码+音频 mux(又一次有损)
|
||||
- **新流程**: UNet 推理帧 → FFmpeg rawvideo stdin 管道 → 一次 libx264 编码+音频 mux
|
||||
|
||||
```python
|
||||
ffmpeg_cmd = [
|
||||
"ffmpeg", "-y", "-v", "warning",
|
||||
"-f", "rawvideo", "-pix_fmt", "bgr24",
|
||||
"-s", f"{w}x{h}", "-r", str(fps),
|
||||
"-i", "-", # stdin 管道输入
|
||||
"-i", audio_path,
|
||||
"-c:v", "libx264", "-preset", ENCODE_PRESET, "-crf", str(ENCODE_CRF),
|
||||
"-pix_fmt", "yuv420p",
|
||||
"-c:a", "copy", "-shortest",
|
||||
output_vid_path,
|
||||
]
|
||||
ffmpeg_proc = subprocess.Popen(ffmpeg_cmd, stdin=subprocess.PIPE, ...)
|
||||
# 每帧直接 pipe_in.write(frame.tobytes())
|
||||
```
|
||||
|
||||
关键实现细节:
|
||||
- `-pix_fmt bgr24` 匹配 OpenCV 原生帧格式,零转换开销
|
||||
- `np.ascontiguousarray` 确保帧内存连续
|
||||
- `BrokenPipeError` 捕获 + return code 检查覆盖异常路径
|
||||
- `pipe_in.close()` 在 `ffmpeg_proc.wait()` 之前,正确发送 EOF
|
||||
- 合成 fallback(resize 失败、mask 失败、blending 失败)均通过 `_write_pipe_frame` 输出原帧
|
||||
|
||||
#### 4b. MuseTalk 参数环境变量化 + 质量优先档
|
||||
|
||||
**文件**: `models/MuseTalk/scripts/server.py` + `backend/.env`
|
||||
|
||||
所有推理与编码参数从硬编码改为 `.env` 可配置,当前使用"质量优先"档:
|
||||
|
||||
| 参数 | 原默认值 | 质量优先值 | 作用 |
|
||||
|------|----------|-----------|------|
|
||||
| `MUSETALK_DETECT_EVERY` | 5 | **2** | 人脸检测频率 ↑2.5x,画面跟踪更稳 |
|
||||
| `MUSETALK_BLEND_CACHE_EVERY` | 5 | **2** | mask 更新更频,面部边缘融合更干净 |
|
||||
| `MUSETALK_EXTRA_MARGIN` | 15 | **14** | 下巴区域微调 |
|
||||
| `MUSETALK_BLEND_MODE` | auto | **jaw** | v1.5 显式 jaw 模式 |
|
||||
| `MUSETALK_ENCODE_CRF` | 18 | **14** | 接近视觉无损(输出还要进 Remotion 再编码) |
|
||||
| `MUSETALK_ENCODE_PRESET` | medium | **slow** | 同 CRF 下压缩效率更高 |
|
||||
| `MUSETALK_AUDIO_PADDING` | 2/2 | 2/2 | 不变 |
|
||||
| `MUSETALK_FACEPARSING_CHEEK` | 90/90 | 90/90 | 不变 |
|
||||
|
||||
新增可配置参数完整列表:`DETECT_EVERY`、`BLEND_CACHE_EVERY`、`AUDIO_PADDING_LEFT/RIGHT`、`EXTRA_MARGIN`、`DELAY_FRAME`、`BLEND_MODE`、`FACEPARSING_LEFT/RIGHT_CHEEK_WIDTH`、`ENCODE_CRF`、`ENCODE_PRESET`。
|
||||
|
||||
---
|
||||
|
||||
### 5. Workflow 异步防阻塞 + compose 跳过优化
|
||||
|
||||
#### 5a. 阻塞调用线程池化
|
||||
|
||||
**文件**: `backend/app/modules/videos/workflow.py`
|
||||
|
||||
workflow 中多处同步 FFmpeg 调用会阻塞 asyncio 事件循环,导致其他 API 请求(健康检查、任务状态查询)无法响应。新增通用辅助函数 `_run_blocking()`,将所有阻塞调用统一走线程池:
|
||||
|
||||
```python
|
||||
async def _run_blocking(func, *args):
|
||||
"""在线程池执行阻塞函数,避免卡住事件循环。"""
|
||||
loop = asyncio.get_running_loop()
|
||||
return await loop.run_in_executor(None, func, *args)
|
||||
```
|
||||
|
||||
已改造的阻塞调用点:
|
||||
|
||||
| 调用 | 位置 | 说明 |
|
||||
|------|------|------|
|
||||
| `video.normalize_orientation()` | 单素材旋转归一化 | FFmpeg 旋转/转码 |
|
||||
| `video.prepare_segment()` | 多素材片段准备 | FFmpeg 缩放+时长裁剪,配合 `asyncio.gather` 多段并行 |
|
||||
| `video.concat_videos()` | 多素材拼接 | FFmpeg concat |
|
||||
| `video.prepare_segment()` | 单素材 prepare | FFmpeg 缩放+时长裁剪 |
|
||||
| `video.mix_audio()` | BGM 混音 | FFmpeg 音频混合 |
|
||||
| `video._get_duration()` | 音频/视频时长探测 (3处) | ffprobe 子进程 |
|
||||
|
||||
#### 5b. `prepare_segment` 同分辨率跳过 scale
|
||||
|
||||
**文件**: `backend/app/modules/videos/workflow.py`
|
||||
|
||||
原来无论素材分辨率是否已匹配目标,都强制传 `target_resolution` 给 `prepare_segment`,触发 scale filter + libx264 重编码。优化后逐素材比对分辨率:
|
||||
|
||||
- **多素材**: 逐段判断,分辨率匹配的传 `None`(`prepare_target_res = None if res == base_res else base_res`),走 `-c:v copy` 分支
|
||||
- **单素材**: 先 `get_resolution` 比对,匹配则传 `None`
|
||||
|
||||
当分辨率匹配且无截取、不需要循环、不需要变帧率时,`prepare_segment` 内部走 `-c:v copy`,完全零损编码。
|
||||
|
||||
#### 5c. `_get_duration()` 线程池化
|
||||
|
||||
**文件**: `backend/app/modules/videos/workflow.py`
|
||||
|
||||
3 处 `video._get_duration()` 同步 ffprobe 调用改为 `await _run_blocking(video._get_duration, ...)`,避免阻塞事件循环。
|
||||
|
||||
#### 5d. compose 循环场景 CRF 统一
|
||||
|
||||
**文件**: `backend/app/services/video_service.py`
|
||||
|
||||
`compose()` 在视频需要循环时的编码从 CRF 23 提升到 CRF 18,与全流水线质量标准统一。
|
||||
|
||||
#### 5e. 多素材片段校验
|
||||
|
||||
**文件**: `backend/app/modules/videos/workflow.py`
|
||||
|
||||
多素材 `prepare_segment` 完成后新增片段数量一致性校验,避免空片段进入 concat 导致异常。
|
||||
|
||||
#### 5f. compose() 内部防阻塞
|
||||
|
||||
**文件**: `backend/app/services/video_service.py`
|
||||
|
||||
`compose()` 改为 `async def`,内部的 `_get_duration()` 和 `_run_ffmpeg()` 都通过 `loop.run_in_executor` 在线程池执行。
|
||||
|
||||
#### 5g. 无需二次 compose 直接透传
|
||||
|
||||
**文件**: `backend/app/modules/videos/workflow.py`
|
||||
|
||||
当没有 BGM 时(`final_audio_path == audio_path`),LatentSync/MuseTalk 输出已包含正确音轨,跳过多余的 compose 步骤:
|
||||
|
||||
```python
|
||||
needs_audio_compose = str(final_audio_path) != str(audio_path)
|
||||
```
|
||||
|
||||
- **Remotion 路径**: 音频没变则跳过 pre-compose,直接用 lipsync 输出进 Remotion
|
||||
- **非 Remotion 路径**: 音频没变则 `shutil.copy` 直接透传 lipsync 输出,不再走 compose
|
||||
|
||||
---
|
||||
|
||||
### 6. 唇形模型前端选择
|
||||
|
||||
前端生成按钮右侧新增模型下拉,用户可按需选择唇形同步引擎,全链路透传到后端路由。
|
||||
|
||||
#### 模型选项
|
||||
|
||||
| 选项 | 值 | 路由逻辑 |
|
||||
|------|------|------|
|
||||
| 默认模型 | `default` | 保持现有阈值策略(`LIPSYNC_DURATION_THRESHOLD` 分水岭,短视频 LatentSync,长视频 MuseTalk) |
|
||||
| 快速模型 | `fast` | 强制 MuseTalk,不可用时回退 LatentSync |
|
||||
| 高级模型 | `advanced` | 强制 LatentSync,跳过 MuseTalk |
|
||||
|
||||
三种模式最终都有 LatentSync 兜底,不会出现无模型可用的情况。
|
||||
|
||||
#### 数据流
|
||||
|
||||
```
|
||||
前端 select → setLipsyncModelMode("fast") → localStorage 持久化
|
||||
↓
|
||||
用户点击"生成视频" → handleGenerate()
|
||||
→ payload.lipsync_model = lipsyncModelMode
|
||||
→ POST /api/videos/generate { ..., lipsync_model: "fast" }
|
||||
→ workflow: req.lipsync_model 透传给 lipsync.generate(model_mode=...)
|
||||
→ lipsync_service.generate(): 按 model_mode 路由
|
||||
→ fast: 强制 MuseTalk → 回退 LatentSync
|
||||
→ advanced: 强制 LatentSync
|
||||
→ default: 阈值策略
|
||||
```
|
||||
|
||||
#### 改动文件
|
||||
|
||||
| 文件 | 改动 |
|
||||
|------|------|
|
||||
| `frontend/src/features/home/ui/GenerateActionBar.tsx` | 生成按钮右侧新增模型 `<select>` 下拉 |
|
||||
| `frontend/src/features/home/ui/HomePage.tsx` | 透传 `modelMode` / `onModelModeChange` |
|
||||
| `frontend/src/features/home/model/useHomeController.ts` | `lipsyncModelMode` state + payload 透传 |
|
||||
| `frontend/src/features/home/model/useHomePersistence.ts` | 读/校验/写三步持久化 |
|
||||
| `backend/app/modules/videos/schemas.py` | `lipsync_model: Literal["default", "fast", "advanced"]` |
|
||||
| `backend/app/modules/videos/workflow.py` | 多素材/单素材两处 `model_mode=req.lipsync_model` 透传 |
|
||||
| `backend/app/services/lipsync_service.py` | `generate()` 新增 `model_mode` 参数,三路分支路由 |
|
||||
|
||||
---
|
||||
|
||||
## 📁 总修改文件清单
|
||||
|
||||
| 文件 | 改动 |
|
||||
|------|------|
|
||||
| `remotion/render.ts` | bundle 缓存使用时硬链接视频+字体到 public 目录 |
|
||||
| `models/LatentSync/latentsync/utils/util.py` | `read_video` 检测 FPS,25fps 时跳过重编码 |
|
||||
| `models/LatentSync/latentsync/pipelines/lipsync_pipeline.py` | final mux `-c:v copy`;无脸帧容错 |
|
||||
| `backend/app/services/video_service.py` | CRF 23→18;`concat_videos` copy;`compose()` 异步化 + 循环 CRF 18 |
|
||||
| `backend/app/modules/videos/workflow.py` | 线程池化;同分辨率跳过 scale;compose 跳过;片段校验;模型选择透传 |
|
||||
| `backend/app/modules/videos/schemas.py` | 新增 `lipsync_model` 字段 |
|
||||
| `backend/app/services/lipsync_service.py` | `generate()` 新增 `model_mode` 三路分支路由 |
|
||||
| `models/MuseTalk/scripts/server.py` | FFmpeg rawvideo 管道;参数环境变量化 |
|
||||
| `backend/.env` | 新增 MuseTalk 质量优先参数 |
|
||||
| `frontend/src/features/home/ui/GenerateActionBar.tsx` | 模型下拉 UI |
|
||||
| `frontend/src/features/home/ui/HomePage.tsx` | 模型状态透传 |
|
||||
| `frontend/src/features/home/model/useHomeController.ts` | `lipsyncModelMode` state + payload |
|
||||
| `frontend/src/features/home/model/useHomePersistence.ts` | 模型选择持久化 |
|
||||
|
||||
---
|
||||
|
||||
## 🔍 验证
|
||||
|
||||
1. **标题字幕恢复**: 生成视频应有标题和逐字高亮字幕(Remotion 渲染成功,非 FFmpeg 回退)
|
||||
2. **Remotion 日志**: 应出现 `Hardlinked into cached bundle:` 或 `Copied into cached bundle:` 而非 404
|
||||
3. **LatentSync FPS 跳过**: 日志应出现 `Video already at 25.0fps, skipping FPS conversion`
|
||||
4. **LatentSync mux**: FFmpeg 日志中 final mux 应为 `-c:v copy`
|
||||
5. **画质对比**: 同一素材+音频,优化后生成的视频嘴型区域(尤其牙齿)应比优化前更清晰
|
||||
6. **多素材拼接**: concat 步骤应为流复制,耗时从秒级降到毫秒级
|
||||
7. **无脸帧容错**: 包含转头/遮挡帧的素材不再导致任务失败,无脸帧保留原画面
|
||||
8. **MuseTalk 管道编码**: 日志中不应出现中间 mp4v 文件,合成阶段直接管道写入
|
||||
9. **MuseTalk 质量参数**: `curl localhost:8011/health` 确认服务在线,生成视频嘴型边缘更清晰
|
||||
10. **事件循环不阻塞**: 生成视频期间,`/api/tasks/{id}` 等接口应正常响应,不出现超时
|
||||
11. **compose 跳过**: 无 BGM 时日志应出现 `Audio unchanged, skip pre-Remotion compose`
|
||||
12. **同分辨率跳过 scale**: 素材已是目标分辨率时,`prepare_segment` 应走 `-c:v copy`(日志中无 scale filter)
|
||||
13. **compose 循环 CRF**: 循环场景编码应为 CRF 18(非 23)
|
||||
14. **模型选择 UI**: 生成按钮右侧应出现默认模型/快速模型/高级模型下拉
|
||||
15. **模型选择持久化**: 切换模型后刷新页面,下拉应恢复上次选择
|
||||
16. **快速模型路由**: 选择"快速模型"时,后端日志应出现 `强制快速模型:MuseTalk`
|
||||
17. **高级模型路由**: 选择"高级模型"时,后端日志应出现 `强制高级模型:LatentSync`
|
||||
18. **默认模型不变**: 选择"默认模型"时行为与改动前完全一致(阈值路由)
|
||||
@@ -201,6 +201,29 @@ LatentSync 1.6 需要 ~18GB VRAM。如果遇到 OOM 错误:
|
||||
- `inference_steps`: 增加到 30-50 可提高质量
|
||||
- `guidance_scale`: 增加可改善唇同步,但过高可能导致抖动
|
||||
|
||||
### 编码流水线优化(Day 30)
|
||||
|
||||
LatentSync 内部默认流程有两处冗余编码已优化:
|
||||
|
||||
1. **`read_video` FPS 转换**: 原代码无条件 `ffmpeg -r 25 -crf 18`,现已改为检测 FPS,25fps 时跳过(我们的 `prepare_segment` 已输出 25fps)
|
||||
2. **final mux 双重编码**: 原代码 `imageio` CRF 13 写帧后又用 `libx264 -crf 18` 重编码做 mux,现已改为 `-c:v copy` 流复制
|
||||
|
||||
这两项优化位于:
|
||||
- `latentsync/utils/util.py` — `read_video()` 函数
|
||||
- `latentsync/pipelines/lipsync_pipeline.py` — final mux 命令
|
||||
|
||||
---
|
||||
|
||||
### 无脸帧容错(Day 30)
|
||||
|
||||
素材中部分帧检测不到人脸(转头、遮挡、空镜头)时,不再中断整次推理:
|
||||
|
||||
- `affine_transform_video`: 单帧异常时用最近有效帧填充,全部帧无脸时仍报错
|
||||
- `restore_video`: 无脸帧保留原画面,不做嘴型替换
|
||||
- 后端 `workflow.py`: LatentSync 整体异常时自动回退原视频,任务不会失败
|
||||
|
||||
改动位于 `latentsync/pipelines/lipsync_pipeline.py`。
|
||||
|
||||
---
|
||||
|
||||
## 参考链接
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# MuseTalk 部署指南
|
||||
|
||||
> **更新时间**:2026-02-27
|
||||
> **更新时间**:2026-03-02
|
||||
> **适用版本**:MuseTalk v1.5 (常驻服务模式)
|
||||
> **架构**:FastAPI 常驻服务 + PM2 进程管理
|
||||
|
||||
@@ -173,17 +173,36 @@ curl http://localhost:8011/health
|
||||
`backend/.env` 中的相关变量:
|
||||
|
||||
```ini
|
||||
# MuseTalk 配置
|
||||
# MuseTalk 基础配置
|
||||
MUSETALK_GPU_ID=0 # GPU 编号 (与 CosyVoice 共存)
|
||||
MUSETALK_API_URL=http://localhost:8011 # 常驻服务地址
|
||||
MUSETALK_BATCH_SIZE=32 # 推理批大小
|
||||
MUSETALK_VERSION=v15 # 模型版本
|
||||
MUSETALK_USE_FLOAT16=true # 半精度加速
|
||||
|
||||
# 推理质量参数
|
||||
MUSETALK_DETECT_EVERY=2 # 人脸检测降频间隔 (帧,越小越准但更慢)
|
||||
MUSETALK_BLEND_CACHE_EVERY=2 # BiSeNet mask 缓存更新间隔 (帧)
|
||||
MUSETALK_AUDIO_PADDING_LEFT=2 # Whisper 时序上下文 (左)
|
||||
MUSETALK_AUDIO_PADDING_RIGHT=2 # Whisper 时序上下文 (右)
|
||||
MUSETALK_EXTRA_MARGIN=14 # v1.5 下巴区域扩展像素
|
||||
MUSETALK_DELAY_FRAME=0 # 音频-口型对齐偏移 (帧)
|
||||
MUSETALK_BLEND_MODE=jaw # 融合模式: auto / jaw / raw
|
||||
MUSETALK_FACEPARSING_LEFT_CHEEK_WIDTH=90 # 面颊宽度 (仅 v1.5)
|
||||
MUSETALK_FACEPARSING_RIGHT_CHEEK_WIDTH=90
|
||||
|
||||
# 编码质量参数
|
||||
MUSETALK_ENCODE_CRF=14 # CRF 越小越清晰 (14≈接近视觉无损)
|
||||
MUSETALK_ENCODE_PRESET=slow # x264 preset (slow=高压缩效率)
|
||||
|
||||
# 混合唇形同步路由
|
||||
LIPSYNC_DURATION_THRESHOLD=120 # 秒, >=此值用 MuseTalk
|
||||
```
|
||||
|
||||
> **参数档位参考**:
|
||||
> - 速度优先:`DETECT_EVERY=5, BLEND_CACHE_EVERY=5, ENCODE_CRF=18, ENCODE_PRESET=medium`
|
||||
> - 质量优先(当前):`DETECT_EVERY=2, BLEND_CACHE_EVERY=2, ENCODE_CRF=14, ENCODE_PRESET=slow`
|
||||
|
||||
---
|
||||
|
||||
## 相关文件
|
||||
@@ -207,22 +226,36 @@ LIPSYNC_DURATION_THRESHOLD=120 # 秒, >=此值用 MuseTalk
|
||||
|--------|------|
|
||||
| `MUSETALK_BATCH_SIZE` 8→32 | RTX 3090 显存充裕,UNet 推理加速 ~3x |
|
||||
| cv2.VideoCapture 直读帧 | 跳过 ffmpeg→PNG→imread 链路 |
|
||||
| 人脸检测降频 (每5帧) | DWPose + FaceAlignment 只在采样帧运行,中间帧线性插值 bbox |
|
||||
| BiSeNet mask 缓存 (每5帧) | `get_image_prepare_material` 每 5 帧运行,中间帧用 `get_image_blending` 复用 |
|
||||
| cv2.VideoWriter 直写 | 跳过逐帧 PNG 写盘 + ffmpeg 重编码 |
|
||||
| 人脸检测降频 (每N帧) | DWPose + FaceAlignment 只在采样帧运行,中间帧线性插值 bbox |
|
||||
| BiSeNet mask 缓存 (每N帧) | `get_image_prepare_material` 每 N 帧运行,中间帧复用 |
|
||||
| FFmpeg rawvideo 管道直编码 | 原 `cv2.VideoWriter(mp4v)` 中间有损文件改为 stdin 管道直写,消除一次冗余有损编码 |
|
||||
| 参数环境变量化 | 所有推理/编码参数从 `.env` 读取,支持速度优先/质量优先快速切换 |
|
||||
| 每阶段计时 | 7 个阶段精确计时,方便后续调优 |
|
||||
|
||||
### 编码链路
|
||||
|
||||
```
|
||||
UNet 推理帧 (raw BGR24)
|
||||
→ FFmpeg rawvideo stdin 管道
|
||||
→ 一次 libx264 编码 (CRF 14, preset slow) + 音频 mux
|
||||
→ 最终输出 .mp4
|
||||
```
|
||||
|
||||
与旧流程对比:消除了 `cv2.VideoWriter(mp4v)` 中间有损文件,编码次数从 2 次减至 1 次。
|
||||
|
||||
### 调优参数
|
||||
|
||||
`models/MuseTalk/scripts/server.py` 顶部可调:
|
||||
所有参数通过 `backend/.env` 配置(修改后需重启 MuseTalk 服务生效):
|
||||
|
||||
```python
|
||||
DETECT_EVERY = 5 # 人脸检测降频间隔 (帧)
|
||||
BLEND_CACHE_EVERY = 5 # BiSeNet mask 缓存间隔 (帧)
|
||||
```ini
|
||||
MUSETALK_DETECT_EVERY=2 # 人脸检测降频间隔 (帧),质量优先用 2,速度优先用 5
|
||||
MUSETALK_BLEND_CACHE_EVERY=2 # BiSeNet mask 缓存间隔 (帧)
|
||||
MUSETALK_ENCODE_CRF=14 # 编码质量 (14≈接近视觉无损,18=高质量)
|
||||
MUSETALK_ENCODE_PRESET=slow # 编码速度 (slow=高压缩效率,medium=平衡)
|
||||
```
|
||||
|
||||
> 对于口播视频 (人脸几乎不动),5 帧间隔的插值误差可忽略。
|
||||
> 如人脸运动剧烈的场景,可降低为 2-3。
|
||||
> 对于口播视频 (人脸几乎不动),detect_every=5 的插值误差可忽略。
|
||||
> 如人脸运动剧烈或追求最佳质量,使用 detect_every=2。
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -241,6 +241,15 @@ const bundleLocation = await bundle({
|
||||
const videoUrl = staticFile(videoSrc); // 使用 staticFile
|
||||
```
|
||||
|
||||
**问题**: Remotion 渲染失败 - 404 视频文件找不到(bundle 缓存问题)
|
||||
|
||||
Remotion 使用 bundle 缓存加速打包。缓存命中时,新生成的视频/字体文件需要硬链接到缓存的 `public/` 目录。如果出现 404 错误,清除缓存重试:
|
||||
|
||||
```bash
|
||||
rm -rf /home/rongye/ProgramFiles/ViGent2/remotion/.remotion-bundle-cache
|
||||
pm2 restart vigent2-backend
|
||||
```
|
||||
|
||||
**问题**: Remotion 渲染失败
|
||||
|
||||
查看后端日志:
|
||||
@@ -296,3 +305,4 @@ WhisperService(device="cuda:0") # 或 "cuda:1"
|
||||
| 2026-02-27 | 1.3.0 | 架构图更新 MuseTalk 混合路由;Remotion 并发渲染从 8 提升到 16;GPU 分配说明更新 |
|
||||
| 2026-02-28 | 1.3.1 | MuseTalk 合成阶段优化:纯 numpy blending + FFmpeg pipe NVENC GPU 硬编码替代双重编码 |
|
||||
| 2026-02-28 | 1.4.0 | compose 流复制替代重编码;FFmpeg 超时保护 (600s/30s);Remotion 并发 16→4;Whisper 时间戳平滑 + 原文节奏映射;全局视频生成 Semaphore(2);Redis 任务 TTL |
|
||||
| 2026-03-02 | 1.5.0 | Remotion bundle 缓存修复(硬链接视频/字体到 cached public 目录);编码流水线优化 prepare_segment/normalize CRF 23→18;多素材 concat 改为流复制 |
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
# ViGent2 开发任务清单 (Task Log)
|
||||
|
||||
**项目**: ViGent2 数字人口播视频生成系统
|
||||
**进度**: 100% (Day 29 - 视频流水线优化 + CosyVoice 语气控制)
|
||||
**更新时间**: 2026-02-28
|
||||
**进度**: 100% (Day 30 - Remotion 缓存修复 + 编码流水线质量优化)
|
||||
**更新时间**: 2026-03-02
|
||||
|
||||
---
|
||||
|
||||
@@ -10,7 +10,26 @@
|
||||
|
||||
> 这里记录了每一天的核心开发内容与 milestone。
|
||||
|
||||
### Day 29: 视频流水线优化 + CosyVoice 语气控制 (Current)
|
||||
### Day 30: Remotion 缓存修复 + 编码流水线质量优化 + 唇形同步容错 (Current)
|
||||
- [x] **Remotion 缓存 404 修复**: bundle 缓存命中时,新生成的视频/字体文件不在旧缓存 `public/` 目录 → 404 → 回退 FFmpeg(无标题字幕)。改为硬链接(`fs.linkSync`)当前渲染所需文件到缓存目录。
|
||||
- [x] **LatentSync `read_video` 跳过冗余 FPS 重编码**: 检测输入 FPS,已是 25fps 时跳过 `ffmpeg -r 25 -crf 18` 重编码。
|
||||
- [x] **LatentSync final mux 流复制**: `imageio` CRF 13 写帧后的 mux 步骤从 `libx264 -crf 18` 改为 `-c:v copy`,消除冗余双重编码。
|
||||
- [x] **`prepare_segment` + `normalize_orientation` CRF 提质**: CRF 23 → 18,与 LatentSync 内部质量标准统一。
|
||||
- [x] **多素材 concat 流复制**: 各段参数已统一,`concat_videos` 从 `libx264 -crf 23` 改为 `-c:v copy`。
|
||||
- [x] **编码次数总计**: 从 5-6 次有损编码降至 3 次(prepare_segment → LatentSync/MuseTalk 模型输出 → Remotion)。
|
||||
- [x] **LatentSync 无脸帧容错**: 素材部分帧检测不到人脸时不再中断推理,无脸帧保留原画面,单素材异常时回退原视频。
|
||||
- [x] **MuseTalk 管道直编码**: `cv2.VideoWriter(mp4v)` 中间有损文件改为 FFmpeg rawvideo stdin 管道,消除一次冗余有损编码。
|
||||
- [x] **MuseTalk 参数环境变量化**: 推理与编码参数(detect_every/blend_cache/CRF/preset 等)从硬编码迁移到 `backend/.env`,当前使用质量优先档(CRF 14, preset slow, detect_every 2, blend_cache_every 2)。
|
||||
- [x] **Workflow 异步防阻塞**: 新增 `_run_blocking()` 线程池辅助,5 处同步 FFmpeg 调用(旋转归一化/prepare_segment/concat/BGM 混音)改为 `await _run_blocking()`,事件循环不再被阻塞。
|
||||
- [x] **compose 跳过优化**: 无 BGM 时 `final_audio_path == audio_path`,跳过多余的 compose 步骤,Remotion 路径直接用 lipsync 输出,非 Remotion 路径 `shutil.copy` 透传。
|
||||
- [x] **compose() 异步化**: `compose()` 改为 `async def`,内部 `_get_duration` 和 `_run_ffmpeg` 走 `run_in_executor`。
|
||||
- [x] **同分辨率跳过 scale**: 多素材逐段比对分辨率,匹配的传 `None` 走 copy 分支;单素材同理。避免已是目标分辨率时的无效重编码。
|
||||
- [x] **`_get_duration()` 线程池化**: workflow 中 3 处同步 ffprobe 探测改为 `await _run_blocking()`。
|
||||
- [x] **compose 循环 CRF 统一**: 循环场景 CRF 23 → 18,与全流水线质量标准一致。
|
||||
- [x] **多素材片段校验**: prepare 完成后校验片段数量一致,防止空片段进入 concat。
|
||||
- [x] **唇形模型前端选择**: 生成按钮右侧新增模型下拉(默认模型/快速模型/高级模型),全链路透传 `lipsync_model` 到后端路由。默认保持阈值策略,快速强制 MuseTalk,高级强制 LatentSync,三种模式均有 LatentSync 兜底。选择 localStorage 持久化。
|
||||
|
||||
### Day 29: 视频流水线优化 + CosyVoice 语气控制
|
||||
- [x] **字幕同步修复**: Whisper 时间戳三步平滑(单调递增+重叠消除+间隙填补)+ 原文节奏映射(线性插值 + 单字时长钳位)。
|
||||
- [x] **LatentSync 嘴型参数调优**: inference_steps 16→20, guidance_scale 2.0, DeepCache 启用, Remotion concurrency 16→4。
|
||||
- [x] **compose 流复制**: 不循环时 `-c:v copy` 替代 libx264 重编码,compose 耗时从分钟级降到秒级。
|
||||
|
||||
@@ -16,7 +16,7 @@
|
||||
## ✨ 功能特性
|
||||
|
||||
### 核心能力
|
||||
- 🎬 **高清唇形同步** - 混合方案:短视频 (<120s) 用 LatentSync 1.6 (高质量 Latent Diffusion),长视频 (>=120s) 用 MuseTalk 1.5 (实时级单步推理),自动路由 + 回退。
|
||||
- 🎬 **高清唇形同步** - 混合方案:短视频 (<120s) 用 LatentSync 1.6 (高质量 Latent Diffusion),长视频 (>=120s) 用 MuseTalk 1.5 (实时级单步推理),自动路由 + 回退。前端可选模型:默认模型(阈值自动路由)/ 快速模型(强制 MuseTalk)/ 高级模型(强制 LatentSync)。
|
||||
- 🎙️ **多模态配音** - 支持 **EdgeTTS** (微软超自然语音, 10 语言) 和 **CosyVoice 3.0** (3秒极速声音克隆, 9语言+18方言, 语速/语气可调)。上传参考音频自动 Whisper 转写 + 智能截取。配音前置工作流:先生成配音 → 选素材 → 生成视频。
|
||||
- 📝 **智能字幕** - 集成 faster-whisper + Remotion,自动生成逐字高亮 (卡拉OK效果) 字幕。
|
||||
- 🎨 **样式预设** - 12 种标题 + 8 种字幕样式预设,支持预览 + 字号调节 + 自定义字体库。CSS 原生描边渲染,清晰无重影。
|
||||
@@ -37,7 +37,7 @@
|
||||
- 💳 **付费会员** - 支付宝电脑网站支付自动开通会员,到期自动停用并引导续费,管理员手动激活并存。
|
||||
- 🔐 **认证与隔离** - 基于 Supabase 的用户隔离,支持手机号注册/登录、密码管理。
|
||||
- 🛡️ **服务守护** - 内置 Watchdog 看门狗机制,自动监控并重启僵死服务,确保 7x24h 稳定运行。
|
||||
- 🚀 **性能优化** - compose 流复制免重编码、FFmpeg 超时保护、全局视频生成并发限制 (Semaphore(2))、Remotion 4 并发渲染、MuseTalk NVENC GPU 硬编码 + 纯 numpy blending、模型常驻服务、双 GPU 流水线并发、Redis 任务 TTL 自动清理。
|
||||
- 🚀 **性能优化** - 编码流水线从 5-6 次有损编码精简至 3 次(prepare_segment → 模型输出 → Remotion)、compose 流复制免重编码、同分辨率跳过 scale、FFmpeg 超时保护、全局视频生成并发限制 (Semaphore(2))、Remotion 4 并发渲染、MuseTalk rawvideo 管道直编码(消除中间有损文件)、模型常驻服务、双 GPU 流水线并发、Redis 任务 TTL 自动清理、workflow 阻塞调用线程池化。
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -25,10 +25,10 @@ LATENTSYNC_USE_SERVER=true
|
||||
# LATENTSYNC_API_URL=http://localhost:8007
|
||||
|
||||
# 推理步数 (20-50, 越高质量越好,速度越慢)
|
||||
LATENTSYNC_INFERENCE_STEPS=20
|
||||
LATENTSYNC_INFERENCE_STEPS=30
|
||||
|
||||
# 引导系数 (1.0-3.0, 越高唇同步越准,但可能抖动)
|
||||
LATENTSYNC_GUIDANCE_SCALE=2.0
|
||||
LATENTSYNC_GUIDANCE_SCALE=1.9
|
||||
|
||||
# 启用 DeepCache 加速 (推荐开启)
|
||||
LATENTSYNC_ENABLE_DEEPCACHE=true
|
||||
@@ -52,9 +52,36 @@ MUSETALK_VERSION=v15
|
||||
# 半精度加速
|
||||
MUSETALK_USE_FLOAT16=true
|
||||
|
||||
# 人脸检测降频间隔(帧,越小质量越稳但更慢)
|
||||
MUSETALK_DETECT_EVERY=2
|
||||
|
||||
# BiSeNet mask 缓存更新间隔(帧,越小质量越稳但更慢)
|
||||
MUSETALK_BLEND_CACHE_EVERY=2
|
||||
|
||||
# Whisper 时序上下文(越大越平滑,口型响应会更钝)
|
||||
MUSETALK_AUDIO_PADDING_LEFT=2
|
||||
MUSETALK_AUDIO_PADDING_RIGHT=2
|
||||
|
||||
# v1.5 下巴区域扩展像素(越大越容易看到下唇/牙齿,也更易边缘不稳)
|
||||
MUSETALK_EXTRA_MARGIN=14
|
||||
|
||||
# 音频-口型对齐偏移(帧,正数=口型更晚,负数=口型更早)
|
||||
MUSETALK_DELAY_FRAME=0
|
||||
|
||||
# 融合模式:auto(按版本自动) / jaw / raw
|
||||
MUSETALK_BLEND_MODE=jaw
|
||||
|
||||
# FaceParsing 面颊宽度(仅 v1.5 生效,影响融合掩膜范围)
|
||||
MUSETALK_FACEPARSING_LEFT_CHEEK_WIDTH=90
|
||||
MUSETALK_FACEPARSING_RIGHT_CHEEK_WIDTH=90
|
||||
|
||||
# 最终编码质量(CRF 越小越清晰但体积更大)
|
||||
MUSETALK_ENCODE_CRF=14
|
||||
MUSETALK_ENCODE_PRESET=slow
|
||||
|
||||
# =============== 混合唇形同步路由 ===============
|
||||
# 音频时长 >= 此阈值(秒)用 MuseTalk,< 此阈值用 LatentSync
|
||||
LIPSYNC_DURATION_THRESHOLD=120
|
||||
LIPSYNC_DURATION_THRESHOLD=100
|
||||
|
||||
# =============== 上传配置 ===============
|
||||
# 最大上传文件大小 (MB)
|
||||
|
||||
@@ -38,3 +38,4 @@ class GenerateRequest(BaseModel):
|
||||
bgm_volume: Optional[float] = 0.2
|
||||
custom_assignments: Optional[List[CustomAssignment]] = None
|
||||
output_aspect_ratio: Literal["9:16", "16:9"] = "9:16"
|
||||
lipsync_model: Literal["default", "fast", "advanced"] = "default"
|
||||
|
||||
@@ -94,6 +94,12 @@ def _update_task(task_id: str, **updates: Any) -> None:
|
||||
task_store.update(task_id, updates)
|
||||
|
||||
|
||||
async def _run_blocking(func, *args):
|
||||
"""在线程池执行阻塞函数,避免卡住事件循环。"""
|
||||
loop = asyncio.get_running_loop()
|
||||
return await loop.run_in_executor(None, func, *args)
|
||||
|
||||
|
||||
# ── 多素材辅助函数 ──
|
||||
|
||||
|
||||
@@ -214,7 +220,8 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
|
||||
# 归一化旋转元数据(如 iPhone MOV 1920x1080 + rotation=-90)
|
||||
normalized_input_path = temp_dir / f"{task_id}_input_norm.mp4"
|
||||
normalized_result = video.normalize_orientation(
|
||||
normalized_result = await _run_blocking(
|
||||
video.normalize_orientation,
|
||||
str(input_material_path),
|
||||
str(normalized_input_path),
|
||||
)
|
||||
@@ -317,7 +324,7 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
result = _split_equal(captions_data["segments"], material_paths)
|
||||
else:
|
||||
logger.warning("[MultiMat] Whisper 无数据,按时长均分")
|
||||
audio_dur = video._get_duration(str(audio_path))
|
||||
audio_dur = await _run_blocking(video._get_duration, str(audio_path))
|
||||
if audio_dur <= 0:
|
||||
audio_dur = 30.0
|
||||
seg_dur = audio_dur / len(material_paths)
|
||||
@@ -378,7 +385,7 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
assignments, captions_path = await _whisper_and_split()
|
||||
|
||||
# 扩展段覆盖完整音频范围:首段从0开始,末段到音频结尾
|
||||
audio_duration = video._get_duration(str(audio_path))
|
||||
audio_duration = await _run_blocking(video._get_duration, str(audio_path))
|
||||
if assignments and audio_duration > 0:
|
||||
assignments[0]["start"] = 0.0
|
||||
assignments[-1]["end"] = audio_duration
|
||||
@@ -402,9 +409,7 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
await _download_material(assignment["material_path"], material_local)
|
||||
|
||||
normalized_material = temp_dir / f"{task_id}_material_{i}_norm.mp4"
|
||||
loop = asyncio.get_event_loop()
|
||||
normalized_result = await loop.run_in_executor(
|
||||
None,
|
||||
normalized_result = await _run_blocking(
|
||||
video.normalize_orientation,
|
||||
str(material_local),
|
||||
str(normalized_material),
|
||||
@@ -432,22 +437,21 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
logger.info(f"[MultiMat] 素材分辨率不一致,统一到 {base_res[0]}x{base_res[1]}")
|
||||
|
||||
# ── 第二步:并行裁剪每段素材到对应时长 ──
|
||||
prepared_segments: List[Path] = [None] * num_segments
|
||||
prepared_segments: List[Optional[Path]] = [None] * num_segments
|
||||
|
||||
async def _prepare_one_segment(i: int, assignment: dict):
|
||||
"""将单个素材裁剪/循环到对应时长"""
|
||||
seg_dur = assignment["end"] - assignment["start"]
|
||||
prepared_path = temp_dir / f"{task_id}_prepared_{i}.mp4"
|
||||
temp_files.append(prepared_path)
|
||||
prepare_target_res = None if resolutions[i] == base_res else base_res
|
||||
|
||||
loop = asyncio.get_event_loop()
|
||||
await loop.run_in_executor(
|
||||
None,
|
||||
await _run_blocking(
|
||||
video.prepare_segment,
|
||||
str(material_locals[i]),
|
||||
seg_dur,
|
||||
str(prepared_path),
|
||||
base_res,
|
||||
prepare_target_res,
|
||||
assignment.get("source_start", 0.0),
|
||||
assignment.get("source_end"),
|
||||
25,
|
||||
@@ -472,10 +476,14 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
_update_task(task_id, progress=50, message="正在拼接素材片段...")
|
||||
concat_path = temp_dir / f"{task_id}_concat.mp4"
|
||||
temp_files.append(concat_path)
|
||||
video.concat_videos(
|
||||
[str(p) for p in prepared_segments],
|
||||
prepared_segment_paths = [str(p) for p in prepared_segments if p is not None]
|
||||
if len(prepared_segment_paths) != num_segments:
|
||||
raise RuntimeError("Multi-material: prepared segments mismatch")
|
||||
await _run_blocking(
|
||||
video.concat_videos,
|
||||
prepared_segment_paths,
|
||||
str(concat_path),
|
||||
target_fps=25,
|
||||
25,
|
||||
)
|
||||
|
||||
# ── 第三步:一次 LatentSync 推理 ──
|
||||
@@ -485,7 +493,12 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
_update_task(task_id, progress=55, message="正在合成唇形 (LatentSync)...")
|
||||
print(f"[LipSync] Multi-material: single LatentSync on concatenated video")
|
||||
try:
|
||||
await lipsync.generate(str(concat_path), str(audio_path), str(lipsync_video_path))
|
||||
await lipsync.generate(
|
||||
str(concat_path),
|
||||
str(audio_path),
|
||||
str(lipsync_video_path),
|
||||
model_mode=req.lipsync_model,
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning(f"[LipSync] Failed, fallback to concat without lipsync: {e}")
|
||||
import shutil
|
||||
@@ -519,18 +532,22 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
single_source_end = req.custom_assignments[0].source_end
|
||||
|
||||
_update_task(task_id, progress=20, message="正在准备素材片段...")
|
||||
audio_dur = video._get_duration(str(audio_path))
|
||||
audio_dur = await _run_blocking(video._get_duration, str(audio_path))
|
||||
if audio_dur <= 0:
|
||||
audio_dur = 30.0
|
||||
single_res = await _run_blocking(video.get_resolution, str(input_material_path))
|
||||
single_target_res = None if single_res == target_resolution else target_resolution
|
||||
prepared_single_path = temp_dir / f"{task_id}_prepared_single.mp4"
|
||||
temp_files.append(prepared_single_path)
|
||||
video.prepare_segment(
|
||||
await _run_blocking(
|
||||
video.prepare_segment,
|
||||
str(input_material_path),
|
||||
audio_dur,
|
||||
str(prepared_single_path),
|
||||
target_resolution=target_resolution,
|
||||
source_start=single_source_start,
|
||||
source_end=single_source_end,
|
||||
single_target_res,
|
||||
single_source_start,
|
||||
single_source_end,
|
||||
None,
|
||||
)
|
||||
input_material_path = prepared_single_path
|
||||
|
||||
@@ -543,7 +560,18 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
if is_ready:
|
||||
print(f"[LipSync] Starting LatentSync inference...")
|
||||
_update_task(task_id, progress=35, message="正在运行 LatentSync 推理...")
|
||||
await lipsync.generate(str(input_material_path), str(audio_path), str(lipsync_video_path))
|
||||
try:
|
||||
await lipsync.generate(
|
||||
str(input_material_path),
|
||||
str(audio_path),
|
||||
str(lipsync_video_path),
|
||||
model_mode=req.lipsync_model,
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning(f"[LipSync] Failed on single-material, fallback to prepared video: {e}")
|
||||
_update_task(task_id, message="唇形同步失败,使用原始视频...")
|
||||
import shutil
|
||||
shutil.copy(str(input_material_path), str(lipsync_video_path))
|
||||
else:
|
||||
print(f"[LipSync] LatentSync not ready, copying original video")
|
||||
_update_task(task_id, message="唇形同步不可用,使用原始视频...")
|
||||
@@ -564,6 +592,7 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
final_audio_path = audio_path
|
||||
_whisper_task = None
|
||||
_bgm_task = None
|
||||
mix_output_path: Optional[Path] = None
|
||||
|
||||
# 单素材模式下 Whisper 尚未执行,这里与 BGM 并行启动
|
||||
need_whisper = not is_multi and req.enable_subtitles and captions_path is None
|
||||
@@ -604,10 +633,8 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
|
||||
async def _run_bgm():
|
||||
_update_task(task_id, message="正在合成背景音乐...", progress=86)
|
||||
loop = asyncio.get_event_loop()
|
||||
try:
|
||||
await loop.run_in_executor(
|
||||
None,
|
||||
await _run_blocking(
|
||||
video.mix_audio,
|
||||
_voice_path,
|
||||
_bgm_path,
|
||||
@@ -633,7 +660,7 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
captions_path = None
|
||||
result_idx += 1
|
||||
if _bgm_task is not None:
|
||||
if results[result_idx]:
|
||||
if results[result_idx] and mix_output_path is not None:
|
||||
final_audio_path = mix_output_path
|
||||
|
||||
|
||||
@@ -705,14 +732,19 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
|
||||
final_output_local_path = temp_dir / f"{task_id}_output.mp4"
|
||||
temp_files.append(final_output_local_path)
|
||||
needs_audio_compose = str(final_audio_path) != str(audio_path)
|
||||
|
||||
if use_remotion:
|
||||
_update_task(task_id, message="正在合成视频 (Remotion)...", progress=87)
|
||||
remotion_input_path = lipsync_video_path
|
||||
|
||||
composed_video_path = temp_dir / f"{task_id}_composed.mp4"
|
||||
temp_files.append(composed_video_path)
|
||||
|
||||
await video.compose(str(lipsync_video_path), str(final_audio_path), str(composed_video_path))
|
||||
if needs_audio_compose:
|
||||
composed_video_path = temp_dir / f"{task_id}_composed.mp4"
|
||||
temp_files.append(composed_video_path)
|
||||
await video.compose(str(lipsync_video_path), str(final_audio_path), str(composed_video_path))
|
||||
remotion_input_path = composed_video_path
|
||||
else:
|
||||
logger.info("[Pipeline] Audio unchanged, skip pre-Remotion compose")
|
||||
|
||||
remotion_health = await remotion_service.check_health()
|
||||
if remotion_health.get("ready"):
|
||||
@@ -729,7 +761,7 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
title_duration = max(0.5, min(float(req.title_duration or 4.0), 30.0))
|
||||
|
||||
await remotion_service.render(
|
||||
video_path=str(composed_video_path),
|
||||
video_path=str(remotion_input_path),
|
||||
output_path=str(final_output_local_path),
|
||||
captions_path=str(captions_path) if captions_path else None,
|
||||
title=req.title,
|
||||
@@ -747,15 +779,18 @@ async def _process_video_generation_inner(task_id: str, req: GenerateRequest, us
|
||||
except Exception as e:
|
||||
logger.warning(f"Remotion render failed, using FFmpeg fallback: {e}")
|
||||
import shutil
|
||||
shutil.copy(str(composed_video_path), final_output_local_path)
|
||||
shutil.copy(str(remotion_input_path), str(final_output_local_path))
|
||||
else:
|
||||
logger.warning(f"Remotion not ready: {remotion_health.get('error')}, using FFmpeg")
|
||||
import shutil
|
||||
shutil.copy(str(composed_video_path), final_output_local_path)
|
||||
shutil.copy(str(remotion_input_path), str(final_output_local_path))
|
||||
else:
|
||||
_update_task(task_id, message="正在合成最终视频...", progress=90)
|
||||
|
||||
await video.compose(str(lipsync_video_path), str(final_audio_path), str(final_output_local_path))
|
||||
if needs_audio_compose:
|
||||
await video.compose(str(lipsync_video_path), str(final_audio_path), str(final_output_local_path))
|
||||
else:
|
||||
import shutil
|
||||
shutil.copy(str(lipsync_video_path), str(final_output_local_path))
|
||||
|
||||
total_time = time.time() - start_time
|
||||
|
||||
|
||||
@@ -11,12 +11,12 @@ import asyncio
|
||||
import httpx
|
||||
from pathlib import Path
|
||||
from loguru import logger
|
||||
from typing import Optional
|
||||
from typing import Optional, Literal
|
||||
|
||||
from app.core.config import settings
|
||||
|
||||
|
||||
class LipSyncService:
|
||||
class LipSyncService:
|
||||
"""唇形同步服务 - LatentSync 1.6 + MuseTalk 1.5 混合方案"""
|
||||
|
||||
def __init__(self):
|
||||
@@ -121,47 +121,43 @@ class LipSyncService:
|
||||
logger.warning(f"⚠️ 视频循环异常: {e}")
|
||||
return video_path
|
||||
|
||||
async def generate(
|
||||
self,
|
||||
video_path: str,
|
||||
audio_path: str,
|
||||
output_path: str,
|
||||
fps: int = 25
|
||||
) -> str:
|
||||
"""生成唇形同步视频"""
|
||||
logger.info(f"🎬 唇形同步任务: {Path(video_path).name} + {Path(audio_path).name}")
|
||||
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
if self.use_local:
|
||||
return await self._local_generate(video_path, audio_path, output_path, fps)
|
||||
else:
|
||||
return await self._remote_generate(video_path, audio_path, output_path, fps)
|
||||
async def generate(
|
||||
self,
|
||||
video_path: str,
|
||||
audio_path: str,
|
||||
output_path: str,
|
||||
fps: int = 25,
|
||||
model_mode: Literal["default", "fast", "advanced"] = "default",
|
||||
) -> str:
|
||||
"""生成唇形同步视频"""
|
||||
logger.info(f"🎬 唇形同步任务: {Path(video_path).name} + {Path(audio_path).name}")
|
||||
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
normalized_mode: Literal["default", "fast", "advanced"] = model_mode
|
||||
if normalized_mode not in ("default", "fast", "advanced"):
|
||||
normalized_mode = "default"
|
||||
logger.info(f"🧠 Lipsync 模式: {normalized_mode}")
|
||||
|
||||
if self.use_local:
|
||||
return await self._local_generate(video_path, audio_path, output_path, fps, normalized_mode)
|
||||
else:
|
||||
return await self._remote_generate(video_path, audio_path, output_path, fps, normalized_mode)
|
||||
|
||||
async def _local_generate(
|
||||
self,
|
||||
video_path: str,
|
||||
audio_path: str,
|
||||
output_path: str,
|
||||
fps: int
|
||||
) -> str:
|
||||
"""使用 subprocess 调用 LatentSync conda 环境"""
|
||||
|
||||
# 检查前置条件
|
||||
if not self._check_conda_env():
|
||||
logger.warning("⚠️ Conda 环境不可用,使用 Fallback")
|
||||
shutil.copy(video_path, output_path)
|
||||
return output_path
|
||||
|
||||
if not self._check_weights():
|
||||
logger.warning("⚠️ 模型权重不存在,使用 Fallback")
|
||||
shutil.copy(video_path, output_path)
|
||||
return output_path
|
||||
|
||||
logger.info("⏳ 等待 GPU 资源 (排队中)...")
|
||||
async with self._lock:
|
||||
# 使用临时目录存放中间文件
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
tmpdir = Path(tmpdir)
|
||||
async def _local_generate(
|
||||
self,
|
||||
video_path: str,
|
||||
audio_path: str,
|
||||
output_path: str,
|
||||
fps: int,
|
||||
model_mode: Literal["default", "fast", "advanced"],
|
||||
) -> str:
|
||||
"""使用 subprocess 调用 LatentSync conda 环境"""
|
||||
|
||||
logger.info("⏳ 等待 GPU 资源 (排队中)...")
|
||||
async with self._lock:
|
||||
# 使用临时目录存放中间文件
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
tmpdir = Path(tmpdir)
|
||||
|
||||
# 获取音频和视频时长
|
||||
audio_duration = self._get_media_duration(audio_path)
|
||||
@@ -176,24 +172,53 @@ class LipSyncService:
|
||||
str(looped_video),
|
||||
audio_duration
|
||||
)
|
||||
else:
|
||||
actual_video_path = video_path
|
||||
|
||||
# 混合路由: 长视频走 MuseTalk,短视频走 LatentSync
|
||||
if audio_duration and audio_duration >= settings.LIPSYNC_DURATION_THRESHOLD:
|
||||
logger.info(
|
||||
f"🔄 音频 {audio_duration:.1f}s >= {settings.LIPSYNC_DURATION_THRESHOLD}s,路由到 MuseTalk"
|
||||
)
|
||||
musetalk_result = await self._call_musetalk_server(
|
||||
actual_video_path, audio_path, output_path
|
||||
)
|
||||
if musetalk_result:
|
||||
return musetalk_result
|
||||
logger.warning("⚠️ MuseTalk 不可用,回退到 LatentSync(长视频,会较慢)")
|
||||
|
||||
if self.use_server:
|
||||
# 模式 A: 调用常驻服务 (加速模式)
|
||||
return await self._call_persistent_server(actual_video_path, audio_path, output_path)
|
||||
else:
|
||||
actual_video_path = video_path
|
||||
|
||||
# 模型路由
|
||||
force_musetalk = model_mode == "fast"
|
||||
force_latentsync = model_mode == "advanced"
|
||||
auto_to_musetalk = (
|
||||
model_mode == "default"
|
||||
and audio_duration is not None
|
||||
and audio_duration >= settings.LIPSYNC_DURATION_THRESHOLD
|
||||
)
|
||||
|
||||
if force_musetalk:
|
||||
logger.info("⚡ 强制快速模型:MuseTalk")
|
||||
musetalk_result = await self._call_musetalk_server(
|
||||
actual_video_path, audio_path, output_path
|
||||
)
|
||||
if musetalk_result:
|
||||
return musetalk_result
|
||||
logger.warning("⚠️ MuseTalk 不可用,快速模型回退到 LatentSync")
|
||||
elif auto_to_musetalk:
|
||||
logger.info(
|
||||
f"🔄 音频 {audio_duration:.1f}s >= {settings.LIPSYNC_DURATION_THRESHOLD}s,路由到 MuseTalk"
|
||||
)
|
||||
musetalk_result = await self._call_musetalk_server(
|
||||
actual_video_path, audio_path, output_path
|
||||
)
|
||||
if musetalk_result:
|
||||
return musetalk_result
|
||||
logger.warning("⚠️ MuseTalk 不可用,回退到 LatentSync(长视频,会较慢)")
|
||||
elif force_latentsync:
|
||||
logger.info("🎯 强制高级模型:LatentSync")
|
||||
|
||||
# 检查 LatentSync 前置条件(仅在需要回退或使用 LatentSync 时)
|
||||
if not self._check_conda_env():
|
||||
logger.warning("⚠️ Conda 环境不可用,使用 Fallback")
|
||||
shutil.copy(video_path, output_path)
|
||||
return output_path
|
||||
|
||||
if not self._check_weights():
|
||||
logger.warning("⚠️ 模型权重不存在,使用 Fallback")
|
||||
shutil.copy(video_path, output_path)
|
||||
return output_path
|
||||
|
||||
if self.use_server:
|
||||
# 模式 A: 调用常驻服务 (加速模式)
|
||||
return await self._call_persistent_server(actual_video_path, audio_path, output_path)
|
||||
|
||||
logger.info("🔄 调用 LatentSync 推理 (subprocess)...")
|
||||
|
||||
@@ -388,15 +413,18 @@ class LipSyncService:
|
||||
"请确保 LatentSync 服务已启动 (cd models/LatentSync && python scripts/server.py)"
|
||||
)
|
||||
|
||||
async def _remote_generate(
|
||||
self,
|
||||
video_path: str,
|
||||
audio_path: str,
|
||||
output_path: str,
|
||||
fps: int
|
||||
) -> str:
|
||||
"""调用远程 LatentSync API 服务"""
|
||||
logger.info(f"📡 调用远程 API: {self.api_url}")
|
||||
async def _remote_generate(
|
||||
self,
|
||||
video_path: str,
|
||||
audio_path: str,
|
||||
output_path: str,
|
||||
fps: int,
|
||||
model_mode: Literal["default", "fast", "advanced"],
|
||||
) -> str:
|
||||
"""调用远程 LatentSync API 服务"""
|
||||
if model_mode == "fast":
|
||||
logger.warning("⚠️ 远程模式未接入 MuseTalk,快速模型将使用远程 LatentSync")
|
||||
logger.info(f"📡 调用远程 API: {self.api_url}")
|
||||
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=600.0) as client:
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
"""
|
||||
视频合成服务
|
||||
"""
|
||||
import asyncio
|
||||
import os
|
||||
import subprocess
|
||||
import json
|
||||
@@ -96,7 +97,7 @@ class VideoService:
|
||||
"-map", "0:a?",
|
||||
"-c:v", "libx264",
|
||||
"-preset", "fast",
|
||||
"-crf", "23",
|
||||
"-crf", "18",
|
||||
"-c:a", "copy",
|
||||
"-movflags", "+faststart",
|
||||
output_path,
|
||||
@@ -199,9 +200,10 @@ class VideoService:
|
||||
"""合成视频"""
|
||||
# Ensure output dir
|
||||
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
video_duration = self._get_duration(video_path)
|
||||
audio_duration = self._get_duration(audio_path)
|
||||
|
||||
loop = asyncio.get_running_loop()
|
||||
video_duration = await loop.run_in_executor(None, self._get_duration, video_path)
|
||||
audio_duration = await loop.run_in_executor(None, self._get_duration, audio_path)
|
||||
|
||||
# Audio loop if needed
|
||||
loop_count = 1
|
||||
@@ -228,7 +230,7 @@ class VideoService:
|
||||
# 不需要循环时用流复制(几乎瞬间完成),需要循环时才重编码
|
||||
if loop_count > 1:
|
||||
cmd.extend([
|
||||
"-c:v", "libx264", "-preset", "fast", "-crf", "23",
|
||||
"-c:v", "libx264", "-preset", "fast", "-crf", "18",
|
||||
])
|
||||
else:
|
||||
cmd.extend(["-c:v", "copy"])
|
||||
@@ -242,7 +244,8 @@ class VideoService:
|
||||
|
||||
cmd.append(output_path)
|
||||
|
||||
if self._run_ffmpeg(cmd):
|
||||
ok = await loop.run_in_executor(None, self._run_ffmpeg, cmd)
|
||||
if ok:
|
||||
return output_path
|
||||
else:
|
||||
raise RuntimeError("FFmpeg composition failed")
|
||||
@@ -267,12 +270,7 @@ class VideoService:
|
||||
"-fflags", "+genpts",
|
||||
"-i", str(list_path),
|
||||
"-an",
|
||||
"-vsync", "cfr",
|
||||
"-r", str(target_fps),
|
||||
"-c:v", "libx264",
|
||||
"-preset", "fast",
|
||||
"-crf", "23",
|
||||
"-pix_fmt", "yuv420p",
|
||||
"-c:v", "copy",
|
||||
"-movflags", "+faststart",
|
||||
output_path,
|
||||
]
|
||||
@@ -346,6 +344,7 @@ class VideoService:
|
||||
needs_loop = target_duration > available
|
||||
needs_scale = target_resolution is not None
|
||||
needs_fps = bool(target_fps and target_fps > 0)
|
||||
target_fps_value = int(target_fps) if needs_fps and target_fps is not None else None
|
||||
has_source_end = clip_end < video_dur
|
||||
|
||||
# 当需要循环且存在截取范围时,先裁剪出片段,再循环裁剪后的文件
|
||||
@@ -360,7 +359,7 @@ class VideoService:
|
||||
"-i", video_path,
|
||||
"-t", str(available),
|
||||
"-an",
|
||||
"-c:v", "libx264", "-preset", "fast", "-crf", "23",
|
||||
"-c:v", "libx264", "-preset", "fast", "-crf", "18",
|
||||
trim_temp,
|
||||
]
|
||||
if not self._run_ffmpeg(trim_cmd):
|
||||
@@ -380,20 +379,20 @@ class VideoService:
|
||||
cmd.extend(["-i", actual_input, "-t", str(target_duration), "-an"])
|
||||
|
||||
filters = []
|
||||
if needs_fps:
|
||||
filters.append(f"fps={int(target_fps)}")
|
||||
if target_fps_value is not None:
|
||||
filters.append(f"fps={target_fps_value}")
|
||||
if needs_scale:
|
||||
w, h = target_resolution
|
||||
filters.append(f"scale={w}:{h}:force_original_aspect_ratio=decrease,pad={w}:{h}:(ow-iw)/2:(oh-ih)/2")
|
||||
|
||||
if filters:
|
||||
cmd.extend(["-vf", ",".join(filters)])
|
||||
if needs_fps:
|
||||
cmd.extend(["-vsync", "cfr", "-r", str(int(target_fps))])
|
||||
if target_fps_value is not None:
|
||||
cmd.extend(["-vsync", "cfr", "-r", str(target_fps_value)])
|
||||
|
||||
# 需要循环、缩放或指定起点时必须重编码,否则用 stream copy 保持原画质
|
||||
if needs_loop or needs_scale or source_start > 0 or has_source_end or needs_fps:
|
||||
cmd.extend(["-c:v", "libx264", "-preset", "fast", "-crf", "23"])
|
||||
cmd.extend(["-c:v", "libx264", "-preset", "fast", "-crf", "18"])
|
||||
else:
|
||||
cmd.extend(["-c:v", "copy"])
|
||||
|
||||
|
||||
@@ -124,6 +124,8 @@ interface RefAudio {
|
||||
created_at: number;
|
||||
}
|
||||
|
||||
type LipsyncModelMode = "default" | "fast" | "advanced";
|
||||
|
||||
import type { Material } from "@/shared/types/material";
|
||||
|
||||
export const useHomeController = () => {
|
||||
@@ -155,6 +157,7 @@ export const useHomeController = () => {
|
||||
const [titleDisplayMode, setTitleDisplayMode] = useState<"short" | "persistent">("short");
|
||||
const [subtitleBottomMargin, setSubtitleBottomMargin] = useState<number>(80);
|
||||
const [outputAspectRatio, setOutputAspectRatio] = useState<"9:16" | "16:9">("9:16");
|
||||
const [lipsyncModelMode, setLipsyncModelMode] = useState<LipsyncModelMode>("default");
|
||||
const [showStylePreview, setShowStylePreview] = useState<boolean>(false);
|
||||
const [materialDimensions, setMaterialDimensions] = useState<{ width: number; height: number } | null>(null);
|
||||
|
||||
@@ -492,6 +495,8 @@ export const useHomeController = () => {
|
||||
setSubtitleBottomMargin,
|
||||
outputAspectRatio,
|
||||
setOutputAspectRatio,
|
||||
lipsyncModelMode,
|
||||
setLipsyncModelMode,
|
||||
selectedBgmId,
|
||||
setSelectedBgmId,
|
||||
bgmVolume,
|
||||
@@ -934,6 +939,7 @@ export const useHomeController = () => {
|
||||
text: selectedAudio.text || text,
|
||||
generated_audio_id: selectedAudio.id,
|
||||
language: selectedAudio.language || textLang,
|
||||
lipsync_model: lipsyncModelMode,
|
||||
title: videoTitle.trim() || undefined,
|
||||
enable_subtitles: true,
|
||||
output_aspect_ratio: outputAspectRatio,
|
||||
@@ -1154,6 +1160,8 @@ export const useHomeController = () => {
|
||||
setSubtitleBottomMargin,
|
||||
outputAspectRatio,
|
||||
setOutputAspectRatio,
|
||||
lipsyncModelMode,
|
||||
setLipsyncModelMode,
|
||||
resolveAssetUrl,
|
||||
getFontFormat,
|
||||
buildTextShadow,
|
||||
|
||||
@@ -52,6 +52,8 @@ interface UseHomePersistenceOptions {
|
||||
setSubtitleBottomMargin: React.Dispatch<React.SetStateAction<number>>;
|
||||
outputAspectRatio: '9:16' | '16:9';
|
||||
setOutputAspectRatio: React.Dispatch<React.SetStateAction<'9:16' | '16:9'>>;
|
||||
lipsyncModelMode: 'default' | 'fast' | 'advanced';
|
||||
setLipsyncModelMode: React.Dispatch<React.SetStateAction<'default' | 'fast' | 'advanced'>>;
|
||||
selectedBgmId: string;
|
||||
setSelectedBgmId: React.Dispatch<React.SetStateAction<string>>;
|
||||
bgmVolume: number;
|
||||
@@ -111,6 +113,8 @@ export const useHomePersistence = ({
|
||||
setSubtitleBottomMargin,
|
||||
outputAspectRatio,
|
||||
setOutputAspectRatio,
|
||||
lipsyncModelMode,
|
||||
setLipsyncModelMode,
|
||||
selectedBgmId,
|
||||
setSelectedBgmId,
|
||||
bgmVolume,
|
||||
@@ -156,6 +160,7 @@ export const useHomePersistence = ({
|
||||
const savedTitleDisplayMode = localStorage.getItem(`vigent_${storageKey}_titleDisplayMode`);
|
||||
const savedSubtitleBottomMargin = localStorage.getItem(`vigent_${storageKey}_subtitleBottomMargin`);
|
||||
const savedOutputAspectRatio = localStorage.getItem(`vigent_${storageKey}_outputAspectRatio`);
|
||||
const savedLipsyncModelMode = localStorage.getItem(`vigent_${storageKey}_lipsyncModelMode`);
|
||||
const savedSpeed = localStorage.getItem(`vigent_${storageKey}_speed`);
|
||||
const savedEmotion = localStorage.getItem(`vigent_${storageKey}_emotion`);
|
||||
|
||||
@@ -235,6 +240,14 @@ export const useHomePersistence = ({
|
||||
setOutputAspectRatio(savedOutputAspectRatio);
|
||||
}
|
||||
|
||||
if (
|
||||
savedLipsyncModelMode === 'default'
|
||||
|| savedLipsyncModelMode === 'fast'
|
||||
|| savedLipsyncModelMode === 'advanced'
|
||||
) {
|
||||
setLipsyncModelMode(savedLipsyncModelMode);
|
||||
}
|
||||
|
||||
if (savedSpeed) {
|
||||
const parsed = parseFloat(savedSpeed);
|
||||
if (!Number.isNaN(parsed)) setSpeed(parsed);
|
||||
@@ -270,6 +283,7 @@ export const useHomePersistence = ({
|
||||
setTitleDisplayMode,
|
||||
setSubtitleBottomMargin,
|
||||
setOutputAspectRatio,
|
||||
setLipsyncModelMode,
|
||||
setTtsMode,
|
||||
setVideoTitle,
|
||||
setVideoSecondaryTitle,
|
||||
@@ -385,6 +399,12 @@ export const useHomePersistence = ({
|
||||
}
|
||||
}, [outputAspectRatio, storageKey, isRestored]);
|
||||
|
||||
useEffect(() => {
|
||||
if (isRestored) {
|
||||
localStorage.setItem(`vigent_${storageKey}_lipsyncModelMode`, lipsyncModelMode);
|
||||
}
|
||||
}, [lipsyncModelMode, storageKey, isRestored]);
|
||||
|
||||
useEffect(() => {
|
||||
if (isRestored) {
|
||||
localStorage.setItem(`vigent_${storageKey}_bgmId`, selectedBgmId);
|
||||
|
||||
@@ -1,10 +1,14 @@
|
||||
import { Rocket } from "lucide-react";
|
||||
|
||||
type LipsyncModelMode = "default" | "fast" | "advanced";
|
||||
|
||||
interface GenerateActionBarProps {
|
||||
isGenerating: boolean;
|
||||
progress: number;
|
||||
disabled: boolean;
|
||||
materialCount?: number;
|
||||
modelMode: LipsyncModelMode;
|
||||
onModelModeChange: (value: LipsyncModelMode) => void;
|
||||
onGenerate: () => void;
|
||||
}
|
||||
|
||||
@@ -13,45 +17,61 @@ export function GenerateActionBar({
|
||||
progress,
|
||||
disabled,
|
||||
materialCount = 1,
|
||||
modelMode,
|
||||
onModelModeChange,
|
||||
onGenerate,
|
||||
}: GenerateActionBarProps) {
|
||||
return (
|
||||
<div>
|
||||
<button
|
||||
onClick={onGenerate}
|
||||
disabled={disabled}
|
||||
className={`w-full py-4 rounded-xl font-bold text-lg transition-all ${disabled
|
||||
? "bg-gray-600 cursor-not-allowed text-gray-400"
|
||||
: "bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-700 hover:to-pink-700 text-white shadow-lg hover:shadow-purple-500/25"
|
||||
}`}
|
||||
>
|
||||
{isGenerating ? (
|
||||
<span className="flex items-center justify-center gap-3">
|
||||
<svg className="animate-spin h-5 w-5" viewBox="0 0 24 24">
|
||||
<circle
|
||||
className="opacity-25"
|
||||
cx="12"
|
||||
cy="12"
|
||||
r="10"
|
||||
stroke="currentColor"
|
||||
strokeWidth="4"
|
||||
fill="none"
|
||||
/>
|
||||
<path
|
||||
className="opacity-75"
|
||||
fill="currentColor"
|
||||
d="M4 12a8 8 0 018-8V0C5.373 0 0 5.373 0 12h4z"
|
||||
/>
|
||||
</svg>
|
||||
生成中... {progress}%
|
||||
</span>
|
||||
) : (
|
||||
<span className="flex items-center justify-center gap-2">
|
||||
<Rocket className="h-5 w-5" />
|
||||
生成视频
|
||||
</span>
|
||||
)}
|
||||
</button>
|
||||
<div className="flex items-center gap-2">
|
||||
<button
|
||||
onClick={onGenerate}
|
||||
disabled={disabled}
|
||||
className={`flex-1 py-4 rounded-xl font-bold text-lg transition-all ${disabled
|
||||
? "bg-gray-600 cursor-not-allowed text-gray-400"
|
||||
: "bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-700 hover:to-pink-700 text-white shadow-lg hover:shadow-purple-500/25"
|
||||
}`}
|
||||
>
|
||||
{isGenerating ? (
|
||||
<span className="flex items-center justify-center gap-3">
|
||||
<svg className="animate-spin h-5 w-5" viewBox="0 0 24 24">
|
||||
<circle
|
||||
className="opacity-25"
|
||||
cx="12"
|
||||
cy="12"
|
||||
r="10"
|
||||
stroke="currentColor"
|
||||
strokeWidth="4"
|
||||
fill="none"
|
||||
/>
|
||||
<path
|
||||
className="opacity-75"
|
||||
fill="currentColor"
|
||||
d="M4 12a8 8 0 018-8V0C5.373 0 0 5.373 0 12h4z"
|
||||
/>
|
||||
</svg>
|
||||
生成中... {progress}%
|
||||
</span>
|
||||
) : (
|
||||
<span className="flex items-center justify-center gap-2">
|
||||
<Rocket className="h-5 w-5" />
|
||||
生成视频
|
||||
</span>
|
||||
)}
|
||||
</button>
|
||||
|
||||
<select
|
||||
value={modelMode}
|
||||
onChange={(e) => onModelModeChange(e.target.value as LipsyncModelMode)}
|
||||
disabled={isGenerating}
|
||||
className="h-[58px] rounded-xl border border-white/15 bg-black/30 px-3 text-sm text-gray-200 outline-none focus:border-purple-400"
|
||||
title="选择唇形模型"
|
||||
>
|
||||
<option value="default">默认模型</option>
|
||||
<option value="fast">快速模型</option>
|
||||
<option value="advanced">高级模型</option>
|
||||
</select>
|
||||
</div>
|
||||
{!isGenerating && materialCount >= 2 && (
|
||||
<p className="text-xs text-gray-400 text-center mt-1.5">
|
||||
多素材模式 ({materialCount} 个机位),生成耗时较长
|
||||
|
||||
@@ -97,6 +97,8 @@ export function HomePage() {
|
||||
setTitleDisplayMode,
|
||||
outputAspectRatio,
|
||||
setOutputAspectRatio,
|
||||
lipsyncModelMode,
|
||||
setLipsyncModelMode,
|
||||
resolveAssetUrl,
|
||||
getFontFormat,
|
||||
buildTextShadow,
|
||||
@@ -431,6 +433,8 @@ export function HomePage() {
|
||||
progress={currentTask?.progress || 0}
|
||||
materialCount={selectedMaterials.length}
|
||||
disabled={isGenerating || selectedMaterials.length === 0 || !selectedAudio}
|
||||
modelMode={lipsyncModelMode}
|
||||
onModelModeChange={setLipsyncModelMode}
|
||||
onGenerate={handleGenerate}
|
||||
/>
|
||||
</div>
|
||||
|
||||
@@ -253,21 +253,58 @@ class LipsyncPipeline(DiffusionPipeline):
|
||||
faces = []
|
||||
boxes = []
|
||||
affine_matrices = []
|
||||
valid_face_flags = []
|
||||
print(f"Affine transforming {len(video_frames)} faces...")
|
||||
for frame in tqdm.tqdm(video_frames):
|
||||
face, box, affine_matrix = self.image_processor.affine_transform(frame)
|
||||
faces.append(face)
|
||||
boxes.append(box)
|
||||
affine_matrices.append(affine_matrix)
|
||||
try:
|
||||
face, box, affine_matrix = self.image_processor.affine_transform(frame)
|
||||
faces.append(face)
|
||||
boxes.append(box)
|
||||
affine_matrices.append(affine_matrix)
|
||||
valid_face_flags.append(True)
|
||||
except Exception:
|
||||
faces.append(None)
|
||||
boxes.append(None)
|
||||
affine_matrices.append(None)
|
||||
valid_face_flags.append(False)
|
||||
|
||||
valid_indices = [i for i, flag in enumerate(valid_face_flags) if flag]
|
||||
if not valid_indices:
|
||||
raise RuntimeError("Face not detected in any frame")
|
||||
|
||||
for i in range(len(faces)):
|
||||
if faces[i] is not None:
|
||||
continue
|
||||
nearest_idx = min(valid_indices, key=lambda idx: abs(idx - i))
|
||||
faces[i] = faces[nearest_idx].clone()
|
||||
boxes[i] = boxes[nearest_idx]
|
||||
affine_matrices[i] = affine_matrices[nearest_idx]
|
||||
|
||||
missing_count = len(valid_face_flags) - len(valid_indices)
|
||||
if missing_count > 0:
|
||||
print(
|
||||
f"Warning: face not detected in {missing_count}/{len(valid_face_flags)} frames. "
|
||||
"Those frames will keep original content."
|
||||
)
|
||||
|
||||
faces = torch.stack(faces)
|
||||
return faces, boxes, affine_matrices
|
||||
return faces, boxes, affine_matrices, valid_face_flags
|
||||
|
||||
def restore_video(self, faces: torch.Tensor, video_frames: np.ndarray, boxes: list, affine_matrices: list):
|
||||
def restore_video(
|
||||
self,
|
||||
faces: torch.Tensor,
|
||||
video_frames: np.ndarray,
|
||||
boxes: list,
|
||||
affine_matrices: list,
|
||||
valid_face_flags: Optional[list] = None,
|
||||
):
|
||||
video_frames = video_frames[: len(faces)]
|
||||
out_frames = []
|
||||
print(f"Restoring {len(faces)} faces...")
|
||||
for index, face in enumerate(tqdm.tqdm(faces)):
|
||||
if valid_face_flags is not None and not valid_face_flags[index]:
|
||||
out_frames.append(video_frames[index])
|
||||
continue
|
||||
x1, y1, x2, y2 = boxes[index]
|
||||
height = int(y2 - y1)
|
||||
width = int(x2 - x1)
|
||||
@@ -281,33 +318,37 @@ class LipsyncPipeline(DiffusionPipeline):
|
||||
def loop_video(self, whisper_chunks: list, video_frames: np.ndarray):
|
||||
# If the audio is longer than the video, we need to loop the video
|
||||
if len(whisper_chunks) > len(video_frames):
|
||||
faces, boxes, affine_matrices = self.affine_transform_video(video_frames)
|
||||
faces, boxes, affine_matrices, valid_face_flags = self.affine_transform_video(video_frames)
|
||||
num_loops = math.ceil(len(whisper_chunks) / len(video_frames))
|
||||
loop_video_frames = []
|
||||
loop_faces = []
|
||||
loop_boxes = []
|
||||
loop_affine_matrices = []
|
||||
loop_valid_face_flags = []
|
||||
for i in range(num_loops):
|
||||
if i % 2 == 0:
|
||||
loop_video_frames.append(video_frames)
|
||||
loop_faces.append(faces)
|
||||
loop_boxes += boxes
|
||||
loop_affine_matrices += affine_matrices
|
||||
loop_valid_face_flags += valid_face_flags
|
||||
else:
|
||||
loop_video_frames.append(video_frames[::-1])
|
||||
loop_faces.append(faces.flip(0))
|
||||
loop_boxes += boxes[::-1]
|
||||
loop_affine_matrices += affine_matrices[::-1]
|
||||
loop_valid_face_flags += valid_face_flags[::-1]
|
||||
|
||||
video_frames = np.concatenate(loop_video_frames, axis=0)[: len(whisper_chunks)]
|
||||
faces = torch.cat(loop_faces, dim=0)[: len(whisper_chunks)]
|
||||
boxes = loop_boxes[: len(whisper_chunks)]
|
||||
affine_matrices = loop_affine_matrices[: len(whisper_chunks)]
|
||||
valid_face_flags = loop_valid_face_flags[: len(whisper_chunks)]
|
||||
else:
|
||||
video_frames = video_frames[: len(whisper_chunks)]
|
||||
faces, boxes, affine_matrices = self.affine_transform_video(video_frames)
|
||||
faces, boxes, affine_matrices, valid_face_flags = self.affine_transform_video(video_frames)
|
||||
|
||||
return video_frames, faces, boxes, affine_matrices
|
||||
return video_frames, faces, boxes, affine_matrices, valid_face_flags
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(
|
||||
@@ -367,7 +408,7 @@ class LipsyncPipeline(DiffusionPipeline):
|
||||
audio_samples = read_audio(audio_path)
|
||||
video_frames = read_video(video_path, use_decord=False)
|
||||
|
||||
video_frames, faces, boxes, affine_matrices = self.loop_video(whisper_chunks, video_frames)
|
||||
video_frames, faces, boxes, affine_matrices, valid_face_flags = self.loop_video(whisper_chunks, video_frames)
|
||||
|
||||
synced_video_frames = []
|
||||
|
||||
@@ -457,7 +498,13 @@ class LipsyncPipeline(DiffusionPipeline):
|
||||
)
|
||||
synced_video_frames.append(decoded_latents)
|
||||
|
||||
synced_video_frames = self.restore_video(torch.cat(synced_video_frames), video_frames, boxes, affine_matrices)
|
||||
synced_video_frames = self.restore_video(
|
||||
torch.cat(synced_video_frames),
|
||||
video_frames,
|
||||
boxes,
|
||||
affine_matrices,
|
||||
valid_face_flags=valid_face_flags,
|
||||
)
|
||||
|
||||
audio_samples_remain_length = int(synced_video_frames.shape[0] / video_fps * audio_sample_rate)
|
||||
audio_samples = audio_samples[:audio_samples_remain_length].cpu().numpy()
|
||||
@@ -473,5 +520,5 @@ class LipsyncPipeline(DiffusionPipeline):
|
||||
|
||||
sf.write(os.path.join(temp_dir, "audio.wav"), audio_samples, audio_sample_rate)
|
||||
|
||||
command = f"ffmpeg -y -loglevel error -nostdin -i {os.path.join(temp_dir, 'video.mp4')} -i {os.path.join(temp_dir, 'audio.wav')} -c:v libx264 -crf 18 -c:a aac -q:v 0 -q:a 0 {video_out_path}"
|
||||
command = f"ffmpeg -y -loglevel error -nostdin -i {os.path.join(temp_dir, 'video.mp4')} -i {os.path.join(temp_dir, 'audio.wav')} -c:v copy -c:a aac -q:a 0 {video_out_path}"
|
||||
subprocess.run(command, shell=True)
|
||||
|
||||
@@ -49,11 +49,22 @@ def read_video(video_path: str, change_fps=True, use_decord=True):
|
||||
if os.path.exists(temp_dir):
|
||||
shutil.rmtree(temp_dir)
|
||||
os.makedirs(temp_dir, exist_ok=True)
|
||||
command = (
|
||||
f"ffmpeg -loglevel error -y -nostdin -i {video_path} -r 25 -crf 18 {os.path.join(temp_dir, 'video.mp4')}"
|
||||
)
|
||||
subprocess.run(command, shell=True)
|
||||
target_video_path = os.path.join(temp_dir, "video.mp4")
|
||||
|
||||
# 检测输入视频 FPS,已是 25fps 时跳过重编码
|
||||
cap = cv2.VideoCapture(video_path)
|
||||
current_fps = cap.get(cv2.CAP_PROP_FPS)
|
||||
cap.release()
|
||||
|
||||
if abs(current_fps - 25.0) < 0.5:
|
||||
# 已是 25fps,直接使用原文件(避免一次有损重编码)
|
||||
print(f"Video already at {current_fps:.1f}fps, skipping FPS conversion")
|
||||
target_video_path = video_path
|
||||
else:
|
||||
command = (
|
||||
f"ffmpeg -loglevel error -y -nostdin -i {video_path} -r 25 -crf 18 {os.path.join(temp_dir, 'video.mp4')}"
|
||||
)
|
||||
subprocess.run(command, shell=True)
|
||||
target_video_path = os.path.join(temp_dir, "video.mp4")
|
||||
else:
|
||||
target_video_path = video_path
|
||||
|
||||
|
||||
@@ -4,14 +4,14 @@ MuseTalk v1.5 常驻推理服务 (优化版 v2)
|
||||
- GPU: 从 backend/.env 读取 MUSETALK_GPU_ID (默认 0)
|
||||
- 架构: FastAPI + lifespan (与 LatentSync server.py 同模式)
|
||||
|
||||
优化项 (vs v1):
|
||||
1. cv2.VideoCapture 直读帧 (跳过 ffmpeg→PNG→imread)
|
||||
2. 人脸检测降频 (每 N 帧检测, 中间插值 bbox)
|
||||
3. BiSeNet mask 缓存 (每 N 帧更新, 中间复用)
|
||||
4. cv2.VideoWriter 直写视频 (跳过逐帧 PNG 写盘)
|
||||
5. batch_size 8→32
|
||||
6. 每阶段计时
|
||||
"""
|
||||
优化项 (vs v1):
|
||||
1. cv2.VideoCapture 直读帧 (跳过 ffmpeg→PNG→imread)
|
||||
2. 人脸检测降频 (每 N 帧检测, 中间插值 bbox)
|
||||
3. BiSeNet mask 缓存 (每 N 帧更新, 中间复用)
|
||||
4. FFmpeg rawvideo 管道直编码 (去掉中间有损 mp4v)
|
||||
5. batch_size 8→32
|
||||
6. 每阶段计时
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
@@ -84,17 +84,28 @@ from musetalk.utils.utils import get_file_type, get_video_fps, datagen, load_all
|
||||
from musetalk.utils.preprocessing import get_landmark_and_bbox, read_imgs, coord_placeholder
|
||||
|
||||
# --- 从 .env 读取额外配置 ---
|
||||
def load_env_config():
|
||||
"""读取 MuseTalk 相关环境变量"""
|
||||
config = {
|
||||
"batch_size": 32,
|
||||
"version": "v15",
|
||||
"use_float16": True,
|
||||
}
|
||||
try:
|
||||
env_path = musetalk_root.parent.parent / "backend" / ".env"
|
||||
if env_path.exists():
|
||||
with open(env_path, "r", encoding="utf-8") as f:
|
||||
def load_env_config():
|
||||
"""读取 MuseTalk 相关环境变量"""
|
||||
config = {
|
||||
"batch_size": 32,
|
||||
"version": "v15",
|
||||
"use_float16": True,
|
||||
"detect_every": 5,
|
||||
"blend_cache_every": 5,
|
||||
"audio_padding_left": 2,
|
||||
"audio_padding_right": 2,
|
||||
"extra_margin": 15,
|
||||
"delay_frame": 0,
|
||||
"blend_mode": "auto",
|
||||
"faceparsing_left_cheek_width": 90,
|
||||
"faceparsing_right_cheek_width": 90,
|
||||
"encode_crf": 18,
|
||||
"encode_preset": "medium",
|
||||
}
|
||||
try:
|
||||
env_path = musetalk_root.parent.parent / "backend" / ".env"
|
||||
if env_path.exists():
|
||||
with open(env_path, "r", encoding="utf-8") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line.startswith("MUSETALK_BATCH_SIZE="):
|
||||
@@ -105,22 +116,78 @@ def load_env_config():
|
||||
val = line.split("=")[1].strip().split("#")[0].strip()
|
||||
if val:
|
||||
config["version"] = val
|
||||
elif line.startswith("MUSETALK_USE_FLOAT16="):
|
||||
val = line.split("=")[1].strip().split("#")[0].strip().lower()
|
||||
config["use_float16"] = val in ("true", "1", "yes")
|
||||
except Exception as e:
|
||||
print(f"⚠️ 读取额外配置失败: {e}")
|
||||
return config
|
||||
|
||||
env_config = load_env_config()
|
||||
elif line.startswith("MUSETALK_USE_FLOAT16="):
|
||||
val = line.split("=")[1].strip().split("#")[0].strip().lower()
|
||||
config["use_float16"] = val in ("true", "1", "yes")
|
||||
elif line.startswith("MUSETALK_DETECT_EVERY="):
|
||||
val = line.split("=")[1].strip().split("#")[0].strip()
|
||||
if val:
|
||||
config["detect_every"] = max(1, int(val))
|
||||
elif line.startswith("MUSETALK_BLEND_CACHE_EVERY="):
|
||||
val = line.split("=")[1].strip().split("#")[0].strip()
|
||||
if val:
|
||||
config["blend_cache_every"] = max(1, int(val))
|
||||
elif line.startswith("MUSETALK_AUDIO_PADDING_LEFT="):
|
||||
val = line.split("=")[1].strip().split("#")[0].strip()
|
||||
if val:
|
||||
config["audio_padding_left"] = max(0, int(val))
|
||||
elif line.startswith("MUSETALK_AUDIO_PADDING_RIGHT="):
|
||||
val = line.split("=")[1].strip().split("#")[0].strip()
|
||||
if val:
|
||||
config["audio_padding_right"] = max(0, int(val))
|
||||
elif line.startswith("MUSETALK_EXTRA_MARGIN="):
|
||||
val = line.split("=")[1].strip().split("#")[0].strip()
|
||||
if val:
|
||||
config["extra_margin"] = max(0, int(val))
|
||||
elif line.startswith("MUSETALK_DELAY_FRAME="):
|
||||
val = line.split("=")[1].strip().split("#")[0].strip()
|
||||
if val:
|
||||
config["delay_frame"] = int(val)
|
||||
elif line.startswith("MUSETALK_BLEND_MODE="):
|
||||
val = line.split("=")[1].strip().split("#")[0].strip().lower()
|
||||
if val in ("auto", "jaw", "raw"):
|
||||
config["blend_mode"] = val
|
||||
elif line.startswith("MUSETALK_FACEPARSING_LEFT_CHEEK_WIDTH="):
|
||||
val = line.split("=")[1].strip().split("#")[0].strip()
|
||||
if val:
|
||||
config["faceparsing_left_cheek_width"] = max(0, int(val))
|
||||
elif line.startswith("MUSETALK_FACEPARSING_RIGHT_CHEEK_WIDTH="):
|
||||
val = line.split("=")[1].strip().split("#")[0].strip()
|
||||
if val:
|
||||
config["faceparsing_right_cheek_width"] = max(0, int(val))
|
||||
elif line.startswith("MUSETALK_ENCODE_CRF="):
|
||||
val = line.split("=")[1].strip().split("#")[0].strip()
|
||||
if val:
|
||||
config["encode_crf"] = min(51, max(0, int(val)))
|
||||
elif line.startswith("MUSETALK_ENCODE_PRESET="):
|
||||
val = line.split("=")[1].strip().split("#")[0].strip().lower()
|
||||
if val in (
|
||||
"ultrafast", "superfast", "veryfast", "faster", "fast",
|
||||
"medium", "slow", "slower", "veryslow"
|
||||
):
|
||||
config["encode_preset"] = val
|
||||
except Exception as e:
|
||||
print(f"⚠️ 读取额外配置失败: {e}")
|
||||
return config
|
||||
|
||||
env_config = load_env_config()
|
||||
|
||||
# 全局模型缓存
|
||||
models = {}
|
||||
|
||||
# ===================== 优化参数 =====================
|
||||
DETECT_EVERY = 5 # 人脸检测降频: 每 N 帧检测一次
|
||||
BLEND_CACHE_EVERY = 5 # BiSeNet mask 缓存: 每 N 帧更新一次
|
||||
# ====================================================
|
||||
# ===================== 优化参数 =====================
|
||||
DETECT_EVERY = int(env_config["detect_every"]) # 人脸检测降频: 每 N 帧检测一次
|
||||
BLEND_CACHE_EVERY = int(env_config["blend_cache_every"]) # BiSeNet mask 缓存: 每 N 帧更新一次
|
||||
AUDIO_PADDING_LEFT = int(env_config["audio_padding_left"])
|
||||
AUDIO_PADDING_RIGHT = int(env_config["audio_padding_right"])
|
||||
EXTRA_MARGIN = int(env_config["extra_margin"])
|
||||
DELAY_FRAME = int(env_config["delay_frame"])
|
||||
BLEND_MODE = str(env_config["blend_mode"])
|
||||
FACEPARSING_LEFT_CHEEK_WIDTH = int(env_config["faceparsing_left_cheek_width"])
|
||||
FACEPARSING_RIGHT_CHEEK_WIDTH = int(env_config["faceparsing_right_cheek_width"])
|
||||
ENCODE_CRF = int(env_config["encode_crf"])
|
||||
ENCODE_PRESET = str(env_config["encode_preset"])
|
||||
# ====================================================
|
||||
|
||||
|
||||
def run_ffmpeg(cmd):
|
||||
@@ -191,11 +258,14 @@ async def lifespan(app: FastAPI):
|
||||
whisper = whisper.to(device=device, dtype=weight_dtype).eval()
|
||||
whisper.requires_grad_(False)
|
||||
|
||||
# FaceParsing
|
||||
if version == "v15":
|
||||
fp = FaceParsing(left_cheek_width=90, right_cheek_width=90)
|
||||
else:
|
||||
fp = FaceParsing()
|
||||
# FaceParsing
|
||||
if version == "v15":
|
||||
fp = FaceParsing(
|
||||
left_cheek_width=FACEPARSING_LEFT_CHEEK_WIDTH,
|
||||
right_cheek_width=FACEPARSING_RIGHT_CHEEK_WIDTH,
|
||||
)
|
||||
else:
|
||||
fp = FaceParsing()
|
||||
|
||||
# 恢复工作目录
|
||||
os.chdir(original_cwd)
|
||||
@@ -211,9 +281,13 @@ async def lifespan(app: FastAPI):
|
||||
models["version"] = version
|
||||
models["timesteps"] = torch.tensor([0], device=device)
|
||||
|
||||
print("✅ MuseTalk v1.5 模型加载完成,服务就绪!")
|
||||
print(f"⚙️ 优化参数: batch_size={env_config['batch_size']}, "
|
||||
f"detect_every={DETECT_EVERY}, blend_cache_every={BLEND_CACHE_EVERY}")
|
||||
print("✅ MuseTalk v1.5 模型加载完成,服务就绪!")
|
||||
print(f"⚙️ 优化参数: batch_size={env_config['batch_size']}, "
|
||||
f"detect_every={DETECT_EVERY}, blend_cache_every={BLEND_CACHE_EVERY}, "
|
||||
f"audio_padding=({AUDIO_PADDING_LEFT},{AUDIO_PADDING_RIGHT}), extra_margin={EXTRA_MARGIN}, "
|
||||
f"delay_frame={DELAY_FRAME}, blend_mode={BLEND_MODE}, "
|
||||
f"faceparsing_cheek=({FACEPARSING_LEFT_CHEEK_WIDTH},{FACEPARSING_RIGHT_CHEEK_WIDTH}), "
|
||||
f"encode=libx264/{ENCODE_PRESET}/crf{ENCODE_CRF}")
|
||||
yield
|
||||
models.clear()
|
||||
torch.cuda.empty_cache()
|
||||
@@ -354,15 +428,15 @@ def _detect_faces_subsampled(frames, detect_every=5):
|
||||
# 核心推理 (优化版)
|
||||
# =====================================================================
|
||||
@torch.no_grad()
|
||||
def _run_inference(req: LipSyncRequest) -> dict:
|
||||
"""
|
||||
优化版推理逻辑:
|
||||
1. cv2.VideoCapture 直读帧 (跳过 ffmpeg→PNG→imread)
|
||||
2. 人脸检测降频 (每 N 帧, 中间插值)
|
||||
3. BiSeNet mask 缓存 (每 N 帧更新)
|
||||
4. cv2.VideoWriter 直写 (跳过逐帧 PNG)
|
||||
5. 每阶段计时
|
||||
def _run_inference(req: LipSyncRequest) -> dict:
|
||||
"""
|
||||
优化版推理逻辑:
|
||||
1. cv2.VideoCapture 直读帧 (跳过 ffmpeg→PNG→imread)
|
||||
2. 人脸检测降频 (每 N 帧, 中间插值)
|
||||
3. BiSeNet mask 缓存 (每 N 帧更新)
|
||||
4. FFmpeg rawvideo 管道直编码 (无中间有损文件)
|
||||
5. 每阶段计时
|
||||
"""
|
||||
vae = models["vae"]
|
||||
unet = models["unet"]
|
||||
pe = models["pe"]
|
||||
@@ -411,12 +485,12 @@ def _run_inference(req: LipSyncRequest) -> dict:
|
||||
# ===== Phase 2: Whisper 音频特征 =====
|
||||
t0 = time.time()
|
||||
whisper_input_features, librosa_length = audio_processor.get_audio_feature(audio_path)
|
||||
whisper_chunks = audio_processor.get_whisper_chunk(
|
||||
whisper_input_features, device, weight_dtype, whisper, librosa_length,
|
||||
fps=fps,
|
||||
audio_padding_length_left=2,
|
||||
audio_padding_length_right=2,
|
||||
)
|
||||
whisper_chunks = audio_processor.get_whisper_chunk(
|
||||
whisper_input_features, device, weight_dtype, whisper, librosa_length,
|
||||
fps=fps,
|
||||
audio_padding_length_left=AUDIO_PADDING_LEFT,
|
||||
audio_padding_length_right=AUDIO_PADDING_RIGHT,
|
||||
)
|
||||
timings["2_whisper"] = time.time() - t0
|
||||
print(f"🎵 Whisper 特征 [{timings['2_whisper']:.1f}s]")
|
||||
|
||||
@@ -427,12 +501,12 @@ def _run_inference(req: LipSyncRequest) -> dict:
|
||||
print(f"🔍 人脸检测 [{timings['3_face']:.1f}s]")
|
||||
|
||||
# ===== Phase 4: VAE 潜空间编码 =====
|
||||
t0 = time.time()
|
||||
input_latent_list = []
|
||||
extra_margin = 15
|
||||
for bbox, frame in zip(coord_list, frames):
|
||||
if bbox == coord_placeholder:
|
||||
continue
|
||||
t0 = time.time()
|
||||
input_latent_list = []
|
||||
extra_margin = EXTRA_MARGIN
|
||||
for bbox, frame in zip(coord_list, frames):
|
||||
if bbox == coord_placeholder:
|
||||
continue
|
||||
x1, y1, x2, y2 = bbox
|
||||
if version == "v15":
|
||||
y2 = min(y2 + extra_margin, frame.shape[0])
|
||||
@@ -453,13 +527,13 @@ def _run_inference(req: LipSyncRequest) -> dict:
|
||||
input_latent_list_cycle = input_latent_list + input_latent_list[::-1]
|
||||
|
||||
video_num = len(whisper_chunks)
|
||||
gen = datagen(
|
||||
whisper_chunks=whisper_chunks,
|
||||
vae_encode_latents=input_latent_list_cycle,
|
||||
batch_size=batch_size,
|
||||
delay_frame=0,
|
||||
device=device,
|
||||
)
|
||||
gen = datagen(
|
||||
whisper_chunks=whisper_chunks,
|
||||
vae_encode_latents=input_latent_list_cycle,
|
||||
batch_size=batch_size,
|
||||
delay_frame=DELAY_FRAME,
|
||||
device=device,
|
||||
)
|
||||
|
||||
res_frame_list = []
|
||||
total_batches = int(np.ceil(float(video_num) / batch_size))
|
||||
@@ -479,21 +553,44 @@ def _run_inference(req: LipSyncRequest) -> dict:
|
||||
timings["5_unet"] = time.time() - t0
|
||||
print(f"✅ UNet 推理: {len(res_frame_list)} 帧 [{timings['5_unet']:.1f}s]")
|
||||
|
||||
# ===== Phase 6: 合成 (cv2.VideoWriter + 纯 numpy blending) =====
|
||||
t0 = time.time()
|
||||
|
||||
h, w = frames[0].shape[:2]
|
||||
temp_raw_path = output_vid_path + ".raw.mp4"
|
||||
|
||||
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
|
||||
writer = cv2.VideoWriter(temp_raw_path, fourcc, fps, (w, h))
|
||||
|
||||
if not writer.isOpened():
|
||||
raise RuntimeError(f"cv2.VideoWriter 打开失败: {temp_raw_path}")
|
||||
|
||||
cached_mask = None
|
||||
cached_crop_box = None
|
||||
blend_mode = "jaw" if version == "v15" else "raw"
|
||||
# ===== Phase 6: 合成并写入 FFmpeg rawvideo 管道 =====
|
||||
t0 = time.time()
|
||||
|
||||
h, w = frames[0].shape[:2]
|
||||
ffmpeg_cmd = [
|
||||
"ffmpeg", "-y", "-v", "warning",
|
||||
"-f", "rawvideo",
|
||||
"-pix_fmt", "bgr24",
|
||||
"-s", f"{w}x{h}",
|
||||
"-r", str(fps),
|
||||
"-i", "-",
|
||||
"-i", audio_path,
|
||||
"-c:v", "libx264", "-preset", ENCODE_PRESET, "-crf", str(ENCODE_CRF), "-pix_fmt", "yuv420p",
|
||||
"-c:a", "copy", "-shortest",
|
||||
output_vid_path,
|
||||
]
|
||||
ffmpeg_proc = subprocess.Popen(
|
||||
ffmpeg_cmd,
|
||||
stdin=subprocess.PIPE,
|
||||
stdout=subprocess.DEVNULL,
|
||||
stderr=subprocess.DEVNULL,
|
||||
)
|
||||
pipe_in = ffmpeg_proc.stdin
|
||||
if pipe_in is None:
|
||||
raise RuntimeError("FFmpeg 管道初始化失败")
|
||||
|
||||
def _write_pipe_frame(frame: np.ndarray):
|
||||
try:
|
||||
pipe_in.write(np.ascontiguousarray(frame, dtype=np.uint8).tobytes())
|
||||
except BrokenPipeError as exc:
|
||||
raise RuntimeError("FFmpeg 管道写入失败") from exc
|
||||
|
||||
cached_mask = None
|
||||
cached_crop_box = None
|
||||
if BLEND_MODE == "auto":
|
||||
blend_mode = "jaw" if version == "v15" else "raw"
|
||||
else:
|
||||
blend_mode = BLEND_MODE
|
||||
|
||||
for i in tqdm(range(len(res_frame_list)), desc="合成"):
|
||||
res_frame = res_frame_list[i]
|
||||
@@ -503,26 +600,26 @@ def _run_inference(req: LipSyncRequest) -> dict:
|
||||
x1, y1, x2, y2 = bbox
|
||||
if version == "v15":
|
||||
y2 = min(y2 + extra_margin, ori_frame.shape[0])
|
||||
adjusted_bbox = (x1, y1, x2, y2)
|
||||
|
||||
try:
|
||||
res_frame = cv2.resize(res_frame.astype(np.uint8), (x2 - x1, y2 - y1))
|
||||
except Exception:
|
||||
writer.write(ori_frame)
|
||||
continue
|
||||
adjusted_bbox = (x1, y1, x2, y2)
|
||||
|
||||
try:
|
||||
res_frame = cv2.resize(res_frame.astype(np.uint8), (x2 - x1, y2 - y1))
|
||||
except Exception:
|
||||
_write_pipe_frame(ori_frame)
|
||||
continue
|
||||
|
||||
# 每 N 帧更新 BiSeNet 人脸解析 mask, 其余帧复用缓存
|
||||
if i % BLEND_CACHE_EVERY == 0 or cached_mask is None:
|
||||
try:
|
||||
cached_mask, cached_crop_box = get_image_prepare_material(
|
||||
ori_frame, adjusted_bbox, mode=blend_mode, fp=fp)
|
||||
except Exception:
|
||||
# 如果 prepare 失败, 用完整方式
|
||||
combine_frame = get_image(
|
||||
ori_frame, res_frame, list(adjusted_bbox),
|
||||
mode=blend_mode, fp=fp)
|
||||
writer.write(combine_frame)
|
||||
continue
|
||||
except Exception:
|
||||
# 如果 prepare 失败, 用完整方式
|
||||
combine_frame = get_image(
|
||||
ori_frame, res_frame, list(adjusted_bbox),
|
||||
mode=blend_mode, fp=fp)
|
||||
_write_pipe_frame(combine_frame)
|
||||
continue
|
||||
|
||||
try:
|
||||
combine_frame = get_image_blending_fast(
|
||||
@@ -532,35 +629,25 @@ def _run_inference(req: LipSyncRequest) -> dict:
|
||||
try:
|
||||
combine_frame = get_image_blending(
|
||||
ori_frame, res_frame, adjusted_bbox, cached_mask, cached_crop_box)
|
||||
except Exception:
|
||||
combine_frame = get_image(
|
||||
ori_frame, res_frame, list(adjusted_bbox),
|
||||
mode=blend_mode, fp=fp)
|
||||
|
||||
writer.write(combine_frame)
|
||||
|
||||
writer.release()
|
||||
timings["6_blend"] = time.time() - t0
|
||||
print(f"🎨 合成 [{timings['6_blend']:.1f}s]")
|
||||
|
||||
# ===== Phase 7: FFmpeg H.264 编码 + 合并音频 =====
|
||||
t0 = time.time()
|
||||
cmd = [
|
||||
"ffmpeg", "-y", "-v", "warning",
|
||||
"-i", temp_raw_path, "-i", audio_path,
|
||||
"-c:v", "libx264", "-crf", "18", "-pix_fmt", "yuv420p",
|
||||
"-c:a", "copy", "-shortest",
|
||||
output_vid_path
|
||||
]
|
||||
if not run_ffmpeg(cmd):
|
||||
raise RuntimeError("FFmpeg 重编码+音频合并失败")
|
||||
|
||||
# 清理临时文件
|
||||
if os.path.exists(temp_raw_path):
|
||||
os.unlink(temp_raw_path)
|
||||
|
||||
timings["7_encode"] = time.time() - t0
|
||||
print(f"🔊 编码+音频 [{timings['7_encode']:.1f}s]")
|
||||
except Exception:
|
||||
combine_frame = get_image(
|
||||
ori_frame, res_frame, list(adjusted_bbox),
|
||||
mode=blend_mode, fp=fp)
|
||||
|
||||
_write_pipe_frame(combine_frame)
|
||||
|
||||
pipe_in.close()
|
||||
timings["6_blend"] = time.time() - t0
|
||||
print(f"🎨 合成 [{timings['6_blend']:.1f}s]")
|
||||
|
||||
# ===== Phase 7: 等待 FFmpeg 编码完成 =====
|
||||
t0 = time.time()
|
||||
return_code = ffmpeg_proc.wait()
|
||||
if return_code != 0:
|
||||
raise RuntimeError("FFmpeg 编码+音频合并失败")
|
||||
|
||||
timings["7_encode"] = time.time() - t0
|
||||
print(f"🔊 编码+音频 [{timings['7_encode']:.1f}s]")
|
||||
|
||||
# ===== 汇总 =====
|
||||
total_time = time.time() - t_total
|
||||
|
||||
@@ -185,9 +185,50 @@ async function main() {
|
||||
const currentHash = getSourceHash();
|
||||
let bundleLocation: string;
|
||||
|
||||
// 辅助函数: 确保文件在缓存 public 目录中可访问 (硬链接 > 复制)
|
||||
function ensureInCachedPublic(cachedPublicDir: string, srcAbsPath: string, fileName: string) {
|
||||
const cachedPath = path.join(cachedPublicDir, fileName);
|
||||
// 已存在且大小一致,跳过
|
||||
try {
|
||||
if (fs.existsSync(cachedPath)) {
|
||||
const srcStat = fs.statSync(srcAbsPath);
|
||||
const cachedStat = fs.statSync(cachedPath);
|
||||
if (srcStat.size === cachedStat.size && srcStat.ino === cachedStat.ino) return;
|
||||
}
|
||||
} catch { /* file doesn't exist or broken, will recreate */ }
|
||||
// 移除旧的文件/链接
|
||||
try { fs.unlinkSync(cachedPath); } catch { /* doesn't exist, fine */ }
|
||||
// 优先硬链接(零拷贝,对应用透明),跨文件系统时回退为复制
|
||||
try {
|
||||
fs.linkSync(srcAbsPath, cachedPath);
|
||||
console.log(`Hardlinked into cached bundle: ${fileName}`);
|
||||
} catch {
|
||||
fs.copyFileSync(srcAbsPath, cachedPath);
|
||||
console.log(`Copied into cached bundle: ${fileName}`);
|
||||
}
|
||||
}
|
||||
|
||||
if (fs.existsSync(hashFile) && fs.readFileSync(hashFile, 'utf-8') === currentHash) {
|
||||
bundleLocation = BUNDLE_CACHE_DIR;
|
||||
console.log('Using cached bundle');
|
||||
// 确保当前渲染所需的文件在缓存 bundle 的 public 目录中可访问
|
||||
const cachedPublicDir = path.join(BUNDLE_CACHE_DIR, 'public');
|
||||
if (!fs.existsSync(cachedPublicDir)) {
|
||||
fs.mkdirSync(cachedPublicDir, { recursive: true });
|
||||
}
|
||||
// 1) 视频文件
|
||||
ensureInCachedPublic(cachedPublicDir, path.resolve(options.videoPath), videoFileName);
|
||||
// 2) 字体文件 (从 subtitleStyle / titleStyle / secondaryTitleStyle 中提取)
|
||||
const styleSources = [options.subtitleStyle, options.titleStyle, options.secondaryTitleStyle];
|
||||
for (const style of styleSources) {
|
||||
const fontFile = (style as Record<string, unknown>)?.font_file as string | undefined;
|
||||
if (fontFile) {
|
||||
const fontSrcPath = path.join(publicDir, fontFile);
|
||||
if (fs.existsSync(fontSrcPath)) {
|
||||
ensureInCachedPublic(cachedPublicDir, path.resolve(fontSrcPath), fontFile);
|
||||
}
|
||||
}
|
||||
}
|
||||
} else {
|
||||
console.log('Bundling Remotion project...');
|
||||
console.log(`Entry point: ${entryPoint}`);
|
||||
|
||||
Reference in New Issue
Block a user