diff --git a/Docs/DevLogs/Day30.md b/Docs/DevLogs/Day30.md new file mode 100644 index 0000000..7c98fbb --- /dev/null +++ b/Docs/DevLogs/Day30.md @@ -0,0 +1,363 @@ +## Remotion 缓存修复 + 编码流水线质量优化 + 唇形同步容错 + 模型选择 (Day 30) + +### 概述 + +本轮解决四大方面:(1) Remotion bundle 缓存导致标题/字幕丢失的严重 Bug;(2) 全面优化 LatentSync + MuseTalk 双引擎编码流水线,消除冗余有损编码;(3) 增强 LatentSync 的鲁棒性,允许素材中部分帧检测不到人脸时继续推理而非中断任务;(4) 前端唇形模型选择,用户可按需切换默认/快速/高级模型。 + +--- + +## ✅ 改动内容 + +### 1. Remotion Bundle 缓存 404 修复(严重 Bug) + +- **问题**: 生成的视频没有标题和字幕,Remotion 渲染失败后静默回退到 FFmpeg(无文字叠加能力) +- **根因**: Remotion 的 bundle 缓存机制只在首次打包时复制 `publicDir`(视频/字体所在目录)。代码稳定后缓存持续命中,新生成的视频和字体文件不在旧缓存的 `public/` 目录 → Remotion HTTP server 返回 404 → 渲染失败 +- **尝试**: 先用 `fs.symlinkSync` 符号链接,但 Remotion 内部 HTTP server 不支持跟随符号链接 +- **最终方案**: 使用 `fs.linkSync` 硬链接(同文件系统零拷贝,对应用完全透明),跨文件系统时自动回退为 `fs.copyFileSync` + +**文件**: `remotion/render.ts` + +```typescript +function ensureInCachedPublic(cachedPublicDir, srcAbsPath, fileName) { + // 检查是否已存在且为同一 inode + // 优先硬链接(零拷贝),跨文件系统回退为复制 + try { + fs.linkSync(srcAbsPath, cachedPath); + } catch { + fs.copyFileSync(srcAbsPath, cachedPath); + } +} +``` + +使用缓存 bundle 时,自动将当前渲染所需的文件(视频 + 字体)硬链接到缓存的 `public/` 目录: +- 视频文件(`videoFileName`) +- 字体文件(从 `subtitleStyle` / `titleStyle` / `secondaryTitleStyle` 的 `font_file` 字段提取) + +--- + +### 2. 视频编码流水线质量优化 + +对完整流水线做全面审查,发现从素材上传到最终输出,视频最多经历 **5-6 次有损重编码**,而官方 LatentSync demo 只有 1-2 次。 + +#### 优化前编码链路 + +| # | 阶段 | CRF | 问题 | +|---|------|-----|------| +| 1 | 方向归一化 | 23 | 条件触发 | +| 2 | `prepare_segment` 缩放+时长 | 23 | 必经,质量偏低 | +| 3 | LatentSync `read_video` FPS 转换 | 18 | **即使已是 25fps 也重编码** | +| 4 | LatentSync `imageio` 写帧 | 13 | 模型输出 | +| 5 | LatentSync final mux | 18 | **CRF13 刚写完立刻 CRF18 重编码** | +| 6 | compose | copy | Day29 已优化 | +| 7 | 多素材 concat | 23 | **段参数已统一,不需要重编码** | +| 8 | Remotion 渲染 | ~18 | 必经(叠加文字) | + +#### 优化措施 + +##### 2a. LatentSync `read_video` 跳过冗余 FPS 重编码 + +**文件**: `models/LatentSync/latentsync/utils/util.py` + +- 原代码无条件执行 `ffmpeg -r 25 -crf 18`,即使输入视频已是 25fps +- 新增 FPS 检测:`abs(current_fps - 25.0) < 0.5` 时直接使用原文件 +- 我们的 `prepare_segment` 已统一输出 25fps,此步完全多余 + +```python +cap = cv2.VideoCapture(video_path) +current_fps = cap.get(cv2.CAP_PROP_FPS) +cap.release() + +if abs(current_fps - 25.0) < 0.5: + print(f"Video already at {current_fps:.1f}fps, skipping FPS conversion") + target_video_path = video_path +else: + # 仅非 25fps 时才重编码 + command = f"ffmpeg ... -r 25 -crf 18 ..." +``` + +##### 2b. LatentSync final mux 流复制替代重编码 + +**文件**: `models/LatentSync/latentsync/pipelines/lipsync_pipeline.py` + +- 原代码:`imageio` 以 CRF 13 高质量写完帧后,final mux 又用 `libx264 -crf 18` 完整重编码 +- 修复:改为 `-c:v copy` 流复制,仅 mux 音频轨,视频零损失 + +```diff +- ffmpeg ... -c:v libx264 -crf 18 -c:a aac -q:v 0 -q:a 0 ++ ffmpeg ... -c:v copy -c:a aac -q:a 0 +``` + +##### 2c. `prepare_segment` + `normalize_orientation` CRF 23 → 18 + +**文件**: `backend/app/services/video_service.py` + +- `normalize_orientation`:CRF 23 → 18 +- `prepare_segment` trim 临时文件:CRF 23 → 18 +- `prepare_segment` 主命令:CRF 23 → 18 +- CRF 18 是"高质量"级别,与 LatentSync 内部标准一致 + +##### 2d. 多素材 concat 流复制 + +**文件**: `backend/app/services/video_service.py` + +- 原代码用 `libx264 -crf 23` 重编码拼接 +- 所有段已由 `prepare_segment` 统一为相同分辨率/帧率/编码参数 +- 改为 `-c:v copy` 流复制,消除一次完整重编码 + +```diff +- -vsync cfr -r 25 -c:v libx264 -preset fast -crf 23 -pix_fmt yuv420p ++ -c:v copy +``` + +#### 优化后编码链路 + +| # | 阶段 | CRF | 状态 | +|---|------|-----|------| +| 1 | 方向归一化 | **18** | 提质(条件触发) | +| 2 | `prepare_segment` | **18** | 提质(必经) | +| 3 | ~~LatentSync FPS 转换~~ | - | **已消除** | +| 4 | LatentSync 模型输出 | 13 | 不变(不可避免) | +| 5 | ~~LatentSync final mux~~ | - | **已消除(copy)** | +| 6 | compose | copy | 不变 | +| 7 | ~~多素材 concat~~ | - | **已消除(copy)** | +| 8 | Remotion 渲染 | ~18 | 不变(不可避免) | + +**总计:5-6 次有损编码 → 3 次**(prepare_segment → LatentSync 模型输出 → Remotion),质量损失减少近一半。 + +--- + +## 📁 修改文件清单 + +| 文件 | 改动 | +|------|------| +| `remotion/render.ts` | bundle 缓存使用时硬链接视频+字体到 public 目录 | +| `models/LatentSync/latentsync/utils/util.py` | `read_video` 检测 FPS,25fps 时跳过重编码 | +| `models/LatentSync/latentsync/pipelines/lipsync_pipeline.py` | final mux `-c:v copy`;无脸帧容错(affine_transform + restore_video) | +| `backend/app/services/video_service.py` | `normalize_orientation` CRF 23→18;`prepare_segment` CRF 23→18;`concat_videos` `-c:v copy` | +| `backend/app/modules/videos/workflow.py` | 单素材 LatentSync 异常时回退原视频 | + +--- + +### 3. LatentSync 无脸帧容错 + +- **问题**: 素材中如果有部分帧检测不到人脸(转头、遮挡、空镜头),`affine_transform` 会抛异常导致整个推理任务失败 +- **改动**: + - `affine_transform_video`: 单帧异常时 catch 住,用最近有效帧的 face/box/affine_matrix 填充(保证 tensor batch 维度完整),全部帧无脸时仍 raise + - `restore_video`: 新增 `valid_face_flags` 参数,无脸帧直接保留原画面(不做嘴型替换) + - `loop_video`: `valid_face_flags` 跟随循环和翻转 + - `workflow.py`: 单素材路径 `lipsync.generate()` 整体异常时 copy 原视频继续流程,任务不会失败 + +--- + +### 4. MuseTalk 编码链路优化 + +#### 4a. FFmpeg rawvideo 管道直编码(消除中间有损文件) + +**文件**: `models/MuseTalk/scripts/server.py` + +- **原流程**: UNet 推理帧 → `cv2.VideoWriter(mp4v)` 写中间文件(有损) → FFmpeg 重编码+音频 mux(又一次有损) +- **新流程**: UNet 推理帧 → FFmpeg rawvideo stdin 管道 → 一次 libx264 编码+音频 mux + +```python +ffmpeg_cmd = [ + "ffmpeg", "-y", "-v", "warning", + "-f", "rawvideo", "-pix_fmt", "bgr24", + "-s", f"{w}x{h}", "-r", str(fps), + "-i", "-", # stdin 管道输入 + "-i", audio_path, + "-c:v", "libx264", "-preset", ENCODE_PRESET, "-crf", str(ENCODE_CRF), + "-pix_fmt", "yuv420p", + "-c:a", "copy", "-shortest", + output_vid_path, +] +ffmpeg_proc = subprocess.Popen(ffmpeg_cmd, stdin=subprocess.PIPE, ...) +# 每帧直接 pipe_in.write(frame.tobytes()) +``` + +关键实现细节: +- `-pix_fmt bgr24` 匹配 OpenCV 原生帧格式,零转换开销 +- `np.ascontiguousarray` 确保帧内存连续 +- `BrokenPipeError` 捕获 + return code 检查覆盖异常路径 +- `pipe_in.close()` 在 `ffmpeg_proc.wait()` 之前,正确发送 EOF +- 合成 fallback(resize 失败、mask 失败、blending 失败)均通过 `_write_pipe_frame` 输出原帧 + +#### 4b. MuseTalk 参数环境变量化 + 质量优先档 + +**文件**: `models/MuseTalk/scripts/server.py` + `backend/.env` + +所有推理与编码参数从硬编码改为 `.env` 可配置,当前使用"质量优先"档: + +| 参数 | 原默认值 | 质量优先值 | 作用 | +|------|----------|-----------|------| +| `MUSETALK_DETECT_EVERY` | 5 | **2** | 人脸检测频率 ↑2.5x,画面跟踪更稳 | +| `MUSETALK_BLEND_CACHE_EVERY` | 5 | **2** | mask 更新更频,面部边缘融合更干净 | +| `MUSETALK_EXTRA_MARGIN` | 15 | **14** | 下巴区域微调 | +| `MUSETALK_BLEND_MODE` | auto | **jaw** | v1.5 显式 jaw 模式 | +| `MUSETALK_ENCODE_CRF` | 18 | **14** | 接近视觉无损(输出还要进 Remotion 再编码) | +| `MUSETALK_ENCODE_PRESET` | medium | **slow** | 同 CRF 下压缩效率更高 | +| `MUSETALK_AUDIO_PADDING` | 2/2 | 2/2 | 不变 | +| `MUSETALK_FACEPARSING_CHEEK` | 90/90 | 90/90 | 不变 | + +新增可配置参数完整列表:`DETECT_EVERY`、`BLEND_CACHE_EVERY`、`AUDIO_PADDING_LEFT/RIGHT`、`EXTRA_MARGIN`、`DELAY_FRAME`、`BLEND_MODE`、`FACEPARSING_LEFT/RIGHT_CHEEK_WIDTH`、`ENCODE_CRF`、`ENCODE_PRESET`。 + +--- + +### 5. Workflow 异步防阻塞 + compose 跳过优化 + +#### 5a. 阻塞调用线程池化 + +**文件**: `backend/app/modules/videos/workflow.py` + +workflow 中多处同步 FFmpeg 调用会阻塞 asyncio 事件循环,导致其他 API 请求(健康检查、任务状态查询)无法响应。新增通用辅助函数 `_run_blocking()`,将所有阻塞调用统一走线程池: + +```python +async def _run_blocking(func, *args): + """在线程池执行阻塞函数,避免卡住事件循环。""" + loop = asyncio.get_running_loop() + return await loop.run_in_executor(None, func, *args) +``` + +已改造的阻塞调用点: + +| 调用 | 位置 | 说明 | +|------|------|------| +| `video.normalize_orientation()` | 单素材旋转归一化 | FFmpeg 旋转/转码 | +| `video.prepare_segment()` | 多素材片段准备 | FFmpeg 缩放+时长裁剪,配合 `asyncio.gather` 多段并行 | +| `video.concat_videos()` | 多素材拼接 | FFmpeg concat | +| `video.prepare_segment()` | 单素材 prepare | FFmpeg 缩放+时长裁剪 | +| `video.mix_audio()` | BGM 混音 | FFmpeg 音频混合 | +| `video._get_duration()` | 音频/视频时长探测 (3处) | ffprobe 子进程 | + +#### 5b. `prepare_segment` 同分辨率跳过 scale + +**文件**: `backend/app/modules/videos/workflow.py` + +原来无论素材分辨率是否已匹配目标,都强制传 `target_resolution` 给 `prepare_segment`,触发 scale filter + libx264 重编码。优化后逐素材比对分辨率: + +- **多素材**: 逐段判断,分辨率匹配的传 `None`(`prepare_target_res = None if res == base_res else base_res`),走 `-c:v copy` 分支 +- **单素材**: 先 `get_resolution` 比对,匹配则传 `None` + +当分辨率匹配且无截取、不需要循环、不需要变帧率时,`prepare_segment` 内部走 `-c:v copy`,完全零损编码。 + +#### 5c. `_get_duration()` 线程池化 + +**文件**: `backend/app/modules/videos/workflow.py` + +3 处 `video._get_duration()` 同步 ffprobe 调用改为 `await _run_blocking(video._get_duration, ...)`,避免阻塞事件循环。 + +#### 5d. compose 循环场景 CRF 统一 + +**文件**: `backend/app/services/video_service.py` + +`compose()` 在视频需要循环时的编码从 CRF 23 提升到 CRF 18,与全流水线质量标准统一。 + +#### 5e. 多素材片段校验 + +**文件**: `backend/app/modules/videos/workflow.py` + +多素材 `prepare_segment` 完成后新增片段数量一致性校验,避免空片段进入 concat 导致异常。 + +#### 5f. compose() 内部防阻塞 + +**文件**: `backend/app/services/video_service.py` + +`compose()` 改为 `async def`,内部的 `_get_duration()` 和 `_run_ffmpeg()` 都通过 `loop.run_in_executor` 在线程池执行。 + +#### 5g. 无需二次 compose 直接透传 + +**文件**: `backend/app/modules/videos/workflow.py` + +当没有 BGM 时(`final_audio_path == audio_path`),LatentSync/MuseTalk 输出已包含正确音轨,跳过多余的 compose 步骤: + +```python +needs_audio_compose = str(final_audio_path) != str(audio_path) +``` + +- **Remotion 路径**: 音频没变则跳过 pre-compose,直接用 lipsync 输出进 Remotion +- **非 Remotion 路径**: 音频没变则 `shutil.copy` 直接透传 lipsync 输出,不再走 compose + +--- + +### 6. 唇形模型前端选择 + +前端生成按钮右侧新增模型下拉,用户可按需选择唇形同步引擎,全链路透传到后端路由。 + +#### 模型选项 + +| 选项 | 值 | 路由逻辑 | +|------|------|------| +| 默认模型 | `default` | 保持现有阈值策略(`LIPSYNC_DURATION_THRESHOLD` 分水岭,短视频 LatentSync,长视频 MuseTalk) | +| 快速模型 | `fast` | 强制 MuseTalk,不可用时回退 LatentSync | +| 高级模型 | `advanced` | 强制 LatentSync,跳过 MuseTalk | + +三种模式最终都有 LatentSync 兜底,不会出现无模型可用的情况。 + +#### 数据流 + +``` +前端 select → setLipsyncModelMode("fast") → localStorage 持久化 + ↓ +用户点击"生成视频" → handleGenerate() + → payload.lipsync_model = lipsyncModelMode + → POST /api/videos/generate { ..., lipsync_model: "fast" } + → workflow: req.lipsync_model 透传给 lipsync.generate(model_mode=...) + → lipsync_service.generate(): 按 model_mode 路由 + → fast: 强制 MuseTalk → 回退 LatentSync + → advanced: 强制 LatentSync + → default: 阈值策略 +``` + +#### 改动文件 + +| 文件 | 改动 | +|------|------| +| `frontend/src/features/home/ui/GenerateActionBar.tsx` | 生成按钮右侧新增模型 ` onModelModeChange(e.target.value as LipsyncModelMode)} + disabled={isGenerating} + className="h-[58px] rounded-xl border border-white/15 bg-black/30 px-3 text-sm text-gray-200 outline-none focus:border-purple-400" + title="选择唇形模型" + > + + + + + {!isGenerating && materialCount >= 2 && (

多素材模式 ({materialCount} 个机位),生成耗时较长 diff --git a/frontend/src/features/home/ui/HomePage.tsx b/frontend/src/features/home/ui/HomePage.tsx index e82f486..1d933e2 100644 --- a/frontend/src/features/home/ui/HomePage.tsx +++ b/frontend/src/features/home/ui/HomePage.tsx @@ -97,6 +97,8 @@ export function HomePage() { setTitleDisplayMode, outputAspectRatio, setOutputAspectRatio, + lipsyncModelMode, + setLipsyncModelMode, resolveAssetUrl, getFontFormat, buildTextShadow, @@ -431,6 +433,8 @@ export function HomePage() { progress={currentTask?.progress || 0} materialCount={selectedMaterials.length} disabled={isGenerating || selectedMaterials.length === 0 || !selectedAudio} + modelMode={lipsyncModelMode} + onModelModeChange={setLipsyncModelMode} onGenerate={handleGenerate} /> diff --git a/models/LatentSync/latentsync/pipelines/lipsync_pipeline.py b/models/LatentSync/latentsync/pipelines/lipsync_pipeline.py index 693155e..53ed5f7 100644 --- a/models/LatentSync/latentsync/pipelines/lipsync_pipeline.py +++ b/models/LatentSync/latentsync/pipelines/lipsync_pipeline.py @@ -253,21 +253,58 @@ class LipsyncPipeline(DiffusionPipeline): faces = [] boxes = [] affine_matrices = [] + valid_face_flags = [] print(f"Affine transforming {len(video_frames)} faces...") for frame in tqdm.tqdm(video_frames): - face, box, affine_matrix = self.image_processor.affine_transform(frame) - faces.append(face) - boxes.append(box) - affine_matrices.append(affine_matrix) + try: + face, box, affine_matrix = self.image_processor.affine_transform(frame) + faces.append(face) + boxes.append(box) + affine_matrices.append(affine_matrix) + valid_face_flags.append(True) + except Exception: + faces.append(None) + boxes.append(None) + affine_matrices.append(None) + valid_face_flags.append(False) + + valid_indices = [i for i, flag in enumerate(valid_face_flags) if flag] + if not valid_indices: + raise RuntimeError("Face not detected in any frame") + + for i in range(len(faces)): + if faces[i] is not None: + continue + nearest_idx = min(valid_indices, key=lambda idx: abs(idx - i)) + faces[i] = faces[nearest_idx].clone() + boxes[i] = boxes[nearest_idx] + affine_matrices[i] = affine_matrices[nearest_idx] + + missing_count = len(valid_face_flags) - len(valid_indices) + if missing_count > 0: + print( + f"Warning: face not detected in {missing_count}/{len(valid_face_flags)} frames. " + "Those frames will keep original content." + ) faces = torch.stack(faces) - return faces, boxes, affine_matrices + return faces, boxes, affine_matrices, valid_face_flags - def restore_video(self, faces: torch.Tensor, video_frames: np.ndarray, boxes: list, affine_matrices: list): + def restore_video( + self, + faces: torch.Tensor, + video_frames: np.ndarray, + boxes: list, + affine_matrices: list, + valid_face_flags: Optional[list] = None, + ): video_frames = video_frames[: len(faces)] out_frames = [] print(f"Restoring {len(faces)} faces...") for index, face in enumerate(tqdm.tqdm(faces)): + if valid_face_flags is not None and not valid_face_flags[index]: + out_frames.append(video_frames[index]) + continue x1, y1, x2, y2 = boxes[index] height = int(y2 - y1) width = int(x2 - x1) @@ -281,33 +318,37 @@ class LipsyncPipeline(DiffusionPipeline): def loop_video(self, whisper_chunks: list, video_frames: np.ndarray): # If the audio is longer than the video, we need to loop the video if len(whisper_chunks) > len(video_frames): - faces, boxes, affine_matrices = self.affine_transform_video(video_frames) + faces, boxes, affine_matrices, valid_face_flags = self.affine_transform_video(video_frames) num_loops = math.ceil(len(whisper_chunks) / len(video_frames)) loop_video_frames = [] loop_faces = [] loop_boxes = [] loop_affine_matrices = [] + loop_valid_face_flags = [] for i in range(num_loops): if i % 2 == 0: loop_video_frames.append(video_frames) loop_faces.append(faces) loop_boxes += boxes loop_affine_matrices += affine_matrices + loop_valid_face_flags += valid_face_flags else: loop_video_frames.append(video_frames[::-1]) loop_faces.append(faces.flip(0)) loop_boxes += boxes[::-1] loop_affine_matrices += affine_matrices[::-1] + loop_valid_face_flags += valid_face_flags[::-1] video_frames = np.concatenate(loop_video_frames, axis=0)[: len(whisper_chunks)] faces = torch.cat(loop_faces, dim=0)[: len(whisper_chunks)] boxes = loop_boxes[: len(whisper_chunks)] affine_matrices = loop_affine_matrices[: len(whisper_chunks)] + valid_face_flags = loop_valid_face_flags[: len(whisper_chunks)] else: video_frames = video_frames[: len(whisper_chunks)] - faces, boxes, affine_matrices = self.affine_transform_video(video_frames) + faces, boxes, affine_matrices, valid_face_flags = self.affine_transform_video(video_frames) - return video_frames, faces, boxes, affine_matrices + return video_frames, faces, boxes, affine_matrices, valid_face_flags @torch.no_grad() def __call__( @@ -367,7 +408,7 @@ class LipsyncPipeline(DiffusionPipeline): audio_samples = read_audio(audio_path) video_frames = read_video(video_path, use_decord=False) - video_frames, faces, boxes, affine_matrices = self.loop_video(whisper_chunks, video_frames) + video_frames, faces, boxes, affine_matrices, valid_face_flags = self.loop_video(whisper_chunks, video_frames) synced_video_frames = [] @@ -457,7 +498,13 @@ class LipsyncPipeline(DiffusionPipeline): ) synced_video_frames.append(decoded_latents) - synced_video_frames = self.restore_video(torch.cat(synced_video_frames), video_frames, boxes, affine_matrices) + synced_video_frames = self.restore_video( + torch.cat(synced_video_frames), + video_frames, + boxes, + affine_matrices, + valid_face_flags=valid_face_flags, + ) audio_samples_remain_length = int(synced_video_frames.shape[0] / video_fps * audio_sample_rate) audio_samples = audio_samples[:audio_samples_remain_length].cpu().numpy() @@ -473,5 +520,5 @@ class LipsyncPipeline(DiffusionPipeline): sf.write(os.path.join(temp_dir, "audio.wav"), audio_samples, audio_sample_rate) - command = f"ffmpeg -y -loglevel error -nostdin -i {os.path.join(temp_dir, 'video.mp4')} -i {os.path.join(temp_dir, 'audio.wav')} -c:v libx264 -crf 18 -c:a aac -q:v 0 -q:a 0 {video_out_path}" + command = f"ffmpeg -y -loglevel error -nostdin -i {os.path.join(temp_dir, 'video.mp4')} -i {os.path.join(temp_dir, 'audio.wav')} -c:v copy -c:a aac -q:a 0 {video_out_path}" subprocess.run(command, shell=True) diff --git a/models/LatentSync/latentsync/utils/util.py b/models/LatentSync/latentsync/utils/util.py index 6aa9ab3..371d42d 100644 --- a/models/LatentSync/latentsync/utils/util.py +++ b/models/LatentSync/latentsync/utils/util.py @@ -49,11 +49,22 @@ def read_video(video_path: str, change_fps=True, use_decord=True): if os.path.exists(temp_dir): shutil.rmtree(temp_dir) os.makedirs(temp_dir, exist_ok=True) - command = ( - f"ffmpeg -loglevel error -y -nostdin -i {video_path} -r 25 -crf 18 {os.path.join(temp_dir, 'video.mp4')}" - ) - subprocess.run(command, shell=True) - target_video_path = os.path.join(temp_dir, "video.mp4") + + # 检测输入视频 FPS,已是 25fps 时跳过重编码 + cap = cv2.VideoCapture(video_path) + current_fps = cap.get(cv2.CAP_PROP_FPS) + cap.release() + + if abs(current_fps - 25.0) < 0.5: + # 已是 25fps,直接使用原文件(避免一次有损重编码) + print(f"Video already at {current_fps:.1f}fps, skipping FPS conversion") + target_video_path = video_path + else: + command = ( + f"ffmpeg -loglevel error -y -nostdin -i {video_path} -r 25 -crf 18 {os.path.join(temp_dir, 'video.mp4')}" + ) + subprocess.run(command, shell=True) + target_video_path = os.path.join(temp_dir, "video.mp4") else: target_video_path = video_path diff --git a/models/MuseTalk/scripts/server.py b/models/MuseTalk/scripts/server.py index ec8f61e..81048b1 100644 --- a/models/MuseTalk/scripts/server.py +++ b/models/MuseTalk/scripts/server.py @@ -4,14 +4,14 @@ MuseTalk v1.5 常驻推理服务 (优化版 v2) - GPU: 从 backend/.env 读取 MUSETALK_GPU_ID (默认 0) - 架构: FastAPI + lifespan (与 LatentSync server.py 同模式) -优化项 (vs v1): -1. cv2.VideoCapture 直读帧 (跳过 ffmpeg→PNG→imread) -2. 人脸检测降频 (每 N 帧检测, 中间插值 bbox) -3. BiSeNet mask 缓存 (每 N 帧更新, 中间复用) -4. cv2.VideoWriter 直写视频 (跳过逐帧 PNG 写盘) -5. batch_size 8→32 -6. 每阶段计时 -""" +优化项 (vs v1): +1. cv2.VideoCapture 直读帧 (跳过 ffmpeg→PNG→imread) +2. 人脸检测降频 (每 N 帧检测, 中间插值 bbox) +3. BiSeNet mask 缓存 (每 N 帧更新, 中间复用) +4. FFmpeg rawvideo 管道直编码 (去掉中间有损 mp4v) +5. batch_size 8→32 +6. 每阶段计时 +""" import os import sys @@ -84,17 +84,28 @@ from musetalk.utils.utils import get_file_type, get_video_fps, datagen, load_all from musetalk.utils.preprocessing import get_landmark_and_bbox, read_imgs, coord_placeholder # --- 从 .env 读取额外配置 --- -def load_env_config(): - """读取 MuseTalk 相关环境变量""" - config = { - "batch_size": 32, - "version": "v15", - "use_float16": True, - } - try: - env_path = musetalk_root.parent.parent / "backend" / ".env" - if env_path.exists(): - with open(env_path, "r", encoding="utf-8") as f: +def load_env_config(): + """读取 MuseTalk 相关环境变量""" + config = { + "batch_size": 32, + "version": "v15", + "use_float16": True, + "detect_every": 5, + "blend_cache_every": 5, + "audio_padding_left": 2, + "audio_padding_right": 2, + "extra_margin": 15, + "delay_frame": 0, + "blend_mode": "auto", + "faceparsing_left_cheek_width": 90, + "faceparsing_right_cheek_width": 90, + "encode_crf": 18, + "encode_preset": "medium", + } + try: + env_path = musetalk_root.parent.parent / "backend" / ".env" + if env_path.exists(): + with open(env_path, "r", encoding="utf-8") as f: for line in f: line = line.strip() if line.startswith("MUSETALK_BATCH_SIZE="): @@ -105,22 +116,78 @@ def load_env_config(): val = line.split("=")[1].strip().split("#")[0].strip() if val: config["version"] = val - elif line.startswith("MUSETALK_USE_FLOAT16="): - val = line.split("=")[1].strip().split("#")[0].strip().lower() - config["use_float16"] = val in ("true", "1", "yes") - except Exception as e: - print(f"⚠️ 读取额外配置失败: {e}") - return config - -env_config = load_env_config() + elif line.startswith("MUSETALK_USE_FLOAT16="): + val = line.split("=")[1].strip().split("#")[0].strip().lower() + config["use_float16"] = val in ("true", "1", "yes") + elif line.startswith("MUSETALK_DETECT_EVERY="): + val = line.split("=")[1].strip().split("#")[0].strip() + if val: + config["detect_every"] = max(1, int(val)) + elif line.startswith("MUSETALK_BLEND_CACHE_EVERY="): + val = line.split("=")[1].strip().split("#")[0].strip() + if val: + config["blend_cache_every"] = max(1, int(val)) + elif line.startswith("MUSETALK_AUDIO_PADDING_LEFT="): + val = line.split("=")[1].strip().split("#")[0].strip() + if val: + config["audio_padding_left"] = max(0, int(val)) + elif line.startswith("MUSETALK_AUDIO_PADDING_RIGHT="): + val = line.split("=")[1].strip().split("#")[0].strip() + if val: + config["audio_padding_right"] = max(0, int(val)) + elif line.startswith("MUSETALK_EXTRA_MARGIN="): + val = line.split("=")[1].strip().split("#")[0].strip() + if val: + config["extra_margin"] = max(0, int(val)) + elif line.startswith("MUSETALK_DELAY_FRAME="): + val = line.split("=")[1].strip().split("#")[0].strip() + if val: + config["delay_frame"] = int(val) + elif line.startswith("MUSETALK_BLEND_MODE="): + val = line.split("=")[1].strip().split("#")[0].strip().lower() + if val in ("auto", "jaw", "raw"): + config["blend_mode"] = val + elif line.startswith("MUSETALK_FACEPARSING_LEFT_CHEEK_WIDTH="): + val = line.split("=")[1].strip().split("#")[0].strip() + if val: + config["faceparsing_left_cheek_width"] = max(0, int(val)) + elif line.startswith("MUSETALK_FACEPARSING_RIGHT_CHEEK_WIDTH="): + val = line.split("=")[1].strip().split("#")[0].strip() + if val: + config["faceparsing_right_cheek_width"] = max(0, int(val)) + elif line.startswith("MUSETALK_ENCODE_CRF="): + val = line.split("=")[1].strip().split("#")[0].strip() + if val: + config["encode_crf"] = min(51, max(0, int(val))) + elif line.startswith("MUSETALK_ENCODE_PRESET="): + val = line.split("=")[1].strip().split("#")[0].strip().lower() + if val in ( + "ultrafast", "superfast", "veryfast", "faster", "fast", + "medium", "slow", "slower", "veryslow" + ): + config["encode_preset"] = val + except Exception as e: + print(f"⚠️ 读取额外配置失败: {e}") + return config + +env_config = load_env_config() # 全局模型缓存 models = {} -# ===================== 优化参数 ===================== -DETECT_EVERY = 5 # 人脸检测降频: 每 N 帧检测一次 -BLEND_CACHE_EVERY = 5 # BiSeNet mask 缓存: 每 N 帧更新一次 -# ==================================================== +# ===================== 优化参数 ===================== +DETECT_EVERY = int(env_config["detect_every"]) # 人脸检测降频: 每 N 帧检测一次 +BLEND_CACHE_EVERY = int(env_config["blend_cache_every"]) # BiSeNet mask 缓存: 每 N 帧更新一次 +AUDIO_PADDING_LEFT = int(env_config["audio_padding_left"]) +AUDIO_PADDING_RIGHT = int(env_config["audio_padding_right"]) +EXTRA_MARGIN = int(env_config["extra_margin"]) +DELAY_FRAME = int(env_config["delay_frame"]) +BLEND_MODE = str(env_config["blend_mode"]) +FACEPARSING_LEFT_CHEEK_WIDTH = int(env_config["faceparsing_left_cheek_width"]) +FACEPARSING_RIGHT_CHEEK_WIDTH = int(env_config["faceparsing_right_cheek_width"]) +ENCODE_CRF = int(env_config["encode_crf"]) +ENCODE_PRESET = str(env_config["encode_preset"]) +# ==================================================== def run_ffmpeg(cmd): @@ -191,11 +258,14 @@ async def lifespan(app: FastAPI): whisper = whisper.to(device=device, dtype=weight_dtype).eval() whisper.requires_grad_(False) - # FaceParsing - if version == "v15": - fp = FaceParsing(left_cheek_width=90, right_cheek_width=90) - else: - fp = FaceParsing() + # FaceParsing + if version == "v15": + fp = FaceParsing( + left_cheek_width=FACEPARSING_LEFT_CHEEK_WIDTH, + right_cheek_width=FACEPARSING_RIGHT_CHEEK_WIDTH, + ) + else: + fp = FaceParsing() # 恢复工作目录 os.chdir(original_cwd) @@ -211,9 +281,13 @@ async def lifespan(app: FastAPI): models["version"] = version models["timesteps"] = torch.tensor([0], device=device) - print("✅ MuseTalk v1.5 模型加载完成,服务就绪!") - print(f"⚙️ 优化参数: batch_size={env_config['batch_size']}, " - f"detect_every={DETECT_EVERY}, blend_cache_every={BLEND_CACHE_EVERY}") + print("✅ MuseTalk v1.5 模型加载完成,服务就绪!") + print(f"⚙️ 优化参数: batch_size={env_config['batch_size']}, " + f"detect_every={DETECT_EVERY}, blend_cache_every={BLEND_CACHE_EVERY}, " + f"audio_padding=({AUDIO_PADDING_LEFT},{AUDIO_PADDING_RIGHT}), extra_margin={EXTRA_MARGIN}, " + f"delay_frame={DELAY_FRAME}, blend_mode={BLEND_MODE}, " + f"faceparsing_cheek=({FACEPARSING_LEFT_CHEEK_WIDTH},{FACEPARSING_RIGHT_CHEEK_WIDTH}), " + f"encode=libx264/{ENCODE_PRESET}/crf{ENCODE_CRF}") yield models.clear() torch.cuda.empty_cache() @@ -354,15 +428,15 @@ def _detect_faces_subsampled(frames, detect_every=5): # 核心推理 (优化版) # ===================================================================== @torch.no_grad() -def _run_inference(req: LipSyncRequest) -> dict: - """ - 优化版推理逻辑: - 1. cv2.VideoCapture 直读帧 (跳过 ffmpeg→PNG→imread) - 2. 人脸检测降频 (每 N 帧, 中间插值) - 3. BiSeNet mask 缓存 (每 N 帧更新) - 4. cv2.VideoWriter 直写 (跳过逐帧 PNG) - 5. 每阶段计时 +def _run_inference(req: LipSyncRequest) -> dict: """ + 优化版推理逻辑: + 1. cv2.VideoCapture 直读帧 (跳过 ffmpeg→PNG→imread) + 2. 人脸检测降频 (每 N 帧, 中间插值) + 3. BiSeNet mask 缓存 (每 N 帧更新) + 4. FFmpeg rawvideo 管道直编码 (无中间有损文件) + 5. 每阶段计时 + """ vae = models["vae"] unet = models["unet"] pe = models["pe"] @@ -411,12 +485,12 @@ def _run_inference(req: LipSyncRequest) -> dict: # ===== Phase 2: Whisper 音频特征 ===== t0 = time.time() whisper_input_features, librosa_length = audio_processor.get_audio_feature(audio_path) - whisper_chunks = audio_processor.get_whisper_chunk( - whisper_input_features, device, weight_dtype, whisper, librosa_length, - fps=fps, - audio_padding_length_left=2, - audio_padding_length_right=2, - ) + whisper_chunks = audio_processor.get_whisper_chunk( + whisper_input_features, device, weight_dtype, whisper, librosa_length, + fps=fps, + audio_padding_length_left=AUDIO_PADDING_LEFT, + audio_padding_length_right=AUDIO_PADDING_RIGHT, + ) timings["2_whisper"] = time.time() - t0 print(f"🎵 Whisper 特征 [{timings['2_whisper']:.1f}s]") @@ -427,12 +501,12 @@ def _run_inference(req: LipSyncRequest) -> dict: print(f"🔍 人脸检测 [{timings['3_face']:.1f}s]") # ===== Phase 4: VAE 潜空间编码 ===== - t0 = time.time() - input_latent_list = [] - extra_margin = 15 - for bbox, frame in zip(coord_list, frames): - if bbox == coord_placeholder: - continue + t0 = time.time() + input_latent_list = [] + extra_margin = EXTRA_MARGIN + for bbox, frame in zip(coord_list, frames): + if bbox == coord_placeholder: + continue x1, y1, x2, y2 = bbox if version == "v15": y2 = min(y2 + extra_margin, frame.shape[0]) @@ -453,13 +527,13 @@ def _run_inference(req: LipSyncRequest) -> dict: input_latent_list_cycle = input_latent_list + input_latent_list[::-1] video_num = len(whisper_chunks) - gen = datagen( - whisper_chunks=whisper_chunks, - vae_encode_latents=input_latent_list_cycle, - batch_size=batch_size, - delay_frame=0, - device=device, - ) + gen = datagen( + whisper_chunks=whisper_chunks, + vae_encode_latents=input_latent_list_cycle, + batch_size=batch_size, + delay_frame=DELAY_FRAME, + device=device, + ) res_frame_list = [] total_batches = int(np.ceil(float(video_num) / batch_size)) @@ -479,21 +553,44 @@ def _run_inference(req: LipSyncRequest) -> dict: timings["5_unet"] = time.time() - t0 print(f"✅ UNet 推理: {len(res_frame_list)} 帧 [{timings['5_unet']:.1f}s]") - # ===== Phase 6: 合成 (cv2.VideoWriter + 纯 numpy blending) ===== - t0 = time.time() - - h, w = frames[0].shape[:2] - temp_raw_path = output_vid_path + ".raw.mp4" - - fourcc = cv2.VideoWriter_fourcc(*'mp4v') - writer = cv2.VideoWriter(temp_raw_path, fourcc, fps, (w, h)) - - if not writer.isOpened(): - raise RuntimeError(f"cv2.VideoWriter 打开失败: {temp_raw_path}") - - cached_mask = None - cached_crop_box = None - blend_mode = "jaw" if version == "v15" else "raw" + # ===== Phase 6: 合成并写入 FFmpeg rawvideo 管道 ===== + t0 = time.time() + + h, w = frames[0].shape[:2] + ffmpeg_cmd = [ + "ffmpeg", "-y", "-v", "warning", + "-f", "rawvideo", + "-pix_fmt", "bgr24", + "-s", f"{w}x{h}", + "-r", str(fps), + "-i", "-", + "-i", audio_path, + "-c:v", "libx264", "-preset", ENCODE_PRESET, "-crf", str(ENCODE_CRF), "-pix_fmt", "yuv420p", + "-c:a", "copy", "-shortest", + output_vid_path, + ] + ffmpeg_proc = subprocess.Popen( + ffmpeg_cmd, + stdin=subprocess.PIPE, + stdout=subprocess.DEVNULL, + stderr=subprocess.DEVNULL, + ) + pipe_in = ffmpeg_proc.stdin + if pipe_in is None: + raise RuntimeError("FFmpeg 管道初始化失败") + + def _write_pipe_frame(frame: np.ndarray): + try: + pipe_in.write(np.ascontiguousarray(frame, dtype=np.uint8).tobytes()) + except BrokenPipeError as exc: + raise RuntimeError("FFmpeg 管道写入失败") from exc + + cached_mask = None + cached_crop_box = None + if BLEND_MODE == "auto": + blend_mode = "jaw" if version == "v15" else "raw" + else: + blend_mode = BLEND_MODE for i in tqdm(range(len(res_frame_list)), desc="合成"): res_frame = res_frame_list[i] @@ -503,26 +600,26 @@ def _run_inference(req: LipSyncRequest) -> dict: x1, y1, x2, y2 = bbox if version == "v15": y2 = min(y2 + extra_margin, ori_frame.shape[0]) - adjusted_bbox = (x1, y1, x2, y2) - - try: - res_frame = cv2.resize(res_frame.astype(np.uint8), (x2 - x1, y2 - y1)) - except Exception: - writer.write(ori_frame) - continue + adjusted_bbox = (x1, y1, x2, y2) + + try: + res_frame = cv2.resize(res_frame.astype(np.uint8), (x2 - x1, y2 - y1)) + except Exception: + _write_pipe_frame(ori_frame) + continue # 每 N 帧更新 BiSeNet 人脸解析 mask, 其余帧复用缓存 if i % BLEND_CACHE_EVERY == 0 or cached_mask is None: try: cached_mask, cached_crop_box = get_image_prepare_material( ori_frame, adjusted_bbox, mode=blend_mode, fp=fp) - except Exception: - # 如果 prepare 失败, 用完整方式 - combine_frame = get_image( - ori_frame, res_frame, list(adjusted_bbox), - mode=blend_mode, fp=fp) - writer.write(combine_frame) - continue + except Exception: + # 如果 prepare 失败, 用完整方式 + combine_frame = get_image( + ori_frame, res_frame, list(adjusted_bbox), + mode=blend_mode, fp=fp) + _write_pipe_frame(combine_frame) + continue try: combine_frame = get_image_blending_fast( @@ -532,35 +629,25 @@ def _run_inference(req: LipSyncRequest) -> dict: try: combine_frame = get_image_blending( ori_frame, res_frame, adjusted_bbox, cached_mask, cached_crop_box) - except Exception: - combine_frame = get_image( - ori_frame, res_frame, list(adjusted_bbox), - mode=blend_mode, fp=fp) - - writer.write(combine_frame) - - writer.release() - timings["6_blend"] = time.time() - t0 - print(f"🎨 合成 [{timings['6_blend']:.1f}s]") - - # ===== Phase 7: FFmpeg H.264 编码 + 合并音频 ===== - t0 = time.time() - cmd = [ - "ffmpeg", "-y", "-v", "warning", - "-i", temp_raw_path, "-i", audio_path, - "-c:v", "libx264", "-crf", "18", "-pix_fmt", "yuv420p", - "-c:a", "copy", "-shortest", - output_vid_path - ] - if not run_ffmpeg(cmd): - raise RuntimeError("FFmpeg 重编码+音频合并失败") - - # 清理临时文件 - if os.path.exists(temp_raw_path): - os.unlink(temp_raw_path) - - timings["7_encode"] = time.time() - t0 - print(f"🔊 编码+音频 [{timings['7_encode']:.1f}s]") + except Exception: + combine_frame = get_image( + ori_frame, res_frame, list(adjusted_bbox), + mode=blend_mode, fp=fp) + + _write_pipe_frame(combine_frame) + + pipe_in.close() + timings["6_blend"] = time.time() - t0 + print(f"🎨 合成 [{timings['6_blend']:.1f}s]") + + # ===== Phase 7: 等待 FFmpeg 编码完成 ===== + t0 = time.time() + return_code = ffmpeg_proc.wait() + if return_code != 0: + raise RuntimeError("FFmpeg 编码+音频合并失败") + + timings["7_encode"] = time.time() - t0 + print(f"🔊 编码+音频 [{timings['7_encode']:.1f}s]") # ===== 汇总 ===== total_time = time.time() - t_total diff --git a/remotion/render.ts b/remotion/render.ts index 2e7bf09..a60abc6 100644 --- a/remotion/render.ts +++ b/remotion/render.ts @@ -185,9 +185,50 @@ async function main() { const currentHash = getSourceHash(); let bundleLocation: string; + // 辅助函数: 确保文件在缓存 public 目录中可访问 (硬链接 > 复制) + function ensureInCachedPublic(cachedPublicDir: string, srcAbsPath: string, fileName: string) { + const cachedPath = path.join(cachedPublicDir, fileName); + // 已存在且大小一致,跳过 + try { + if (fs.existsSync(cachedPath)) { + const srcStat = fs.statSync(srcAbsPath); + const cachedStat = fs.statSync(cachedPath); + if (srcStat.size === cachedStat.size && srcStat.ino === cachedStat.ino) return; + } + } catch { /* file doesn't exist or broken, will recreate */ } + // 移除旧的文件/链接 + try { fs.unlinkSync(cachedPath); } catch { /* doesn't exist, fine */ } + // 优先硬链接(零拷贝,对应用透明),跨文件系统时回退为复制 + try { + fs.linkSync(srcAbsPath, cachedPath); + console.log(`Hardlinked into cached bundle: ${fileName}`); + } catch { + fs.copyFileSync(srcAbsPath, cachedPath); + console.log(`Copied into cached bundle: ${fileName}`); + } + } + if (fs.existsSync(hashFile) && fs.readFileSync(hashFile, 'utf-8') === currentHash) { bundleLocation = BUNDLE_CACHE_DIR; console.log('Using cached bundle'); + // 确保当前渲染所需的文件在缓存 bundle 的 public 目录中可访问 + const cachedPublicDir = path.join(BUNDLE_CACHE_DIR, 'public'); + if (!fs.existsSync(cachedPublicDir)) { + fs.mkdirSync(cachedPublicDir, { recursive: true }); + } + // 1) 视频文件 + ensureInCachedPublic(cachedPublicDir, path.resolve(options.videoPath), videoFileName); + // 2) 字体文件 (从 subtitleStyle / titleStyle / secondaryTitleStyle 中提取) + const styleSources = [options.subtitleStyle, options.titleStyle, options.secondaryTitleStyle]; + for (const style of styleSources) { + const fontFile = (style as Record)?.font_file as string | undefined; + if (fontFile) { + const fontSrcPath = path.join(publicDir, fontFile); + if (fs.existsSync(fontSrcPath)) { + ensureInCachedPublic(cachedPublicDir, path.resolve(fontSrcPath), fontFile); + } + } + } } else { console.log('Bundling Remotion project...'); console.log(`Entry point: ${entryPoint}`);