diff --git a/Docs/BACKEND_DEV.md b/Docs/BACKEND_DEV.md index f8c5f47..cc3a60a 100644 --- a/Docs/BACKEND_DEV.md +++ b/Docs/BACKEND_DEV.md @@ -194,6 +194,15 @@ backend/user_data/{user_uuid}/cookies/ - `MUSETALK_USE_FLOAT16` (半精度,默认 true) - `LIPSYNC_DURATION_THRESHOLD` (秒,>=此值用 MuseTalk;代码默认 120,本仓库当前 `.env` 配置 100) +### 小脸口型质量补偿(本地唇形路径) +- `LIPSYNC_SMALL_FACE_ENHANCE` (总开关,默认 false) +- `LIPSYNC_SMALL_FACE_THRESHOLD` (触发阈值,默认 256) +- `LIPSYNC_SMALL_FACE_UPSCALER` (`gfpgan` / `codeformer`) +- `LIPSYNC_SMALL_FACE_GPU_ID` (超分 GPU,默认 0) +- `LIPSYNC_SMALL_FACE_FAIL_OPEN` (失败回退,默认 true) + +> 部署与验证细节见 `Docs/FACEENHANCE_DEPLOY.md`。 + ### 微信视频号 - `WEIXIN_HEADLESS_MODE` (headful/headless-new) - `WEIXIN_CHROME_PATH` / `WEIXIN_BROWSER_CHANNEL` diff --git a/Docs/BACKEND_README.md b/Docs/BACKEND_README.md index 98cb3d7..54177c6 100644 --- a/Docs/BACKEND_README.md +++ b/Docs/BACKEND_README.md @@ -126,9 +126,11 @@ backend/ > - `analysis_id` 绑定 `user_id` 且有 TTL(默认 20 分钟),用于后续“生成文案”阶段安全读取标题上下文。 10. **健康检查** - * `GET /api/videos/lipsync/health`: 唇形同步服务健康状态(含 LatentSync + MuseTalk + 混合路由阈值) + * `GET /api/videos/lipsync/health`: 唇形同步服务健康状态(含 LatentSync + MuseTalk + 混合路由阈值 + `data.small_face_enhance`) * `GET /api/videos/voiceclone/health`: CosyVoice 3.0 服务健康状态 +> 小脸口型质量补偿链路健康字段说明:`data.small_face_enhance.enabled`(总开关)、`threshold`(触发阈值)、`detector_loaded`(SCRFD 是否已懒加载)。 + 11. **支付 (Payment)** * `POST /api/payment/create-order`: 创建支付宝电脑网站支付订单(需 payment_token) * `POST /api/payment/notify`: 支付宝异步通知回调(返回纯文本 success/fail) @@ -260,6 +262,13 @@ MUSETALK_API_URL=http://localhost:8011 MUSETALK_BATCH_SIZE=32 LIPSYNC_DURATION_THRESHOLD=100 +# 小脸口型质量补偿(默认关闭,建议灰度开启) +LIPSYNC_SMALL_FACE_ENHANCE=false +LIPSYNC_SMALL_FACE_THRESHOLD=256 +LIPSYNC_SMALL_FACE_UPSCALER=gfpgan +LIPSYNC_SMALL_FACE_GPU_ID=0 +LIPSYNC_SMALL_FACE_FAIL_OPEN=true + # MuseTalk 可调参数(示例) MUSETALK_DETECT_EVERY=2 MUSETALK_BLEND_CACHE_EVERY=2 @@ -267,6 +276,8 @@ MUSETALK_ENCODE_CRF=14 MUSETALK_ENCODE_PRESET=slow ``` +> 小脸口型质量补偿链路部署、权重与回滚说明见 `Docs/FACEENHANCE_DEPLOY.md`(仅本地 `_local_generate()` 路径接入,远程模式暂不接入)。 + ### 4. 启动服务 **开发模式 (热重载)**: diff --git a/Docs/DEPLOY_MANUAL.md b/Docs/DEPLOY_MANUAL.md index ba3934d..cbd3562 100644 --- a/Docs/DEPLOY_MANUAL.md +++ b/Docs/DEPLOY_MANUAL.md @@ -99,8 +99,11 @@ python -m scripts.server # 测试能否启动,Ctrl+C 退出 > MuseTalk 是单步潜空间修复模型(非扩散模型),推理速度接近实时,适合达到路由阈值的长视频(本仓库当前 `.env` 示例为 >=100s)。与 CosyVoice 共享 GPU0,fp16 推理约需 4-8GB 显存。合成阶段已改为 FFmpeg rawvideo 管道直编码(`libx264` + 可配 CRF/preset)并保留 numpy blending,减少中间有损文件。 -请参考详细的独立部署指南: -**[MuseTalk 部署指南](MUSETALK_DEPLOY.md)** +请参考详细的独立部署指南: +**[MuseTalk 部署指南](MUSETALK_DEPLOY.md)** + +小脸口型质量补偿(可选)部署与验证: +**[小脸口型质量补偿链路部署指南](FACEENHANCE_DEPLOY.md)** 简要步骤: 1. 创建独立的 `musetalk` Conda 环境 (Python 3.10 + PyTorch 2.0.1 + CUDA 11.8) @@ -248,9 +251,14 @@ cd /home/rongye/ProgramFiles/ViGent2/backend | `MUSETALK_GPU_ID` | 0 | MuseTalk GPU 编号 | | `MUSETALK_API_URL` | `http://localhost:8011` | MuseTalk 常驻服务地址 | | `MUSETALK_BATCH_SIZE` | 32 | MuseTalk 推理批大小 | -| `MUSETALK_VERSION` | v15 | MuseTalk 模型版本 | -| `MUSETALK_USE_FLOAT16` | true | MuseTalk 半精度加速 | +| `MUSETALK_VERSION` | v15 | MuseTalk 模型版本 | +| `MUSETALK_USE_FLOAT16` | true | MuseTalk 半精度加速 | | `LIPSYNC_DURATION_THRESHOLD` | 100 | 秒,>=此值用 MuseTalk,<此值用 LatentSync(代码默认 120,建议在 `.env` 显式配置) | +| `LIPSYNC_SMALL_FACE_ENHANCE` | false | 小脸口型质量补偿总开关(建议先关闭,灰度验证后开启) | +| `LIPSYNC_SMALL_FACE_THRESHOLD` | 256 | 小脸触发阈值(像素) | +| `LIPSYNC_SMALL_FACE_UPSCALER` | gfpgan | 超分模型(`gfpgan` / `codeformer`) | +| `LIPSYNC_SMALL_FACE_GPU_ID` | 0 | 小脸补偿超分 GPU(建议与 MuseTalk 同卡) | +| `LIPSYNC_SMALL_FACE_FAIL_OPEN` | true | 补偿链路失败时是否自动回退原流程 | | `ALIPAY_APP_ID` | 空 | 支付宝应用 APPID | | `ALIPAY_PRIVATE_KEY_PATH` | 空 | 应用私钥 PEM 文件路径 | | `ALIPAY_PUBLIC_KEY_PATH` | 空 | 支付宝公钥 PEM 文件路径 | diff --git a/Docs/DevLogs/Day35.md b/Docs/DevLogs/Day35.md new file mode 100644 index 0000000..bc2fa11 --- /dev/null +++ b/Docs/DevLogs/Day35.md @@ -0,0 +1,165 @@ +## 小脸口型质量补偿落地 + 部署验证 (Day 35) + +### 概述 + +完成「小脸口型质量补偿(Small-Face LipSync Compensation)」后端落地与部署收口。核心目标是在不改变用户模型选择语义(`default/fast/advanced`)的前提下,对远景小脸素材增加质量补偿链路(检测 -> 裁切 -> 稀疏超分 -> 模型推理 -> 贴回),并保持默认关闭、失败回退(fail-open)、线上可快速回滚。 + +--- + +## ✅ 1) 后端能力落地 + +### 1.1 配置与开关 + +新增 5 个配置项(默认保守): + +- `LIPSYNC_SMALL_FACE_ENHANCE`(默认 `false`) +- `LIPSYNC_SMALL_FACE_THRESHOLD`(默认 `256`) +- `LIPSYNC_SMALL_FACE_UPSCALER`(`gfpgan | codeformer`) +- `LIPSYNC_SMALL_FACE_GPU_ID`(默认 `0`) +- `LIPSYNC_SMALL_FACE_FAIL_OPEN`(默认 `true`) + +对应代码入口:`backend/app/core/config.py`、`backend/.env`。 + +### 1.2 新增小脸增强服务 + +新增 `backend/app/services/small_face_enhance_service.py`,实现完整补偿链路: + +1. **小脸判定**(CPU) + - SCRFD(`det_10g.onnx`,复用 LatentSync 权重) + - 从视频 10%-30% 区间均匀采样 24 帧 + - 用最大脸宽中位数与阈值比较触发 + +2. **裁切与轨迹**(CPU) + - 每 8 帧检测一次,其余帧前向填充 + EMA 平滑 + - bbox 外扩 `padding=0.28` + +3. **稀疏超分**(GPU0) + - 检测帧走 GFPGAN/CodeFormer + - 非检测帧走 bicubic resize + - 目标尺寸 `512x512` + +4. **贴回融合**(CPU) + - 口型局部 mask(起点 68% + 侧边留白 16%)+ 高斯羽化(15px) + - `cv2.seamlessClone`,失败回退 alpha blend + +5. **帧数保护** + - 贴回前校验 `lipsync_frames <= original_frames` + - 仅当 `lipsync_frames > original_frames` 时报错(异常),其余按 lipsync 帧数正常贴回 + +--- + +## ✅ 2) LipSyncService 集成 + +`backend/app/services/lipsync_service.py` 关键改造: + +- 在 `_local_generate()` 内按顺序执行: + - `video looping` -> `small face enhance` -> `model infer` -> `blend back` +- 抽取 `_run_selected_model()` 统一模型路由(MuseTalk / LatentSync server / LatentSync subprocess) +- 小脸增强分支全链路 `try/except`,受 `LIPSYNC_SMALL_FACE_FAIL_OPEN` 控制 +- `check_health()` 新增 `small_face_enhance` 状态字段 + +语义保持: + +- 前端与 API 协议不变 +- 用户选择模型优先,不因小脸强制换模型 +- 仅本地路径(`_local_generate`)接入;远程路径暂不接入 + +--- + +## ✅ 3) 依赖与权重 + +### 3.1 依赖 + +`backend/requirements.txt` 新增: + +- `opencv-python-headless>=4.8.0` +- `gfpgan>=1.3.8` + +### 3.2 权重 + +- `models/FaceEnhance/GFPGANv1.4.pth`(新增目录与权重) +- `models/LatentSync/checkpoints/auxiliary/models/buffalo_l/det_10g.onnx`(复用) + +--- + +## ✅ 4) 稳定性修复(部署后补丁) + +为解决实际部署中的依赖兼容、帧数估算偏差、贴回误判与输出质量问题,补充九处修复: + +1. **懒加载 + 守卫** + - `cv2/numpy` 改为 `try/except` 导入 + - 用 `_CV2_AVAILABLE` 守卫增强入口 + - 缺依赖时跳过增强,不影响主流程 + +2. **类型注解与 torchvision 兼容补丁** + - 增加 `from __future__ import annotations`,避免 `np.ndarray` 在缺依赖场景下导入期报错 + - 在 `_ensure_upscaler()` 中注入 + `sys.modules['torchvision.transforms.functional_tensor']` + 兼容 `torchvision>=0.20` 与 `gfpgan/basicsr` 旧引用 + +3. **ffprobe 帧率与帧数估算修复** + - `_get_video_info()` 从 `csv` 切到 `json` 字段访问,避免 `nb_frames` 缺失导致字段错位 + - fps 取值改为优先 `avg_frame_rate`,`r_frame_rate` 仅作为 fallback + +4. **轨迹帧数与贴回检查修复** + - `_build_face_track()` 记录 ffmpeg 实际读帧数,覆盖估算 `nb_frames` + - `blend_back()` 放宽检查为 `lipsync <= original` 正常贴回,仅 `>` 报错 + +5. **空输出防护** + - `blend_back()` 增加 `ls_frames <= 0` 异常分支 + - 由外层 `FAIL_OPEN` 捕获并回退常规路径,避免写出空视频 + +6. **时基对齐修复(慢动作/重影)** + - `_crop_and_upscale_video()` 输出 fps 改为跟随源视频 fps,避免增强视频时间轴拉伸 + - `blend_back()` 按 `orig_fps/ls_fps` 映射原始帧索引,避免只贴回前段帧导致动作变慢/重影 + +7. **无声视频修复** + - 小脸贴回成功后新增音轨封装(mux)步骤 + - 强制将当前任务 `audio_path` 封装回贴回视频,防止增强路径无声音 + +8. **眼部重影修复** + - 口型 mask 起点进一步下移到 68%,并增加左右 16% 留白,减少眼周/鼻翼参与融合 + - `seamlessClone` 后对结果做 mask 限域二次融合,抑制 Poisson 扩散到眼部上方 + +9. **畸形规避(运行侧)** + - `LIPSYNC_SMALL_FACE_THRESHOLD=9999` 仅用于链路冒烟,不用于质量评估 + - 质量验证前统一恢复 `LIPSYNC_SMALL_FACE_THRESHOLD=256` + +--- + +## ✅ 5) 部署文档与验证 + +新增并回写部署文档:`Docs/FACEENHANCE_DEPLOY.md`。 + +文档修正点: + +- 健康检查地址修正为:`/api/videos/lipsync/health` +- 响应示例补齐 `success/data` 外层包装 + +实际验证要点: + +- `GET /api/videos/lipsync/health` 返回 `data.small_face_enhance` +- 默认 `enabled=false`,开关关闭时行为与旧版一致 +- `detector_loaded=false`(懒加载)符合预期 + +--- + +## 📁 今日修改文件 + +| 文件 | 改动 | +|------|------| +| `backend/app/core/config.py` | 新增 `LIPSYNC_SMALL_FACE_*` 配置项(5 个) | +| `backend/.env` | 增加小脸增强开关与参数 | +| `backend/app/services/small_face_enhance_service.py` | 新增:检测/裁切/超分/贴回主服务;后续补丁含懒加载与兼容修复 | +| `backend/app/services/lipsync_service.py` | 集成增强链路、抽取 `_run_selected_model`、health 增强状态 | +| `backend/requirements.txt` | 新增 `opencv-python-headless`、`gfpgan` | +| `models/FaceEnhance/GFPGANv1.4.pth` | 新增超分权重 | +| `Docs/FACEENHANCE_DEPLOY.md` | 新增部署文档并修正健康检查路径/返回示例 | + +--- + +## ⚠️ 已知限制 + +- 仅本地唇形路径接入(`_local_generate()`);远程模式未接入小脸补偿 +- 多镜头场景当前仍为全局判定,暂不做逐段小脸判定 +- v1 优先单人自拍稳定性,多人脸切换策略后续再补 diff --git a/Docs/FACEENHANCE_DEPLOY.md b/Docs/FACEENHANCE_DEPLOY.md new file mode 100644 index 0000000..03a0914 --- /dev/null +++ b/Docs/FACEENHANCE_DEPLOY.md @@ -0,0 +1,428 @@ +# 小脸口型质量补偿链路部署指南 + +> **更新时间**:2026-03-10 v1.4 +> **适用版本**:SmallFaceEnhance v1.4 (内嵌于 Backend 进程) +> **架构**:LipSyncService 内部模块,无独立进程 + +--- + +## 架构概览 + +小脸口型质量补偿链路(简称"小脸增强")作为 `LipSyncService._local_generate()` 的**前处理分支**,在 lipsync 推理前自动检测小脸并增强输入质量: + +``` +原视频 + 音频 + → video looping (已有逻辑) + → 小脸检测 (SCRFD, CPU) + → [非小脸] 直接用用户所选模型推理 (现有路径) + → [小脸] + A. 裁切主脸区域 (带 padding) + B. 稀疏关键帧超分到 512px (GFPGAN, GPU0) + C. 用用户所选模型推理 (MuseTalk 或 LatentSync) + D. 下半脸 mask 羽化 + seamlessClone 贴回原帧 + → 进入现有后续流程 (字幕/BGM/上传) +``` + +**关键约束**: +- 不改前端、不改 API 协议 +- 模型选择权归用户,不因小脸自动换模型 +- 默认 fail-open:增强链任何一步失败,自动回退原流程 +- 无独立进程/PM2,跟随 `vigent2-backend` 运行 + +--- + +## 硬件要求 + +| 配置 | 说明 | +|------|------| +| 检测器 | SCRFD (det_10g.onnx),CPU 推理,无额外 GPU 开销 | +| 超分 | GFPGAN,GPU0 (与 MuseTalk 同卡,顺序执行),约 2-3GB 显存 | +| 内存 | 流式 ffmpeg pipe 逐帧处理,不额外占用大量内存 | + +> 超分与 MuseTalk 共享 GPU0,顺序执行不会同时占用显存。 + +--- + +## 依赖安装 + +### 1. pip 依赖 + +已在 `backend/requirements.txt` 中添加: + +``` +opencv-python-headless>=4.8.0 +gfpgan>=1.3.8 +``` + +安装: + +```bash +cd /home/rongye/ProgramFiles/ViGent2/backend +pip install opencv-python-headless gfpgan +``` + +> `gfpgan` 会自动拉取 `basicsr`、`facexlib` 等依赖。 +> `onnxruntime` 需单独确认已安装(LatentSync 环境中已有 1.23.2)。 +> 如果 backend 虚拟环境中缺少 onnxruntime,需额外安装:`pip install onnxruntime` + +### 2. 系统依赖 + +- `ffmpeg` / `ffprobe`:已有(视频处理必需) + +--- + +## 模型权重 + +### 目录结构 + +``` +models/ +├── FaceEnhance/ +│ └── GFPGANv1.4.pth ← 超分权重 (~333MB) +└── LatentSync/checkpoints/auxiliary/ + └── models/buffalo_l/ + └── det_10g.onnx ← 人脸检测权重 (~16MB, 复用已有) +``` + +### 下载方式 + +**GFPGAN 权重**(已下载): + +```bash +cd /home/rongye/ProgramFiles/ViGent2/models/FaceEnhance +wget -O GFPGANv1.4.pth "https://github.com/TencentARC/GFPGAN/releases/download/v1.3.4/GFPGANv1.4.pth" +``` + +**SCRFD 检测器权重**: + +复用 LatentSync 已有的 `det_10g.onnx`,无需额外下载。代码自动引用路径: +`models/LatentSync/checkpoints/auxiliary/models/buffalo_l/det_10g.onnx` + +> 权重缺失时自动 fail-open 跳过增强,不会导致任务失败。 + +--- + +## 后端配置 + +`backend/.env` 中的相关变量: + +```ini +# =============== 小脸口型质量补偿链路 =============== +LIPSYNC_SMALL_FACE_ENHANCE=false # 总开关 (true/false) +LIPSYNC_SMALL_FACE_THRESHOLD=256 # 触发阈值 (像素,脸宽 < 此值触发) +LIPSYNC_SMALL_FACE_UPSCALER=gfpgan # 超分模型: gfpgan | codeformer +LIPSYNC_SMALL_FACE_GPU_ID=0 # 超分 GPU (与 MuseTalk 同卡) +LIPSYNC_SMALL_FACE_FAIL_OPEN=true # 失败回退 (true=回退原流程, false=报错) +``` + +`backend/app/core/config.py` 中的默认值: + +```python +LIPSYNC_SMALL_FACE_ENHANCE: bool = False +LIPSYNC_SMALL_FACE_THRESHOLD: int = 256 +LIPSYNC_SMALL_FACE_UPSCALER: str = "codeformer" +LIPSYNC_SMALL_FACE_GPU_ID: int = 0 +LIPSYNC_SMALL_FACE_FAIL_OPEN: bool = True +``` + +> `.env` 优先于 `config.py` 默认值。`config.py` 仅在 `.env` 未设置时生效。 + +### 模块内部常量 + +以下参数固定为代码常量(`small_face_enhance_service.py`),暂不走 env: + +| 常量 | 值 | 说明 | +|------|-----|------| +| `PADDING` | 0.28 | bbox 外扩比例 | +| `DETECT_EVERY` | 8 | 每 N 帧检测,中间帧 EMA 插值 | +| `TARGET_SIZE` | 512 | 超分目标尺寸 | +| `MASK_FEATHER` | 15 | 下半脸 mask 羽化像素 | +| `MASK_UPPER_RATIO` | 0.68 | 口型 mask 起始位置 (crop 高度的 68%,仅覆盖嘴部/下巴) | +| `MASK_SIDE_MARGIN` | 0.16 | 左右留白比例,避免改动面颊/鼻翼 | +| `SAMPLE_FRAMES` | 24 | 小脸判定采样帧数 | +| `SAMPLE_WINDOW` | (0.10, 0.30) | 采样窗口 (视频 10%~30%) | +| `ENCODE_FPS` | 25 | 中间视频编码帧率 fallback(优先跟随源视频 fps,源 fps 不可用时回退 25) | +| `ENCODE_CRF` | 18 | 中间视频编码质量 | +| `EMA_ALPHA` | 0.3 | bbox EMA 平滑系数 | + +--- + +## 启用与验证 + +### 1. 开启小脸口型质量补偿链路 + +```bash +# 编辑 backend/.env +LIPSYNC_SMALL_FACE_ENHANCE=true +``` + +重启后端: + +```bash +pm2 restart vigent2-backend +``` + +### 2. 强制触发测试 + +设置极大阈值,使任何视频都触发增强: + +```ini +LIPSYNC_SMALL_FACE_THRESHOLD=9999 +``` + +> 仅用于链路冒烟测试,不用于质量评估。`9999` 会强制大脸素材进入增强分支,可能出现中脸变形/鼻翼细节异常。 + +提交一个视频任务,检查日志: + +```bash +pm2 logs vigent2-backend --lines 50 +``` + +应看到类似输出: + +``` +小脸增强: face_w=320px < threshold=9999px, 触发增强 +✅ SCRFD 检测器已加载 +✅ 超分器已加载: gfpgan +小脸增强: face_w=320px threshold=9999px enhanced=True upscaler=gfpgan time=12.3s +✅ 小脸增强 + 唇形同步完成: /path/to/output.mp4 +``` + +### 3. 调回正常阈值 + +验证通过后,改回合理阈值: + +```ini +LIPSYNC_SMALL_FACE_THRESHOLD=256 +``` + +并重启 backend:`pm2 restart vigent2-backend`。 + +### 4. 健康检查 + +```bash +curl http://localhost:8006/api/videos/lipsync/health | python3 -m json.tool +``` + +应包含 `data.small_face_enhance`: + +```json +{ + "success": true, + "data": { + "small_face_enhance": { + "enabled": true, + "threshold": 256, + "detector_loaded": true + } + } +} +``` + +--- + +## 相关文件 + +| 文件 | 说明 | +|------|------| +| `backend/app/services/small_face_enhance_service.py` | 小脸增强主服务 (检测 + 裁切 + 超分 + 贴回) | +| `backend/app/services/lipsync_service.py` | 混合路由 + 小脸增强集成 + `_run_selected_model()` | +| `backend/app/core/config.py` | `LIPSYNC_SMALL_FACE_*` 配置项 | +| `models/FaceEnhance/GFPGANv1.4.pth` | GFPGAN 超分权重 | +| `models/LatentSync/checkpoints/auxiliary/models/buffalo_l/det_10g.onnx` | SCRFD 检测器权重 (复用) | +| `Temp/小脸增强分支-实施计划.md` | 详细方案文档 | + +--- + +## 处理流程详解 + +### 1. 检测阶段 (CPU) + +- 从视频 10%~30% 区间均匀采 24 帧 +- SCRFD (det_10g.onnx) 检测最大脸,取中位数脸宽 +- `脸宽 < THRESHOLD` 时触发增强 + +### 2. 裁切 + 轨迹 (CPU) + +- 每 8 帧检测人脸 bbox,中间帧 EMA 插值平滑 +- bbox + 0.28 padding 外扩,clamp 到帧边界 +- 实际读取帧数回写 `track.frame_count`,修正 ffprobe 估算偏差 +- ffmpeg pipe 流式裁切,输出 512x512 视频 + +### 3. 超分 (GPU0) + +- 检测帧 (每 8 帧):GFPGAN 全量超分 +- 非检测帧:bicubic resize 到 512x512 +- 增强视频输出 fps 跟随源视频 fps(不再固定写 25fps),避免时基拉伸 +- 推理后自动 `torch.cuda.empty_cache()` + +### 4. Lipsync 推理 + +- 用户选择的模型 (fast/default/advanced) 对增强后的人脸视频推理 +- 模型选择语义不变 + +### 5. 贴回 (CPU) + +- 口型局部 mask (从 68% 高度开始 + 左右留白 16%) + 高斯羽化 15px(仅覆盖嘴部/下巴) +- `cv2.seamlessClone(NORMAL_CLONE)` 贴回原帧 +- 对 seamlessClone 结果再按 mask 区域做二次 alpha 限域,避免融合扩散到眼部上方 +- seamlessClone 失败时 fallback alpha 混合 +- 贴回按时间轴映射原始帧索引(`orig_fps/ls_fps`),避免只使用前段帧导致动作变慢/重影 +- 帧数保护:lipsync 按音频时长输出,帧数通常 <= 原始 looped 视频;仅 `lipsync帧数 > 原始帧数` 时报错,`<=` 时正常贴回 +- 空输出保护:`lipsync帧数 <= 0` 直接抛异常,外层 `FAIL_OPEN` 回退原流程,避免写出空视频 +- 音轨封装:贴回后强制复用 `audio_path` 重新 mux 音轨,避免增强路径出现无声视频 + +--- + +## 回滚方案 + +**一级回滚 (秒级)**: + +```ini +LIPSYNC_SMALL_FACE_ENHANCE=false +``` + +重启 backend 即可,所有任务走原流程。 + +**二级回滚 (版本级)**: + +回退 `lipsync_service.py` 增强接入提交,配置项保留但不生效。 + +--- + +## 常见问题 + +### onnxruntime 未安装 + +``` +⚠️ SCRFD 初始化失败: No module named 'onnxruntime' +``` + +**解决**: + +```bash +pip install onnxruntime +``` + +### GFPGAN 权重缺失 + +``` +⚠️ GFPGAN 权重不存在: .../models/FaceEnhance/GFPGANv1.4.pth +``` + +**解决**:参考上方"模型权重"章节下载。权重缺失时超分自动降级为 bicubic resize。 + +### 帧数异常导致 fail-open + +``` +⚠️ 小脸贴回失败,回退原流程: 帧数异常: lipsync=300 > original=250 +``` + +**说明**:v1.1 已放宽帧数检查。lipsync 模型按音频时长输出帧数,通常 <= looped 视频帧数,此时正常贴回。仅当 lipsync 输出帧数**大于**原始帧数时才报错(异常情况)。 + +### lipsync 输出为空导致回退 + +``` +⚠️ 小脸贴回失败,回退原流程: lipsync 输出帧数为 0,跳过贴回 +``` + +**说明**:v1.2 新增空输出保护。`ls_frames <= 0` 时立即抛错,由外层 fail-open 回退到常规唇形路径,避免生成空视频文件。 + +### 增强后动作变慢 / 眼睛重影 + +**原因**:原视频与 lipsync 输出 fps 不一致时,若按同帧号直接贴回,可能出现时间轴错位(只贴回前段帧)。 + +**修复**:v1.3 已改为按 `orig_fps/ls_fps` 做时间轴映射,贴回阶段使用时间对应帧而非同索引帧,同时增强视频输出 fps 跟随源 fps。 + +**进一步修复(v1.4)**: +- mask 起点进一步下移到 68%,并增加左右 16% 留白,减少眼周/鼻翼参与融合 +- 对 seamlessClone 输出增加 mask 限域,防止 Poisson 扩散造成眼部上方重影 + +### 增强后脸部畸形(鼻翼/中脸异常) + +**高概率原因**:使用了测试阈值 `LIPSYNC_SMALL_FACE_THRESHOLD=9999`,把本不需要增强的大脸素材强制送入补偿链路。 + +**建议处理**: +- 先改回 `LIPSYNC_SMALL_FACE_THRESHOLD=256` 并重启 backend。 +- 如仍有异常,临时关闭 `LIPSYNC_SMALL_FACE_ENHANCE=false` 做 A/B 对比,再继续调参。 + +### 增强后无声音 + +**原因**:贴回阶段 rawvideo 写出默认不带音轨。 + +**修复**:v1.3 已在贴回后强制执行音轨封装(mux),使用当前任务 `audio_path` 写回音频。 + +> v1.0 使用严格一致性检查(`lipsync != original` 即失败),在 looped 视频帧数远大于音频帧数时会误判失败。v1.1 已修复。 + +### 增强后口型有偏移 + +检查 `PADDING` 常量是否合理。过小的 padding 可能导致裁切区域不够,过大会引入太多背景。当前默认 0.28 (28%) 适用于大多数单人自拍场景。 + +### torchvision 兼容性 (functional_tensor) + +``` +No module named 'torchvision.transforms.functional_tensor' +``` + +**原因**:torchvision >= 0.20 移除了 `functional_tensor` 模块,但 `basicsr`(gfpgan 依赖)仍引用。 + +**解决**:代码已内置兼容 shim(`_ensure_upscaler()` 中自动注入 `sys.modules`),无需手动处理。如仍出现,检查 `_ensure_upscaler` 方法是否正常执行。 + +### cv2/numpy 未安装 + +``` +⚠️ cv2 未安装,小脸增强不可用 +``` + +**说明**:`cv2` 和 `numpy` 为 lazy import(`try/except`),缺失时小脸增强自动禁用,不影响后端启动和其他功能。安装 `opencv-python-headless` 即可恢复。 + +--- + +## 已知限制 (v1.4) + +- 仅覆盖本地 lipsync 路径 (`_local_generate()`),远程模式 (`_remote_generate()`) 暂不接入 +- 多镜头仅全局判定,不做逐段小脸检测 +- 仅保证单人 (主脸) 场景稳定,不做多人脸切换 +- CodeFormer 超分需额外安装 `basicsr`,当前推荐使用 GFPGAN + +--- + +## v1.3 → v1.4 变更记录 + +| 修复项 | 说明 | +|--------|------| +| 眼部重影修复 | mask 起点下移到 68% + 左右 16% 留白,减少上半脸与鼻翼参与融合 | +| Poisson 扩散抑制 | seamlessClone 后按 mask 二次限域,避免眼部上方 ghosting | + +--- + +## v1.2 → v1.3 变更记录 + +| 修复项 | 说明 | +|--------|------| +| 时基修复 | `_crop_and_upscale_video()` 输出 fps 跟随源视频 fps,避免增强视频时间轴被拉伸 | +| 贴回对齐修复 | `blend_back()` 改为按 `orig_fps/ls_fps` 映射原始帧索引,减少动作变慢与重影 | +| 音轨修复 | 贴回成功后新增音轨封装(mux),避免增强路径无声音 | + +--- + +## v1.1 → v1.2 变更记录 + +| 修复项 | 说明 | +|--------|------| +| 空输出保护 | `blend_back()` 新增 `ls_frames <= 0` 判断,直接抛错并由外层 fail-open 回退,避免写出空视频 | + +--- + +## v1.0 → v1.1 变更记录 + +| 修复项 | 说明 | +|--------|------| +| ffprobe 解析 | CSV → JSON 格式,字段名访问,不再受 `nb_frames` 缺失导致的字段错位影响 | +| fps 选取 | 优先 `avg_frame_rate`(真实平均帧率),`r_frame_rate` 作为 fallback;避免 `60/1` 等 timebase 倍数导致帧数估算偏大 | +| 实际帧数回写 | `_build_face_track()` 用 ffmpeg 实际读到的帧数覆盖估算值,`track.frame_count` 更准确 | +| 贴回帧数检查 | 放宽为 `lipsync <= original` 时正常贴回,仅 `>` 时报错;适配 MuseTalk/LatentSync 按音频时长输出的行为 | +| 边界防护 | `streams` 为空时 return None;`r_frame_rate` 分母为 0 时 fallback 25fps | +| torchvision 兼容 | `_ensure_upscaler()` 中注入 `functional_tensor` shim,兼容 torchvision >= 0.20 | +| lazy import | `cv2`/`numpy` 包装在 `try/except`,缺失时增强自动禁用不影响后端启动 | +| 类型注解 | `from __future__ import annotations` 避免依赖缺失时 `np.ndarray` 等注解触发 NameError | diff --git a/Docs/task_complete.md b/Docs/task_complete.md index b46833f..a3669cb 100644 --- a/Docs/task_complete.md +++ b/Docs/task_complete.md @@ -1,8 +1,8 @@ # ViGent2 开发任务清单 (Task Log) **项目**: ViGent2 数字人口播视频生成系统 -**进度**: 100% (Day 34 - 多镜头时间轴重构 + 文案深度学习弹窗防误触关闭) -**更新时间**: 2026-03-09 +**进度**: 100% (Day 35 - 小脸口型质量补偿落地 + 部署验证) +**更新时间**: 2026-03-10 --- @@ -10,7 +10,25 @@ > 这里记录了每一天的核心开发内容与 milestone。 -### Day 34: 多镜头时间轴重构 + 文案深度学习弹窗防误触关闭 + Code Review 修复 (Current) +### Day 35: 小脸口型质量补偿落地 + 部署验证 + 稳定性补丁 (Current) +- [x] **小脸口型质量补偿落地**: 新增 `small_face_enhance_service.py`,实现 SCRFD 小脸检测(10%-30% 采样)-> 裁切轨迹(每 8 帧检测 + EMA)-> 稀疏关键帧超分(GFPGAN/CodeFormer)-> 下半脸贴回(seamlessClone/alpha fallback)完整链路。 +- [x] **后端集成完成**: `lipsync_service.py` 在 `_local_generate()` 内完成 looping 后插入增强,抽取 `_run_selected_model()` 统一模型路由,增强失败按 `FAIL_OPEN` 自动回退原流程。 +- [x] **配置与依赖**: 新增 5 个 `LIPSYNC_SMALL_FACE_*` 配置项;`requirements.txt` 增加 `opencv-python-headless`、`gfpgan`;新增 `models/FaceEnhance/GFPGANv1.4.pth` 权重目录。 +- [x] **部署文档新增**: 新增并回写 `Docs/FACEENHANCE_DEPLOY.md`,补齐部署、权重、开关、验证、回滚说明。 +- [x] **线上稳定性修复**: + - `small_face_enhance_service.py` 增加 `cv2/numpy` 懒加载守卫,缺依赖时跳过增强不影响主流程。 + - 增加 `from __future__ import annotations`,避免 `np.ndarray` 注解在缺依赖场景导入期报错。 + - 增加 `torchvision.transforms.functional_tensor` shim,修复 `torchvision>=0.20` 下 GFPGAN 初始化失败。 + - `_get_video_info()` 改为 JSON 字段解析并优先 `avg_frame_rate`,修复 `nb_frames` 缺失导致的帧数估算偏差。 + - `_build_face_track()` 回写实际读帧数;`blend_back()` 帧数校验放宽为 `lipsync <= original` 正常贴回,仅 `>` 报错。 + - `blend_back()` 新增 `ls_frames <= 0` 空输出保护,异常时由 `FAIL_OPEN` 回退常规路径,避免写出空视频。 + - 时基修复:增强视频输出 fps 跟随源视频 fps;贴回按 `orig_fps/ls_fps` 映射原始帧索引,修复动作变慢与重影。 + - 音轨修复:贴回成功后新增 mux 音轨步骤,确保小脸增强路径输出视频包含声音。 + - 眼部重影修复:mask 起点下移到 68% 并增加左右 16% 留白,对 seamlessClone 结果做 mask 限域二次融合,减少眼部上方 ghosting。 + - 运行策略收口:`LIPSYNC_SMALL_FACE_THRESHOLD=9999` 仅用于链路冒烟,质量验证与日常运行统一回归 `256`。 +- [x] **部署校验通过**: `GET /api/videos/lipsync/health` 已返回 `data.small_face_enhance`;默认 `enabled=false`,开关关闭下行为与原流程一致。 + +### Day 34: 多镜头时间轴重构 + 文案深度学习弹窗防误触关闭 + Code Review 修复 - [x] **时间轴模型重构**: 多素材从”等分顺序片段”升级为”主素材连续播放 + 插入镜头块”,支持自由插入、拖拽移动。 - [x] **前端链路落地**: 重写 `useTimelineEditor` 与 `TimelineEditor`,新增主素材/插入候选语义,`useHomeController` / `HomePage` / `MaterialSelector` 全链路适配。 - [x] **后端生成链路适配**: `workflow.py` 完成 `material_paths` 来源修正、`custom_assignments` 新校验、素材下载去重与段处理并发限制,保持单素材兼容。 diff --git a/README.md b/README.md index 4a0166b..9bf5f3b 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,8 @@ ## ✨ 功能特性 ### 核心能力 -- 🎬 **高清唇形同步** - 混合方案:短视频(本仓库当前 `.env` 阈值 100s,可配)用 LatentSync 1.6(高质量 Latent Diffusion),长视频用 MuseTalk 1.5(实时级单步推理),自动路由 + 回退。前端可选模型:默认模型(阈值自动路由)/ 快速模型(速度优先)/ 高级模型(质量优先)。 +- 🎬 **高清唇形同步** - 混合方案:短视频(本仓库当前 `.env` 阈值 100s,可配)用 LatentSync 1.6(高质量 Latent Diffusion),长视频用 MuseTalk 1.5(实时级单步推理),自动路由 + 回退。前端可选模型:默认模型(阈值自动路由)/ 快速模型(速度优先)/ 高级模型(质量优先)。 +- 🧠 **小脸口型质量补偿(可选)** - 本地唇形路径支持小脸检测 + 裁切 + 稀疏关键帧超分 + 下半脸贴回补偿链路;默认关闭(`LIPSYNC_SMALL_FACE_ENHANCE=false`),失败自动回退原流程(fail-open)。 - 🎙️ **多模态配音** - 支持 **EdgeTTS** (微软超自然语音, 10 语言) 和 **CosyVoice 3.0** (3秒极速声音克隆, 9语言+18方言, 语速/语气可调)。上传参考音频自动 Whisper 转写 + 智能截取。配音前置工作流:先生成配音 → 选素材 → 生成视频。 - 📝 **智能字幕** - 集成 faster-whisper + Remotion,自动生成逐字高亮 (卡拉OK效果) 字幕。 - 🎨 **样式预设** - 12 种标题 + 8 种字幕样式预设,支持预览 + 字号调节 + 自定义字体库。CSS 原生描边渲染,清晰无重影。 @@ -67,9 +68,10 @@ - **[部署手册 (DEPLOY_MANUAL.md)](Docs/DEPLOY_MANUAL.md)** - 👈 **部署请看这里**!包含完整的环境搭建步骤。 - [多平台发布部署说明 (PUBLISH_DEPLOY.md)](Docs/PUBLISH_DEPLOY.md) - 抖音/微信视频号/B站/小红书登录与自动化发布专项文档。 - [参考音频服务部署 (COSYVOICE3_DEPLOY.md)](Docs/COSYVOICE3_DEPLOY.md) - 声音克隆模型部署指南。 -- [LatentSync 部署指南 (LATENTSYNC_DEPLOY.md)](Docs/LATENTSYNC_DEPLOY.md) - 唇形同步模型独立部署。 -- [MuseTalk 部署指南 (MUSETALK_DEPLOY.md)](Docs/MUSETALK_DEPLOY.md) - 长视频唇形同步模型部署。 -- [Supabase 部署指南 (SUPABASE_DEPLOY.md)](Docs/SUPABASE_DEPLOY.md) - Supabase 与认证系统配置。 +- [LatentSync 部署指南 (LATENTSYNC_DEPLOY.md)](Docs/LATENTSYNC_DEPLOY.md) - 唇形同步模型独立部署。 +- [MuseTalk 部署指南 (MUSETALK_DEPLOY.md)](Docs/MUSETALK_DEPLOY.md) - 长视频唇形同步模型部署。 +- [小脸口型质量补偿链路部署指南 (FACEENHANCE_DEPLOY.md)](Docs/FACEENHANCE_DEPLOY.md) - 小脸口型质量补偿链路部署与验证。 +- [Supabase 部署指南 (SUPABASE_DEPLOY.md)](Docs/SUPABASE_DEPLOY.md) - Supabase 与认证系统配置。 - [支付宝部署指南 (ALIPAY_DEPLOY.md)](Docs/ALIPAY_DEPLOY.md) - 支付宝付费开通会员配置。 ### 开发文档 diff --git a/backend/.env.example b/backend/.env.example index a930669..ea5651e 100644 --- a/backend/.env.example +++ b/backend/.env.example @@ -83,6 +83,13 @@ MUSETALK_ENCODE_PRESET=slow # 音频时长 >= 此阈值(秒)用 MuseTalk,< 此阈值用 LatentSync LIPSYNC_DURATION_THRESHOLD=100 +# =============== 小脸口型质量补偿链路 =============== +LIPSYNC_SMALL_FACE_ENHANCE=true +LIPSYNC_SMALL_FACE_THRESHOLD=256 +LIPSYNC_SMALL_FACE_UPSCALER=gfpgan +LIPSYNC_SMALL_FACE_GPU_ID=0 +LIPSYNC_SMALL_FACE_FAIL_OPEN=true + # =============== 上传配置 =============== # 最大上传文件大小 (MB) MAX_UPLOAD_SIZE_MB=500 diff --git a/backend/app/core/config.py b/backend/app/core/config.py index cb62f3e..e30977f 100644 --- a/backend/app/core/config.py +++ b/backend/app/core/config.py @@ -37,22 +37,22 @@ class Settings(BaseSettings): DOUYIN_BROWSER_CHANNEL: str = "" DOUYIN_FORCE_SWIFTSHADER: bool = True - # Douyin 调试录屏 - DOUYIN_DEBUG_ARTIFACTS: bool = False - DOUYIN_RECORD_VIDEO: bool = False - DOUYIN_KEEP_SUCCESS_VIDEO: bool = False - DOUYIN_RECORD_VIDEO_WIDTH: int = 1280 - DOUYIN_RECORD_VIDEO_HEIGHT: int = 720 - - # Xiaohongshu Playwright 配置 - XIAOHONGSHU_HEADLESS_MODE: str = "headless-new" - XIAOHONGSHU_USER_AGENT: str = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.0.0 Safari/537.36" - XIAOHONGSHU_LOCALE: str = "zh-CN" - XIAOHONGSHU_TIMEZONE_ID: str = "Asia/Shanghai" - XIAOHONGSHU_CHROME_PATH: str = "/usr/bin/google-chrome" - XIAOHONGSHU_BROWSER_CHANNEL: str = "" - XIAOHONGSHU_FORCE_SWIFTSHADER: bool = True - XIAOHONGSHU_DEBUG_ARTIFACTS: bool = False + # Douyin 调试录屏 + DOUYIN_DEBUG_ARTIFACTS: bool = False + DOUYIN_RECORD_VIDEO: bool = False + DOUYIN_KEEP_SUCCESS_VIDEO: bool = False + DOUYIN_RECORD_VIDEO_WIDTH: int = 1280 + DOUYIN_RECORD_VIDEO_HEIGHT: int = 720 + + # Xiaohongshu Playwright 配置 + XIAOHONGSHU_HEADLESS_MODE: str = "headless-new" + XIAOHONGSHU_USER_AGENT: str = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.0.0 Safari/537.36" + XIAOHONGSHU_LOCALE: str = "zh-CN" + XIAOHONGSHU_TIMEZONE_ID: str = "Asia/Shanghai" + XIAOHONGSHU_CHROME_PATH: str = "/usr/bin/google-chrome" + XIAOHONGSHU_BROWSER_CHANNEL: str = "" + XIAOHONGSHU_FORCE_SWIFTSHADER: bool = True + XIAOHONGSHU_DEBUG_ARTIFACTS: bool = False # TTS 配置 DEFAULT_TTS_VOICE: str = "zh-CN-YunxiNeural" @@ -78,6 +78,13 @@ class Settings(BaseSettings): # 混合唇形同步路由 LIPSYNC_DURATION_THRESHOLD: float = 120.0 # 秒,>=此值用 MuseTalk + # 小脸口型质量补偿链路 + LIPSYNC_SMALL_FACE_ENHANCE: bool = False + LIPSYNC_SMALL_FACE_THRESHOLD: int = 256 + LIPSYNC_SMALL_FACE_UPSCALER: str = "codeformer" + LIPSYNC_SMALL_FACE_GPU_ID: int = 0 + LIPSYNC_SMALL_FACE_FAIL_OPEN: bool = True + # Supabase 配置 SUPABASE_URL: str = "" SUPABASE_PUBLIC_URL: str = "" # 公网访问地址,用于生成前端可访问的 URL diff --git a/backend/app/services/lipsync_service.py b/backend/app/services/lipsync_service.py index ee3dae1..0ed2f29 100644 --- a/backend/app/services/lipsync_service.py +++ b/backend/app/services/lipsync_service.py @@ -11,12 +11,13 @@ import asyncio import httpx from pathlib import Path from loguru import logger -from typing import Optional, Literal +from typing import Optional, Literal from app.core.config import settings +from app.services.small_face_enhance_service import SmallFaceEnhanceService -class LipSyncService: +class LipSyncService: """唇形同步服务 - LatentSync 1.6 + MuseTalk 1.5 混合方案""" def __init__(self): @@ -38,6 +39,9 @@ class LipSyncService: # 运行时检测 self._weights_available: Optional[bool] = None + + # 小脸增强 + self._face_enhance = SmallFaceEnhanceService() def _check_weights(self) -> bool: """检查模型权重是否存在""" @@ -93,7 +97,7 @@ class LipSyncService: logger.warning(f"⚠️ 获取媒体时长失败: {e}") return None - def _loop_video_to_duration(self, video_path: str, output_path: str, target_duration: float) -> str: + def _loop_video_to_duration(self, video_path: str, output_path: str, target_duration: float) -> str: """ 循环视频以匹配目标时长 使用 FFmpeg stream_loop 实现无缝循环 @@ -117,47 +121,70 @@ class LipSyncService: else: logger.warning(f"⚠️ 视频循环失败: {result.stderr[:200]}") return video_path - except Exception as e: - logger.warning(f"⚠️ 视频循环异常: {e}") - return video_path + except Exception as e: + logger.warning(f"⚠️ 视频循环异常: {e}") + return video_path + + def _mux_audio_to_video(self, video_path: str, audio_path: str, output_path: str) -> bool: + """将音轨封装到视频,避免增强路径出现无声输出。""" + try: + cmd = [ + "ffmpeg", "-y", + "-i", video_path, + "-i", audio_path, + "-map", "0:v:0", + "-map", "1:a:0", + "-c:v", "copy", + "-c:a", "aac", + "-shortest", + output_path, + ] + result = subprocess.run(cmd, capture_output=True, text=True, timeout=120) + if result.returncode == 0 and Path(output_path).exists(): + return True + logger.warning(f"⚠️ 音轨封装失败: {result.stderr[:200]}") + return False + except Exception as e: + logger.warning(f"⚠️ 音轨封装异常: {e}") + return False - async def generate( - self, - video_path: str, - audio_path: str, - output_path: str, - fps: int = 25, - model_mode: Literal["default", "fast", "advanced"] = "default", - ) -> str: - """生成唇形同步视频""" - logger.info(f"🎬 唇形同步任务: {Path(video_path).name} + {Path(audio_path).name}") - Path(output_path).parent.mkdir(parents=True, exist_ok=True) - - normalized_mode: Literal["default", "fast", "advanced"] = model_mode - if normalized_mode not in ("default", "fast", "advanced"): - normalized_mode = "default" - logger.info(f"🧠 Lipsync 模式: {normalized_mode}") - - if self.use_local: - return await self._local_generate(video_path, audio_path, output_path, fps, normalized_mode) - else: - return await self._remote_generate(video_path, audio_path, output_path, fps, normalized_mode) + async def generate( + self, + video_path: str, + audio_path: str, + output_path: str, + fps: int = 25, + model_mode: Literal["default", "fast", "advanced"] = "default", + ) -> str: + """生成唇形同步视频""" + logger.info(f"🎬 唇形同步任务: {Path(video_path).name} + {Path(audio_path).name}") + Path(output_path).parent.mkdir(parents=True, exist_ok=True) + + normalized_mode: Literal["default", "fast", "advanced"] = model_mode + if normalized_mode not in ("default", "fast", "advanced"): + normalized_mode = "default" + logger.info(f"🧠 Lipsync 模式: {normalized_mode}") + + if self.use_local: + return await self._local_generate(video_path, audio_path, output_path, fps, normalized_mode) + else: + return await self._remote_generate(video_path, audio_path, output_path, fps, normalized_mode) - async def _local_generate( - self, - video_path: str, - audio_path: str, - output_path: str, - fps: int, - model_mode: Literal["default", "fast", "advanced"], - ) -> str: - """使用 subprocess 调用 LatentSync conda 环境""" - - logger.info("⏳ 等待 GPU 资源 (排队中)...") - async with self._lock: - # 使用临时目录存放中间文件 - with tempfile.TemporaryDirectory() as tmpdir: - tmpdir = Path(tmpdir) + async def _local_generate( + self, + video_path: str, + audio_path: str, + output_path: str, + fps: int, + model_mode: Literal["default", "fast", "advanced"], + ) -> str: + """使用 subprocess 调用 LatentSync conda 环境""" + + logger.info("⏳ 等待 GPU 资源 (排队中)...") + async with self._lock: + # 使用临时目录存放中间文件 + with tempfile.TemporaryDirectory() as tmpdir: + tmpdir = Path(tmpdir) # 获取音频和视频时长 audio_duration = self._get_media_duration(audio_path) @@ -172,133 +199,206 @@ class LipSyncService: str(looped_video), audio_duration ) - else: - actual_video_path = video_path - - # 模型路由 - force_musetalk = model_mode == "fast" - force_latentsync = model_mode == "advanced" - auto_to_musetalk = ( - model_mode == "default" - and audio_duration is not None - and audio_duration >= settings.LIPSYNC_DURATION_THRESHOLD - ) - - if force_musetalk: - logger.info("⚡ 强制快速模型:MuseTalk") - musetalk_result = await self._call_musetalk_server( - actual_video_path, audio_path, output_path - ) - if musetalk_result: - return musetalk_result - logger.warning("⚠️ MuseTalk 不可用,快速模型回退到 LatentSync") - elif auto_to_musetalk: - logger.info( - f"🔄 音频 {audio_duration:.1f}s >= {settings.LIPSYNC_DURATION_THRESHOLD}s,路由到 MuseTalk" - ) - musetalk_result = await self._call_musetalk_server( - actual_video_path, audio_path, output_path - ) - if musetalk_result: - return musetalk_result - logger.warning("⚠️ MuseTalk 不可用,回退到 LatentSync(长视频,会较慢)") - elif force_latentsync: - logger.info("🎯 强制高级模型:LatentSync") - - # 检查 LatentSync 前置条件(仅在需要回退或使用 LatentSync 时) - if not self._check_conda_env(): - logger.warning("⚠️ Conda 环境不可用,使用 Fallback") - shutil.copy(video_path, output_path) - return output_path - - if not self._check_weights(): - logger.warning("⚠️ 模型权重不存在,使用 Fallback") - shutil.copy(video_path, output_path) - return output_path - - if self.use_server: - # 模式 A: 调用常驻服务 (加速模式) - return await self._call_persistent_server(actual_video_path, audio_path, output_path) + else: + actual_video_path = video_path - logger.info("🔄 调用 LatentSync 推理 (subprocess)...") - - temp_output = tmpdir / "output.mp4" - - # 构建命令 - cmd = [ - str(self.conda_python), - "-m", "scripts.inference", - "--unet_config_path", "configs/unet/stage2_512.yaml", - "--inference_ckpt_path", "checkpoints/latentsync_unet.pt", - "--inference_steps", str(settings.LATENTSYNC_INFERENCE_STEPS), - "--guidance_scale", str(settings.LATENTSYNC_GUIDANCE_SCALE), - "--video_path", str(actual_video_path), # 使用预处理后的视频 - "--audio_path", str(audio_path), - "--video_out_path", str(temp_output), - "--seed", str(settings.LATENTSYNC_SEED), - "--temp_dir", str(tmpdir / "cache"), - ] - - if settings.LATENTSYNC_ENABLE_DEEPCACHE: - cmd.append("--enable_deepcache") - - # 设置环境变量 - env = os.environ.copy() - env["CUDA_VISIBLE_DEVICES"] = str(self.gpu_id) - - logger.info(f"🖥️ 执行命令: {' '.join(cmd[:8])}...") - logger.info(f"🖥️ GPU: CUDA_VISIBLE_DEVICES={self.gpu_id}") - + # ── 小脸增强 ── + enhance_result = None try: - # 使用 asyncio subprocess 实现真正的异步执行 - # 这样事件循环可以继续处理其他请求(如进度查询) - process = await asyncio.create_subprocess_exec( - *cmd, - cwd=str(self.latentsync_dir), - env=env, - stdout=asyncio.subprocess.PIPE, - stderr=asyncio.subprocess.PIPE, + enhance_result = self._face_enhance.enhance_if_needed( + video_path=str(actual_video_path), + tmpdir=tmpdir, + gpu_id=settings.LIPSYNC_SMALL_FACE_GPU_ID, ) - - # 等待进程完成,带超时 - try: - stdout, stderr = await asyncio.wait_for( - process.communicate(), - timeout=900 # 15分钟超时 - ) - except asyncio.TimeoutError: - process.kill() - await process.wait() - logger.error("⏰ LatentSync 推理超时 (15分钟)") - shutil.copy(video_path, output_path) - return output_path - - stdout_text = stdout.decode() if stdout else "" - stderr_text = stderr.decode() if stderr else "" - - if process.returncode != 0: - logger.error(f"LatentSync 推理失败:\n{stderr_text}") - logger.error(f"stdout:\n{stdout_text[-1000:] if stdout_text else 'N/A'}") - # Fallback - shutil.copy(video_path, output_path) - return output_path - - logger.info(f"LatentSync 输出:\n{stdout_text[-500:] if stdout_text else 'N/A'}") - - # 检查输出文件 - if temp_output.exists(): - shutil.copy(temp_output, output_path) - logger.info(f"✅ 唇形同步完成: {output_path}") - return output_path - else: - logger.warning("⚠️ 未找到输出文件,使用 Fallback") - shutil.copy(video_path, output_path) - return output_path - except Exception as e: - logger.error(f"❌ 推理异常: {e}") - shutil.copy(video_path, output_path) - return output_path + if settings.LIPSYNC_SMALL_FACE_FAIL_OPEN: + logger.warning(f"⚠️ 小脸增强失败,跳过: {e}") + else: + raise + + if enhance_result and enhance_result.was_enhanced: + track = enhance_result.track + if track is None: + raise RuntimeError("小脸增强轨迹缺失") + + # 增强路径:模型推理增强后的人脸视频 → 贴回原视频 + temp_sync = tmpdir / "face_sync.mp4" + await self._run_selected_model( + video_path=enhance_result.video_path, + audio_path=audio_path, + output_path=str(temp_sync), + tmpdir=tmpdir, + model_mode=model_mode, + audio_duration=audio_duration, + original_video_path=video_path, + ) + + try: + blended = self._face_enhance.blend_back( + original_video=str(actual_video_path), + lipsync_video=str(temp_sync), + track=track, + tmpdir=tmpdir, + ) + blended_with_audio = tmpdir / "blended_with_audio.mp4" + if not self._mux_audio_to_video( + video_path=str(blended), + audio_path=audio_path, + output_path=str(blended_with_audio), + ): + raise RuntimeError("贴回视频音轨封装失败") + + shutil.copy(str(blended_with_audio), output_path) + logger.info(f"✅ 小脸增强 + 唇形同步完成: {output_path}") + return output_path + except Exception as e: + if settings.LIPSYNC_SMALL_FACE_FAIL_OPEN: + logger.warning(f"⚠️ 小脸贴回失败,回退原流程: {e}") + else: + raise + + # 常规路径(未增强或增强失败) + return await self._run_selected_model( + video_path=str(actual_video_path), + audio_path=audio_path, + output_path=output_path, + tmpdir=tmpdir, + model_mode=model_mode, + audio_duration=audio_duration, + original_video_path=video_path, + ) + + async def _run_selected_model( + self, + video_path: str, + audio_path: str, + output_path: str, + tmpdir: Path, + model_mode: Literal["default", "fast", "advanced"], + audio_duration: Optional[float], + original_video_path: str, + ) -> str: + """模型路由 + 执行(MuseTalk / LatentSync 常驻服务 / LatentSync subprocess)""" + + # 模型路由 + force_musetalk = model_mode == "fast" + force_latentsync = model_mode == "advanced" + auto_to_musetalk = ( + model_mode == "default" + and audio_duration is not None + and audio_duration >= settings.LIPSYNC_DURATION_THRESHOLD + ) + + if force_musetalk: + logger.info("⚡ 强制快速模型:MuseTalk") + musetalk_result = await self._call_musetalk_server( + video_path, audio_path, output_path + ) + if musetalk_result: + return musetalk_result + logger.warning("⚠️ MuseTalk 不可用,快速模型回退到 LatentSync") + elif auto_to_musetalk: + logger.info( + f"🔄 音频 {audio_duration:.1f}s >= {settings.LIPSYNC_DURATION_THRESHOLD}s,路由到 MuseTalk" + ) + musetalk_result = await self._call_musetalk_server( + video_path, audio_path, output_path + ) + if musetalk_result: + return musetalk_result + logger.warning("⚠️ MuseTalk 不可用,回退到 LatentSync(长视频,会较慢)") + elif force_latentsync: + logger.info("🎯 强制高级模型:LatentSync") + + # 检查 LatentSync 前置条件 + if not self._check_conda_env(): + logger.warning("⚠️ Conda 环境不可用,使用 Fallback") + shutil.copy(original_video_path, output_path) + return output_path + + if not self._check_weights(): + logger.warning("⚠️ 模型权重不存在,使用 Fallback") + shutil.copy(original_video_path, output_path) + return output_path + + if self.use_server: + # 模式 A: 调用常驻服务 (加速模式) + return await self._call_persistent_server(video_path, audio_path, output_path) + + logger.info("🔄 调用 LatentSync 推理 (subprocess)...") + + temp_output = tmpdir / "output.mp4" + + # 构建命令 + cmd = [ + str(self.conda_python), + "-m", "scripts.inference", + "--unet_config_path", "configs/unet/stage2_512.yaml", + "--inference_ckpt_path", "checkpoints/latentsync_unet.pt", + "--inference_steps", str(settings.LATENTSYNC_INFERENCE_STEPS), + "--guidance_scale", str(settings.LATENTSYNC_GUIDANCE_SCALE), + "--video_path", str(video_path), + "--audio_path", str(audio_path), + "--video_out_path", str(temp_output), + "--seed", str(settings.LATENTSYNC_SEED), + "--temp_dir", str(tmpdir / "cache"), + ] + + if settings.LATENTSYNC_ENABLE_DEEPCACHE: + cmd.append("--enable_deepcache") + + # 设置环境变量 + env = os.environ.copy() + env["CUDA_VISIBLE_DEVICES"] = str(self.gpu_id) + + logger.info(f"🖥️ 执行命令: {' '.join(cmd[:8])}...") + logger.info(f"🖥️ GPU: CUDA_VISIBLE_DEVICES={self.gpu_id}") + + try: + process = await asyncio.create_subprocess_exec( + *cmd, + cwd=str(self.latentsync_dir), + env=env, + stdout=asyncio.subprocess.PIPE, + stderr=asyncio.subprocess.PIPE, + ) + + try: + stdout, stderr = await asyncio.wait_for( + process.communicate(), + timeout=900 # 15分钟超时 + ) + except asyncio.TimeoutError: + process.kill() + await process.wait() + logger.error("⏰ LatentSync 推理超时 (15分钟)") + shutil.copy(original_video_path, output_path) + return output_path + + stdout_text = stdout.decode() if stdout else "" + stderr_text = stderr.decode() if stderr else "" + + if process.returncode != 0: + logger.error(f"LatentSync 推理失败:\n{stderr_text}") + logger.error(f"stdout:\n{stdout_text[-1000:] if stdout_text else 'N/A'}") + shutil.copy(original_video_path, output_path) + return output_path + + logger.info(f"LatentSync 输出:\n{stdout_text[-500:] if stdout_text else 'N/A'}") + + if temp_output.exists(): + shutil.copy(temp_output, output_path) + logger.info(f"✅ 唇形同步完成: {output_path}") + return output_path + else: + logger.warning("⚠️ 未找到输出文件,使用 Fallback") + shutil.copy(original_video_path, output_path) + return output_path + + except Exception as e: + logger.error(f"❌ 推理异常: {e}") + shutil.copy(original_video_path, output_path) + return output_path async def _call_musetalk_server( self, video_path: str, audio_path: str, output_path: str @@ -413,18 +513,18 @@ class LipSyncService: "请确保 LatentSync 服务已启动 (cd models/LatentSync && python scripts/server.py)" ) - async def _remote_generate( - self, - video_path: str, - audio_path: str, - output_path: str, - fps: int, - model_mode: Literal["default", "fast", "advanced"], - ) -> str: - """调用远程 LatentSync API 服务""" - if model_mode == "fast": - logger.warning("⚠️ 远程模式未接入 MuseTalk,快速模型将使用远程 LatentSync") - logger.info(f"📡 调用远程 API: {self.api_url}") + async def _remote_generate( + self, + video_path: str, + audio_path: str, + output_path: str, + fps: int, + model_mode: Literal["default", "fast", "advanced"], + ) -> str: + """调用远程 LatentSync API 服务""" + if model_mode == "fast": + logger.warning("⚠️ 远程模式未接入 MuseTalk,快速模型将使用远程 LatentSync") + logger.info(f"📡 调用远程 API: {self.api_url}") try: async with httpx.AsyncClient(timeout=600.0) as client: @@ -499,4 +599,9 @@ class LipSyncService: "ready": conda_ok and weights_ok and gpu_ok, "musetalk_ready": musetalk_ready, "lipsync_threshold": settings.LIPSYNC_DURATION_THRESHOLD, + "small_face_enhance": { + "enabled": settings.LIPSYNC_SMALL_FACE_ENHANCE, + "threshold": settings.LIPSYNC_SMALL_FACE_THRESHOLD, + "detector_loaded": self._face_enhance._detector_session is not None, + }, } diff --git a/backend/app/services/small_face_enhance_service.py b/backend/app/services/small_face_enhance_service.py new file mode 100644 index 0000000..a24d33e --- /dev/null +++ b/backend/app/services/small_face_enhance_service.py @@ -0,0 +1,872 @@ +""" +小脸增强服务 +远景小脸场景下,裁切 + 超分 -> lipsync 推理 -> 贴回,提升输入质量。 + +单文件单类,供 LipSyncService 调用。 +""" +from __future__ import annotations + +import subprocess +import time +from dataclasses import dataclass, field +from pathlib import Path +from typing import Optional, Tuple, List + +from loguru import logger + +from app.core.config import settings + +try: + import cv2 + import numpy as np + _CV2_AVAILABLE = True +except ImportError: + _CV2_AVAILABLE = False + +# ── 模块常量 ── +PADDING = 0.28 # bbox 外扩比例 +DETECT_EVERY = 8 # 每 N 帧检测一次 +TARGET_SIZE = 512 # 超分目标尺寸 +MASK_FEATHER = 15 # 羽化像素 +MASK_UPPER_RATIO = 0.68 # 口型区域起始位置(仅覆盖嘴部/下巴) +MASK_SIDE_MARGIN = 0.16 # 左右留白比例,避免改动面颊/鼻翼 +SAMPLE_FRAMES = 24 # 采样帧数 +SAMPLE_WINDOW = (0.10, 0.30) # 采样窗口 (10%~30%) +ENCODE_FPS = 25 # 编码帧率 +ENCODE_CRF = 18 # 编码质量 +EMA_ALPHA = 0.3 # EMA 平滑系数 + +# 检测过滤 +MIN_FACE_WIDTH = 50 +FACE_ASPECT_MIN = 0.2 +FACE_ASPECT_MAX = 1.5 +DET_SCORE_THRESH = 0.5 +NMS_IOU_THRESH = 0.4 + +# 权重路径 +_PROJECT_ROOT = Path(__file__).resolve().parent.parent.parent.parent +DET_MODEL_PATH = ( + _PROJECT_ROOT + / "models" / "LatentSync" / "checkpoints" + / "auxiliary" / "models" / "buffalo_l" / "det_10g.onnx" +) + + +# ── 数据结构 ── + +@dataclass +class FaceTrack: + """每帧的人脸追踪数据(用于裁切 + 贴回)""" + crop_boxes: List[Tuple[int, int, int, int]] # 每帧 (x1,y1,x2,y2) + face_width_median: float + frame_count: int + frame_w: int + frame_h: int + + +@dataclass +class EnhanceResult: + """enhance_if_needed 返回值""" + video_path: str + was_enhanced: bool + track: Optional[FaceTrack] = None + face_width: float = 0.0 + + +class SmallFaceEnhanceService: + """小脸增强服务:检测 → 裁切 → 超分 → (lipsync) → 贴回""" + + def __init__(self): + self._detector_session = None + self._sr_model = None + self._sr_type: Optional[str] = None + + # ================================================================ + # SCRFD 人脸检测(det_10g.onnx,CPU 推理) + # ================================================================ + + def _ensure_detector(self) -> bool: + if self._detector_session is not None: + return True + if not DET_MODEL_PATH.exists(): + logger.warning(f"⚠️ SCRFD 权重不存在: {DET_MODEL_PATH}") + return False + try: + import onnxruntime as ort + self._detector_session = ort.InferenceSession( + str(DET_MODEL_PATH), + providers=["CPUExecutionProvider"], + ) + logger.info("✅ SCRFD 检测器已加载") + return True + except Exception as e: + logger.warning(f"⚠️ SCRFD 初始化失败: {e}") + return False + + def _detect_faces(self, img_bgr: np.ndarray) -> List[Tuple[np.ndarray, float]]: + """ + 用 SCRFD 检测人脸。 + Returns: [(bbox_xyxy, score), ...] 按面积降序。 + """ + if self._detector_session is None: + return [] + + h, w = img_bgr.shape[:2] + input_h, input_w = 640, 640 + + # ── Preprocess ── + ratio = min(input_h / h, input_w / w) + new_h, new_w = int(h * ratio), int(w * ratio) + resized = cv2.resize(img_bgr, (new_w, new_h)) + + padded = np.full((input_h, input_w, 3), 127.5, dtype=np.float32) + padded[:new_h, :new_w] = resized.astype(np.float32) + + # BGR → RGB → normalize + blob = padded[:, :, ::-1].copy() + blob = (blob - 127.5) / 128.0 + blob = blob.transpose(2, 0, 1)[np.newaxis].astype(np.float32) + + # ── Inference ── + input_name = self._detector_session.get_inputs()[0].name + outputs = self._detector_session.run(None, {input_name: blob}) + + # det_10g outputs: [scores_s8, scores_s16, scores_s32, + # bbox_s8, bbox_s16, bbox_s32, + # kps_s8, kps_s16, kps_s32] + strides = [8, 16, 32] + all_bboxes = [] + all_scores = [] + + for i, stride in enumerate(strides): + scores = outputs[i].flatten() + bboxes = outputs[i + 3].reshape(-1, 4) + + # 生成 anchor 中心 + feat_h = input_h // stride + feat_w = input_w // stride + anchors = [] + for y in range(feat_h): + for x in range(feat_w): + cx, cy = x * stride, y * stride + anchors.append([cx, cy]) + anchors.append([cx, cy]) # 2 anchors per cell + anchors = np.array(anchors, dtype=np.float32) + + # 置信度过滤 + mask = scores > DET_SCORE_THRESH + if not mask.any(): + continue + + f_scores = scores[mask] + f_bboxes = bboxes[mask] + f_anchors = anchors[mask] + + # Decode: distance * stride → xyxy + decoded = np.empty_like(f_bboxes) + decoded[:, 0] = f_anchors[:, 0] - f_bboxes[:, 0] * stride + decoded[:, 1] = f_anchors[:, 1] - f_bboxes[:, 1] * stride + decoded[:, 2] = f_anchors[:, 0] + f_bboxes[:, 2] * stride + decoded[:, 3] = f_anchors[:, 1] + f_bboxes[:, 3] * stride + + # 缩放回原始图像坐标 + decoded /= ratio + + all_bboxes.append(decoded) + all_scores.append(f_scores) + + if not all_bboxes: + return [] + + bboxes_cat = np.concatenate(all_bboxes) + scores_cat = np.concatenate(all_scores) + + # NMS + keep = self._nms(bboxes_cat, scores_cat, NMS_IOU_THRESH) + + # 尺寸 + 宽高比过滤 + results = [] + for idx in keep: + bbox = bboxes_cat[idx] + score = float(scores_cat[idx]) + bw = bbox[2] - bbox[0] + bh = bbox[3] - bbox[1] + if bw < MIN_FACE_WIDTH or bh < MIN_FACE_WIDTH: + continue + aspect = bw / max(bh, 1) + if aspect < FACE_ASPECT_MIN or aspect > FACE_ASPECT_MAX: + continue + results.append((bbox.copy(), score)) + + results.sort(key=lambda x: (x[0][2] - x[0][0]) * (x[0][3] - x[0][1]), reverse=True) + return results + + @staticmethod + def _nms(bboxes: np.ndarray, scores: np.ndarray, threshold: float) -> List[int]: + x1 = bboxes[:, 0] + y1 = bboxes[:, 1] + x2 = bboxes[:, 2] + y2 = bboxes[:, 3] + areas = (x2 - x1) * (y2 - y1) + order = scores.argsort()[::-1] + keep = [] + while order.size > 0: + i = order[0] + keep.append(int(i)) + if order.size == 1: + break + xx1 = np.maximum(x1[i], x1[order[1:]]) + yy1 = np.maximum(y1[i], y1[order[1:]]) + xx2 = np.minimum(x2[i], x2[order[1:]]) + yy2 = np.minimum(y2[i], y2[order[1:]]) + inter = np.maximum(0, xx2 - xx1) * np.maximum(0, yy2 - yy1) + iou = inter / (areas[i] + areas[order[1:]] - inter + 1e-6) + inds = np.where(iou <= threshold)[0] + order = order[inds + 1] + return keep + + # ================================================================ + # 视频工具 + # ================================================================ + + @staticmethod + def _get_video_info(video_path: str) -> Optional[Tuple[int, int, int, float]]: + """返回 (width, height, frame_count, fps)""" + try: + import json as _json + cmd = [ + "ffprobe", "-v", "error", + "-select_streams", "v:0", + "-show_entries", "stream=width,height,nb_frames,r_frame_rate,avg_frame_rate", + "-of", "json", + video_path, + ] + r = subprocess.run(cmd, capture_output=True, text=True, timeout=10) + if r.returncode != 0: + return None + info = _json.loads(r.stdout) + streams = info.get("streams") + if not streams: + return None + stream = streams[0] + w, h = int(stream["width"]), int(stream["height"]) + # nb_frames 可能为 "N/A" 或缺失 + nb_raw = stream.get("nb_frames", "N/A") + nb = int(nb_raw) if nb_raw not in ("N/A", "") else 0 + + def _parse_fps(s: str) -> float: + if "/" in s: + num, den = s.split("/") + return float(num) / float(den) if float(den) != 0 else 0.0 + return float(s) if s else 0.0 + + # 优先 avg_frame_rate(真实平均帧率),r_frame_rate 可能是 timebase 倍数 + avg_fps = _parse_fps(stream.get("avg_frame_rate", "0/0")) + r_fps = _parse_fps(stream.get("r_frame_rate", "25/1")) + fps = avg_fps if avg_fps > 0 else (r_fps if r_fps > 0 else 25.0) + + if nb == 0: + cmd2 = [ + "ffprobe", "-v", "error", + "-show_entries", "format=duration", + "-of", "default=noprint_wrappers=1:nokey=1", + video_path, + ] + r2 = subprocess.run(cmd2, capture_output=True, text=True, timeout=10) + if r2.returncode == 0 and r2.stdout.strip(): + nb = int(float(r2.stdout.strip()) * fps) + return w, h, nb, fps + except Exception as e: + logger.warning(f"⚠️ 获取视频信息失败: {e}") + return None + + @staticmethod + def _open_video_reader(video_path: str, w: int, h: int, + seek_sec: float = 0, duration_sec: float = 0): + """打开 ffmpeg rawvideo 读取管道""" + cmd = ["ffmpeg"] + if seek_sec > 0: + cmd += ["-ss", f"{seek_sec:.3f}"] + cmd += ["-i", video_path] + if duration_sec > 0: + cmd += ["-t", f"{duration_sec:.3f}"] + cmd += ["-f", "rawvideo", "-pix_fmt", "bgr24", "-v", "quiet", "-"] + return subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL) + + @staticmethod + def _read_one_frame(proc, w: int, h: int) -> Optional[np.ndarray]: + raw = proc.stdout.read(w * h * 3) + if len(raw) < w * h * 3: + return None + return np.frombuffer(raw, dtype=np.uint8).reshape(h, w, 3).copy() + + @staticmethod + def _open_video_writer(output_path: str, w: int, h: int, + fps: int = ENCODE_FPS, crf: int = ENCODE_CRF): + """打开 ffmpeg rawvideo 写入管道""" + cmd = [ + "ffmpeg", "-y", + "-f", "rawvideo", "-pix_fmt", "bgr24", + "-s", f"{w}x{h}", "-r", str(fps), "-i", "-", + "-c:v", "libx264", "-crf", str(crf), + "-preset", "fast", "-pix_fmt", "yuv420p", + output_path, + ] + return subprocess.Popen(cmd, stdin=subprocess.PIPE, stderr=subprocess.DEVNULL) + + # ================================================================ + # Phase 2: 人脸尺寸检测 + # ================================================================ + + def _detect_face_size(self, video_path: str) -> Optional[float]: + """ + 从视频 10%~30% 区间均匀采样,检测最大脸宽度中位数。 + 返回 None 表示未检测到人脸或检测器不可用。 + """ + if not self._ensure_detector(): + return None + + info = self._get_video_info(video_path) + if info is None: + return None + w, h, nb_frames, fps = info + if nb_frames < 1 or fps <= 0: + return None + + # 计算采样区间 + start_frame = int(nb_frames * SAMPLE_WINDOW[0]) + end_frame = int(nb_frames * SAMPLE_WINDOW[1]) + end_frame = max(end_frame, start_frame + 1) + n_sample = min(SAMPLE_FRAMES, end_frame - start_frame) + if n_sample <= 0: + return None + + step = max(1, (end_frame - start_frame) // n_sample) + sample_indices = set(range(start_frame, end_frame, step)) + + # 用 ffmpeg seek 定位到采样起点 + seek_sec = start_frame / fps + duration_sec = (end_frame - start_frame) / fps + 0.5 # 余量 + + proc = self._open_video_reader(video_path, w, h, seek_sec, duration_sec) + face_widths = [] + try: + for local_idx in range(end_frame - start_frame + 1): + frame = self._read_one_frame(proc, w, h) + if frame is None: + break + global_idx = start_frame + local_idx + if global_idx not in sample_indices: + continue + faces = self._detect_faces(frame) + if faces: + bbox = faces[0][0] # 最大脸 + face_widths.append(float(bbox[2] - bbox[0])) + finally: + proc.stdout.close() + proc.terminate() + proc.wait() + + if not face_widths: + return None + + face_widths.sort() + mid = len(face_widths) // 2 + if len(face_widths) % 2 == 0: + return (face_widths[mid - 1] + face_widths[mid]) / 2 + return face_widths[mid] + + # ================================================================ + # Phase 3: 裁切 + 轨迹 + # ================================================================ + + def _build_face_track(self, video_path: str, + w: int, h: int, nb_frames: int) -> Optional[FaceTrack]: + """ + 逐帧人脸追踪:每 DETECT_EVERY 帧检测,中间帧 EMA 插值。 + 返回 FaceTrack 或 None(检测失败)。 + """ + if not self._ensure_detector(): + return None + + detect_set = set(range(0, nb_frames, DETECT_EVERY)) + + # 第一遍:检测帧 + proc = self._open_video_reader(video_path, w, h) + keyframe_bboxes = {} + actual_frames = 0 + try: + for idx in range(nb_frames): + frame = self._read_one_frame(proc, w, h) + if frame is None: + break + actual_frames = idx + 1 + if idx not in detect_set: + continue + faces = self._detect_faces(frame) + if faces: + keyframe_bboxes[idx] = faces[0][0].copy() + finally: + proc.stdout.close() + proc.terminate() + proc.wait() + + if not keyframe_bboxes: + return None + + # 用实际读到的帧数,避免 _get_video_info 估算偏差 + nb_frames = actual_frames + + # 前向填充 + EMA 平滑 + sorted_keys = sorted(keyframe_bboxes.keys()) + raw_bboxes: List[np.ndarray] = [None] * nb_frames # type: ignore + + for k in sorted_keys: + raw_bboxes[k] = keyframe_bboxes[k] + + prev = keyframe_bboxes[sorted_keys[0]] + for i in range(nb_frames): + if raw_bboxes[i] is not None: + prev = raw_bboxes[i] + else: + raw_bboxes[i] = prev.copy() + + # EMA 平滑 + smoothed = [raw_bboxes[0].copy()] + for i in range(1, nb_frames): + s = EMA_ALPHA * raw_bboxes[i] + (1 - EMA_ALPHA) * smoothed[-1] + smoothed.append(s) + + # 带 padding 的 crop box(clamp 到帧边界) + crop_boxes = [] + for bbox in smoothed: + x1, y1, x2, y2 = bbox + bw, bh = x2 - x1, y2 - y1 + pad_w, pad_h = bw * PADDING, bh * PADDING + cx1 = max(0, int(x1 - pad_w)) + cy1 = max(0, int(y1 - pad_h)) + cx2 = min(w, int(x2 + pad_w)) + cy2 = min(h, int(y2 + pad_h)) + crop_boxes.append((cx1, cy1, cx2, cy2)) + + # 中位数脸宽 + widths = sorted(float(b[2] - b[0]) for b in smoothed) + median_w = widths[len(widths) // 2] + + return FaceTrack( + crop_boxes=crop_boxes, + face_width_median=median_w, + frame_count=nb_frames, + frame_w=w, + frame_h=h, + ) + + # ================================================================ + # Phase 3: 超分 + # ================================================================ + + def _ensure_upscaler(self, upscaler: str, gpu_id: int) -> bool: + """懒加载超分模型""" + if self._sr_model is not None and self._sr_type == upscaler: + return True + try: + import sys + import torch + + # torchvision >= 0.20 移除了 functional_tensor,但 basicsr 仍引用 + if "torchvision.transforms.functional_tensor" not in sys.modules: + try: + import torchvision.transforms.functional as _F + sys.modules["torchvision.transforms.functional_tensor"] = _F + except ImportError: + pass + + device = torch.device(f"cuda:{gpu_id}" if torch.cuda.is_available() else "cpu") + + if upscaler == "gfpgan": + from gfpgan import GFPGANer + model_path = _PROJECT_ROOT / "models" / "FaceEnhance" / "GFPGANv1.4.pth" + if not model_path.exists(): + logger.warning(f"⚠️ GFPGAN 权重不存在: {model_path}") + return False + self._sr_model = GFPGANer( + model_path=str(model_path), + upscale=2, + arch="clean", + channel_multiplier=2, + bg_upsampler=None, + device=device, + ) + elif upscaler == "codeformer": + from basicsr.archs.codeformer_arch import CodeFormer as CodeFormerArch + model_path = _PROJECT_ROOT / "models" / "FaceEnhance" / "codeformer.pth" + if not model_path.exists(): + logger.warning(f"⚠️ CodeFormer 权重不存在: {model_path}") + # 尝试回退 gfpgan + return self._ensure_upscaler("gfpgan", gpu_id) + net = CodeFormerArch( + dim_embd=512, codebook_size=1024, n_head=8, n_layers=9, + connect_list=["32", "64", "128", "256"], + ).to(device) + ckpt = torch.load(str(model_path), map_location=device, weights_only=False) + net.load_state_dict(ckpt.get("params_ema", ckpt.get("params", ckpt))) + net.eval() + self._sr_model = net + self._sr_device = device + else: + logger.warning(f"⚠️ 未知超分器: {upscaler}") + return False + + self._sr_type = upscaler + logger.info(f"✅ 超分器已加载: {upscaler}") + return True + except Exception as e: + logger.warning(f"⚠️ 超分器初始化失败 ({upscaler}): {e}") + return False + + def _upscale_face(self, face_img: np.ndarray, target_size: int) -> np.ndarray: + """用已加载的超分模型增强单帧,失败回退 bicubic""" + try: + if self._sr_type == "gfpgan": + _, _, output = self._sr_model.enhance( + face_img, paste_back=False, has_aligned=False, + ) + if output is not None: + return cv2.resize( + output, (target_size, target_size), + interpolation=cv2.INTER_LANCZOS4, + ) + elif self._sr_type == "codeformer": + import torch + img = cv2.resize(face_img, (512, 512)) + img_t = ( + torch.from_numpy(img.astype(np.float32) / 255.0) + .permute(2, 0, 1) + .unsqueeze(0) + .to(self._sr_device) + ) + with torch.no_grad(): + out = self._sr_model(img_t, w=0.7)[0] + out_np = ( + out.squeeze().permute(1, 2, 0).cpu().numpy() * 255 + ).clip(0, 255).astype(np.uint8) + return cv2.resize( + out_np, (target_size, target_size), + interpolation=cv2.INTER_LANCZOS4, + ) + except Exception as e: + logger.debug(f"超分失败,回退 bicubic: {e}") + + return cv2.resize( + face_img, (target_size, target_size), + interpolation=cv2.INTER_CUBIC, + ) + + # ================================================================ + # Phase 3: 裁切 + 超分 → 增强视频 + # ================================================================ + + def _crop_and_upscale_video( + self, + video_path: str, + track: FaceTrack, + tmpdir: Path, + gpu_id: int, + source_fps: float, + ) -> str: + """ + 裁切人脸区域 → 稀疏关键帧超分 → 输出 TARGET_SIZE 视频。 + 流式处理,不占满内存。 + """ + output_path = str(tmpdir / "enhanced_face.mp4") + w, h = track.frame_w, track.frame_h + + upscaler = settings.LIPSYNC_SMALL_FACE_UPSCALER + sr_available = self._ensure_upscaler(upscaler, gpu_id) + detect_set = set(range(0, track.frame_count, DETECT_EVERY)) + + reader = self._open_video_reader(video_path, w, h) + out_fps = max(1, int(round(source_fps))) if source_fps > 0 else ENCODE_FPS + writer = self._open_video_writer(output_path, TARGET_SIZE, TARGET_SIZE, fps=out_fps) + + try: + for idx in range(track.frame_count): + frame = self._read_one_frame(reader, w, h) + if frame is None: + break + + cx1, cy1, cx2, cy2 = track.crop_boxes[idx] + cropped = frame[cy1:cy2, cx1:cx2] + + if sr_available and idx in detect_set: + enhanced = self._upscale_face(cropped, TARGET_SIZE) + else: + enhanced = cv2.resize( + cropped, (TARGET_SIZE, TARGET_SIZE), + interpolation=cv2.INTER_CUBIC, + ) + + writer.stdin.write(enhanced.tobytes()) + finally: + reader.stdout.close() + reader.terminate() + reader.wait() + writer.stdin.close() + writer.wait() + + if not Path(output_path).exists(): + raise RuntimeError("增强视频写入失败") + + return output_path + + # ================================================================ + # Phase 3: 贴回 + # ================================================================ + + def blend_back( + self, + original_video: str, + lipsync_video: str, + track: FaceTrack, + tmpdir, + ) -> str: + """ + 将 lipsync 推理结果贴回原视频。 + 下半脸 mask + 高斯羽化 + seamlessClone。 + """ + tmpdir = Path(tmpdir) + output_path = str(tmpdir / "blended_output.mp4") + w, h = track.frame_w, track.frame_h + + # 获取 lipsync 视频尺寸 + ls_info = self._get_video_info(lipsync_video) + if ls_info is None: + raise RuntimeError("无法读取 lipsync 视频信息") + ls_w, ls_h, ls_frames, ls_fps = ls_info + + if ls_fps <= 0: + ls_fps = ENCODE_FPS + + # 帧数保护:lipsync 模型按音频时长输出,帧数通常 <= 原始(looped)视频 + if ls_frames <= 0: + raise RuntimeError(f"lipsync 输出帧数为 {ls_frames},跳过贴回") + if ls_frames > track.frame_count: + raise RuntimeError( + f"帧数异常: lipsync={ls_frames} > original={track.frame_count}" + ) + blend_count = ls_frames + + orig_info = self._get_video_info(original_video) + orig_fps = orig_info[3] if orig_info is not None else 0.0 + if orig_fps <= 0: + orig_fps = ls_fps + + orig_reader = self._open_video_reader(original_video, w, h) + ls_reader = self._open_video_reader(lipsync_video, ls_w, ls_h) + writer = self._open_video_writer( + output_path, + w, + h, + fps=max(1, int(round(ls_fps))), + ) + + current_orig_idx = -1 + current_orig_frame = None + + try: + for idx in range(blend_count): + target_orig_idx = min( + track.frame_count - 1, + int(round((idx / ls_fps) * orig_fps)), + ) + + while current_orig_idx < target_orig_idx: + frame = self._read_one_frame(orig_reader, w, h) + if frame is None: + current_orig_frame = None + break + current_orig_idx += 1 + current_orig_frame = frame + + orig_frame = current_orig_frame + ls_frame = self._read_one_frame(ls_reader, ls_w, ls_h) + if orig_frame is None or ls_frame is None: + break + + cx1, cy1, cx2, cy2 = track.crop_boxes[target_orig_idx] + crop_w, crop_h = cx2 - cx1, cy2 - cy1 + + # 将 lipsync 输出 resize 到裁切区域尺寸 + ls_resized = cv2.resize( + ls_frame, (crop_w, crop_h), + interpolation=cv2.INTER_LANCZOS4, + ) + + # 嘴部局部 mask(尽量仅覆盖嘴唇与下巴区域,避免鼻子/眼周被改动) + mask = np.zeros((crop_h, crop_w), dtype=np.uint8) + upper = int(crop_h * MASK_UPPER_RATIO) + left = int(crop_w * MASK_SIDE_MARGIN) + right = int(crop_w * (1.0 - MASK_SIDE_MARGIN)) + if right - left < 8: + left, right = 0, crop_w + + mask[upper:, left:right] = 255 + + # 中央椭圆增强口型区域权重 + ellipse_center = (crop_w // 2, int(crop_h * 0.82)) + ellipse_axes = (max(8, int(crop_w * 0.22)), max(8, int(crop_h * 0.13))) + cv2.ellipse(mask, ellipse_center, ellipse_axes, 0, 0, 360, 255, -1) + mask = cv2.GaussianBlur(mask, (0, 0), MASK_FEATHER) + + # 融合 + blended = self._blend_face_region( + orig_frame, ls_resized, mask, cx1, cy1, cx2, cy2, + ) + writer.stdin.write(blended.tobytes()) + finally: + for p in (orig_reader, ls_reader): + p.stdout.close() + p.terminate() + p.wait() + writer.stdin.close() + writer.wait() + + if not Path(output_path).exists(): + raise RuntimeError("融合视频写入失败") + return output_path + + @staticmethod + def _blend_face_region( + orig: np.ndarray, + face: np.ndarray, + mask: np.ndarray, + x1: int, y1: int, x2: int, y2: int, + ) -> np.ndarray: + """seamlessClone 贴回,失败回退 alpha 混合""" + result = orig.copy() + crop_h, crop_w = face.shape[:2] + + # 尝试 seamlessClone + try: + center_x = (x1 + x2) // 2 + center_y = int(y1 + (y2 - y1) * 0.7) + center_x = max(1, min(center_x, orig.shape[1] - 2)) + center_y = max(1, min(center_y, orig.shape[0] - 2)) + + src = np.zeros_like(orig) + src[y1:y2, x1:x2] = face + + full_mask = np.zeros(orig.shape[:2], dtype=np.uint8) + full_mask[y1:y2, x1:x2] = mask + + if full_mask.max() > 0: + cloned = cv2.seamlessClone( + src, orig, full_mask, (center_x, center_y), cv2.NORMAL_CLONE, + ) + + # 限制融合影响范围到 mask 区域,避免 Poisson 扩散导致眼部上方重影 + alpha = mask.astype(np.float32) / 255.0 + alpha_3ch = np.stack([alpha] * 3, axis=-1) + roi_orig = orig[y1:y2, x1:x2].astype(np.float32) + roi_clone = cloned[y1:y2, x1:x2].astype(np.float32) + blended_roi = roi_orig * (1 - alpha_3ch) + roi_clone * alpha_3ch + + result = orig.copy() + result[y1:y2, x1:x2] = blended_roi.astype(np.uint8) + return result + except Exception: + pass + + # Fallback: alpha 混合 + alpha = mask.astype(np.float32) / 255.0 + alpha_3ch = np.stack([alpha] * 3, axis=-1) + crop_region = result[y1:y2, x1:x2].astype(np.float32) + blended = crop_region * (1 - alpha_3ch) + face.astype(np.float32) * alpha_3ch + result[y1:y2, x1:x2] = blended.astype(np.uint8) + return result + + # ================================================================ + # 主入口 + # ================================================================ + + def enhance_if_needed( + self, + video_path: str, + tmpdir, + gpu_id: int, + ) -> EnhanceResult: + """ + 主入口:检测小脸 → 裁切 + 超分 → 返回增强结果。 + 如不需要增强,返回 was_enhanced=False。 + """ + if not settings.LIPSYNC_SMALL_FACE_ENHANCE: + return EnhanceResult(video_path=video_path, was_enhanced=False) + + if not _CV2_AVAILABLE: + logger.warning("⚠️ opencv/numpy 未安装,小脸增强不可用") + return EnhanceResult(video_path=video_path, was_enhanced=False) + + start = time.time() + tmpdir = Path(tmpdir) + face_dir = tmpdir / "face_enhance" + face_dir.mkdir(exist_ok=True) + + # ── 检测 ── + face_width = self._detect_face_size(video_path) + if face_width is None: + logger.info("小脸增强: 未检测到人脸,跳过") + return EnhanceResult(video_path=video_path, was_enhanced=False) + + threshold = settings.LIPSYNC_SMALL_FACE_THRESHOLD + if face_width >= threshold: + logger.info( + f"小脸增强: face_w={face_width:.0f}px >= threshold={threshold}px, 跳过" + ) + return EnhanceResult( + video_path=video_path, was_enhanced=False, face_width=face_width, + ) + + logger.info( + f"小脸增强: face_w={face_width:.0f}px < threshold={threshold}px, 触发增强" + ) + + # ── 构建追踪 ── + info = self._get_video_info(video_path) + if info is None: + raise RuntimeError("无法读取视频信息") + w, h, nb_frames, fps = info + + track = self._build_face_track(video_path, w, h, nb_frames) + if track is None: + raise RuntimeError("人脸追踪失败") + + # ── 裁切 + 超分 ── + enhanced_path = self._crop_and_upscale_video( + video_path, + track, + face_dir, + gpu_id, + source_fps=fps, + ) + + # 清理 GPU 缓存 + try: + import torch + if torch.cuda.is_available(): + torch.cuda.empty_cache() + except ImportError: + pass + + elapsed = time.time() - start + logger.info( + f"小脸增强: face_w={face_width:.0f}px threshold={threshold}px " + f"enhanced=True upscaler={settings.LIPSYNC_SMALL_FACE_UPSCALER} " + f"time={elapsed:.1f}s" + ) + + return EnhanceResult( + video_path=enhanced_path, + was_enhanced=True, + track=track, + face_width=face_width, + ) diff --git a/backend/requirements.txt b/backend/requirements.txt index af51f4a..1cc6afe 100644 --- a/backend/requirements.txt +++ b/backend/requirements.txt @@ -38,3 +38,7 @@ faster-whisper>=1.0.0 # 文案提取与AI生成 yt-dlp>=2023.0.0 zai-sdk>=0.2.0 + +# 小脸增强 +opencv-python-headless>=4.8.0 +gfpgan>=1.3.8