Compare commits

..

1 Commits

Author SHA1 Message Date
Kevin Wong
0939d81e9f 更新 2026-03-10 10:59:38 +08:00
12 changed files with 1841 additions and 205 deletions

View File

@@ -194,6 +194,15 @@ backend/user_data/{user_uuid}/cookies/
- `MUSETALK_USE_FLOAT16` (半精度,默认 true)
- `LIPSYNC_DURATION_THRESHOLD` (秒,>=此值用 MuseTalk代码默认 120本仓库当前 `.env` 配置 100)
### 小脸口型质量补偿(本地唇形路径)
- `LIPSYNC_SMALL_FACE_ENHANCE` (总开关,默认 false)
- `LIPSYNC_SMALL_FACE_THRESHOLD` (触发阈值,默认 256)
- `LIPSYNC_SMALL_FACE_UPSCALER` (`gfpgan` / `codeformer`)
- `LIPSYNC_SMALL_FACE_GPU_ID` (超分 GPU默认 0)
- `LIPSYNC_SMALL_FACE_FAIL_OPEN` (失败回退,默认 true)
> 部署与验证细节见 `Docs/FACEENHANCE_DEPLOY.md`。
### 微信视频号
- `WEIXIN_HEADLESS_MODE` (headful/headless-new)
- `WEIXIN_CHROME_PATH` / `WEIXIN_BROWSER_CHANNEL`

View File

@@ -126,9 +126,11 @@ backend/
> - `analysis_id` 绑定 `user_id` 且有 TTL默认 20 分钟),用于后续“生成文案”阶段安全读取标题上下文。
10. **健康检查**
* `GET /api/videos/lipsync/health`: 唇形同步服务健康状态(含 LatentSync + MuseTalk + 混合路由阈值)
* `GET /api/videos/lipsync/health`: 唇形同步服务健康状态(含 LatentSync + MuseTalk + 混合路由阈值 + `data.small_face_enhance`
* `GET /api/videos/voiceclone/health`: CosyVoice 3.0 服务健康状态
> 小脸口型质量补偿链路健康字段说明:`data.small_face_enhance.enabled`(总开关)、`threshold`(触发阈值)、`detector_loaded`SCRFD 是否已懒加载)。
11. **支付 (Payment)**
* `POST /api/payment/create-order`: 创建支付宝电脑网站支付订单(需 payment_token
* `POST /api/payment/notify`: 支付宝异步通知回调(返回纯文本 success/fail
@@ -260,6 +262,13 @@ MUSETALK_API_URL=http://localhost:8011
MUSETALK_BATCH_SIZE=32
LIPSYNC_DURATION_THRESHOLD=100
# 小脸口型质量补偿(默认关闭,建议灰度开启)
LIPSYNC_SMALL_FACE_ENHANCE=false
LIPSYNC_SMALL_FACE_THRESHOLD=256
LIPSYNC_SMALL_FACE_UPSCALER=gfpgan
LIPSYNC_SMALL_FACE_GPU_ID=0
LIPSYNC_SMALL_FACE_FAIL_OPEN=true
# MuseTalk 可调参数(示例)
MUSETALK_DETECT_EVERY=2
MUSETALK_BLEND_CACHE_EVERY=2
@@ -267,6 +276,8 @@ MUSETALK_ENCODE_CRF=14
MUSETALK_ENCODE_PRESET=slow
```
> 小脸口型质量补偿链路部署、权重与回滚说明见 `Docs/FACEENHANCE_DEPLOY.md`(仅本地 `_local_generate()` 路径接入,远程模式暂不接入)。
### 4. 启动服务
**开发模式 (热重载)**:

View File

@@ -99,8 +99,11 @@ python -m scripts.server # 测试能否启动Ctrl+C 退出
> MuseTalk 是单步潜空间修复模型(非扩散模型),推理速度接近实时,适合达到路由阈值的长视频(本仓库当前 `.env` 示例为 >=100s。与 CosyVoice 共享 GPU0fp16 推理约需 4-8GB 显存。合成阶段已改为 FFmpeg rawvideo 管道直编码(`libx264` + 可配 CRF/preset并保留 numpy blending减少中间有损文件。
请参考详细的独立部署指南:
**[MuseTalk 部署指南](MUSETALK_DEPLOY.md)**
请参考详细的独立部署指南:
**[MuseTalk 部署指南](MUSETALK_DEPLOY.md)**
小脸口型质量补偿(可选)部署与验证:
**[小脸口型质量补偿链路部署指南](FACEENHANCE_DEPLOY.md)**
简要步骤:
1. 创建独立的 `musetalk` Conda 环境 (Python 3.10 + PyTorch 2.0.1 + CUDA 11.8)
@@ -248,9 +251,14 @@ cd /home/rongye/ProgramFiles/ViGent2/backend
| `MUSETALK_GPU_ID` | 0 | MuseTalk GPU 编号 |
| `MUSETALK_API_URL` | `http://localhost:8011` | MuseTalk 常驻服务地址 |
| `MUSETALK_BATCH_SIZE` | 32 | MuseTalk 推理批大小 |
| `MUSETALK_VERSION` | v15 | MuseTalk 模型版本 |
| `MUSETALK_USE_FLOAT16` | true | MuseTalk 半精度加速 |
| `MUSETALK_VERSION` | v15 | MuseTalk 模型版本 |
| `MUSETALK_USE_FLOAT16` | true | MuseTalk 半精度加速 |
| `LIPSYNC_DURATION_THRESHOLD` | 100 | 秒,>=此值用 MuseTalk<此值用 LatentSync代码默认 120建议在 `.env` 显式配置) |
| `LIPSYNC_SMALL_FACE_ENHANCE` | false | 小脸口型质量补偿总开关(建议先关闭,灰度验证后开启) |
| `LIPSYNC_SMALL_FACE_THRESHOLD` | 256 | 小脸触发阈值(像素) |
| `LIPSYNC_SMALL_FACE_UPSCALER` | gfpgan | 超分模型(`gfpgan` / `codeformer` |
| `LIPSYNC_SMALL_FACE_GPU_ID` | 0 | 小脸补偿超分 GPU建议与 MuseTalk 同卡) |
| `LIPSYNC_SMALL_FACE_FAIL_OPEN` | true | 补偿链路失败时是否自动回退原流程 |
| `ALIPAY_APP_ID` | 空 | 支付宝应用 APPID |
| `ALIPAY_PRIVATE_KEY_PATH` | 空 | 应用私钥 PEM 文件路径 |
| `ALIPAY_PUBLIC_KEY_PATH` | 空 | 支付宝公钥 PEM 文件路径 |

165
Docs/DevLogs/Day35.md Normal file
View File

@@ -0,0 +1,165 @@
## 小脸口型质量补偿落地 + 部署验证 (Day 35)
### 概述
完成「小脸口型质量补偿Small-Face LipSync Compensation」后端落地与部署收口。核心目标是在不改变用户模型选择语义`default/fast/advanced`)的前提下,对远景小脸素材增加质量补偿链路(检测 -> 裁切 -> 稀疏超分 -> 模型推理 -> 贴回并保持默认关闭、失败回退fail-open、线上可快速回滚。
---
## ✅ 1) 后端能力落地
### 1.1 配置与开关
新增 5 个配置项(默认保守):
- `LIPSYNC_SMALL_FACE_ENHANCE`(默认 `false`
- `LIPSYNC_SMALL_FACE_THRESHOLD`(默认 `256`
- `LIPSYNC_SMALL_FACE_UPSCALER``gfpgan | codeformer`
- `LIPSYNC_SMALL_FACE_GPU_ID`(默认 `0`
- `LIPSYNC_SMALL_FACE_FAIL_OPEN`(默认 `true`
对应代码入口:`backend/app/core/config.py``backend/.env`
### 1.2 新增小脸增强服务
新增 `backend/app/services/small_face_enhance_service.py`,实现完整补偿链路:
1. **小脸判定**CPU
- SCRFD`det_10g.onnx`,复用 LatentSync 权重)
- 从视频 10%-30% 区间均匀采样 24 帧
- 用最大脸宽中位数与阈值比较触发
2. **裁切与轨迹**CPU
- 每 8 帧检测一次,其余帧前向填充 + EMA 平滑
- bbox 外扩 `padding=0.28`
3. **稀疏超分**GPU0
- 检测帧走 GFPGAN/CodeFormer
- 非检测帧走 bicubic resize
- 目标尺寸 `512x512`
4. **贴回融合**CPU
- 口型局部 mask起点 68% + 侧边留白 16%+ 高斯羽化15px
- `cv2.seamlessClone`,失败回退 alpha blend
5. **帧数保护**
- 贴回前校验 `lipsync_frames <= original_frames`
- 仅当 `lipsync_frames > original_frames` 时报错(异常),其余按 lipsync 帧数正常贴回
---
## ✅ 2) LipSyncService 集成
`backend/app/services/lipsync_service.py` 关键改造:
-`_local_generate()` 内按顺序执行:
- `video looping` -> `small face enhance` -> `model infer` -> `blend back`
- 抽取 `_run_selected_model()` 统一模型路由MuseTalk / LatentSync server / LatentSync subprocess
- 小脸增强分支全链路 `try/except`,受 `LIPSYNC_SMALL_FACE_FAIL_OPEN` 控制
- `check_health()` 新增 `small_face_enhance` 状态字段
语义保持:
- 前端与 API 协议不变
- 用户选择模型优先,不因小脸强制换模型
- 仅本地路径(`_local_generate`)接入;远程路径暂不接入
---
## ✅ 3) 依赖与权重
### 3.1 依赖
`backend/requirements.txt` 新增:
- `opencv-python-headless>=4.8.0`
- `gfpgan>=1.3.8`
### 3.2 权重
- `models/FaceEnhance/GFPGANv1.4.pth`(新增目录与权重)
- `models/LatentSync/checkpoints/auxiliary/models/buffalo_l/det_10g.onnx`(复用)
---
## ✅ 4) 稳定性修复(部署后补丁)
为解决实际部署中的依赖兼容、帧数估算偏差、贴回误判与输出质量问题,补充九处修复:
1. **懒加载 + 守卫**
- `cv2/numpy` 改为 `try/except` 导入
-`_CV2_AVAILABLE` 守卫增强入口
- 缺依赖时跳过增强,不影响主流程
2. **类型注解与 torchvision 兼容补丁**
- 增加 `from __future__ import annotations`,避免 `np.ndarray` 在缺依赖场景下导入期报错
-`_ensure_upscaler()` 中注入
`sys.modules['torchvision.transforms.functional_tensor']`
兼容 `torchvision>=0.20``gfpgan/basicsr` 旧引用
3. **ffprobe 帧率与帧数估算修复**
- `_get_video_info()``csv` 切到 `json` 字段访问,避免 `nb_frames` 缺失导致字段错位
- fps 取值改为优先 `avg_frame_rate``r_frame_rate` 仅作为 fallback
4. **轨迹帧数与贴回检查修复**
- `_build_face_track()` 记录 ffmpeg 实际读帧数,覆盖估算 `nb_frames`
- `blend_back()` 放宽检查为 `lipsync <= original` 正常贴回,仅 `>` 报错
5. **空输出防护**
- `blend_back()` 增加 `ls_frames <= 0` 异常分支
- 由外层 `FAIL_OPEN` 捕获并回退常规路径,避免写出空视频
6. **时基对齐修复(慢动作/重影)**
- `_crop_and_upscale_video()` 输出 fps 改为跟随源视频 fps避免增强视频时间轴拉伸
- `blend_back()``orig_fps/ls_fps` 映射原始帧索引,避免只贴回前段帧导致动作变慢/重影
7. **无声视频修复**
- 小脸贴回成功后新增音轨封装mux步骤
- 强制将当前任务 `audio_path` 封装回贴回视频,防止增强路径无声音
8. **眼部重影修复**
- 口型 mask 起点进一步下移到 68%,并增加左右 16% 留白,减少眼周/鼻翼参与融合
- `seamlessClone` 后对结果做 mask 限域二次融合,抑制 Poisson 扩散到眼部上方
9. **畸形规避(运行侧)**
- `LIPSYNC_SMALL_FACE_THRESHOLD=9999` 仅用于链路冒烟,不用于质量评估
- 质量验证前统一恢复 `LIPSYNC_SMALL_FACE_THRESHOLD=256`
---
## ✅ 5) 部署文档与验证
新增并回写部署文档:`Docs/FACEENHANCE_DEPLOY.md`
文档修正点:
- 健康检查地址修正为:`/api/videos/lipsync/health`
- 响应示例补齐 `success/data` 外层包装
实际验证要点:
- `GET /api/videos/lipsync/health` 返回 `data.small_face_enhance`
- 默认 `enabled=false`,开关关闭时行为与旧版一致
- `detector_loaded=false`(懒加载)符合预期
---
## 📁 今日修改文件
| 文件 | 改动 |
|------|------|
| `backend/app/core/config.py` | 新增 `LIPSYNC_SMALL_FACE_*` 配置项5 个) |
| `backend/.env` | 增加小脸增强开关与参数 |
| `backend/app/services/small_face_enhance_service.py` | 新增:检测/裁切/超分/贴回主服务;后续补丁含懒加载与兼容修复 |
| `backend/app/services/lipsync_service.py` | 集成增强链路、抽取 `_run_selected_model`、health 增强状态 |
| `backend/requirements.txt` | 新增 `opencv-python-headless``gfpgan` |
| `models/FaceEnhance/GFPGANv1.4.pth` | 新增超分权重 |
| `Docs/FACEENHANCE_DEPLOY.md` | 新增部署文档并修正健康检查路径/返回示例 |
---
## ⚠️ 已知限制
- 仅本地唇形路径接入(`_local_generate()`);远程模式未接入小脸补偿
- 多镜头场景当前仍为全局判定,暂不做逐段小脸判定
- v1 优先单人自拍稳定性,多人脸切换策略后续再补

428
Docs/FACEENHANCE_DEPLOY.md Normal file
View File

@@ -0,0 +1,428 @@
# 小脸口型质量补偿链路部署指南
> **更新时间**2026-03-10 v1.4
> **适用版本**SmallFaceEnhance v1.4 (内嵌于 Backend 进程)
> **架构**LipSyncService 内部模块,无独立进程
---
## 架构概览
小脸口型质量补偿链路(简称"小脸增强")作为 `LipSyncService._local_generate()` 的**前处理分支**,在 lipsync 推理前自动检测小脸并增强输入质量:
```
原视频 + 音频
→ video looping (已有逻辑)
→ 小脸检测 (SCRFD, CPU)
→ [非小脸] 直接用用户所选模型推理 (现有路径)
→ [小脸]
A. 裁切主脸区域 (带 padding)
B. 稀疏关键帧超分到 512px (GFPGAN, GPU0)
C. 用用户所选模型推理 (MuseTalk 或 LatentSync)
D. 下半脸 mask 羽化 + seamlessClone 贴回原帧
→ 进入现有后续流程 (字幕/BGM/上传)
```
**关键约束**
- 不改前端、不改 API 协议
- 模型选择权归用户,不因小脸自动换模型
- 默认 fail-open增强链任何一步失败自动回退原流程
- 无独立进程/PM2跟随 `vigent2-backend` 运行
---
## 硬件要求
| 配置 | 说明 |
|------|------|
| 检测器 | SCRFD (det_10g.onnx)CPU 推理,无额外 GPU 开销 |
| 超分 | GFPGANGPU0 (与 MuseTalk 同卡,顺序执行),约 2-3GB 显存 |
| 内存 | 流式 ffmpeg pipe 逐帧处理,不额外占用大量内存 |
> 超分与 MuseTalk 共享 GPU0顺序执行不会同时占用显存。
---
## 依赖安装
### 1. pip 依赖
已在 `backend/requirements.txt` 中添加:
```
opencv-python-headless>=4.8.0
gfpgan>=1.3.8
```
安装:
```bash
cd /home/rongye/ProgramFiles/ViGent2/backend
pip install opencv-python-headless gfpgan
```
> `gfpgan` 会自动拉取 `basicsr`、`facexlib` 等依赖。
> `onnxruntime` 需单独确认已安装LatentSync 环境中已有 1.23.2)。
> 如果 backend 虚拟环境中缺少 onnxruntime需额外安装`pip install onnxruntime`
### 2. 系统依赖
- `ffmpeg` / `ffprobe`:已有(视频处理必需)
---
## 模型权重
### 目录结构
```
models/
├── FaceEnhance/
│ └── GFPGANv1.4.pth ← 超分权重 (~333MB)
└── LatentSync/checkpoints/auxiliary/
└── models/buffalo_l/
└── det_10g.onnx ← 人脸检测权重 (~16MB, 复用已有)
```
### 下载方式
**GFPGAN 权重**(已下载):
```bash
cd /home/rongye/ProgramFiles/ViGent2/models/FaceEnhance
wget -O GFPGANv1.4.pth "https://github.com/TencentARC/GFPGAN/releases/download/v1.3.4/GFPGANv1.4.pth"
```
**SCRFD 检测器权重**
复用 LatentSync 已有的 `det_10g.onnx`,无需额外下载。代码自动引用路径:
`models/LatentSync/checkpoints/auxiliary/models/buffalo_l/det_10g.onnx`
> 权重缺失时自动 fail-open 跳过增强,不会导致任务失败。
---
## 后端配置
`backend/.env` 中的相关变量:
```ini
# =============== 小脸口型质量补偿链路 ===============
LIPSYNC_SMALL_FACE_ENHANCE=false # 总开关 (true/false)
LIPSYNC_SMALL_FACE_THRESHOLD=256 # 触发阈值 (像素,脸宽 < 此值触发)
LIPSYNC_SMALL_FACE_UPSCALER=gfpgan # 超分模型: gfpgan | codeformer
LIPSYNC_SMALL_FACE_GPU_ID=0 # 超分 GPU (与 MuseTalk 同卡)
LIPSYNC_SMALL_FACE_FAIL_OPEN=true # 失败回退 (true=回退原流程, false=报错)
```
`backend/app/core/config.py` 中的默认值:
```python
LIPSYNC_SMALL_FACE_ENHANCE: bool = False
LIPSYNC_SMALL_FACE_THRESHOLD: int = 256
LIPSYNC_SMALL_FACE_UPSCALER: str = "codeformer"
LIPSYNC_SMALL_FACE_GPU_ID: int = 0
LIPSYNC_SMALL_FACE_FAIL_OPEN: bool = True
```
> `.env` 优先于 `config.py` 默认值。`config.py` 仅在 `.env` 未设置时生效。
### 模块内部常量
以下参数固定为代码常量(`small_face_enhance_service.py`),暂不走 env
| 常量 | 值 | 说明 |
|------|-----|------|
| `PADDING` | 0.28 | bbox 外扩比例 |
| `DETECT_EVERY` | 8 | 每 N 帧检测,中间帧 EMA 插值 |
| `TARGET_SIZE` | 512 | 超分目标尺寸 |
| `MASK_FEATHER` | 15 | 下半脸 mask 羽化像素 |
| `MASK_UPPER_RATIO` | 0.68 | 口型 mask 起始位置 (crop 高度的 68%,仅覆盖嘴部/下巴) |
| `MASK_SIDE_MARGIN` | 0.16 | 左右留白比例,避免改动面颊/鼻翼 |
| `SAMPLE_FRAMES` | 24 | 小脸判定采样帧数 |
| `SAMPLE_WINDOW` | (0.10, 0.30) | 采样窗口 (视频 10%~30%) |
| `ENCODE_FPS` | 25 | 中间视频编码帧率 fallback优先跟随源视频 fps源 fps 不可用时回退 25 |
| `ENCODE_CRF` | 18 | 中间视频编码质量 |
| `EMA_ALPHA` | 0.3 | bbox EMA 平滑系数 |
---
## 启用与验证
### 1. 开启小脸口型质量补偿链路
```bash
# 编辑 backend/.env
LIPSYNC_SMALL_FACE_ENHANCE=true
```
重启后端:
```bash
pm2 restart vigent2-backend
```
### 2. 强制触发测试
设置极大阈值,使任何视频都触发增强:
```ini
LIPSYNC_SMALL_FACE_THRESHOLD=9999
```
> 仅用于链路冒烟测试,不用于质量评估。`9999` 会强制大脸素材进入增强分支,可能出现中脸变形/鼻翼细节异常。
提交一个视频任务,检查日志:
```bash
pm2 logs vigent2-backend --lines 50
```
应看到类似输出:
```
小脸增强: face_w=320px < threshold=9999px, 触发增强
✅ SCRFD 检测器已加载
✅ 超分器已加载: gfpgan
小脸增强: face_w=320px threshold=9999px enhanced=True upscaler=gfpgan time=12.3s
✅ 小脸增强 + 唇形同步完成: /path/to/output.mp4
```
### 3. 调回正常阈值
验证通过后,改回合理阈值:
```ini
LIPSYNC_SMALL_FACE_THRESHOLD=256
```
并重启 backend`pm2 restart vigent2-backend`
### 4. 健康检查
```bash
curl http://localhost:8006/api/videos/lipsync/health | python3 -m json.tool
```
应包含 `data.small_face_enhance`
```json
{
"success": true,
"data": {
"small_face_enhance": {
"enabled": true,
"threshold": 256,
"detector_loaded": true
}
}
}
```
---
## 相关文件
| 文件 | 说明 |
|------|------|
| `backend/app/services/small_face_enhance_service.py` | 小脸增强主服务 (检测 + 裁切 + 超分 + 贴回) |
| `backend/app/services/lipsync_service.py` | 混合路由 + 小脸增强集成 + `_run_selected_model()` |
| `backend/app/core/config.py` | `LIPSYNC_SMALL_FACE_*` 配置项 |
| `models/FaceEnhance/GFPGANv1.4.pth` | GFPGAN 超分权重 |
| `models/LatentSync/checkpoints/auxiliary/models/buffalo_l/det_10g.onnx` | SCRFD 检测器权重 (复用) |
| `Temp/小脸增强分支-实施计划.md` | 详细方案文档 |
---
## 处理流程详解
### 1. 检测阶段 (CPU)
- 从视频 10%~30% 区间均匀采 24 帧
- SCRFD (det_10g.onnx) 检测最大脸,取中位数脸宽
- `脸宽 < THRESHOLD` 时触发增强
### 2. 裁切 + 轨迹 (CPU)
- 每 8 帧检测人脸 bbox中间帧 EMA 插值平滑
- bbox + 0.28 padding 外扩clamp 到帧边界
- 实际读取帧数回写 `track.frame_count`,修正 ffprobe 估算偏差
- ffmpeg pipe 流式裁切,输出 512x512 视频
### 3. 超分 (GPU0)
- 检测帧 (每 8 帧)GFPGAN 全量超分
- 非检测帧bicubic resize 到 512x512
- 增强视频输出 fps 跟随源视频 fps不再固定写 25fps避免时基拉伸
- 推理后自动 `torch.cuda.empty_cache()`
### 4. Lipsync 推理
- 用户选择的模型 (fast/default/advanced) 对增强后的人脸视频推理
- 模型选择语义不变
### 5. 贴回 (CPU)
- 口型局部 mask (从 68% 高度开始 + 左右留白 16%) + 高斯羽化 15px仅覆盖嘴部/下巴)
- `cv2.seamlessClone(NORMAL_CLONE)` 贴回原帧
- 对 seamlessClone 结果再按 mask 区域做二次 alpha 限域,避免融合扩散到眼部上方
- seamlessClone 失败时 fallback alpha 混合
- 贴回按时间轴映射原始帧索引(`orig_fps/ls_fps`),避免只使用前段帧导致动作变慢/重影
- 帧数保护lipsync 按音频时长输出,帧数通常 <= 原始 looped 视频;仅 `lipsync帧数 > 原始帧数` 时报错,`<=` 时正常贴回
- 空输出保护:`lipsync帧数 <= 0` 直接抛异常,外层 `FAIL_OPEN` 回退原流程,避免写出空视频
- 音轨封装:贴回后强制复用 `audio_path` 重新 mux 音轨,避免增强路径出现无声视频
---
## 回滚方案
**一级回滚 (秒级)**
```ini
LIPSYNC_SMALL_FACE_ENHANCE=false
```
重启 backend 即可,所有任务走原流程。
**二级回滚 (版本级)**
回退 `lipsync_service.py` 增强接入提交,配置项保留但不生效。
---
## 常见问题
### onnxruntime 未安装
```
⚠️ SCRFD 初始化失败: No module named 'onnxruntime'
```
**解决**
```bash
pip install onnxruntime
```
### GFPGAN 权重缺失
```
⚠️ GFPGAN 权重不存在: .../models/FaceEnhance/GFPGANv1.4.pth
```
**解决**:参考上方"模型权重"章节下载。权重缺失时超分自动降级为 bicubic resize。
### 帧数异常导致 fail-open
```
⚠️ 小脸贴回失败,回退原流程: 帧数异常: lipsync=300 > original=250
```
**说明**v1.1 已放宽帧数检查。lipsync 模型按音频时长输出帧数,通常 <= looped 视频帧数,此时正常贴回。仅当 lipsync 输出帧数**大于**原始帧数时才报错(异常情况)。
### lipsync 输出为空导致回退
```
⚠️ 小脸贴回失败,回退原流程: lipsync 输出帧数为 0跳过贴回
```
**说明**v1.2 新增空输出保护。`ls_frames <= 0` 时立即抛错,由外层 fail-open 回退到常规唇形路径,避免生成空视频文件。
### 增强后动作变慢 / 眼睛重影
**原因**:原视频与 lipsync 输出 fps 不一致时,若按同帧号直接贴回,可能出现时间轴错位(只贴回前段帧)。
**修复**v1.3 已改为按 `orig_fps/ls_fps` 做时间轴映射,贴回阶段使用时间对应帧而非同索引帧,同时增强视频输出 fps 跟随源 fps。
**进一步修复v1.4**
- mask 起点进一步下移到 68%,并增加左右 16% 留白,减少眼周/鼻翼参与融合
- 对 seamlessClone 输出增加 mask 限域,防止 Poisson 扩散造成眼部上方重影
### 增强后脸部畸形(鼻翼/中脸异常)
**高概率原因**:使用了测试阈值 `LIPSYNC_SMALL_FACE_THRESHOLD=9999`,把本不需要增强的大脸素材强制送入补偿链路。
**建议处理**
- 先改回 `LIPSYNC_SMALL_FACE_THRESHOLD=256` 并重启 backend。
- 如仍有异常,临时关闭 `LIPSYNC_SMALL_FACE_ENHANCE=false` 做 A/B 对比,再继续调参。
### 增强后无声音
**原因**:贴回阶段 rawvideo 写出默认不带音轨。
**修复**v1.3 已在贴回后强制执行音轨封装mux使用当前任务 `audio_path` 写回音频。
> v1.0 使用严格一致性检查(`lipsync != original` 即失败),在 looped 视频帧数远大于音频帧数时会误判失败。v1.1 已修复。
### 增强后口型有偏移
检查 `PADDING` 常量是否合理。过小的 padding 可能导致裁切区域不够,过大会引入太多背景。当前默认 0.28 (28%) 适用于大多数单人自拍场景。
### torchvision 兼容性 (functional_tensor)
```
No module named 'torchvision.transforms.functional_tensor'
```
**原因**torchvision >= 0.20 移除了 `functional_tensor` 模块,但 `basicsr`gfpgan 依赖)仍引用。
**解决**:代码已内置兼容 shim`_ensure_upscaler()` 中自动注入 `sys.modules`),无需手动处理。如仍出现,检查 `_ensure_upscaler` 方法是否正常执行。
### cv2/numpy 未安装
```
⚠️ cv2 未安装,小脸增强不可用
```
**说明**`cv2``numpy` 为 lazy import`try/except`),缺失时小脸增强自动禁用,不影响后端启动和其他功能。安装 `opencv-python-headless` 即可恢复。
---
## 已知限制 (v1.4)
- 仅覆盖本地 lipsync 路径 (`_local_generate()`),远程模式 (`_remote_generate()`) 暂不接入
- 多镜头仅全局判定,不做逐段小脸检测
- 仅保证单人 (主脸) 场景稳定,不做多人脸切换
- CodeFormer 超分需额外安装 `basicsr`,当前推荐使用 GFPGAN
---
## v1.3 → v1.4 变更记录
| 修复项 | 说明 |
|--------|------|
| 眼部重影修复 | mask 起点下移到 68% + 左右 16% 留白,减少上半脸与鼻翼参与融合 |
| Poisson 扩散抑制 | seamlessClone 后按 mask 二次限域,避免眼部上方 ghosting |
---
## v1.2 → v1.3 变更记录
| 修复项 | 说明 |
|--------|------|
| 时基修复 | `_crop_and_upscale_video()` 输出 fps 跟随源视频 fps避免增强视频时间轴被拉伸 |
| 贴回对齐修复 | `blend_back()` 改为按 `orig_fps/ls_fps` 映射原始帧索引,减少动作变慢与重影 |
| 音轨修复 | 贴回成功后新增音轨封装mux避免增强路径无声音 |
---
## v1.1 → v1.2 变更记录
| 修复项 | 说明 |
|--------|------|
| 空输出保护 | `blend_back()` 新增 `ls_frames <= 0` 判断,直接抛错并由外层 fail-open 回退,避免写出空视频 |
---
## v1.0 → v1.1 变更记录
| 修复项 | 说明 |
|--------|------|
| ffprobe 解析 | CSV → JSON 格式,字段名访问,不再受 `nb_frames` 缺失导致的字段错位影响 |
| fps 选取 | 优先 `avg_frame_rate`(真实平均帧率),`r_frame_rate` 作为 fallback避免 `60/1` 等 timebase 倍数导致帧数估算偏大 |
| 实际帧数回写 | `_build_face_track()` 用 ffmpeg 实际读到的帧数覆盖估算值,`track.frame_count` 更准确 |
| 贴回帧数检查 | 放宽为 `lipsync <= original` 时正常贴回,仅 `>` 时报错;适配 MuseTalk/LatentSync 按音频时长输出的行为 |
| 边界防护 | `streams` 为空时 return None`r_frame_rate` 分母为 0 时 fallback 25fps |
| torchvision 兼容 | `_ensure_upscaler()` 中注入 `functional_tensor` shim兼容 torchvision >= 0.20 |
| lazy import | `cv2`/`numpy` 包装在 `try/except`,缺失时增强自动禁用不影响后端启动 |
| 类型注解 | `from __future__ import annotations` 避免依赖缺失时 `np.ndarray` 等注解触发 NameError |

View File

@@ -1,8 +1,8 @@
# ViGent2 开发任务清单 (Task Log)
**项目**: ViGent2 数字人口播视频生成系统
**进度**: 100% (Day 34 - 多镜头时间轴重构 + 文案深度学习弹窗防误触关闭)
**更新时间**: 2026-03-09
**进度**: 100% (Day 35 - 小脸口型质量补偿落地 + 部署验证)
**更新时间**: 2026-03-10
---
@@ -10,7 +10,25 @@
> 这里记录了每一天的核心开发内容与 milestone。
### Day 34: 多镜头时间轴重构 + 文案深度学习弹窗防误触关闭 + Code Review 修复 (Current)
### Day 35: 小脸口型质量补偿落地 + 部署验证 + 稳定性补丁 (Current)
- [x] **小脸口型质量补偿落地**: 新增 `small_face_enhance_service.py`,实现 SCRFD 小脸检测10%-30% 采样)-> 裁切轨迹(每 8 帧检测 + EMA-> 稀疏关键帧超分GFPGAN/CodeFormer-> 下半脸贴回seamlessClone/alpha fallback完整链路。
- [x] **后端集成完成**: `lipsync_service.py``_local_generate()` 内完成 looping 后插入增强,抽取 `_run_selected_model()` 统一模型路由,增强失败按 `FAIL_OPEN` 自动回退原流程。
- [x] **配置与依赖**: 新增 5 个 `LIPSYNC_SMALL_FACE_*` 配置项;`requirements.txt` 增加 `opencv-python-headless``gfpgan`;新增 `models/FaceEnhance/GFPGANv1.4.pth` 权重目录。
- [x] **部署文档新增**: 新增并回写 `Docs/FACEENHANCE_DEPLOY.md`,补齐部署、权重、开关、验证、回滚说明。
- [x] **线上稳定性修复**:
- `small_face_enhance_service.py` 增加 `cv2/numpy` 懒加载守卫,缺依赖时跳过增强不影响主流程。
- 增加 `from __future__ import annotations`,避免 `np.ndarray` 注解在缺依赖场景导入期报错。
- 增加 `torchvision.transforms.functional_tensor` shim修复 `torchvision>=0.20` 下 GFPGAN 初始化失败。
- `_get_video_info()` 改为 JSON 字段解析并优先 `avg_frame_rate`,修复 `nb_frames` 缺失导致的帧数估算偏差。
- `_build_face_track()` 回写实际读帧数;`blend_back()` 帧数校验放宽为 `lipsync <= original` 正常贴回,仅 `>` 报错。
- `blend_back()` 新增 `ls_frames <= 0` 空输出保护,异常时由 `FAIL_OPEN` 回退常规路径,避免写出空视频。
- 时基修复:增强视频输出 fps 跟随源视频 fps贴回按 `orig_fps/ls_fps` 映射原始帧索引,修复动作变慢与重影。
- 音轨修复:贴回成功后新增 mux 音轨步骤,确保小脸增强路径输出视频包含声音。
- 眼部重影修复mask 起点下移到 68% 并增加左右 16% 留白,对 seamlessClone 结果做 mask 限域二次融合,减少眼部上方 ghosting。
- 运行策略收口:`LIPSYNC_SMALL_FACE_THRESHOLD=9999` 仅用于链路冒烟,质量验证与日常运行统一回归 `256`
- [x] **部署校验通过**: `GET /api/videos/lipsync/health` 已返回 `data.small_face_enhance`;默认 `enabled=false`,开关关闭下行为与原流程一致。
### Day 34: 多镜头时间轴重构 + 文案深度学习弹窗防误触关闭 + Code Review 修复
- [x] **时间轴模型重构**: 多素材从”等分顺序片段”升级为”主素材连续播放 + 插入镜头块”,支持自由插入、拖拽移动。
- [x] **前端链路落地**: 重写 `useTimelineEditor``TimelineEditor`,新增主素材/插入候选语义,`useHomeController` / `HomePage` / `MaterialSelector` 全链路适配。
- [x] **后端生成链路适配**: `workflow.py` 完成 `material_paths` 来源修正、`custom_assignments` 新校验、素材下载去重与段处理并发限制,保持单素材兼容。

View File

@@ -16,7 +16,8 @@
## ✨ 功能特性
### 核心能力
- 🎬 **高清唇形同步** - 混合方案:短视频(本仓库当前 `.env` 阈值 100s可配用 LatentSync 1.6(高质量 Latent Diffusion长视频用 MuseTalk 1.5(实时级单步推理),自动路由 + 回退。前端可选模型:默认模型(阈值自动路由)/ 快速模型(速度优先)/ 高级模型(质量优先)。
- 🎬 **高清唇形同步** - 混合方案:短视频(本仓库当前 `.env` 阈值 100s可配用 LatentSync 1.6(高质量 Latent Diffusion长视频用 MuseTalk 1.5(实时级单步推理),自动路由 + 回退。前端可选模型:默认模型(阈值自动路由)/ 快速模型(速度优先)/ 高级模型(质量优先)。
- 🧠 **小脸口型质量补偿(可选)** - 本地唇形路径支持小脸检测 + 裁切 + 稀疏关键帧超分 + 下半脸贴回补偿链路;默认关闭(`LIPSYNC_SMALL_FACE_ENHANCE=false`失败自动回退原流程fail-open
- 🎙️ **多模态配音** - 支持 **EdgeTTS** (微软超自然语音, 10 语言) 和 **CosyVoice 3.0** (3秒极速声音克隆, 9语言+18方言, 语速/语气可调)。上传参考音频自动 Whisper 转写 + 智能截取。配音前置工作流:先生成配音 → 选素材 → 生成视频。
- 📝 **智能字幕** - 集成 faster-whisper + Remotion自动生成逐字高亮 (卡拉OK效果) 字幕。
- 🎨 **样式预设** - 12 种标题 + 8 种字幕样式预设,支持预览 + 字号调节 + 自定义字体库。CSS 原生描边渲染,清晰无重影。
@@ -67,9 +68,10 @@
- **[部署手册 (DEPLOY_MANUAL.md)](Docs/DEPLOY_MANUAL.md)** - 👈 **部署请看这里**!包含完整的环境搭建步骤。
- [多平台发布部署说明 (PUBLISH_DEPLOY.md)](Docs/PUBLISH_DEPLOY.md) - 抖音/微信视频号/B站/小红书登录与自动化发布专项文档。
- [参考音频服务部署 (COSYVOICE3_DEPLOY.md)](Docs/COSYVOICE3_DEPLOY.md) - 声音克隆模型部署指南。
- [LatentSync 部署指南 (LATENTSYNC_DEPLOY.md)](Docs/LATENTSYNC_DEPLOY.md) - 唇形同步模型独立部署。
- [MuseTalk 部署指南 (MUSETALK_DEPLOY.md)](Docs/MUSETALK_DEPLOY.md) - 长视频唇形同步模型部署。
- [Supabase 部署指南 (SUPABASE_DEPLOY.md)](Docs/SUPABASE_DEPLOY.md) - Supabase 与认证系统配置
- [LatentSync 部署指南 (LATENTSYNC_DEPLOY.md)](Docs/LATENTSYNC_DEPLOY.md) - 唇形同步模型独立部署。
- [MuseTalk 部署指南 (MUSETALK_DEPLOY.md)](Docs/MUSETALK_DEPLOY.md) - 长视频唇形同步模型部署。
- [小脸口型质量补偿链路部署指南 (FACEENHANCE_DEPLOY.md)](Docs/FACEENHANCE_DEPLOY.md) - 小脸口型质量补偿链路部署与验证
- [Supabase 部署指南 (SUPABASE_DEPLOY.md)](Docs/SUPABASE_DEPLOY.md) - Supabase 与认证系统配置。
- [支付宝部署指南 (ALIPAY_DEPLOY.md)](Docs/ALIPAY_DEPLOY.md) - 支付宝付费开通会员配置。
### 开发文档

View File

@@ -83,6 +83,13 @@ MUSETALK_ENCODE_PRESET=slow
# 音频时长 >= 此阈值(秒)用 MuseTalk< 此阈值用 LatentSync
LIPSYNC_DURATION_THRESHOLD=100
# =============== 小脸口型质量补偿链路 ===============
LIPSYNC_SMALL_FACE_ENHANCE=true
LIPSYNC_SMALL_FACE_THRESHOLD=256
LIPSYNC_SMALL_FACE_UPSCALER=gfpgan
LIPSYNC_SMALL_FACE_GPU_ID=0
LIPSYNC_SMALL_FACE_FAIL_OPEN=true
# =============== 上传配置 ===============
# 最大上传文件大小 (MB)
MAX_UPLOAD_SIZE_MB=500

View File

@@ -37,22 +37,22 @@ class Settings(BaseSettings):
DOUYIN_BROWSER_CHANNEL: str = ""
DOUYIN_FORCE_SWIFTSHADER: bool = True
# Douyin 调试录屏
DOUYIN_DEBUG_ARTIFACTS: bool = False
DOUYIN_RECORD_VIDEO: bool = False
DOUYIN_KEEP_SUCCESS_VIDEO: bool = False
DOUYIN_RECORD_VIDEO_WIDTH: int = 1280
DOUYIN_RECORD_VIDEO_HEIGHT: int = 720
# Xiaohongshu Playwright 配置
XIAOHONGSHU_HEADLESS_MODE: str = "headless-new"
XIAOHONGSHU_USER_AGENT: str = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.0.0 Safari/537.36"
XIAOHONGSHU_LOCALE: str = "zh-CN"
XIAOHONGSHU_TIMEZONE_ID: str = "Asia/Shanghai"
XIAOHONGSHU_CHROME_PATH: str = "/usr/bin/google-chrome"
XIAOHONGSHU_BROWSER_CHANNEL: str = ""
XIAOHONGSHU_FORCE_SWIFTSHADER: bool = True
XIAOHONGSHU_DEBUG_ARTIFACTS: bool = False
# Douyin 调试录屏
DOUYIN_DEBUG_ARTIFACTS: bool = False
DOUYIN_RECORD_VIDEO: bool = False
DOUYIN_KEEP_SUCCESS_VIDEO: bool = False
DOUYIN_RECORD_VIDEO_WIDTH: int = 1280
DOUYIN_RECORD_VIDEO_HEIGHT: int = 720
# Xiaohongshu Playwright 配置
XIAOHONGSHU_HEADLESS_MODE: str = "headless-new"
XIAOHONGSHU_USER_AGENT: str = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.0.0 Safari/537.36"
XIAOHONGSHU_LOCALE: str = "zh-CN"
XIAOHONGSHU_TIMEZONE_ID: str = "Asia/Shanghai"
XIAOHONGSHU_CHROME_PATH: str = "/usr/bin/google-chrome"
XIAOHONGSHU_BROWSER_CHANNEL: str = ""
XIAOHONGSHU_FORCE_SWIFTSHADER: bool = True
XIAOHONGSHU_DEBUG_ARTIFACTS: bool = False
# TTS 配置
DEFAULT_TTS_VOICE: str = "zh-CN-YunxiNeural"
@@ -78,6 +78,13 @@ class Settings(BaseSettings):
# 混合唇形同步路由
LIPSYNC_DURATION_THRESHOLD: float = 120.0 # 秒,>=此值用 MuseTalk
# 小脸口型质量补偿链路
LIPSYNC_SMALL_FACE_ENHANCE: bool = False
LIPSYNC_SMALL_FACE_THRESHOLD: int = 256
LIPSYNC_SMALL_FACE_UPSCALER: str = "codeformer"
LIPSYNC_SMALL_FACE_GPU_ID: int = 0
LIPSYNC_SMALL_FACE_FAIL_OPEN: bool = True
# Supabase 配置
SUPABASE_URL: str = ""
SUPABASE_PUBLIC_URL: str = "" # 公网访问地址,用于生成前端可访问的 URL

View File

@@ -11,12 +11,13 @@ import asyncio
import httpx
from pathlib import Path
from loguru import logger
from typing import Optional, Literal
from typing import Optional, Literal
from app.core.config import settings
from app.services.small_face_enhance_service import SmallFaceEnhanceService
class LipSyncService:
class LipSyncService:
"""唇形同步服务 - LatentSync 1.6 + MuseTalk 1.5 混合方案"""
def __init__(self):
@@ -38,6 +39,9 @@ class LipSyncService:
# 运行时检测
self._weights_available: Optional[bool] = None
# 小脸增强
self._face_enhance = SmallFaceEnhanceService()
def _check_weights(self) -> bool:
"""检查模型权重是否存在"""
@@ -93,7 +97,7 @@ class LipSyncService:
logger.warning(f"⚠️ 获取媒体时长失败: {e}")
return None
def _loop_video_to_duration(self, video_path: str, output_path: str, target_duration: float) -> str:
def _loop_video_to_duration(self, video_path: str, output_path: str, target_duration: float) -> str:
"""
循环视频以匹配目标时长
使用 FFmpeg stream_loop 实现无缝循环
@@ -117,47 +121,70 @@ class LipSyncService:
else:
logger.warning(f"⚠️ 视频循环失败: {result.stderr[:200]}")
return video_path
except Exception as e:
logger.warning(f"⚠️ 视频循环异常: {e}")
return video_path
except Exception as e:
logger.warning(f"⚠️ 视频循环异常: {e}")
return video_path
def _mux_audio_to_video(self, video_path: str, audio_path: str, output_path: str) -> bool:
"""将音轨封装到视频,避免增强路径出现无声输出。"""
try:
cmd = [
"ffmpeg", "-y",
"-i", video_path,
"-i", audio_path,
"-map", "0:v:0",
"-map", "1:a:0",
"-c:v", "copy",
"-c:a", "aac",
"-shortest",
output_path,
]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
if result.returncode == 0 and Path(output_path).exists():
return True
logger.warning(f"⚠️ 音轨封装失败: {result.stderr[:200]}")
return False
except Exception as e:
logger.warning(f"⚠️ 音轨封装异常: {e}")
return False
async def generate(
self,
video_path: str,
audio_path: str,
output_path: str,
fps: int = 25,
model_mode: Literal["default", "fast", "advanced"] = "default",
) -> str:
"""生成唇形同步视频"""
logger.info(f"🎬 唇形同步任务: {Path(video_path).name} + {Path(audio_path).name}")
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
normalized_mode: Literal["default", "fast", "advanced"] = model_mode
if normalized_mode not in ("default", "fast", "advanced"):
normalized_mode = "default"
logger.info(f"🧠 Lipsync 模式: {normalized_mode}")
if self.use_local:
return await self._local_generate(video_path, audio_path, output_path, fps, normalized_mode)
else:
return await self._remote_generate(video_path, audio_path, output_path, fps, normalized_mode)
async def generate(
self,
video_path: str,
audio_path: str,
output_path: str,
fps: int = 25,
model_mode: Literal["default", "fast", "advanced"] = "default",
) -> str:
"""生成唇形同步视频"""
logger.info(f"🎬 唇形同步任务: {Path(video_path).name} + {Path(audio_path).name}")
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
normalized_mode: Literal["default", "fast", "advanced"] = model_mode
if normalized_mode not in ("default", "fast", "advanced"):
normalized_mode = "default"
logger.info(f"🧠 Lipsync 模式: {normalized_mode}")
if self.use_local:
return await self._local_generate(video_path, audio_path, output_path, fps, normalized_mode)
else:
return await self._remote_generate(video_path, audio_path, output_path, fps, normalized_mode)
async def _local_generate(
self,
video_path: str,
audio_path: str,
output_path: str,
fps: int,
model_mode: Literal["default", "fast", "advanced"],
) -> str:
"""使用 subprocess 调用 LatentSync conda 环境"""
logger.info("⏳ 等待 GPU 资源 (排队中)...")
async with self._lock:
# 使用临时目录存放中间文件
with tempfile.TemporaryDirectory() as tmpdir:
tmpdir = Path(tmpdir)
async def _local_generate(
self,
video_path: str,
audio_path: str,
output_path: str,
fps: int,
model_mode: Literal["default", "fast", "advanced"],
) -> str:
"""使用 subprocess 调用 LatentSync conda 环境"""
logger.info("⏳ 等待 GPU 资源 (排队中)...")
async with self._lock:
# 使用临时目录存放中间文件
with tempfile.TemporaryDirectory() as tmpdir:
tmpdir = Path(tmpdir)
# 获取音频和视频时长
audio_duration = self._get_media_duration(audio_path)
@@ -172,133 +199,206 @@ class LipSyncService:
str(looped_video),
audio_duration
)
else:
actual_video_path = video_path
# 模型路由
force_musetalk = model_mode == "fast"
force_latentsync = model_mode == "advanced"
auto_to_musetalk = (
model_mode == "default"
and audio_duration is not None
and audio_duration >= settings.LIPSYNC_DURATION_THRESHOLD
)
if force_musetalk:
logger.info("⚡ 强制快速模型MuseTalk")
musetalk_result = await self._call_musetalk_server(
actual_video_path, audio_path, output_path
)
if musetalk_result:
return musetalk_result
logger.warning("⚠️ MuseTalk 不可用,快速模型回退到 LatentSync")
elif auto_to_musetalk:
logger.info(
f"🔄 音频 {audio_duration:.1f}s >= {settings.LIPSYNC_DURATION_THRESHOLD}s路由到 MuseTalk"
)
musetalk_result = await self._call_musetalk_server(
actual_video_path, audio_path, output_path
)
if musetalk_result:
return musetalk_result
logger.warning("⚠️ MuseTalk 不可用,回退到 LatentSync长视频会较慢")
elif force_latentsync:
logger.info("🎯 强制高级模型LatentSync")
# 检查 LatentSync 前置条件(仅在需要回退或使用 LatentSync 时)
if not self._check_conda_env():
logger.warning("⚠️ Conda 环境不可用,使用 Fallback")
shutil.copy(video_path, output_path)
return output_path
if not self._check_weights():
logger.warning("⚠️ 模型权重不存在,使用 Fallback")
shutil.copy(video_path, output_path)
return output_path
if self.use_server:
# 模式 A: 调用常驻服务 (加速模式)
return await self._call_persistent_server(actual_video_path, audio_path, output_path)
else:
actual_video_path = video_path
logger.info("🔄 调用 LatentSync 推理 (subprocess)...")
temp_output = tmpdir / "output.mp4"
# 构建命令
cmd = [
str(self.conda_python),
"-m", "scripts.inference",
"--unet_config_path", "configs/unet/stage2_512.yaml",
"--inference_ckpt_path", "checkpoints/latentsync_unet.pt",
"--inference_steps", str(settings.LATENTSYNC_INFERENCE_STEPS),
"--guidance_scale", str(settings.LATENTSYNC_GUIDANCE_SCALE),
"--video_path", str(actual_video_path), # 使用预处理后的视频
"--audio_path", str(audio_path),
"--video_out_path", str(temp_output),
"--seed", str(settings.LATENTSYNC_SEED),
"--temp_dir", str(tmpdir / "cache"),
]
if settings.LATENTSYNC_ENABLE_DEEPCACHE:
cmd.append("--enable_deepcache")
# 设置环境变量
env = os.environ.copy()
env["CUDA_VISIBLE_DEVICES"] = str(self.gpu_id)
logger.info(f"🖥️ 执行命令: {' '.join(cmd[:8])}...")
logger.info(f"🖥️ GPU: CUDA_VISIBLE_DEVICES={self.gpu_id}")
# ── 小脸增强 ──
enhance_result = None
try:
# 使用 asyncio subprocess 实现真正的异步执行
# 这样事件循环可以继续处理其他请求(如进度查询)
process = await asyncio.create_subprocess_exec(
*cmd,
cwd=str(self.latentsync_dir),
env=env,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
enhance_result = self._face_enhance.enhance_if_needed(
video_path=str(actual_video_path),
tmpdir=tmpdir,
gpu_id=settings.LIPSYNC_SMALL_FACE_GPU_ID,
)
# 等待进程完成,带超时
try:
stdout, stderr = await asyncio.wait_for(
process.communicate(),
timeout=900 # 15分钟超时
)
except asyncio.TimeoutError:
process.kill()
await process.wait()
logger.error("⏰ LatentSync 推理超时 (15分钟)")
shutil.copy(video_path, output_path)
return output_path
stdout_text = stdout.decode() if stdout else ""
stderr_text = stderr.decode() if stderr else ""
if process.returncode != 0:
logger.error(f"LatentSync 推理失败:\n{stderr_text}")
logger.error(f"stdout:\n{stdout_text[-1000:] if stdout_text else 'N/A'}")
# Fallback
shutil.copy(video_path, output_path)
return output_path
logger.info(f"LatentSync 输出:\n{stdout_text[-500:] if stdout_text else 'N/A'}")
# 检查输出文件
if temp_output.exists():
shutil.copy(temp_output, output_path)
logger.info(f"✅ 唇形同步完成: {output_path}")
return output_path
else:
logger.warning("⚠️ 未找到输出文件,使用 Fallback")
shutil.copy(video_path, output_path)
return output_path
except Exception as e:
logger.error(f"❌ 推理异常: {e}")
shutil.copy(video_path, output_path)
return output_path
if settings.LIPSYNC_SMALL_FACE_FAIL_OPEN:
logger.warning(f"⚠️ 小脸增强失败,跳过: {e}")
else:
raise
if enhance_result and enhance_result.was_enhanced:
track = enhance_result.track
if track is None:
raise RuntimeError("小脸增强轨迹缺失")
# 增强路径:模型推理增强后的人脸视频 → 贴回原视频
temp_sync = tmpdir / "face_sync.mp4"
await self._run_selected_model(
video_path=enhance_result.video_path,
audio_path=audio_path,
output_path=str(temp_sync),
tmpdir=tmpdir,
model_mode=model_mode,
audio_duration=audio_duration,
original_video_path=video_path,
)
try:
blended = self._face_enhance.blend_back(
original_video=str(actual_video_path),
lipsync_video=str(temp_sync),
track=track,
tmpdir=tmpdir,
)
blended_with_audio = tmpdir / "blended_with_audio.mp4"
if not self._mux_audio_to_video(
video_path=str(blended),
audio_path=audio_path,
output_path=str(blended_with_audio),
):
raise RuntimeError("贴回视频音轨封装失败")
shutil.copy(str(blended_with_audio), output_path)
logger.info(f"✅ 小脸增强 + 唇形同步完成: {output_path}")
return output_path
except Exception as e:
if settings.LIPSYNC_SMALL_FACE_FAIL_OPEN:
logger.warning(f"⚠️ 小脸贴回失败,回退原流程: {e}")
else:
raise
# 常规路径(未增强或增强失败)
return await self._run_selected_model(
video_path=str(actual_video_path),
audio_path=audio_path,
output_path=output_path,
tmpdir=tmpdir,
model_mode=model_mode,
audio_duration=audio_duration,
original_video_path=video_path,
)
async def _run_selected_model(
self,
video_path: str,
audio_path: str,
output_path: str,
tmpdir: Path,
model_mode: Literal["default", "fast", "advanced"],
audio_duration: Optional[float],
original_video_path: str,
) -> str:
"""模型路由 + 执行MuseTalk / LatentSync 常驻服务 / LatentSync subprocess"""
# 模型路由
force_musetalk = model_mode == "fast"
force_latentsync = model_mode == "advanced"
auto_to_musetalk = (
model_mode == "default"
and audio_duration is not None
and audio_duration >= settings.LIPSYNC_DURATION_THRESHOLD
)
if force_musetalk:
logger.info("⚡ 强制快速模型MuseTalk")
musetalk_result = await self._call_musetalk_server(
video_path, audio_path, output_path
)
if musetalk_result:
return musetalk_result
logger.warning("⚠️ MuseTalk 不可用,快速模型回退到 LatentSync")
elif auto_to_musetalk:
logger.info(
f"🔄 音频 {audio_duration:.1f}s >= {settings.LIPSYNC_DURATION_THRESHOLD}s路由到 MuseTalk"
)
musetalk_result = await self._call_musetalk_server(
video_path, audio_path, output_path
)
if musetalk_result:
return musetalk_result
logger.warning("⚠️ MuseTalk 不可用,回退到 LatentSync长视频会较慢")
elif force_latentsync:
logger.info("🎯 强制高级模型LatentSync")
# 检查 LatentSync 前置条件
if not self._check_conda_env():
logger.warning("⚠️ Conda 环境不可用,使用 Fallback")
shutil.copy(original_video_path, output_path)
return output_path
if not self._check_weights():
logger.warning("⚠️ 模型权重不存在,使用 Fallback")
shutil.copy(original_video_path, output_path)
return output_path
if self.use_server:
# 模式 A: 调用常驻服务 (加速模式)
return await self._call_persistent_server(video_path, audio_path, output_path)
logger.info("🔄 调用 LatentSync 推理 (subprocess)...")
temp_output = tmpdir / "output.mp4"
# 构建命令
cmd = [
str(self.conda_python),
"-m", "scripts.inference",
"--unet_config_path", "configs/unet/stage2_512.yaml",
"--inference_ckpt_path", "checkpoints/latentsync_unet.pt",
"--inference_steps", str(settings.LATENTSYNC_INFERENCE_STEPS),
"--guidance_scale", str(settings.LATENTSYNC_GUIDANCE_SCALE),
"--video_path", str(video_path),
"--audio_path", str(audio_path),
"--video_out_path", str(temp_output),
"--seed", str(settings.LATENTSYNC_SEED),
"--temp_dir", str(tmpdir / "cache"),
]
if settings.LATENTSYNC_ENABLE_DEEPCACHE:
cmd.append("--enable_deepcache")
# 设置环境变量
env = os.environ.copy()
env["CUDA_VISIBLE_DEVICES"] = str(self.gpu_id)
logger.info(f"🖥️ 执行命令: {' '.join(cmd[:8])}...")
logger.info(f"🖥️ GPU: CUDA_VISIBLE_DEVICES={self.gpu_id}")
try:
process = await asyncio.create_subprocess_exec(
*cmd,
cwd=str(self.latentsync_dir),
env=env,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
try:
stdout, stderr = await asyncio.wait_for(
process.communicate(),
timeout=900 # 15分钟超时
)
except asyncio.TimeoutError:
process.kill()
await process.wait()
logger.error("⏰ LatentSync 推理超时 (15分钟)")
shutil.copy(original_video_path, output_path)
return output_path
stdout_text = stdout.decode() if stdout else ""
stderr_text = stderr.decode() if stderr else ""
if process.returncode != 0:
logger.error(f"LatentSync 推理失败:\n{stderr_text}")
logger.error(f"stdout:\n{stdout_text[-1000:] if stdout_text else 'N/A'}")
shutil.copy(original_video_path, output_path)
return output_path
logger.info(f"LatentSync 输出:\n{stdout_text[-500:] if stdout_text else 'N/A'}")
if temp_output.exists():
shutil.copy(temp_output, output_path)
logger.info(f"✅ 唇形同步完成: {output_path}")
return output_path
else:
logger.warning("⚠️ 未找到输出文件,使用 Fallback")
shutil.copy(original_video_path, output_path)
return output_path
except Exception as e:
logger.error(f"❌ 推理异常: {e}")
shutil.copy(original_video_path, output_path)
return output_path
async def _call_musetalk_server(
self, video_path: str, audio_path: str, output_path: str
@@ -413,18 +513,18 @@ class LipSyncService:
"请确保 LatentSync 服务已启动 (cd models/LatentSync && python scripts/server.py)"
)
async def _remote_generate(
self,
video_path: str,
audio_path: str,
output_path: str,
fps: int,
model_mode: Literal["default", "fast", "advanced"],
) -> str:
"""调用远程 LatentSync API 服务"""
if model_mode == "fast":
logger.warning("⚠️ 远程模式未接入 MuseTalk快速模型将使用远程 LatentSync")
logger.info(f"📡 调用远程 API: {self.api_url}")
async def _remote_generate(
self,
video_path: str,
audio_path: str,
output_path: str,
fps: int,
model_mode: Literal["default", "fast", "advanced"],
) -> str:
"""调用远程 LatentSync API 服务"""
if model_mode == "fast":
logger.warning("⚠️ 远程模式未接入 MuseTalk快速模型将使用远程 LatentSync")
logger.info(f"📡 调用远程 API: {self.api_url}")
try:
async with httpx.AsyncClient(timeout=600.0) as client:
@@ -499,4 +599,9 @@ class LipSyncService:
"ready": conda_ok and weights_ok and gpu_ok,
"musetalk_ready": musetalk_ready,
"lipsync_threshold": settings.LIPSYNC_DURATION_THRESHOLD,
"small_face_enhance": {
"enabled": settings.LIPSYNC_SMALL_FACE_ENHANCE,
"threshold": settings.LIPSYNC_SMALL_FACE_THRESHOLD,
"detector_loaded": self._face_enhance._detector_session is not None,
},
}

View File

@@ -0,0 +1,872 @@
"""
小脸增强服务
远景小脸场景下,裁切 + 超分 -> lipsync 推理 -> 贴回,提升输入质量。
单文件单类,供 LipSyncService 调用。
"""
from __future__ import annotations
import subprocess
import time
from dataclasses import dataclass, field
from pathlib import Path
from typing import Optional, Tuple, List
from loguru import logger
from app.core.config import settings
try:
import cv2
import numpy as np
_CV2_AVAILABLE = True
except ImportError:
_CV2_AVAILABLE = False
# ── 模块常量 ──
PADDING = 0.28 # bbox 外扩比例
DETECT_EVERY = 8 # 每 N 帧检测一次
TARGET_SIZE = 512 # 超分目标尺寸
MASK_FEATHER = 15 # 羽化像素
MASK_UPPER_RATIO = 0.68 # 口型区域起始位置(仅覆盖嘴部/下巴)
MASK_SIDE_MARGIN = 0.16 # 左右留白比例,避免改动面颊/鼻翼
SAMPLE_FRAMES = 24 # 采样帧数
SAMPLE_WINDOW = (0.10, 0.30) # 采样窗口 (10%~30%)
ENCODE_FPS = 25 # 编码帧率
ENCODE_CRF = 18 # 编码质量
EMA_ALPHA = 0.3 # EMA 平滑系数
# 检测过滤
MIN_FACE_WIDTH = 50
FACE_ASPECT_MIN = 0.2
FACE_ASPECT_MAX = 1.5
DET_SCORE_THRESH = 0.5
NMS_IOU_THRESH = 0.4
# 权重路径
_PROJECT_ROOT = Path(__file__).resolve().parent.parent.parent.parent
DET_MODEL_PATH = (
_PROJECT_ROOT
/ "models" / "LatentSync" / "checkpoints"
/ "auxiliary" / "models" / "buffalo_l" / "det_10g.onnx"
)
# ── 数据结构 ──
@dataclass
class FaceTrack:
"""每帧的人脸追踪数据(用于裁切 + 贴回)"""
crop_boxes: List[Tuple[int, int, int, int]] # 每帧 (x1,y1,x2,y2)
face_width_median: float
frame_count: int
frame_w: int
frame_h: int
@dataclass
class EnhanceResult:
"""enhance_if_needed 返回值"""
video_path: str
was_enhanced: bool
track: Optional[FaceTrack] = None
face_width: float = 0.0
class SmallFaceEnhanceService:
"""小脸增强服务:检测 → 裁切 → 超分 → (lipsync) → 贴回"""
def __init__(self):
self._detector_session = None
self._sr_model = None
self._sr_type: Optional[str] = None
# ================================================================
# SCRFD 人脸检测det_10g.onnxCPU 推理)
# ================================================================
def _ensure_detector(self) -> bool:
if self._detector_session is not None:
return True
if not DET_MODEL_PATH.exists():
logger.warning(f"⚠️ SCRFD 权重不存在: {DET_MODEL_PATH}")
return False
try:
import onnxruntime as ort
self._detector_session = ort.InferenceSession(
str(DET_MODEL_PATH),
providers=["CPUExecutionProvider"],
)
logger.info("✅ SCRFD 检测器已加载")
return True
except Exception as e:
logger.warning(f"⚠️ SCRFD 初始化失败: {e}")
return False
def _detect_faces(self, img_bgr: np.ndarray) -> List[Tuple[np.ndarray, float]]:
"""
用 SCRFD 检测人脸。
Returns: [(bbox_xyxy, score), ...] 按面积降序。
"""
if self._detector_session is None:
return []
h, w = img_bgr.shape[:2]
input_h, input_w = 640, 640
# ── Preprocess ──
ratio = min(input_h / h, input_w / w)
new_h, new_w = int(h * ratio), int(w * ratio)
resized = cv2.resize(img_bgr, (new_w, new_h))
padded = np.full((input_h, input_w, 3), 127.5, dtype=np.float32)
padded[:new_h, :new_w] = resized.astype(np.float32)
# BGR → RGB → normalize
blob = padded[:, :, ::-1].copy()
blob = (blob - 127.5) / 128.0
blob = blob.transpose(2, 0, 1)[np.newaxis].astype(np.float32)
# ── Inference ──
input_name = self._detector_session.get_inputs()[0].name
outputs = self._detector_session.run(None, {input_name: blob})
# det_10g outputs: [scores_s8, scores_s16, scores_s32,
# bbox_s8, bbox_s16, bbox_s32,
# kps_s8, kps_s16, kps_s32]
strides = [8, 16, 32]
all_bboxes = []
all_scores = []
for i, stride in enumerate(strides):
scores = outputs[i].flatten()
bboxes = outputs[i + 3].reshape(-1, 4)
# 生成 anchor 中心
feat_h = input_h // stride
feat_w = input_w // stride
anchors = []
for y in range(feat_h):
for x in range(feat_w):
cx, cy = x * stride, y * stride
anchors.append([cx, cy])
anchors.append([cx, cy]) # 2 anchors per cell
anchors = np.array(anchors, dtype=np.float32)
# 置信度过滤
mask = scores > DET_SCORE_THRESH
if not mask.any():
continue
f_scores = scores[mask]
f_bboxes = bboxes[mask]
f_anchors = anchors[mask]
# Decode: distance * stride → xyxy
decoded = np.empty_like(f_bboxes)
decoded[:, 0] = f_anchors[:, 0] - f_bboxes[:, 0] * stride
decoded[:, 1] = f_anchors[:, 1] - f_bboxes[:, 1] * stride
decoded[:, 2] = f_anchors[:, 0] + f_bboxes[:, 2] * stride
decoded[:, 3] = f_anchors[:, 1] + f_bboxes[:, 3] * stride
# 缩放回原始图像坐标
decoded /= ratio
all_bboxes.append(decoded)
all_scores.append(f_scores)
if not all_bboxes:
return []
bboxes_cat = np.concatenate(all_bboxes)
scores_cat = np.concatenate(all_scores)
# NMS
keep = self._nms(bboxes_cat, scores_cat, NMS_IOU_THRESH)
# 尺寸 + 宽高比过滤
results = []
for idx in keep:
bbox = bboxes_cat[idx]
score = float(scores_cat[idx])
bw = bbox[2] - bbox[0]
bh = bbox[3] - bbox[1]
if bw < MIN_FACE_WIDTH or bh < MIN_FACE_WIDTH:
continue
aspect = bw / max(bh, 1)
if aspect < FACE_ASPECT_MIN or aspect > FACE_ASPECT_MAX:
continue
results.append((bbox.copy(), score))
results.sort(key=lambda x: (x[0][2] - x[0][0]) * (x[0][3] - x[0][1]), reverse=True)
return results
@staticmethod
def _nms(bboxes: np.ndarray, scores: np.ndarray, threshold: float) -> List[int]:
x1 = bboxes[:, 0]
y1 = bboxes[:, 1]
x2 = bboxes[:, 2]
y2 = bboxes[:, 3]
areas = (x2 - x1) * (y2 - y1)
order = scores.argsort()[::-1]
keep = []
while order.size > 0:
i = order[0]
keep.append(int(i))
if order.size == 1:
break
xx1 = np.maximum(x1[i], x1[order[1:]])
yy1 = np.maximum(y1[i], y1[order[1:]])
xx2 = np.minimum(x2[i], x2[order[1:]])
yy2 = np.minimum(y2[i], y2[order[1:]])
inter = np.maximum(0, xx2 - xx1) * np.maximum(0, yy2 - yy1)
iou = inter / (areas[i] + areas[order[1:]] - inter + 1e-6)
inds = np.where(iou <= threshold)[0]
order = order[inds + 1]
return keep
# ================================================================
# 视频工具
# ================================================================
@staticmethod
def _get_video_info(video_path: str) -> Optional[Tuple[int, int, int, float]]:
"""返回 (width, height, frame_count, fps)"""
try:
import json as _json
cmd = [
"ffprobe", "-v", "error",
"-select_streams", "v:0",
"-show_entries", "stream=width,height,nb_frames,r_frame_rate,avg_frame_rate",
"-of", "json",
video_path,
]
r = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
if r.returncode != 0:
return None
info = _json.loads(r.stdout)
streams = info.get("streams")
if not streams:
return None
stream = streams[0]
w, h = int(stream["width"]), int(stream["height"])
# nb_frames 可能为 "N/A" 或缺失
nb_raw = stream.get("nb_frames", "N/A")
nb = int(nb_raw) if nb_raw not in ("N/A", "") else 0
def _parse_fps(s: str) -> float:
if "/" in s:
num, den = s.split("/")
return float(num) / float(den) if float(den) != 0 else 0.0
return float(s) if s else 0.0
# 优先 avg_frame_rate真实平均帧率r_frame_rate 可能是 timebase 倍数
avg_fps = _parse_fps(stream.get("avg_frame_rate", "0/0"))
r_fps = _parse_fps(stream.get("r_frame_rate", "25/1"))
fps = avg_fps if avg_fps > 0 else (r_fps if r_fps > 0 else 25.0)
if nb == 0:
cmd2 = [
"ffprobe", "-v", "error",
"-show_entries", "format=duration",
"-of", "default=noprint_wrappers=1:nokey=1",
video_path,
]
r2 = subprocess.run(cmd2, capture_output=True, text=True, timeout=10)
if r2.returncode == 0 and r2.stdout.strip():
nb = int(float(r2.stdout.strip()) * fps)
return w, h, nb, fps
except Exception as e:
logger.warning(f"⚠️ 获取视频信息失败: {e}")
return None
@staticmethod
def _open_video_reader(video_path: str, w: int, h: int,
seek_sec: float = 0, duration_sec: float = 0):
"""打开 ffmpeg rawvideo 读取管道"""
cmd = ["ffmpeg"]
if seek_sec > 0:
cmd += ["-ss", f"{seek_sec:.3f}"]
cmd += ["-i", video_path]
if duration_sec > 0:
cmd += ["-t", f"{duration_sec:.3f}"]
cmd += ["-f", "rawvideo", "-pix_fmt", "bgr24", "-v", "quiet", "-"]
return subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL)
@staticmethod
def _read_one_frame(proc, w: int, h: int) -> Optional[np.ndarray]:
raw = proc.stdout.read(w * h * 3)
if len(raw) < w * h * 3:
return None
return np.frombuffer(raw, dtype=np.uint8).reshape(h, w, 3).copy()
@staticmethod
def _open_video_writer(output_path: str, w: int, h: int,
fps: int = ENCODE_FPS, crf: int = ENCODE_CRF):
"""打开 ffmpeg rawvideo 写入管道"""
cmd = [
"ffmpeg", "-y",
"-f", "rawvideo", "-pix_fmt", "bgr24",
"-s", f"{w}x{h}", "-r", str(fps), "-i", "-",
"-c:v", "libx264", "-crf", str(crf),
"-preset", "fast", "-pix_fmt", "yuv420p",
output_path,
]
return subprocess.Popen(cmd, stdin=subprocess.PIPE, stderr=subprocess.DEVNULL)
# ================================================================
# Phase 2: 人脸尺寸检测
# ================================================================
def _detect_face_size(self, video_path: str) -> Optional[float]:
"""
从视频 10%~30% 区间均匀采样,检测最大脸宽度中位数。
返回 None 表示未检测到人脸或检测器不可用。
"""
if not self._ensure_detector():
return None
info = self._get_video_info(video_path)
if info is None:
return None
w, h, nb_frames, fps = info
if nb_frames < 1 or fps <= 0:
return None
# 计算采样区间
start_frame = int(nb_frames * SAMPLE_WINDOW[0])
end_frame = int(nb_frames * SAMPLE_WINDOW[1])
end_frame = max(end_frame, start_frame + 1)
n_sample = min(SAMPLE_FRAMES, end_frame - start_frame)
if n_sample <= 0:
return None
step = max(1, (end_frame - start_frame) // n_sample)
sample_indices = set(range(start_frame, end_frame, step))
# 用 ffmpeg seek 定位到采样起点
seek_sec = start_frame / fps
duration_sec = (end_frame - start_frame) / fps + 0.5 # 余量
proc = self._open_video_reader(video_path, w, h, seek_sec, duration_sec)
face_widths = []
try:
for local_idx in range(end_frame - start_frame + 1):
frame = self._read_one_frame(proc, w, h)
if frame is None:
break
global_idx = start_frame + local_idx
if global_idx not in sample_indices:
continue
faces = self._detect_faces(frame)
if faces:
bbox = faces[0][0] # 最大脸
face_widths.append(float(bbox[2] - bbox[0]))
finally:
proc.stdout.close()
proc.terminate()
proc.wait()
if not face_widths:
return None
face_widths.sort()
mid = len(face_widths) // 2
if len(face_widths) % 2 == 0:
return (face_widths[mid - 1] + face_widths[mid]) / 2
return face_widths[mid]
# ================================================================
# Phase 3: 裁切 + 轨迹
# ================================================================
def _build_face_track(self, video_path: str,
w: int, h: int, nb_frames: int) -> Optional[FaceTrack]:
"""
逐帧人脸追踪:每 DETECT_EVERY 帧检测,中间帧 EMA 插值。
返回 FaceTrack 或 None检测失败
"""
if not self._ensure_detector():
return None
detect_set = set(range(0, nb_frames, DETECT_EVERY))
# 第一遍:检测帧
proc = self._open_video_reader(video_path, w, h)
keyframe_bboxes = {}
actual_frames = 0
try:
for idx in range(nb_frames):
frame = self._read_one_frame(proc, w, h)
if frame is None:
break
actual_frames = idx + 1
if idx not in detect_set:
continue
faces = self._detect_faces(frame)
if faces:
keyframe_bboxes[idx] = faces[0][0].copy()
finally:
proc.stdout.close()
proc.terminate()
proc.wait()
if not keyframe_bboxes:
return None
# 用实际读到的帧数,避免 _get_video_info 估算偏差
nb_frames = actual_frames
# 前向填充 + EMA 平滑
sorted_keys = sorted(keyframe_bboxes.keys())
raw_bboxes: List[np.ndarray] = [None] * nb_frames # type: ignore
for k in sorted_keys:
raw_bboxes[k] = keyframe_bboxes[k]
prev = keyframe_bboxes[sorted_keys[0]]
for i in range(nb_frames):
if raw_bboxes[i] is not None:
prev = raw_bboxes[i]
else:
raw_bboxes[i] = prev.copy()
# EMA 平滑
smoothed = [raw_bboxes[0].copy()]
for i in range(1, nb_frames):
s = EMA_ALPHA * raw_bboxes[i] + (1 - EMA_ALPHA) * smoothed[-1]
smoothed.append(s)
# 带 padding 的 crop boxclamp 到帧边界)
crop_boxes = []
for bbox in smoothed:
x1, y1, x2, y2 = bbox
bw, bh = x2 - x1, y2 - y1
pad_w, pad_h = bw * PADDING, bh * PADDING
cx1 = max(0, int(x1 - pad_w))
cy1 = max(0, int(y1 - pad_h))
cx2 = min(w, int(x2 + pad_w))
cy2 = min(h, int(y2 + pad_h))
crop_boxes.append((cx1, cy1, cx2, cy2))
# 中位数脸宽
widths = sorted(float(b[2] - b[0]) for b in smoothed)
median_w = widths[len(widths) // 2]
return FaceTrack(
crop_boxes=crop_boxes,
face_width_median=median_w,
frame_count=nb_frames,
frame_w=w,
frame_h=h,
)
# ================================================================
# Phase 3: 超分
# ================================================================
def _ensure_upscaler(self, upscaler: str, gpu_id: int) -> bool:
"""懒加载超分模型"""
if self._sr_model is not None and self._sr_type == upscaler:
return True
try:
import sys
import torch
# torchvision >= 0.20 移除了 functional_tensor但 basicsr 仍引用
if "torchvision.transforms.functional_tensor" not in sys.modules:
try:
import torchvision.transforms.functional as _F
sys.modules["torchvision.transforms.functional_tensor"] = _F
except ImportError:
pass
device = torch.device(f"cuda:{gpu_id}" if torch.cuda.is_available() else "cpu")
if upscaler == "gfpgan":
from gfpgan import GFPGANer
model_path = _PROJECT_ROOT / "models" / "FaceEnhance" / "GFPGANv1.4.pth"
if not model_path.exists():
logger.warning(f"⚠️ GFPGAN 权重不存在: {model_path}")
return False
self._sr_model = GFPGANer(
model_path=str(model_path),
upscale=2,
arch="clean",
channel_multiplier=2,
bg_upsampler=None,
device=device,
)
elif upscaler == "codeformer":
from basicsr.archs.codeformer_arch import CodeFormer as CodeFormerArch
model_path = _PROJECT_ROOT / "models" / "FaceEnhance" / "codeformer.pth"
if not model_path.exists():
logger.warning(f"⚠️ CodeFormer 权重不存在: {model_path}")
# 尝试回退 gfpgan
return self._ensure_upscaler("gfpgan", gpu_id)
net = CodeFormerArch(
dim_embd=512, codebook_size=1024, n_head=8, n_layers=9,
connect_list=["32", "64", "128", "256"],
).to(device)
ckpt = torch.load(str(model_path), map_location=device, weights_only=False)
net.load_state_dict(ckpt.get("params_ema", ckpt.get("params", ckpt)))
net.eval()
self._sr_model = net
self._sr_device = device
else:
logger.warning(f"⚠️ 未知超分器: {upscaler}")
return False
self._sr_type = upscaler
logger.info(f"✅ 超分器已加载: {upscaler}")
return True
except Exception as e:
logger.warning(f"⚠️ 超分器初始化失败 ({upscaler}): {e}")
return False
def _upscale_face(self, face_img: np.ndarray, target_size: int) -> np.ndarray:
"""用已加载的超分模型增强单帧,失败回退 bicubic"""
try:
if self._sr_type == "gfpgan":
_, _, output = self._sr_model.enhance(
face_img, paste_back=False, has_aligned=False,
)
if output is not None:
return cv2.resize(
output, (target_size, target_size),
interpolation=cv2.INTER_LANCZOS4,
)
elif self._sr_type == "codeformer":
import torch
img = cv2.resize(face_img, (512, 512))
img_t = (
torch.from_numpy(img.astype(np.float32) / 255.0)
.permute(2, 0, 1)
.unsqueeze(0)
.to(self._sr_device)
)
with torch.no_grad():
out = self._sr_model(img_t, w=0.7)[0]
out_np = (
out.squeeze().permute(1, 2, 0).cpu().numpy() * 255
).clip(0, 255).astype(np.uint8)
return cv2.resize(
out_np, (target_size, target_size),
interpolation=cv2.INTER_LANCZOS4,
)
except Exception as e:
logger.debug(f"超分失败,回退 bicubic: {e}")
return cv2.resize(
face_img, (target_size, target_size),
interpolation=cv2.INTER_CUBIC,
)
# ================================================================
# Phase 3: 裁切 + 超分 → 增强视频
# ================================================================
def _crop_and_upscale_video(
self,
video_path: str,
track: FaceTrack,
tmpdir: Path,
gpu_id: int,
source_fps: float,
) -> str:
"""
裁切人脸区域 → 稀疏关键帧超分 → 输出 TARGET_SIZE 视频。
流式处理,不占满内存。
"""
output_path = str(tmpdir / "enhanced_face.mp4")
w, h = track.frame_w, track.frame_h
upscaler = settings.LIPSYNC_SMALL_FACE_UPSCALER
sr_available = self._ensure_upscaler(upscaler, gpu_id)
detect_set = set(range(0, track.frame_count, DETECT_EVERY))
reader = self._open_video_reader(video_path, w, h)
out_fps = max(1, int(round(source_fps))) if source_fps > 0 else ENCODE_FPS
writer = self._open_video_writer(output_path, TARGET_SIZE, TARGET_SIZE, fps=out_fps)
try:
for idx in range(track.frame_count):
frame = self._read_one_frame(reader, w, h)
if frame is None:
break
cx1, cy1, cx2, cy2 = track.crop_boxes[idx]
cropped = frame[cy1:cy2, cx1:cx2]
if sr_available and idx in detect_set:
enhanced = self._upscale_face(cropped, TARGET_SIZE)
else:
enhanced = cv2.resize(
cropped, (TARGET_SIZE, TARGET_SIZE),
interpolation=cv2.INTER_CUBIC,
)
writer.stdin.write(enhanced.tobytes())
finally:
reader.stdout.close()
reader.terminate()
reader.wait()
writer.stdin.close()
writer.wait()
if not Path(output_path).exists():
raise RuntimeError("增强视频写入失败")
return output_path
# ================================================================
# Phase 3: 贴回
# ================================================================
def blend_back(
self,
original_video: str,
lipsync_video: str,
track: FaceTrack,
tmpdir,
) -> str:
"""
将 lipsync 推理结果贴回原视频。
下半脸 mask + 高斯羽化 + seamlessClone。
"""
tmpdir = Path(tmpdir)
output_path = str(tmpdir / "blended_output.mp4")
w, h = track.frame_w, track.frame_h
# 获取 lipsync 视频尺寸
ls_info = self._get_video_info(lipsync_video)
if ls_info is None:
raise RuntimeError("无法读取 lipsync 视频信息")
ls_w, ls_h, ls_frames, ls_fps = ls_info
if ls_fps <= 0:
ls_fps = ENCODE_FPS
# 帧数保护lipsync 模型按音频时长输出,帧数通常 <= 原始(looped)视频
if ls_frames <= 0:
raise RuntimeError(f"lipsync 输出帧数为 {ls_frames},跳过贴回")
if ls_frames > track.frame_count:
raise RuntimeError(
f"帧数异常: lipsync={ls_frames} > original={track.frame_count}"
)
blend_count = ls_frames
orig_info = self._get_video_info(original_video)
orig_fps = orig_info[3] if orig_info is not None else 0.0
if orig_fps <= 0:
orig_fps = ls_fps
orig_reader = self._open_video_reader(original_video, w, h)
ls_reader = self._open_video_reader(lipsync_video, ls_w, ls_h)
writer = self._open_video_writer(
output_path,
w,
h,
fps=max(1, int(round(ls_fps))),
)
current_orig_idx = -1
current_orig_frame = None
try:
for idx in range(blend_count):
target_orig_idx = min(
track.frame_count - 1,
int(round((idx / ls_fps) * orig_fps)),
)
while current_orig_idx < target_orig_idx:
frame = self._read_one_frame(orig_reader, w, h)
if frame is None:
current_orig_frame = None
break
current_orig_idx += 1
current_orig_frame = frame
orig_frame = current_orig_frame
ls_frame = self._read_one_frame(ls_reader, ls_w, ls_h)
if orig_frame is None or ls_frame is None:
break
cx1, cy1, cx2, cy2 = track.crop_boxes[target_orig_idx]
crop_w, crop_h = cx2 - cx1, cy2 - cy1
# 将 lipsync 输出 resize 到裁切区域尺寸
ls_resized = cv2.resize(
ls_frame, (crop_w, crop_h),
interpolation=cv2.INTER_LANCZOS4,
)
# 嘴部局部 mask尽量仅覆盖嘴唇与下巴区域避免鼻子/眼周被改动)
mask = np.zeros((crop_h, crop_w), dtype=np.uint8)
upper = int(crop_h * MASK_UPPER_RATIO)
left = int(crop_w * MASK_SIDE_MARGIN)
right = int(crop_w * (1.0 - MASK_SIDE_MARGIN))
if right - left < 8:
left, right = 0, crop_w
mask[upper:, left:right] = 255
# 中央椭圆增强口型区域权重
ellipse_center = (crop_w // 2, int(crop_h * 0.82))
ellipse_axes = (max(8, int(crop_w * 0.22)), max(8, int(crop_h * 0.13)))
cv2.ellipse(mask, ellipse_center, ellipse_axes, 0, 0, 360, 255, -1)
mask = cv2.GaussianBlur(mask, (0, 0), MASK_FEATHER)
# 融合
blended = self._blend_face_region(
orig_frame, ls_resized, mask, cx1, cy1, cx2, cy2,
)
writer.stdin.write(blended.tobytes())
finally:
for p in (orig_reader, ls_reader):
p.stdout.close()
p.terminate()
p.wait()
writer.stdin.close()
writer.wait()
if not Path(output_path).exists():
raise RuntimeError("融合视频写入失败")
return output_path
@staticmethod
def _blend_face_region(
orig: np.ndarray,
face: np.ndarray,
mask: np.ndarray,
x1: int, y1: int, x2: int, y2: int,
) -> np.ndarray:
"""seamlessClone 贴回,失败回退 alpha 混合"""
result = orig.copy()
crop_h, crop_w = face.shape[:2]
# 尝试 seamlessClone
try:
center_x = (x1 + x2) // 2
center_y = int(y1 + (y2 - y1) * 0.7)
center_x = max(1, min(center_x, orig.shape[1] - 2))
center_y = max(1, min(center_y, orig.shape[0] - 2))
src = np.zeros_like(orig)
src[y1:y2, x1:x2] = face
full_mask = np.zeros(orig.shape[:2], dtype=np.uint8)
full_mask[y1:y2, x1:x2] = mask
if full_mask.max() > 0:
cloned = cv2.seamlessClone(
src, orig, full_mask, (center_x, center_y), cv2.NORMAL_CLONE,
)
# 限制融合影响范围到 mask 区域,避免 Poisson 扩散导致眼部上方重影
alpha = mask.astype(np.float32) / 255.0
alpha_3ch = np.stack([alpha] * 3, axis=-1)
roi_orig = orig[y1:y2, x1:x2].astype(np.float32)
roi_clone = cloned[y1:y2, x1:x2].astype(np.float32)
blended_roi = roi_orig * (1 - alpha_3ch) + roi_clone * alpha_3ch
result = orig.copy()
result[y1:y2, x1:x2] = blended_roi.astype(np.uint8)
return result
except Exception:
pass
# Fallback: alpha 混合
alpha = mask.astype(np.float32) / 255.0
alpha_3ch = np.stack([alpha] * 3, axis=-1)
crop_region = result[y1:y2, x1:x2].astype(np.float32)
blended = crop_region * (1 - alpha_3ch) + face.astype(np.float32) * alpha_3ch
result[y1:y2, x1:x2] = blended.astype(np.uint8)
return result
# ================================================================
# 主入口
# ================================================================
def enhance_if_needed(
self,
video_path: str,
tmpdir,
gpu_id: int,
) -> EnhanceResult:
"""
主入口:检测小脸 → 裁切 + 超分 → 返回增强结果。
如不需要增强,返回 was_enhanced=False。
"""
if not settings.LIPSYNC_SMALL_FACE_ENHANCE:
return EnhanceResult(video_path=video_path, was_enhanced=False)
if not _CV2_AVAILABLE:
logger.warning("⚠️ opencv/numpy 未安装,小脸增强不可用")
return EnhanceResult(video_path=video_path, was_enhanced=False)
start = time.time()
tmpdir = Path(tmpdir)
face_dir = tmpdir / "face_enhance"
face_dir.mkdir(exist_ok=True)
# ── 检测 ──
face_width = self._detect_face_size(video_path)
if face_width is None:
logger.info("小脸增强: 未检测到人脸,跳过")
return EnhanceResult(video_path=video_path, was_enhanced=False)
threshold = settings.LIPSYNC_SMALL_FACE_THRESHOLD
if face_width >= threshold:
logger.info(
f"小脸增强: face_w={face_width:.0f}px >= threshold={threshold}px, 跳过"
)
return EnhanceResult(
video_path=video_path, was_enhanced=False, face_width=face_width,
)
logger.info(
f"小脸增强: face_w={face_width:.0f}px < threshold={threshold}px, 触发增强"
)
# ── 构建追踪 ──
info = self._get_video_info(video_path)
if info is None:
raise RuntimeError("无法读取视频信息")
w, h, nb_frames, fps = info
track = self._build_face_track(video_path, w, h, nb_frames)
if track is None:
raise RuntimeError("人脸追踪失败")
# ── 裁切 + 超分 ──
enhanced_path = self._crop_and_upscale_video(
video_path,
track,
face_dir,
gpu_id,
source_fps=fps,
)
# 清理 GPU 缓存
try:
import torch
if torch.cuda.is_available():
torch.cuda.empty_cache()
except ImportError:
pass
elapsed = time.time() - start
logger.info(
f"小脸增强: face_w={face_width:.0f}px threshold={threshold}px "
f"enhanced=True upscaler={settings.LIPSYNC_SMALL_FACE_UPSCALER} "
f"time={elapsed:.1f}s"
)
return EnhanceResult(
video_path=enhanced_path,
was_enhanced=True,
track=track,
face_width=face_width,
)

View File

@@ -38,3 +38,7 @@ faster-whisper>=1.0.0
# 文案提取与AI生成
yt-dlp>=2023.0.0
zai-sdk>=0.2.0
# 小脸增强
opencv-python-headless>=4.8.0
gfpgan>=1.3.8