更新
This commit is contained in:
@@ -194,6 +194,15 @@ backend/user_data/{user_uuid}/cookies/
|
||||
- `MUSETALK_USE_FLOAT16` (半精度,默认 true)
|
||||
- `LIPSYNC_DURATION_THRESHOLD` (秒,>=此值用 MuseTalk;代码默认 120,本仓库当前 `.env` 配置 100)
|
||||
|
||||
### 小脸口型质量补偿(本地唇形路径)
|
||||
- `LIPSYNC_SMALL_FACE_ENHANCE` (总开关,默认 false)
|
||||
- `LIPSYNC_SMALL_FACE_THRESHOLD` (触发阈值,默认 256)
|
||||
- `LIPSYNC_SMALL_FACE_UPSCALER` (`gfpgan` / `codeformer`)
|
||||
- `LIPSYNC_SMALL_FACE_GPU_ID` (超分 GPU,默认 0)
|
||||
- `LIPSYNC_SMALL_FACE_FAIL_OPEN` (失败回退,默认 true)
|
||||
|
||||
> 部署与验证细节见 `Docs/FACEENHANCE_DEPLOY.md`。
|
||||
|
||||
### 微信视频号
|
||||
- `WEIXIN_HEADLESS_MODE` (headful/headless-new)
|
||||
- `WEIXIN_CHROME_PATH` / `WEIXIN_BROWSER_CHANNEL`
|
||||
|
||||
@@ -126,9 +126,11 @@ backend/
|
||||
> - `analysis_id` 绑定 `user_id` 且有 TTL(默认 20 分钟),用于后续“生成文案”阶段安全读取标题上下文。
|
||||
|
||||
10. **健康检查**
|
||||
* `GET /api/videos/lipsync/health`: 唇形同步服务健康状态(含 LatentSync + MuseTalk + 混合路由阈值)
|
||||
* `GET /api/videos/lipsync/health`: 唇形同步服务健康状态(含 LatentSync + MuseTalk + 混合路由阈值 + `data.small_face_enhance`)
|
||||
* `GET /api/videos/voiceclone/health`: CosyVoice 3.0 服务健康状态
|
||||
|
||||
> 小脸口型质量补偿链路健康字段说明:`data.small_face_enhance.enabled`(总开关)、`threshold`(触发阈值)、`detector_loaded`(SCRFD 是否已懒加载)。
|
||||
|
||||
11. **支付 (Payment)**
|
||||
* `POST /api/payment/create-order`: 创建支付宝电脑网站支付订单(需 payment_token)
|
||||
* `POST /api/payment/notify`: 支付宝异步通知回调(返回纯文本 success/fail)
|
||||
@@ -260,6 +262,13 @@ MUSETALK_API_URL=http://localhost:8011
|
||||
MUSETALK_BATCH_SIZE=32
|
||||
LIPSYNC_DURATION_THRESHOLD=100
|
||||
|
||||
# 小脸口型质量补偿(默认关闭,建议灰度开启)
|
||||
LIPSYNC_SMALL_FACE_ENHANCE=false
|
||||
LIPSYNC_SMALL_FACE_THRESHOLD=256
|
||||
LIPSYNC_SMALL_FACE_UPSCALER=gfpgan
|
||||
LIPSYNC_SMALL_FACE_GPU_ID=0
|
||||
LIPSYNC_SMALL_FACE_FAIL_OPEN=true
|
||||
|
||||
# MuseTalk 可调参数(示例)
|
||||
MUSETALK_DETECT_EVERY=2
|
||||
MUSETALK_BLEND_CACHE_EVERY=2
|
||||
@@ -267,6 +276,8 @@ MUSETALK_ENCODE_CRF=14
|
||||
MUSETALK_ENCODE_PRESET=slow
|
||||
```
|
||||
|
||||
> 小脸口型质量补偿链路部署、权重与回滚说明见 `Docs/FACEENHANCE_DEPLOY.md`(仅本地 `_local_generate()` 路径接入,远程模式暂不接入)。
|
||||
|
||||
### 4. 启动服务
|
||||
|
||||
**开发模式 (热重载)**:
|
||||
|
||||
@@ -99,8 +99,11 @@ python -m scripts.server # 测试能否启动,Ctrl+C 退出
|
||||
|
||||
> MuseTalk 是单步潜空间修复模型(非扩散模型),推理速度接近实时,适合达到路由阈值的长视频(本仓库当前 `.env` 示例为 >=100s)。与 CosyVoice 共享 GPU0,fp16 推理约需 4-8GB 显存。合成阶段已改为 FFmpeg rawvideo 管道直编码(`libx264` + 可配 CRF/preset)并保留 numpy blending,减少中间有损文件。
|
||||
|
||||
请参考详细的独立部署指南:
|
||||
**[MuseTalk 部署指南](MUSETALK_DEPLOY.md)**
|
||||
请参考详细的独立部署指南:
|
||||
**[MuseTalk 部署指南](MUSETALK_DEPLOY.md)**
|
||||
|
||||
小脸口型质量补偿(可选)部署与验证:
|
||||
**[小脸口型质量补偿链路部署指南](FACEENHANCE_DEPLOY.md)**
|
||||
|
||||
简要步骤:
|
||||
1. 创建独立的 `musetalk` Conda 环境 (Python 3.10 + PyTorch 2.0.1 + CUDA 11.8)
|
||||
@@ -248,9 +251,14 @@ cd /home/rongye/ProgramFiles/ViGent2/backend
|
||||
| `MUSETALK_GPU_ID` | 0 | MuseTalk GPU 编号 |
|
||||
| `MUSETALK_API_URL` | `http://localhost:8011` | MuseTalk 常驻服务地址 |
|
||||
| `MUSETALK_BATCH_SIZE` | 32 | MuseTalk 推理批大小 |
|
||||
| `MUSETALK_VERSION` | v15 | MuseTalk 模型版本 |
|
||||
| `MUSETALK_USE_FLOAT16` | true | MuseTalk 半精度加速 |
|
||||
| `MUSETALK_VERSION` | v15 | MuseTalk 模型版本 |
|
||||
| `MUSETALK_USE_FLOAT16` | true | MuseTalk 半精度加速 |
|
||||
| `LIPSYNC_DURATION_THRESHOLD` | 100 | 秒,>=此值用 MuseTalk,<此值用 LatentSync(代码默认 120,建议在 `.env` 显式配置) |
|
||||
| `LIPSYNC_SMALL_FACE_ENHANCE` | false | 小脸口型质量补偿总开关(建议先关闭,灰度验证后开启) |
|
||||
| `LIPSYNC_SMALL_FACE_THRESHOLD` | 256 | 小脸触发阈值(像素) |
|
||||
| `LIPSYNC_SMALL_FACE_UPSCALER` | gfpgan | 超分模型(`gfpgan` / `codeformer`) |
|
||||
| `LIPSYNC_SMALL_FACE_GPU_ID` | 0 | 小脸补偿超分 GPU(建议与 MuseTalk 同卡) |
|
||||
| `LIPSYNC_SMALL_FACE_FAIL_OPEN` | true | 补偿链路失败时是否自动回退原流程 |
|
||||
| `ALIPAY_APP_ID` | 空 | 支付宝应用 APPID |
|
||||
| `ALIPAY_PRIVATE_KEY_PATH` | 空 | 应用私钥 PEM 文件路径 |
|
||||
| `ALIPAY_PUBLIC_KEY_PATH` | 空 | 支付宝公钥 PEM 文件路径 |
|
||||
|
||||
165
Docs/DevLogs/Day35.md
Normal file
165
Docs/DevLogs/Day35.md
Normal file
@@ -0,0 +1,165 @@
|
||||
## 小脸口型质量补偿落地 + 部署验证 (Day 35)
|
||||
|
||||
### 概述
|
||||
|
||||
完成「小脸口型质量补偿(Small-Face LipSync Compensation)」后端落地与部署收口。核心目标是在不改变用户模型选择语义(`default/fast/advanced`)的前提下,对远景小脸素材增加质量补偿链路(检测 -> 裁切 -> 稀疏超分 -> 模型推理 -> 贴回),并保持默认关闭、失败回退(fail-open)、线上可快速回滚。
|
||||
|
||||
---
|
||||
|
||||
## ✅ 1) 后端能力落地
|
||||
|
||||
### 1.1 配置与开关
|
||||
|
||||
新增 5 个配置项(默认保守):
|
||||
|
||||
- `LIPSYNC_SMALL_FACE_ENHANCE`(默认 `false`)
|
||||
- `LIPSYNC_SMALL_FACE_THRESHOLD`(默认 `256`)
|
||||
- `LIPSYNC_SMALL_FACE_UPSCALER`(`gfpgan | codeformer`)
|
||||
- `LIPSYNC_SMALL_FACE_GPU_ID`(默认 `0`)
|
||||
- `LIPSYNC_SMALL_FACE_FAIL_OPEN`(默认 `true`)
|
||||
|
||||
对应代码入口:`backend/app/core/config.py`、`backend/.env`。
|
||||
|
||||
### 1.2 新增小脸增强服务
|
||||
|
||||
新增 `backend/app/services/small_face_enhance_service.py`,实现完整补偿链路:
|
||||
|
||||
1. **小脸判定**(CPU)
|
||||
- SCRFD(`det_10g.onnx`,复用 LatentSync 权重)
|
||||
- 从视频 10%-30% 区间均匀采样 24 帧
|
||||
- 用最大脸宽中位数与阈值比较触发
|
||||
|
||||
2. **裁切与轨迹**(CPU)
|
||||
- 每 8 帧检测一次,其余帧前向填充 + EMA 平滑
|
||||
- bbox 外扩 `padding=0.28`
|
||||
|
||||
3. **稀疏超分**(GPU0)
|
||||
- 检测帧走 GFPGAN/CodeFormer
|
||||
- 非检测帧走 bicubic resize
|
||||
- 目标尺寸 `512x512`
|
||||
|
||||
4. **贴回融合**(CPU)
|
||||
- 口型局部 mask(起点 68% + 侧边留白 16%)+ 高斯羽化(15px)
|
||||
- `cv2.seamlessClone`,失败回退 alpha blend
|
||||
|
||||
5. **帧数保护**
|
||||
- 贴回前校验 `lipsync_frames <= original_frames`
|
||||
- 仅当 `lipsync_frames > original_frames` 时报错(异常),其余按 lipsync 帧数正常贴回
|
||||
|
||||
---
|
||||
|
||||
## ✅ 2) LipSyncService 集成
|
||||
|
||||
`backend/app/services/lipsync_service.py` 关键改造:
|
||||
|
||||
- 在 `_local_generate()` 内按顺序执行:
|
||||
- `video looping` -> `small face enhance` -> `model infer` -> `blend back`
|
||||
- 抽取 `_run_selected_model()` 统一模型路由(MuseTalk / LatentSync server / LatentSync subprocess)
|
||||
- 小脸增强分支全链路 `try/except`,受 `LIPSYNC_SMALL_FACE_FAIL_OPEN` 控制
|
||||
- `check_health()` 新增 `small_face_enhance` 状态字段
|
||||
|
||||
语义保持:
|
||||
|
||||
- 前端与 API 协议不变
|
||||
- 用户选择模型优先,不因小脸强制换模型
|
||||
- 仅本地路径(`_local_generate`)接入;远程路径暂不接入
|
||||
|
||||
---
|
||||
|
||||
## ✅ 3) 依赖与权重
|
||||
|
||||
### 3.1 依赖
|
||||
|
||||
`backend/requirements.txt` 新增:
|
||||
|
||||
- `opencv-python-headless>=4.8.0`
|
||||
- `gfpgan>=1.3.8`
|
||||
|
||||
### 3.2 权重
|
||||
|
||||
- `models/FaceEnhance/GFPGANv1.4.pth`(新增目录与权重)
|
||||
- `models/LatentSync/checkpoints/auxiliary/models/buffalo_l/det_10g.onnx`(复用)
|
||||
|
||||
---
|
||||
|
||||
## ✅ 4) 稳定性修复(部署后补丁)
|
||||
|
||||
为解决实际部署中的依赖兼容、帧数估算偏差、贴回误判与输出质量问题,补充九处修复:
|
||||
|
||||
1. **懒加载 + 守卫**
|
||||
- `cv2/numpy` 改为 `try/except` 导入
|
||||
- 用 `_CV2_AVAILABLE` 守卫增强入口
|
||||
- 缺依赖时跳过增强,不影响主流程
|
||||
|
||||
2. **类型注解与 torchvision 兼容补丁**
|
||||
- 增加 `from __future__ import annotations`,避免 `np.ndarray` 在缺依赖场景下导入期报错
|
||||
- 在 `_ensure_upscaler()` 中注入
|
||||
`sys.modules['torchvision.transforms.functional_tensor']`
|
||||
兼容 `torchvision>=0.20` 与 `gfpgan/basicsr` 旧引用
|
||||
|
||||
3. **ffprobe 帧率与帧数估算修复**
|
||||
- `_get_video_info()` 从 `csv` 切到 `json` 字段访问,避免 `nb_frames` 缺失导致字段错位
|
||||
- fps 取值改为优先 `avg_frame_rate`,`r_frame_rate` 仅作为 fallback
|
||||
|
||||
4. **轨迹帧数与贴回检查修复**
|
||||
- `_build_face_track()` 记录 ffmpeg 实际读帧数,覆盖估算 `nb_frames`
|
||||
- `blend_back()` 放宽检查为 `lipsync <= original` 正常贴回,仅 `>` 报错
|
||||
|
||||
5. **空输出防护**
|
||||
- `blend_back()` 增加 `ls_frames <= 0` 异常分支
|
||||
- 由外层 `FAIL_OPEN` 捕获并回退常规路径,避免写出空视频
|
||||
|
||||
6. **时基对齐修复(慢动作/重影)**
|
||||
- `_crop_and_upscale_video()` 输出 fps 改为跟随源视频 fps,避免增强视频时间轴拉伸
|
||||
- `blend_back()` 按 `orig_fps/ls_fps` 映射原始帧索引,避免只贴回前段帧导致动作变慢/重影
|
||||
|
||||
7. **无声视频修复**
|
||||
- 小脸贴回成功后新增音轨封装(mux)步骤
|
||||
- 强制将当前任务 `audio_path` 封装回贴回视频,防止增强路径无声音
|
||||
|
||||
8. **眼部重影修复**
|
||||
- 口型 mask 起点进一步下移到 68%,并增加左右 16% 留白,减少眼周/鼻翼参与融合
|
||||
- `seamlessClone` 后对结果做 mask 限域二次融合,抑制 Poisson 扩散到眼部上方
|
||||
|
||||
9. **畸形规避(运行侧)**
|
||||
- `LIPSYNC_SMALL_FACE_THRESHOLD=9999` 仅用于链路冒烟,不用于质量评估
|
||||
- 质量验证前统一恢复 `LIPSYNC_SMALL_FACE_THRESHOLD=256`
|
||||
|
||||
---
|
||||
|
||||
## ✅ 5) 部署文档与验证
|
||||
|
||||
新增并回写部署文档:`Docs/FACEENHANCE_DEPLOY.md`。
|
||||
|
||||
文档修正点:
|
||||
|
||||
- 健康检查地址修正为:`/api/videos/lipsync/health`
|
||||
- 响应示例补齐 `success/data` 外层包装
|
||||
|
||||
实际验证要点:
|
||||
|
||||
- `GET /api/videos/lipsync/health` 返回 `data.small_face_enhance`
|
||||
- 默认 `enabled=false`,开关关闭时行为与旧版一致
|
||||
- `detector_loaded=false`(懒加载)符合预期
|
||||
|
||||
---
|
||||
|
||||
## 📁 今日修改文件
|
||||
|
||||
| 文件 | 改动 |
|
||||
|------|------|
|
||||
| `backend/app/core/config.py` | 新增 `LIPSYNC_SMALL_FACE_*` 配置项(5 个) |
|
||||
| `backend/.env` | 增加小脸增强开关与参数 |
|
||||
| `backend/app/services/small_face_enhance_service.py` | 新增:检测/裁切/超分/贴回主服务;后续补丁含懒加载与兼容修复 |
|
||||
| `backend/app/services/lipsync_service.py` | 集成增强链路、抽取 `_run_selected_model`、health 增强状态 |
|
||||
| `backend/requirements.txt` | 新增 `opencv-python-headless`、`gfpgan` |
|
||||
| `models/FaceEnhance/GFPGANv1.4.pth` | 新增超分权重 |
|
||||
| `Docs/FACEENHANCE_DEPLOY.md` | 新增部署文档并修正健康检查路径/返回示例 |
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ 已知限制
|
||||
|
||||
- 仅本地唇形路径接入(`_local_generate()`);远程模式未接入小脸补偿
|
||||
- 多镜头场景当前仍为全局判定,暂不做逐段小脸判定
|
||||
- v1 优先单人自拍稳定性,多人脸切换策略后续再补
|
||||
428
Docs/FACEENHANCE_DEPLOY.md
Normal file
428
Docs/FACEENHANCE_DEPLOY.md
Normal file
@@ -0,0 +1,428 @@
|
||||
# 小脸口型质量补偿链路部署指南
|
||||
|
||||
> **更新时间**:2026-03-10 v1.4
|
||||
> **适用版本**:SmallFaceEnhance v1.4 (内嵌于 Backend 进程)
|
||||
> **架构**:LipSyncService 内部模块,无独立进程
|
||||
|
||||
---
|
||||
|
||||
## 架构概览
|
||||
|
||||
小脸口型质量补偿链路(简称"小脸增强")作为 `LipSyncService._local_generate()` 的**前处理分支**,在 lipsync 推理前自动检测小脸并增强输入质量:
|
||||
|
||||
```
|
||||
原视频 + 音频
|
||||
→ video looping (已有逻辑)
|
||||
→ 小脸检测 (SCRFD, CPU)
|
||||
→ [非小脸] 直接用用户所选模型推理 (现有路径)
|
||||
→ [小脸]
|
||||
A. 裁切主脸区域 (带 padding)
|
||||
B. 稀疏关键帧超分到 512px (GFPGAN, GPU0)
|
||||
C. 用用户所选模型推理 (MuseTalk 或 LatentSync)
|
||||
D. 下半脸 mask 羽化 + seamlessClone 贴回原帧
|
||||
→ 进入现有后续流程 (字幕/BGM/上传)
|
||||
```
|
||||
|
||||
**关键约束**:
|
||||
- 不改前端、不改 API 协议
|
||||
- 模型选择权归用户,不因小脸自动换模型
|
||||
- 默认 fail-open:增强链任何一步失败,自动回退原流程
|
||||
- 无独立进程/PM2,跟随 `vigent2-backend` 运行
|
||||
|
||||
---
|
||||
|
||||
## 硬件要求
|
||||
|
||||
| 配置 | 说明 |
|
||||
|------|------|
|
||||
| 检测器 | SCRFD (det_10g.onnx),CPU 推理,无额外 GPU 开销 |
|
||||
| 超分 | GFPGAN,GPU0 (与 MuseTalk 同卡,顺序执行),约 2-3GB 显存 |
|
||||
| 内存 | 流式 ffmpeg pipe 逐帧处理,不额外占用大量内存 |
|
||||
|
||||
> 超分与 MuseTalk 共享 GPU0,顺序执行不会同时占用显存。
|
||||
|
||||
---
|
||||
|
||||
## 依赖安装
|
||||
|
||||
### 1. pip 依赖
|
||||
|
||||
已在 `backend/requirements.txt` 中添加:
|
||||
|
||||
```
|
||||
opencv-python-headless>=4.8.0
|
||||
gfpgan>=1.3.8
|
||||
```
|
||||
|
||||
安装:
|
||||
|
||||
```bash
|
||||
cd /home/rongye/ProgramFiles/ViGent2/backend
|
||||
pip install opencv-python-headless gfpgan
|
||||
```
|
||||
|
||||
> `gfpgan` 会自动拉取 `basicsr`、`facexlib` 等依赖。
|
||||
> `onnxruntime` 需单独确认已安装(LatentSync 环境中已有 1.23.2)。
|
||||
> 如果 backend 虚拟环境中缺少 onnxruntime,需额外安装:`pip install onnxruntime`
|
||||
|
||||
### 2. 系统依赖
|
||||
|
||||
- `ffmpeg` / `ffprobe`:已有(视频处理必需)
|
||||
|
||||
---
|
||||
|
||||
## 模型权重
|
||||
|
||||
### 目录结构
|
||||
|
||||
```
|
||||
models/
|
||||
├── FaceEnhance/
|
||||
│ └── GFPGANv1.4.pth ← 超分权重 (~333MB)
|
||||
└── LatentSync/checkpoints/auxiliary/
|
||||
└── models/buffalo_l/
|
||||
└── det_10g.onnx ← 人脸检测权重 (~16MB, 复用已有)
|
||||
```
|
||||
|
||||
### 下载方式
|
||||
|
||||
**GFPGAN 权重**(已下载):
|
||||
|
||||
```bash
|
||||
cd /home/rongye/ProgramFiles/ViGent2/models/FaceEnhance
|
||||
wget -O GFPGANv1.4.pth "https://github.com/TencentARC/GFPGAN/releases/download/v1.3.4/GFPGANv1.4.pth"
|
||||
```
|
||||
|
||||
**SCRFD 检测器权重**:
|
||||
|
||||
复用 LatentSync 已有的 `det_10g.onnx`,无需额外下载。代码自动引用路径:
|
||||
`models/LatentSync/checkpoints/auxiliary/models/buffalo_l/det_10g.onnx`
|
||||
|
||||
> 权重缺失时自动 fail-open 跳过增强,不会导致任务失败。
|
||||
|
||||
---
|
||||
|
||||
## 后端配置
|
||||
|
||||
`backend/.env` 中的相关变量:
|
||||
|
||||
```ini
|
||||
# =============== 小脸口型质量补偿链路 ===============
|
||||
LIPSYNC_SMALL_FACE_ENHANCE=false # 总开关 (true/false)
|
||||
LIPSYNC_SMALL_FACE_THRESHOLD=256 # 触发阈值 (像素,脸宽 < 此值触发)
|
||||
LIPSYNC_SMALL_FACE_UPSCALER=gfpgan # 超分模型: gfpgan | codeformer
|
||||
LIPSYNC_SMALL_FACE_GPU_ID=0 # 超分 GPU (与 MuseTalk 同卡)
|
||||
LIPSYNC_SMALL_FACE_FAIL_OPEN=true # 失败回退 (true=回退原流程, false=报错)
|
||||
```
|
||||
|
||||
`backend/app/core/config.py` 中的默认值:
|
||||
|
||||
```python
|
||||
LIPSYNC_SMALL_FACE_ENHANCE: bool = False
|
||||
LIPSYNC_SMALL_FACE_THRESHOLD: int = 256
|
||||
LIPSYNC_SMALL_FACE_UPSCALER: str = "codeformer"
|
||||
LIPSYNC_SMALL_FACE_GPU_ID: int = 0
|
||||
LIPSYNC_SMALL_FACE_FAIL_OPEN: bool = True
|
||||
```
|
||||
|
||||
> `.env` 优先于 `config.py` 默认值。`config.py` 仅在 `.env` 未设置时生效。
|
||||
|
||||
### 模块内部常量
|
||||
|
||||
以下参数固定为代码常量(`small_face_enhance_service.py`),暂不走 env:
|
||||
|
||||
| 常量 | 值 | 说明 |
|
||||
|------|-----|------|
|
||||
| `PADDING` | 0.28 | bbox 外扩比例 |
|
||||
| `DETECT_EVERY` | 8 | 每 N 帧检测,中间帧 EMA 插值 |
|
||||
| `TARGET_SIZE` | 512 | 超分目标尺寸 |
|
||||
| `MASK_FEATHER` | 15 | 下半脸 mask 羽化像素 |
|
||||
| `MASK_UPPER_RATIO` | 0.68 | 口型 mask 起始位置 (crop 高度的 68%,仅覆盖嘴部/下巴) |
|
||||
| `MASK_SIDE_MARGIN` | 0.16 | 左右留白比例,避免改动面颊/鼻翼 |
|
||||
| `SAMPLE_FRAMES` | 24 | 小脸判定采样帧数 |
|
||||
| `SAMPLE_WINDOW` | (0.10, 0.30) | 采样窗口 (视频 10%~30%) |
|
||||
| `ENCODE_FPS` | 25 | 中间视频编码帧率 fallback(优先跟随源视频 fps,源 fps 不可用时回退 25) |
|
||||
| `ENCODE_CRF` | 18 | 中间视频编码质量 |
|
||||
| `EMA_ALPHA` | 0.3 | bbox EMA 平滑系数 |
|
||||
|
||||
---
|
||||
|
||||
## 启用与验证
|
||||
|
||||
### 1. 开启小脸口型质量补偿链路
|
||||
|
||||
```bash
|
||||
# 编辑 backend/.env
|
||||
LIPSYNC_SMALL_FACE_ENHANCE=true
|
||||
```
|
||||
|
||||
重启后端:
|
||||
|
||||
```bash
|
||||
pm2 restart vigent2-backend
|
||||
```
|
||||
|
||||
### 2. 强制触发测试
|
||||
|
||||
设置极大阈值,使任何视频都触发增强:
|
||||
|
||||
```ini
|
||||
LIPSYNC_SMALL_FACE_THRESHOLD=9999
|
||||
```
|
||||
|
||||
> 仅用于链路冒烟测试,不用于质量评估。`9999` 会强制大脸素材进入增强分支,可能出现中脸变形/鼻翼细节异常。
|
||||
|
||||
提交一个视频任务,检查日志:
|
||||
|
||||
```bash
|
||||
pm2 logs vigent2-backend --lines 50
|
||||
```
|
||||
|
||||
应看到类似输出:
|
||||
|
||||
```
|
||||
小脸增强: face_w=320px < threshold=9999px, 触发增强
|
||||
✅ SCRFD 检测器已加载
|
||||
✅ 超分器已加载: gfpgan
|
||||
小脸增强: face_w=320px threshold=9999px enhanced=True upscaler=gfpgan time=12.3s
|
||||
✅ 小脸增强 + 唇形同步完成: /path/to/output.mp4
|
||||
```
|
||||
|
||||
### 3. 调回正常阈值
|
||||
|
||||
验证通过后,改回合理阈值:
|
||||
|
||||
```ini
|
||||
LIPSYNC_SMALL_FACE_THRESHOLD=256
|
||||
```
|
||||
|
||||
并重启 backend:`pm2 restart vigent2-backend`。
|
||||
|
||||
### 4. 健康检查
|
||||
|
||||
```bash
|
||||
curl http://localhost:8006/api/videos/lipsync/health | python3 -m json.tool
|
||||
```
|
||||
|
||||
应包含 `data.small_face_enhance`:
|
||||
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"data": {
|
||||
"small_face_enhance": {
|
||||
"enabled": true,
|
||||
"threshold": 256,
|
||||
"detector_loaded": true
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 相关文件
|
||||
|
||||
| 文件 | 说明 |
|
||||
|------|------|
|
||||
| `backend/app/services/small_face_enhance_service.py` | 小脸增强主服务 (检测 + 裁切 + 超分 + 贴回) |
|
||||
| `backend/app/services/lipsync_service.py` | 混合路由 + 小脸增强集成 + `_run_selected_model()` |
|
||||
| `backend/app/core/config.py` | `LIPSYNC_SMALL_FACE_*` 配置项 |
|
||||
| `models/FaceEnhance/GFPGANv1.4.pth` | GFPGAN 超分权重 |
|
||||
| `models/LatentSync/checkpoints/auxiliary/models/buffalo_l/det_10g.onnx` | SCRFD 检测器权重 (复用) |
|
||||
| `Temp/小脸增强分支-实施计划.md` | 详细方案文档 |
|
||||
|
||||
---
|
||||
|
||||
## 处理流程详解
|
||||
|
||||
### 1. 检测阶段 (CPU)
|
||||
|
||||
- 从视频 10%~30% 区间均匀采 24 帧
|
||||
- SCRFD (det_10g.onnx) 检测最大脸,取中位数脸宽
|
||||
- `脸宽 < THRESHOLD` 时触发增强
|
||||
|
||||
### 2. 裁切 + 轨迹 (CPU)
|
||||
|
||||
- 每 8 帧检测人脸 bbox,中间帧 EMA 插值平滑
|
||||
- bbox + 0.28 padding 外扩,clamp 到帧边界
|
||||
- 实际读取帧数回写 `track.frame_count`,修正 ffprobe 估算偏差
|
||||
- ffmpeg pipe 流式裁切,输出 512x512 视频
|
||||
|
||||
### 3. 超分 (GPU0)
|
||||
|
||||
- 检测帧 (每 8 帧):GFPGAN 全量超分
|
||||
- 非检测帧:bicubic resize 到 512x512
|
||||
- 增强视频输出 fps 跟随源视频 fps(不再固定写 25fps),避免时基拉伸
|
||||
- 推理后自动 `torch.cuda.empty_cache()`
|
||||
|
||||
### 4. Lipsync 推理
|
||||
|
||||
- 用户选择的模型 (fast/default/advanced) 对增强后的人脸视频推理
|
||||
- 模型选择语义不变
|
||||
|
||||
### 5. 贴回 (CPU)
|
||||
|
||||
- 口型局部 mask (从 68% 高度开始 + 左右留白 16%) + 高斯羽化 15px(仅覆盖嘴部/下巴)
|
||||
- `cv2.seamlessClone(NORMAL_CLONE)` 贴回原帧
|
||||
- 对 seamlessClone 结果再按 mask 区域做二次 alpha 限域,避免融合扩散到眼部上方
|
||||
- seamlessClone 失败时 fallback alpha 混合
|
||||
- 贴回按时间轴映射原始帧索引(`orig_fps/ls_fps`),避免只使用前段帧导致动作变慢/重影
|
||||
- 帧数保护:lipsync 按音频时长输出,帧数通常 <= 原始 looped 视频;仅 `lipsync帧数 > 原始帧数` 时报错,`<=` 时正常贴回
|
||||
- 空输出保护:`lipsync帧数 <= 0` 直接抛异常,外层 `FAIL_OPEN` 回退原流程,避免写出空视频
|
||||
- 音轨封装:贴回后强制复用 `audio_path` 重新 mux 音轨,避免增强路径出现无声视频
|
||||
|
||||
---
|
||||
|
||||
## 回滚方案
|
||||
|
||||
**一级回滚 (秒级)**:
|
||||
|
||||
```ini
|
||||
LIPSYNC_SMALL_FACE_ENHANCE=false
|
||||
```
|
||||
|
||||
重启 backend 即可,所有任务走原流程。
|
||||
|
||||
**二级回滚 (版本级)**:
|
||||
|
||||
回退 `lipsync_service.py` 增强接入提交,配置项保留但不生效。
|
||||
|
||||
---
|
||||
|
||||
## 常见问题
|
||||
|
||||
### onnxruntime 未安装
|
||||
|
||||
```
|
||||
⚠️ SCRFD 初始化失败: No module named 'onnxruntime'
|
||||
```
|
||||
|
||||
**解决**:
|
||||
|
||||
```bash
|
||||
pip install onnxruntime
|
||||
```
|
||||
|
||||
### GFPGAN 权重缺失
|
||||
|
||||
```
|
||||
⚠️ GFPGAN 权重不存在: .../models/FaceEnhance/GFPGANv1.4.pth
|
||||
```
|
||||
|
||||
**解决**:参考上方"模型权重"章节下载。权重缺失时超分自动降级为 bicubic resize。
|
||||
|
||||
### 帧数异常导致 fail-open
|
||||
|
||||
```
|
||||
⚠️ 小脸贴回失败,回退原流程: 帧数异常: lipsync=300 > original=250
|
||||
```
|
||||
|
||||
**说明**:v1.1 已放宽帧数检查。lipsync 模型按音频时长输出帧数,通常 <= looped 视频帧数,此时正常贴回。仅当 lipsync 输出帧数**大于**原始帧数时才报错(异常情况)。
|
||||
|
||||
### lipsync 输出为空导致回退
|
||||
|
||||
```
|
||||
⚠️ 小脸贴回失败,回退原流程: lipsync 输出帧数为 0,跳过贴回
|
||||
```
|
||||
|
||||
**说明**:v1.2 新增空输出保护。`ls_frames <= 0` 时立即抛错,由外层 fail-open 回退到常规唇形路径,避免生成空视频文件。
|
||||
|
||||
### 增强后动作变慢 / 眼睛重影
|
||||
|
||||
**原因**:原视频与 lipsync 输出 fps 不一致时,若按同帧号直接贴回,可能出现时间轴错位(只贴回前段帧)。
|
||||
|
||||
**修复**:v1.3 已改为按 `orig_fps/ls_fps` 做时间轴映射,贴回阶段使用时间对应帧而非同索引帧,同时增强视频输出 fps 跟随源 fps。
|
||||
|
||||
**进一步修复(v1.4)**:
|
||||
- mask 起点进一步下移到 68%,并增加左右 16% 留白,减少眼周/鼻翼参与融合
|
||||
- 对 seamlessClone 输出增加 mask 限域,防止 Poisson 扩散造成眼部上方重影
|
||||
|
||||
### 增强后脸部畸形(鼻翼/中脸异常)
|
||||
|
||||
**高概率原因**:使用了测试阈值 `LIPSYNC_SMALL_FACE_THRESHOLD=9999`,把本不需要增强的大脸素材强制送入补偿链路。
|
||||
|
||||
**建议处理**:
|
||||
- 先改回 `LIPSYNC_SMALL_FACE_THRESHOLD=256` 并重启 backend。
|
||||
- 如仍有异常,临时关闭 `LIPSYNC_SMALL_FACE_ENHANCE=false` 做 A/B 对比,再继续调参。
|
||||
|
||||
### 增强后无声音
|
||||
|
||||
**原因**:贴回阶段 rawvideo 写出默认不带音轨。
|
||||
|
||||
**修复**:v1.3 已在贴回后强制执行音轨封装(mux),使用当前任务 `audio_path` 写回音频。
|
||||
|
||||
> v1.0 使用严格一致性检查(`lipsync != original` 即失败),在 looped 视频帧数远大于音频帧数时会误判失败。v1.1 已修复。
|
||||
|
||||
### 增强后口型有偏移
|
||||
|
||||
检查 `PADDING` 常量是否合理。过小的 padding 可能导致裁切区域不够,过大会引入太多背景。当前默认 0.28 (28%) 适用于大多数单人自拍场景。
|
||||
|
||||
### torchvision 兼容性 (functional_tensor)
|
||||
|
||||
```
|
||||
No module named 'torchvision.transforms.functional_tensor'
|
||||
```
|
||||
|
||||
**原因**:torchvision >= 0.20 移除了 `functional_tensor` 模块,但 `basicsr`(gfpgan 依赖)仍引用。
|
||||
|
||||
**解决**:代码已内置兼容 shim(`_ensure_upscaler()` 中自动注入 `sys.modules`),无需手动处理。如仍出现,检查 `_ensure_upscaler` 方法是否正常执行。
|
||||
|
||||
### cv2/numpy 未安装
|
||||
|
||||
```
|
||||
⚠️ cv2 未安装,小脸增强不可用
|
||||
```
|
||||
|
||||
**说明**:`cv2` 和 `numpy` 为 lazy import(`try/except`),缺失时小脸增强自动禁用,不影响后端启动和其他功能。安装 `opencv-python-headless` 即可恢复。
|
||||
|
||||
---
|
||||
|
||||
## 已知限制 (v1.4)
|
||||
|
||||
- 仅覆盖本地 lipsync 路径 (`_local_generate()`),远程模式 (`_remote_generate()`) 暂不接入
|
||||
- 多镜头仅全局判定,不做逐段小脸检测
|
||||
- 仅保证单人 (主脸) 场景稳定,不做多人脸切换
|
||||
- CodeFormer 超分需额外安装 `basicsr`,当前推荐使用 GFPGAN
|
||||
|
||||
---
|
||||
|
||||
## v1.3 → v1.4 变更记录
|
||||
|
||||
| 修复项 | 说明 |
|
||||
|--------|------|
|
||||
| 眼部重影修复 | mask 起点下移到 68% + 左右 16% 留白,减少上半脸与鼻翼参与融合 |
|
||||
| Poisson 扩散抑制 | seamlessClone 后按 mask 二次限域,避免眼部上方 ghosting |
|
||||
|
||||
---
|
||||
|
||||
## v1.2 → v1.3 变更记录
|
||||
|
||||
| 修复项 | 说明 |
|
||||
|--------|------|
|
||||
| 时基修复 | `_crop_and_upscale_video()` 输出 fps 跟随源视频 fps,避免增强视频时间轴被拉伸 |
|
||||
| 贴回对齐修复 | `blend_back()` 改为按 `orig_fps/ls_fps` 映射原始帧索引,减少动作变慢与重影 |
|
||||
| 音轨修复 | 贴回成功后新增音轨封装(mux),避免增强路径无声音 |
|
||||
|
||||
---
|
||||
|
||||
## v1.1 → v1.2 变更记录
|
||||
|
||||
| 修复项 | 说明 |
|
||||
|--------|------|
|
||||
| 空输出保护 | `blend_back()` 新增 `ls_frames <= 0` 判断,直接抛错并由外层 fail-open 回退,避免写出空视频 |
|
||||
|
||||
---
|
||||
|
||||
## v1.0 → v1.1 变更记录
|
||||
|
||||
| 修复项 | 说明 |
|
||||
|--------|------|
|
||||
| ffprobe 解析 | CSV → JSON 格式,字段名访问,不再受 `nb_frames` 缺失导致的字段错位影响 |
|
||||
| fps 选取 | 优先 `avg_frame_rate`(真实平均帧率),`r_frame_rate` 作为 fallback;避免 `60/1` 等 timebase 倍数导致帧数估算偏大 |
|
||||
| 实际帧数回写 | `_build_face_track()` 用 ffmpeg 实际读到的帧数覆盖估算值,`track.frame_count` 更准确 |
|
||||
| 贴回帧数检查 | 放宽为 `lipsync <= original` 时正常贴回,仅 `>` 时报错;适配 MuseTalk/LatentSync 按音频时长输出的行为 |
|
||||
| 边界防护 | `streams` 为空时 return None;`r_frame_rate` 分母为 0 时 fallback 25fps |
|
||||
| torchvision 兼容 | `_ensure_upscaler()` 中注入 `functional_tensor` shim,兼容 torchvision >= 0.20 |
|
||||
| lazy import | `cv2`/`numpy` 包装在 `try/except`,缺失时增强自动禁用不影响后端启动 |
|
||||
| 类型注解 | `from __future__ import annotations` 避免依赖缺失时 `np.ndarray` 等注解触发 NameError |
|
||||
@@ -1,8 +1,8 @@
|
||||
# ViGent2 开发任务清单 (Task Log)
|
||||
|
||||
**项目**: ViGent2 数字人口播视频生成系统
|
||||
**进度**: 100% (Day 34 - 多镜头时间轴重构 + 文案深度学习弹窗防误触关闭)
|
||||
**更新时间**: 2026-03-09
|
||||
**进度**: 100% (Day 35 - 小脸口型质量补偿落地 + 部署验证)
|
||||
**更新时间**: 2026-03-10
|
||||
|
||||
---
|
||||
|
||||
@@ -10,7 +10,25 @@
|
||||
|
||||
> 这里记录了每一天的核心开发内容与 milestone。
|
||||
|
||||
### Day 34: 多镜头时间轴重构 + 文案深度学习弹窗防误触关闭 + Code Review 修复 (Current)
|
||||
### Day 35: 小脸口型质量补偿落地 + 部署验证 + 稳定性补丁 (Current)
|
||||
- [x] **小脸口型质量补偿落地**: 新增 `small_face_enhance_service.py`,实现 SCRFD 小脸检测(10%-30% 采样)-> 裁切轨迹(每 8 帧检测 + EMA)-> 稀疏关键帧超分(GFPGAN/CodeFormer)-> 下半脸贴回(seamlessClone/alpha fallback)完整链路。
|
||||
- [x] **后端集成完成**: `lipsync_service.py` 在 `_local_generate()` 内完成 looping 后插入增强,抽取 `_run_selected_model()` 统一模型路由,增强失败按 `FAIL_OPEN` 自动回退原流程。
|
||||
- [x] **配置与依赖**: 新增 5 个 `LIPSYNC_SMALL_FACE_*` 配置项;`requirements.txt` 增加 `opencv-python-headless`、`gfpgan`;新增 `models/FaceEnhance/GFPGANv1.4.pth` 权重目录。
|
||||
- [x] **部署文档新增**: 新增并回写 `Docs/FACEENHANCE_DEPLOY.md`,补齐部署、权重、开关、验证、回滚说明。
|
||||
- [x] **线上稳定性修复**:
|
||||
- `small_face_enhance_service.py` 增加 `cv2/numpy` 懒加载守卫,缺依赖时跳过增强不影响主流程。
|
||||
- 增加 `from __future__ import annotations`,避免 `np.ndarray` 注解在缺依赖场景导入期报错。
|
||||
- 增加 `torchvision.transforms.functional_tensor` shim,修复 `torchvision>=0.20` 下 GFPGAN 初始化失败。
|
||||
- `_get_video_info()` 改为 JSON 字段解析并优先 `avg_frame_rate`,修复 `nb_frames` 缺失导致的帧数估算偏差。
|
||||
- `_build_face_track()` 回写实际读帧数;`blend_back()` 帧数校验放宽为 `lipsync <= original` 正常贴回,仅 `>` 报错。
|
||||
- `blend_back()` 新增 `ls_frames <= 0` 空输出保护,异常时由 `FAIL_OPEN` 回退常规路径,避免写出空视频。
|
||||
- 时基修复:增强视频输出 fps 跟随源视频 fps;贴回按 `orig_fps/ls_fps` 映射原始帧索引,修复动作变慢与重影。
|
||||
- 音轨修复:贴回成功后新增 mux 音轨步骤,确保小脸增强路径输出视频包含声音。
|
||||
- 眼部重影修复:mask 起点下移到 68% 并增加左右 16% 留白,对 seamlessClone 结果做 mask 限域二次融合,减少眼部上方 ghosting。
|
||||
- 运行策略收口:`LIPSYNC_SMALL_FACE_THRESHOLD=9999` 仅用于链路冒烟,质量验证与日常运行统一回归 `256`。
|
||||
- [x] **部署校验通过**: `GET /api/videos/lipsync/health` 已返回 `data.small_face_enhance`;默认 `enabled=false`,开关关闭下行为与原流程一致。
|
||||
|
||||
### Day 34: 多镜头时间轴重构 + 文案深度学习弹窗防误触关闭 + Code Review 修复
|
||||
- [x] **时间轴模型重构**: 多素材从”等分顺序片段”升级为”主素材连续播放 + 插入镜头块”,支持自由插入、拖拽移动。
|
||||
- [x] **前端链路落地**: 重写 `useTimelineEditor` 与 `TimelineEditor`,新增主素材/插入候选语义,`useHomeController` / `HomePage` / `MaterialSelector` 全链路适配。
|
||||
- [x] **后端生成链路适配**: `workflow.py` 完成 `material_paths` 来源修正、`custom_assignments` 新校验、素材下载去重与段处理并发限制,保持单素材兼容。
|
||||
|
||||
10
README.md
10
README.md
@@ -16,7 +16,8 @@
|
||||
## ✨ 功能特性
|
||||
|
||||
### 核心能力
|
||||
- 🎬 **高清唇形同步** - 混合方案:短视频(本仓库当前 `.env` 阈值 100s,可配)用 LatentSync 1.6(高质量 Latent Diffusion),长视频用 MuseTalk 1.5(实时级单步推理),自动路由 + 回退。前端可选模型:默认模型(阈值自动路由)/ 快速模型(速度优先)/ 高级模型(质量优先)。
|
||||
- 🎬 **高清唇形同步** - 混合方案:短视频(本仓库当前 `.env` 阈值 100s,可配)用 LatentSync 1.6(高质量 Latent Diffusion),长视频用 MuseTalk 1.5(实时级单步推理),自动路由 + 回退。前端可选模型:默认模型(阈值自动路由)/ 快速模型(速度优先)/ 高级模型(质量优先)。
|
||||
- 🧠 **小脸口型质量补偿(可选)** - 本地唇形路径支持小脸检测 + 裁切 + 稀疏关键帧超分 + 下半脸贴回补偿链路;默认关闭(`LIPSYNC_SMALL_FACE_ENHANCE=false`),失败自动回退原流程(fail-open)。
|
||||
- 🎙️ **多模态配音** - 支持 **EdgeTTS** (微软超自然语音, 10 语言) 和 **CosyVoice 3.0** (3秒极速声音克隆, 9语言+18方言, 语速/语气可调)。上传参考音频自动 Whisper 转写 + 智能截取。配音前置工作流:先生成配音 → 选素材 → 生成视频。
|
||||
- 📝 **智能字幕** - 集成 faster-whisper + Remotion,自动生成逐字高亮 (卡拉OK效果) 字幕。
|
||||
- 🎨 **样式预设** - 12 种标题 + 8 种字幕样式预设,支持预览 + 字号调节 + 自定义字体库。CSS 原生描边渲染,清晰无重影。
|
||||
@@ -67,9 +68,10 @@
|
||||
- **[部署手册 (DEPLOY_MANUAL.md)](Docs/DEPLOY_MANUAL.md)** - 👈 **部署请看这里**!包含完整的环境搭建步骤。
|
||||
- [多平台发布部署说明 (PUBLISH_DEPLOY.md)](Docs/PUBLISH_DEPLOY.md) - 抖音/微信视频号/B站/小红书登录与自动化发布专项文档。
|
||||
- [参考音频服务部署 (COSYVOICE3_DEPLOY.md)](Docs/COSYVOICE3_DEPLOY.md) - 声音克隆模型部署指南。
|
||||
- [LatentSync 部署指南 (LATENTSYNC_DEPLOY.md)](Docs/LATENTSYNC_DEPLOY.md) - 唇形同步模型独立部署。
|
||||
- [MuseTalk 部署指南 (MUSETALK_DEPLOY.md)](Docs/MUSETALK_DEPLOY.md) - 长视频唇形同步模型部署。
|
||||
- [Supabase 部署指南 (SUPABASE_DEPLOY.md)](Docs/SUPABASE_DEPLOY.md) - Supabase 与认证系统配置。
|
||||
- [LatentSync 部署指南 (LATENTSYNC_DEPLOY.md)](Docs/LATENTSYNC_DEPLOY.md) - 唇形同步模型独立部署。
|
||||
- [MuseTalk 部署指南 (MUSETALK_DEPLOY.md)](Docs/MUSETALK_DEPLOY.md) - 长视频唇形同步模型部署。
|
||||
- [小脸口型质量补偿链路部署指南 (FACEENHANCE_DEPLOY.md)](Docs/FACEENHANCE_DEPLOY.md) - 小脸口型质量补偿链路部署与验证。
|
||||
- [Supabase 部署指南 (SUPABASE_DEPLOY.md)](Docs/SUPABASE_DEPLOY.md) - Supabase 与认证系统配置。
|
||||
- [支付宝部署指南 (ALIPAY_DEPLOY.md)](Docs/ALIPAY_DEPLOY.md) - 支付宝付费开通会员配置。
|
||||
|
||||
### 开发文档
|
||||
|
||||
@@ -83,6 +83,13 @@ MUSETALK_ENCODE_PRESET=slow
|
||||
# 音频时长 >= 此阈值(秒)用 MuseTalk,< 此阈值用 LatentSync
|
||||
LIPSYNC_DURATION_THRESHOLD=100
|
||||
|
||||
# =============== 小脸口型质量补偿链路 ===============
|
||||
LIPSYNC_SMALL_FACE_ENHANCE=true
|
||||
LIPSYNC_SMALL_FACE_THRESHOLD=256
|
||||
LIPSYNC_SMALL_FACE_UPSCALER=gfpgan
|
||||
LIPSYNC_SMALL_FACE_GPU_ID=0
|
||||
LIPSYNC_SMALL_FACE_FAIL_OPEN=true
|
||||
|
||||
# =============== 上传配置 ===============
|
||||
# 最大上传文件大小 (MB)
|
||||
MAX_UPLOAD_SIZE_MB=500
|
||||
|
||||
@@ -37,22 +37,22 @@ class Settings(BaseSettings):
|
||||
DOUYIN_BROWSER_CHANNEL: str = ""
|
||||
DOUYIN_FORCE_SWIFTSHADER: bool = True
|
||||
|
||||
# Douyin 调试录屏
|
||||
DOUYIN_DEBUG_ARTIFACTS: bool = False
|
||||
DOUYIN_RECORD_VIDEO: bool = False
|
||||
DOUYIN_KEEP_SUCCESS_VIDEO: bool = False
|
||||
DOUYIN_RECORD_VIDEO_WIDTH: int = 1280
|
||||
DOUYIN_RECORD_VIDEO_HEIGHT: int = 720
|
||||
|
||||
# Xiaohongshu Playwright 配置
|
||||
XIAOHONGSHU_HEADLESS_MODE: str = "headless-new"
|
||||
XIAOHONGSHU_USER_AGENT: str = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.0.0 Safari/537.36"
|
||||
XIAOHONGSHU_LOCALE: str = "zh-CN"
|
||||
XIAOHONGSHU_TIMEZONE_ID: str = "Asia/Shanghai"
|
||||
XIAOHONGSHU_CHROME_PATH: str = "/usr/bin/google-chrome"
|
||||
XIAOHONGSHU_BROWSER_CHANNEL: str = ""
|
||||
XIAOHONGSHU_FORCE_SWIFTSHADER: bool = True
|
||||
XIAOHONGSHU_DEBUG_ARTIFACTS: bool = False
|
||||
# Douyin 调试录屏
|
||||
DOUYIN_DEBUG_ARTIFACTS: bool = False
|
||||
DOUYIN_RECORD_VIDEO: bool = False
|
||||
DOUYIN_KEEP_SUCCESS_VIDEO: bool = False
|
||||
DOUYIN_RECORD_VIDEO_WIDTH: int = 1280
|
||||
DOUYIN_RECORD_VIDEO_HEIGHT: int = 720
|
||||
|
||||
# Xiaohongshu Playwright 配置
|
||||
XIAOHONGSHU_HEADLESS_MODE: str = "headless-new"
|
||||
XIAOHONGSHU_USER_AGENT: str = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.0.0 Safari/537.36"
|
||||
XIAOHONGSHU_LOCALE: str = "zh-CN"
|
||||
XIAOHONGSHU_TIMEZONE_ID: str = "Asia/Shanghai"
|
||||
XIAOHONGSHU_CHROME_PATH: str = "/usr/bin/google-chrome"
|
||||
XIAOHONGSHU_BROWSER_CHANNEL: str = ""
|
||||
XIAOHONGSHU_FORCE_SWIFTSHADER: bool = True
|
||||
XIAOHONGSHU_DEBUG_ARTIFACTS: bool = False
|
||||
|
||||
# TTS 配置
|
||||
DEFAULT_TTS_VOICE: str = "zh-CN-YunxiNeural"
|
||||
@@ -78,6 +78,13 @@ class Settings(BaseSettings):
|
||||
# 混合唇形同步路由
|
||||
LIPSYNC_DURATION_THRESHOLD: float = 120.0 # 秒,>=此值用 MuseTalk
|
||||
|
||||
# 小脸口型质量补偿链路
|
||||
LIPSYNC_SMALL_FACE_ENHANCE: bool = False
|
||||
LIPSYNC_SMALL_FACE_THRESHOLD: int = 256
|
||||
LIPSYNC_SMALL_FACE_UPSCALER: str = "codeformer"
|
||||
LIPSYNC_SMALL_FACE_GPU_ID: int = 0
|
||||
LIPSYNC_SMALL_FACE_FAIL_OPEN: bool = True
|
||||
|
||||
# Supabase 配置
|
||||
SUPABASE_URL: str = ""
|
||||
SUPABASE_PUBLIC_URL: str = "" # 公网访问地址,用于生成前端可访问的 URL
|
||||
|
||||
@@ -11,12 +11,13 @@ import asyncio
|
||||
import httpx
|
||||
from pathlib import Path
|
||||
from loguru import logger
|
||||
from typing import Optional, Literal
|
||||
from typing import Optional, Literal
|
||||
|
||||
from app.core.config import settings
|
||||
from app.services.small_face_enhance_service import SmallFaceEnhanceService
|
||||
|
||||
|
||||
class LipSyncService:
|
||||
class LipSyncService:
|
||||
"""唇形同步服务 - LatentSync 1.6 + MuseTalk 1.5 混合方案"""
|
||||
|
||||
def __init__(self):
|
||||
@@ -38,6 +39,9 @@ class LipSyncService:
|
||||
|
||||
# 运行时检测
|
||||
self._weights_available: Optional[bool] = None
|
||||
|
||||
# 小脸增强
|
||||
self._face_enhance = SmallFaceEnhanceService()
|
||||
|
||||
def _check_weights(self) -> bool:
|
||||
"""检查模型权重是否存在"""
|
||||
@@ -93,7 +97,7 @@ class LipSyncService:
|
||||
logger.warning(f"⚠️ 获取媒体时长失败: {e}")
|
||||
return None
|
||||
|
||||
def _loop_video_to_duration(self, video_path: str, output_path: str, target_duration: float) -> str:
|
||||
def _loop_video_to_duration(self, video_path: str, output_path: str, target_duration: float) -> str:
|
||||
"""
|
||||
循环视频以匹配目标时长
|
||||
使用 FFmpeg stream_loop 实现无缝循环
|
||||
@@ -117,47 +121,70 @@ class LipSyncService:
|
||||
else:
|
||||
logger.warning(f"⚠️ 视频循环失败: {result.stderr[:200]}")
|
||||
return video_path
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ 视频循环异常: {e}")
|
||||
return video_path
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ 视频循环异常: {e}")
|
||||
return video_path
|
||||
|
||||
def _mux_audio_to_video(self, video_path: str, audio_path: str, output_path: str) -> bool:
|
||||
"""将音轨封装到视频,避免增强路径出现无声输出。"""
|
||||
try:
|
||||
cmd = [
|
||||
"ffmpeg", "-y",
|
||||
"-i", video_path,
|
||||
"-i", audio_path,
|
||||
"-map", "0:v:0",
|
||||
"-map", "1:a:0",
|
||||
"-c:v", "copy",
|
||||
"-c:a", "aac",
|
||||
"-shortest",
|
||||
output_path,
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
|
||||
if result.returncode == 0 and Path(output_path).exists():
|
||||
return True
|
||||
logger.warning(f"⚠️ 音轨封装失败: {result.stderr[:200]}")
|
||||
return False
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ 音轨封装异常: {e}")
|
||||
return False
|
||||
|
||||
async def generate(
|
||||
self,
|
||||
video_path: str,
|
||||
audio_path: str,
|
||||
output_path: str,
|
||||
fps: int = 25,
|
||||
model_mode: Literal["default", "fast", "advanced"] = "default",
|
||||
) -> str:
|
||||
"""生成唇形同步视频"""
|
||||
logger.info(f"🎬 唇形同步任务: {Path(video_path).name} + {Path(audio_path).name}")
|
||||
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
normalized_mode: Literal["default", "fast", "advanced"] = model_mode
|
||||
if normalized_mode not in ("default", "fast", "advanced"):
|
||||
normalized_mode = "default"
|
||||
logger.info(f"🧠 Lipsync 模式: {normalized_mode}")
|
||||
|
||||
if self.use_local:
|
||||
return await self._local_generate(video_path, audio_path, output_path, fps, normalized_mode)
|
||||
else:
|
||||
return await self._remote_generate(video_path, audio_path, output_path, fps, normalized_mode)
|
||||
async def generate(
|
||||
self,
|
||||
video_path: str,
|
||||
audio_path: str,
|
||||
output_path: str,
|
||||
fps: int = 25,
|
||||
model_mode: Literal["default", "fast", "advanced"] = "default",
|
||||
) -> str:
|
||||
"""生成唇形同步视频"""
|
||||
logger.info(f"🎬 唇形同步任务: {Path(video_path).name} + {Path(audio_path).name}")
|
||||
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
normalized_mode: Literal["default", "fast", "advanced"] = model_mode
|
||||
if normalized_mode not in ("default", "fast", "advanced"):
|
||||
normalized_mode = "default"
|
||||
logger.info(f"🧠 Lipsync 模式: {normalized_mode}")
|
||||
|
||||
if self.use_local:
|
||||
return await self._local_generate(video_path, audio_path, output_path, fps, normalized_mode)
|
||||
else:
|
||||
return await self._remote_generate(video_path, audio_path, output_path, fps, normalized_mode)
|
||||
|
||||
async def _local_generate(
|
||||
self,
|
||||
video_path: str,
|
||||
audio_path: str,
|
||||
output_path: str,
|
||||
fps: int,
|
||||
model_mode: Literal["default", "fast", "advanced"],
|
||||
) -> str:
|
||||
"""使用 subprocess 调用 LatentSync conda 环境"""
|
||||
|
||||
logger.info("⏳ 等待 GPU 资源 (排队中)...")
|
||||
async with self._lock:
|
||||
# 使用临时目录存放中间文件
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
tmpdir = Path(tmpdir)
|
||||
async def _local_generate(
|
||||
self,
|
||||
video_path: str,
|
||||
audio_path: str,
|
||||
output_path: str,
|
||||
fps: int,
|
||||
model_mode: Literal["default", "fast", "advanced"],
|
||||
) -> str:
|
||||
"""使用 subprocess 调用 LatentSync conda 环境"""
|
||||
|
||||
logger.info("⏳ 等待 GPU 资源 (排队中)...")
|
||||
async with self._lock:
|
||||
# 使用临时目录存放中间文件
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
tmpdir = Path(tmpdir)
|
||||
|
||||
# 获取音频和视频时长
|
||||
audio_duration = self._get_media_duration(audio_path)
|
||||
@@ -172,133 +199,206 @@ class LipSyncService:
|
||||
str(looped_video),
|
||||
audio_duration
|
||||
)
|
||||
else:
|
||||
actual_video_path = video_path
|
||||
|
||||
# 模型路由
|
||||
force_musetalk = model_mode == "fast"
|
||||
force_latentsync = model_mode == "advanced"
|
||||
auto_to_musetalk = (
|
||||
model_mode == "default"
|
||||
and audio_duration is not None
|
||||
and audio_duration >= settings.LIPSYNC_DURATION_THRESHOLD
|
||||
)
|
||||
|
||||
if force_musetalk:
|
||||
logger.info("⚡ 强制快速模型:MuseTalk")
|
||||
musetalk_result = await self._call_musetalk_server(
|
||||
actual_video_path, audio_path, output_path
|
||||
)
|
||||
if musetalk_result:
|
||||
return musetalk_result
|
||||
logger.warning("⚠️ MuseTalk 不可用,快速模型回退到 LatentSync")
|
||||
elif auto_to_musetalk:
|
||||
logger.info(
|
||||
f"🔄 音频 {audio_duration:.1f}s >= {settings.LIPSYNC_DURATION_THRESHOLD}s,路由到 MuseTalk"
|
||||
)
|
||||
musetalk_result = await self._call_musetalk_server(
|
||||
actual_video_path, audio_path, output_path
|
||||
)
|
||||
if musetalk_result:
|
||||
return musetalk_result
|
||||
logger.warning("⚠️ MuseTalk 不可用,回退到 LatentSync(长视频,会较慢)")
|
||||
elif force_latentsync:
|
||||
logger.info("🎯 强制高级模型:LatentSync")
|
||||
|
||||
# 检查 LatentSync 前置条件(仅在需要回退或使用 LatentSync 时)
|
||||
if not self._check_conda_env():
|
||||
logger.warning("⚠️ Conda 环境不可用,使用 Fallback")
|
||||
shutil.copy(video_path, output_path)
|
||||
return output_path
|
||||
|
||||
if not self._check_weights():
|
||||
logger.warning("⚠️ 模型权重不存在,使用 Fallback")
|
||||
shutil.copy(video_path, output_path)
|
||||
return output_path
|
||||
|
||||
if self.use_server:
|
||||
# 模式 A: 调用常驻服务 (加速模式)
|
||||
return await self._call_persistent_server(actual_video_path, audio_path, output_path)
|
||||
else:
|
||||
actual_video_path = video_path
|
||||
|
||||
logger.info("🔄 调用 LatentSync 推理 (subprocess)...")
|
||||
|
||||
temp_output = tmpdir / "output.mp4"
|
||||
|
||||
# 构建命令
|
||||
cmd = [
|
||||
str(self.conda_python),
|
||||
"-m", "scripts.inference",
|
||||
"--unet_config_path", "configs/unet/stage2_512.yaml",
|
||||
"--inference_ckpt_path", "checkpoints/latentsync_unet.pt",
|
||||
"--inference_steps", str(settings.LATENTSYNC_INFERENCE_STEPS),
|
||||
"--guidance_scale", str(settings.LATENTSYNC_GUIDANCE_SCALE),
|
||||
"--video_path", str(actual_video_path), # 使用预处理后的视频
|
||||
"--audio_path", str(audio_path),
|
||||
"--video_out_path", str(temp_output),
|
||||
"--seed", str(settings.LATENTSYNC_SEED),
|
||||
"--temp_dir", str(tmpdir / "cache"),
|
||||
]
|
||||
|
||||
if settings.LATENTSYNC_ENABLE_DEEPCACHE:
|
||||
cmd.append("--enable_deepcache")
|
||||
|
||||
# 设置环境变量
|
||||
env = os.environ.copy()
|
||||
env["CUDA_VISIBLE_DEVICES"] = str(self.gpu_id)
|
||||
|
||||
logger.info(f"🖥️ 执行命令: {' '.join(cmd[:8])}...")
|
||||
logger.info(f"🖥️ GPU: CUDA_VISIBLE_DEVICES={self.gpu_id}")
|
||||
|
||||
# ── 小脸增强 ──
|
||||
enhance_result = None
|
||||
try:
|
||||
# 使用 asyncio subprocess 实现真正的异步执行
|
||||
# 这样事件循环可以继续处理其他请求(如进度查询)
|
||||
process = await asyncio.create_subprocess_exec(
|
||||
*cmd,
|
||||
cwd=str(self.latentsync_dir),
|
||||
env=env,
|
||||
stdout=asyncio.subprocess.PIPE,
|
||||
stderr=asyncio.subprocess.PIPE,
|
||||
enhance_result = self._face_enhance.enhance_if_needed(
|
||||
video_path=str(actual_video_path),
|
||||
tmpdir=tmpdir,
|
||||
gpu_id=settings.LIPSYNC_SMALL_FACE_GPU_ID,
|
||||
)
|
||||
|
||||
# 等待进程完成,带超时
|
||||
try:
|
||||
stdout, stderr = await asyncio.wait_for(
|
||||
process.communicate(),
|
||||
timeout=900 # 15分钟超时
|
||||
)
|
||||
except asyncio.TimeoutError:
|
||||
process.kill()
|
||||
await process.wait()
|
||||
logger.error("⏰ LatentSync 推理超时 (15分钟)")
|
||||
shutil.copy(video_path, output_path)
|
||||
return output_path
|
||||
|
||||
stdout_text = stdout.decode() if stdout else ""
|
||||
stderr_text = stderr.decode() if stderr else ""
|
||||
|
||||
if process.returncode != 0:
|
||||
logger.error(f"LatentSync 推理失败:\n{stderr_text}")
|
||||
logger.error(f"stdout:\n{stdout_text[-1000:] if stdout_text else 'N/A'}")
|
||||
# Fallback
|
||||
shutil.copy(video_path, output_path)
|
||||
return output_path
|
||||
|
||||
logger.info(f"LatentSync 输出:\n{stdout_text[-500:] if stdout_text else 'N/A'}")
|
||||
|
||||
# 检查输出文件
|
||||
if temp_output.exists():
|
||||
shutil.copy(temp_output, output_path)
|
||||
logger.info(f"✅ 唇形同步完成: {output_path}")
|
||||
return output_path
|
||||
else:
|
||||
logger.warning("⚠️ 未找到输出文件,使用 Fallback")
|
||||
shutil.copy(video_path, output_path)
|
||||
return output_path
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"❌ 推理异常: {e}")
|
||||
shutil.copy(video_path, output_path)
|
||||
return output_path
|
||||
if settings.LIPSYNC_SMALL_FACE_FAIL_OPEN:
|
||||
logger.warning(f"⚠️ 小脸增强失败,跳过: {e}")
|
||||
else:
|
||||
raise
|
||||
|
||||
if enhance_result and enhance_result.was_enhanced:
|
||||
track = enhance_result.track
|
||||
if track is None:
|
||||
raise RuntimeError("小脸增强轨迹缺失")
|
||||
|
||||
# 增强路径:模型推理增强后的人脸视频 → 贴回原视频
|
||||
temp_sync = tmpdir / "face_sync.mp4"
|
||||
await self._run_selected_model(
|
||||
video_path=enhance_result.video_path,
|
||||
audio_path=audio_path,
|
||||
output_path=str(temp_sync),
|
||||
tmpdir=tmpdir,
|
||||
model_mode=model_mode,
|
||||
audio_duration=audio_duration,
|
||||
original_video_path=video_path,
|
||||
)
|
||||
|
||||
try:
|
||||
blended = self._face_enhance.blend_back(
|
||||
original_video=str(actual_video_path),
|
||||
lipsync_video=str(temp_sync),
|
||||
track=track,
|
||||
tmpdir=tmpdir,
|
||||
)
|
||||
blended_with_audio = tmpdir / "blended_with_audio.mp4"
|
||||
if not self._mux_audio_to_video(
|
||||
video_path=str(blended),
|
||||
audio_path=audio_path,
|
||||
output_path=str(blended_with_audio),
|
||||
):
|
||||
raise RuntimeError("贴回视频音轨封装失败")
|
||||
|
||||
shutil.copy(str(blended_with_audio), output_path)
|
||||
logger.info(f"✅ 小脸增强 + 唇形同步完成: {output_path}")
|
||||
return output_path
|
||||
except Exception as e:
|
||||
if settings.LIPSYNC_SMALL_FACE_FAIL_OPEN:
|
||||
logger.warning(f"⚠️ 小脸贴回失败,回退原流程: {e}")
|
||||
else:
|
||||
raise
|
||||
|
||||
# 常规路径(未增强或增强失败)
|
||||
return await self._run_selected_model(
|
||||
video_path=str(actual_video_path),
|
||||
audio_path=audio_path,
|
||||
output_path=output_path,
|
||||
tmpdir=tmpdir,
|
||||
model_mode=model_mode,
|
||||
audio_duration=audio_duration,
|
||||
original_video_path=video_path,
|
||||
)
|
||||
|
||||
async def _run_selected_model(
|
||||
self,
|
||||
video_path: str,
|
||||
audio_path: str,
|
||||
output_path: str,
|
||||
tmpdir: Path,
|
||||
model_mode: Literal["default", "fast", "advanced"],
|
||||
audio_duration: Optional[float],
|
||||
original_video_path: str,
|
||||
) -> str:
|
||||
"""模型路由 + 执行(MuseTalk / LatentSync 常驻服务 / LatentSync subprocess)"""
|
||||
|
||||
# 模型路由
|
||||
force_musetalk = model_mode == "fast"
|
||||
force_latentsync = model_mode == "advanced"
|
||||
auto_to_musetalk = (
|
||||
model_mode == "default"
|
||||
and audio_duration is not None
|
||||
and audio_duration >= settings.LIPSYNC_DURATION_THRESHOLD
|
||||
)
|
||||
|
||||
if force_musetalk:
|
||||
logger.info("⚡ 强制快速模型:MuseTalk")
|
||||
musetalk_result = await self._call_musetalk_server(
|
||||
video_path, audio_path, output_path
|
||||
)
|
||||
if musetalk_result:
|
||||
return musetalk_result
|
||||
logger.warning("⚠️ MuseTalk 不可用,快速模型回退到 LatentSync")
|
||||
elif auto_to_musetalk:
|
||||
logger.info(
|
||||
f"🔄 音频 {audio_duration:.1f}s >= {settings.LIPSYNC_DURATION_THRESHOLD}s,路由到 MuseTalk"
|
||||
)
|
||||
musetalk_result = await self._call_musetalk_server(
|
||||
video_path, audio_path, output_path
|
||||
)
|
||||
if musetalk_result:
|
||||
return musetalk_result
|
||||
logger.warning("⚠️ MuseTalk 不可用,回退到 LatentSync(长视频,会较慢)")
|
||||
elif force_latentsync:
|
||||
logger.info("🎯 强制高级模型:LatentSync")
|
||||
|
||||
# 检查 LatentSync 前置条件
|
||||
if not self._check_conda_env():
|
||||
logger.warning("⚠️ Conda 环境不可用,使用 Fallback")
|
||||
shutil.copy(original_video_path, output_path)
|
||||
return output_path
|
||||
|
||||
if not self._check_weights():
|
||||
logger.warning("⚠️ 模型权重不存在,使用 Fallback")
|
||||
shutil.copy(original_video_path, output_path)
|
||||
return output_path
|
||||
|
||||
if self.use_server:
|
||||
# 模式 A: 调用常驻服务 (加速模式)
|
||||
return await self._call_persistent_server(video_path, audio_path, output_path)
|
||||
|
||||
logger.info("🔄 调用 LatentSync 推理 (subprocess)...")
|
||||
|
||||
temp_output = tmpdir / "output.mp4"
|
||||
|
||||
# 构建命令
|
||||
cmd = [
|
||||
str(self.conda_python),
|
||||
"-m", "scripts.inference",
|
||||
"--unet_config_path", "configs/unet/stage2_512.yaml",
|
||||
"--inference_ckpt_path", "checkpoints/latentsync_unet.pt",
|
||||
"--inference_steps", str(settings.LATENTSYNC_INFERENCE_STEPS),
|
||||
"--guidance_scale", str(settings.LATENTSYNC_GUIDANCE_SCALE),
|
||||
"--video_path", str(video_path),
|
||||
"--audio_path", str(audio_path),
|
||||
"--video_out_path", str(temp_output),
|
||||
"--seed", str(settings.LATENTSYNC_SEED),
|
||||
"--temp_dir", str(tmpdir / "cache"),
|
||||
]
|
||||
|
||||
if settings.LATENTSYNC_ENABLE_DEEPCACHE:
|
||||
cmd.append("--enable_deepcache")
|
||||
|
||||
# 设置环境变量
|
||||
env = os.environ.copy()
|
||||
env["CUDA_VISIBLE_DEVICES"] = str(self.gpu_id)
|
||||
|
||||
logger.info(f"🖥️ 执行命令: {' '.join(cmd[:8])}...")
|
||||
logger.info(f"🖥️ GPU: CUDA_VISIBLE_DEVICES={self.gpu_id}")
|
||||
|
||||
try:
|
||||
process = await asyncio.create_subprocess_exec(
|
||||
*cmd,
|
||||
cwd=str(self.latentsync_dir),
|
||||
env=env,
|
||||
stdout=asyncio.subprocess.PIPE,
|
||||
stderr=asyncio.subprocess.PIPE,
|
||||
)
|
||||
|
||||
try:
|
||||
stdout, stderr = await asyncio.wait_for(
|
||||
process.communicate(),
|
||||
timeout=900 # 15分钟超时
|
||||
)
|
||||
except asyncio.TimeoutError:
|
||||
process.kill()
|
||||
await process.wait()
|
||||
logger.error("⏰ LatentSync 推理超时 (15分钟)")
|
||||
shutil.copy(original_video_path, output_path)
|
||||
return output_path
|
||||
|
||||
stdout_text = stdout.decode() if stdout else ""
|
||||
stderr_text = stderr.decode() if stderr else ""
|
||||
|
||||
if process.returncode != 0:
|
||||
logger.error(f"LatentSync 推理失败:\n{stderr_text}")
|
||||
logger.error(f"stdout:\n{stdout_text[-1000:] if stdout_text else 'N/A'}")
|
||||
shutil.copy(original_video_path, output_path)
|
||||
return output_path
|
||||
|
||||
logger.info(f"LatentSync 输出:\n{stdout_text[-500:] if stdout_text else 'N/A'}")
|
||||
|
||||
if temp_output.exists():
|
||||
shutil.copy(temp_output, output_path)
|
||||
logger.info(f"✅ 唇形同步完成: {output_path}")
|
||||
return output_path
|
||||
else:
|
||||
logger.warning("⚠️ 未找到输出文件,使用 Fallback")
|
||||
shutil.copy(original_video_path, output_path)
|
||||
return output_path
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"❌ 推理异常: {e}")
|
||||
shutil.copy(original_video_path, output_path)
|
||||
return output_path
|
||||
|
||||
async def _call_musetalk_server(
|
||||
self, video_path: str, audio_path: str, output_path: str
|
||||
@@ -413,18 +513,18 @@ class LipSyncService:
|
||||
"请确保 LatentSync 服务已启动 (cd models/LatentSync && python scripts/server.py)"
|
||||
)
|
||||
|
||||
async def _remote_generate(
|
||||
self,
|
||||
video_path: str,
|
||||
audio_path: str,
|
||||
output_path: str,
|
||||
fps: int,
|
||||
model_mode: Literal["default", "fast", "advanced"],
|
||||
) -> str:
|
||||
"""调用远程 LatentSync API 服务"""
|
||||
if model_mode == "fast":
|
||||
logger.warning("⚠️ 远程模式未接入 MuseTalk,快速模型将使用远程 LatentSync")
|
||||
logger.info(f"📡 调用远程 API: {self.api_url}")
|
||||
async def _remote_generate(
|
||||
self,
|
||||
video_path: str,
|
||||
audio_path: str,
|
||||
output_path: str,
|
||||
fps: int,
|
||||
model_mode: Literal["default", "fast", "advanced"],
|
||||
) -> str:
|
||||
"""调用远程 LatentSync API 服务"""
|
||||
if model_mode == "fast":
|
||||
logger.warning("⚠️ 远程模式未接入 MuseTalk,快速模型将使用远程 LatentSync")
|
||||
logger.info(f"📡 调用远程 API: {self.api_url}")
|
||||
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=600.0) as client:
|
||||
@@ -499,4 +599,9 @@ class LipSyncService:
|
||||
"ready": conda_ok and weights_ok and gpu_ok,
|
||||
"musetalk_ready": musetalk_ready,
|
||||
"lipsync_threshold": settings.LIPSYNC_DURATION_THRESHOLD,
|
||||
"small_face_enhance": {
|
||||
"enabled": settings.LIPSYNC_SMALL_FACE_ENHANCE,
|
||||
"threshold": settings.LIPSYNC_SMALL_FACE_THRESHOLD,
|
||||
"detector_loaded": self._face_enhance._detector_session is not None,
|
||||
},
|
||||
}
|
||||
|
||||
872
backend/app/services/small_face_enhance_service.py
Normal file
872
backend/app/services/small_face_enhance_service.py
Normal file
@@ -0,0 +1,872 @@
|
||||
"""
|
||||
小脸增强服务
|
||||
远景小脸场景下,裁切 + 超分 -> lipsync 推理 -> 贴回,提升输入质量。
|
||||
|
||||
单文件单类,供 LipSyncService 调用。
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import subprocess
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Optional, Tuple, List
|
||||
|
||||
from loguru import logger
|
||||
|
||||
from app.core.config import settings
|
||||
|
||||
try:
|
||||
import cv2
|
||||
import numpy as np
|
||||
_CV2_AVAILABLE = True
|
||||
except ImportError:
|
||||
_CV2_AVAILABLE = False
|
||||
|
||||
# ── 模块常量 ──
|
||||
PADDING = 0.28 # bbox 外扩比例
|
||||
DETECT_EVERY = 8 # 每 N 帧检测一次
|
||||
TARGET_SIZE = 512 # 超分目标尺寸
|
||||
MASK_FEATHER = 15 # 羽化像素
|
||||
MASK_UPPER_RATIO = 0.68 # 口型区域起始位置(仅覆盖嘴部/下巴)
|
||||
MASK_SIDE_MARGIN = 0.16 # 左右留白比例,避免改动面颊/鼻翼
|
||||
SAMPLE_FRAMES = 24 # 采样帧数
|
||||
SAMPLE_WINDOW = (0.10, 0.30) # 采样窗口 (10%~30%)
|
||||
ENCODE_FPS = 25 # 编码帧率
|
||||
ENCODE_CRF = 18 # 编码质量
|
||||
EMA_ALPHA = 0.3 # EMA 平滑系数
|
||||
|
||||
# 检测过滤
|
||||
MIN_FACE_WIDTH = 50
|
||||
FACE_ASPECT_MIN = 0.2
|
||||
FACE_ASPECT_MAX = 1.5
|
||||
DET_SCORE_THRESH = 0.5
|
||||
NMS_IOU_THRESH = 0.4
|
||||
|
||||
# 权重路径
|
||||
_PROJECT_ROOT = Path(__file__).resolve().parent.parent.parent.parent
|
||||
DET_MODEL_PATH = (
|
||||
_PROJECT_ROOT
|
||||
/ "models" / "LatentSync" / "checkpoints"
|
||||
/ "auxiliary" / "models" / "buffalo_l" / "det_10g.onnx"
|
||||
)
|
||||
|
||||
|
||||
# ── 数据结构 ──
|
||||
|
||||
@dataclass
|
||||
class FaceTrack:
|
||||
"""每帧的人脸追踪数据(用于裁切 + 贴回)"""
|
||||
crop_boxes: List[Tuple[int, int, int, int]] # 每帧 (x1,y1,x2,y2)
|
||||
face_width_median: float
|
||||
frame_count: int
|
||||
frame_w: int
|
||||
frame_h: int
|
||||
|
||||
|
||||
@dataclass
|
||||
class EnhanceResult:
|
||||
"""enhance_if_needed 返回值"""
|
||||
video_path: str
|
||||
was_enhanced: bool
|
||||
track: Optional[FaceTrack] = None
|
||||
face_width: float = 0.0
|
||||
|
||||
|
||||
class SmallFaceEnhanceService:
|
||||
"""小脸增强服务:检测 → 裁切 → 超分 → (lipsync) → 贴回"""
|
||||
|
||||
def __init__(self):
|
||||
self._detector_session = None
|
||||
self._sr_model = None
|
||||
self._sr_type: Optional[str] = None
|
||||
|
||||
# ================================================================
|
||||
# SCRFD 人脸检测(det_10g.onnx,CPU 推理)
|
||||
# ================================================================
|
||||
|
||||
def _ensure_detector(self) -> bool:
|
||||
if self._detector_session is not None:
|
||||
return True
|
||||
if not DET_MODEL_PATH.exists():
|
||||
logger.warning(f"⚠️ SCRFD 权重不存在: {DET_MODEL_PATH}")
|
||||
return False
|
||||
try:
|
||||
import onnxruntime as ort
|
||||
self._detector_session = ort.InferenceSession(
|
||||
str(DET_MODEL_PATH),
|
||||
providers=["CPUExecutionProvider"],
|
||||
)
|
||||
logger.info("✅ SCRFD 检测器已加载")
|
||||
return True
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ SCRFD 初始化失败: {e}")
|
||||
return False
|
||||
|
||||
def _detect_faces(self, img_bgr: np.ndarray) -> List[Tuple[np.ndarray, float]]:
|
||||
"""
|
||||
用 SCRFD 检测人脸。
|
||||
Returns: [(bbox_xyxy, score), ...] 按面积降序。
|
||||
"""
|
||||
if self._detector_session is None:
|
||||
return []
|
||||
|
||||
h, w = img_bgr.shape[:2]
|
||||
input_h, input_w = 640, 640
|
||||
|
||||
# ── Preprocess ──
|
||||
ratio = min(input_h / h, input_w / w)
|
||||
new_h, new_w = int(h * ratio), int(w * ratio)
|
||||
resized = cv2.resize(img_bgr, (new_w, new_h))
|
||||
|
||||
padded = np.full((input_h, input_w, 3), 127.5, dtype=np.float32)
|
||||
padded[:new_h, :new_w] = resized.astype(np.float32)
|
||||
|
||||
# BGR → RGB → normalize
|
||||
blob = padded[:, :, ::-1].copy()
|
||||
blob = (blob - 127.5) / 128.0
|
||||
blob = blob.transpose(2, 0, 1)[np.newaxis].astype(np.float32)
|
||||
|
||||
# ── Inference ──
|
||||
input_name = self._detector_session.get_inputs()[0].name
|
||||
outputs = self._detector_session.run(None, {input_name: blob})
|
||||
|
||||
# det_10g outputs: [scores_s8, scores_s16, scores_s32,
|
||||
# bbox_s8, bbox_s16, bbox_s32,
|
||||
# kps_s8, kps_s16, kps_s32]
|
||||
strides = [8, 16, 32]
|
||||
all_bboxes = []
|
||||
all_scores = []
|
||||
|
||||
for i, stride in enumerate(strides):
|
||||
scores = outputs[i].flatten()
|
||||
bboxes = outputs[i + 3].reshape(-1, 4)
|
||||
|
||||
# 生成 anchor 中心
|
||||
feat_h = input_h // stride
|
||||
feat_w = input_w // stride
|
||||
anchors = []
|
||||
for y in range(feat_h):
|
||||
for x in range(feat_w):
|
||||
cx, cy = x * stride, y * stride
|
||||
anchors.append([cx, cy])
|
||||
anchors.append([cx, cy]) # 2 anchors per cell
|
||||
anchors = np.array(anchors, dtype=np.float32)
|
||||
|
||||
# 置信度过滤
|
||||
mask = scores > DET_SCORE_THRESH
|
||||
if not mask.any():
|
||||
continue
|
||||
|
||||
f_scores = scores[mask]
|
||||
f_bboxes = bboxes[mask]
|
||||
f_anchors = anchors[mask]
|
||||
|
||||
# Decode: distance * stride → xyxy
|
||||
decoded = np.empty_like(f_bboxes)
|
||||
decoded[:, 0] = f_anchors[:, 0] - f_bboxes[:, 0] * stride
|
||||
decoded[:, 1] = f_anchors[:, 1] - f_bboxes[:, 1] * stride
|
||||
decoded[:, 2] = f_anchors[:, 0] + f_bboxes[:, 2] * stride
|
||||
decoded[:, 3] = f_anchors[:, 1] + f_bboxes[:, 3] * stride
|
||||
|
||||
# 缩放回原始图像坐标
|
||||
decoded /= ratio
|
||||
|
||||
all_bboxes.append(decoded)
|
||||
all_scores.append(f_scores)
|
||||
|
||||
if not all_bboxes:
|
||||
return []
|
||||
|
||||
bboxes_cat = np.concatenate(all_bboxes)
|
||||
scores_cat = np.concatenate(all_scores)
|
||||
|
||||
# NMS
|
||||
keep = self._nms(bboxes_cat, scores_cat, NMS_IOU_THRESH)
|
||||
|
||||
# 尺寸 + 宽高比过滤
|
||||
results = []
|
||||
for idx in keep:
|
||||
bbox = bboxes_cat[idx]
|
||||
score = float(scores_cat[idx])
|
||||
bw = bbox[2] - bbox[0]
|
||||
bh = bbox[3] - bbox[1]
|
||||
if bw < MIN_FACE_WIDTH or bh < MIN_FACE_WIDTH:
|
||||
continue
|
||||
aspect = bw / max(bh, 1)
|
||||
if aspect < FACE_ASPECT_MIN or aspect > FACE_ASPECT_MAX:
|
||||
continue
|
||||
results.append((bbox.copy(), score))
|
||||
|
||||
results.sort(key=lambda x: (x[0][2] - x[0][0]) * (x[0][3] - x[0][1]), reverse=True)
|
||||
return results
|
||||
|
||||
@staticmethod
|
||||
def _nms(bboxes: np.ndarray, scores: np.ndarray, threshold: float) -> List[int]:
|
||||
x1 = bboxes[:, 0]
|
||||
y1 = bboxes[:, 1]
|
||||
x2 = bboxes[:, 2]
|
||||
y2 = bboxes[:, 3]
|
||||
areas = (x2 - x1) * (y2 - y1)
|
||||
order = scores.argsort()[::-1]
|
||||
keep = []
|
||||
while order.size > 0:
|
||||
i = order[0]
|
||||
keep.append(int(i))
|
||||
if order.size == 1:
|
||||
break
|
||||
xx1 = np.maximum(x1[i], x1[order[1:]])
|
||||
yy1 = np.maximum(y1[i], y1[order[1:]])
|
||||
xx2 = np.minimum(x2[i], x2[order[1:]])
|
||||
yy2 = np.minimum(y2[i], y2[order[1:]])
|
||||
inter = np.maximum(0, xx2 - xx1) * np.maximum(0, yy2 - yy1)
|
||||
iou = inter / (areas[i] + areas[order[1:]] - inter + 1e-6)
|
||||
inds = np.where(iou <= threshold)[0]
|
||||
order = order[inds + 1]
|
||||
return keep
|
||||
|
||||
# ================================================================
|
||||
# 视频工具
|
||||
# ================================================================
|
||||
|
||||
@staticmethod
|
||||
def _get_video_info(video_path: str) -> Optional[Tuple[int, int, int, float]]:
|
||||
"""返回 (width, height, frame_count, fps)"""
|
||||
try:
|
||||
import json as _json
|
||||
cmd = [
|
||||
"ffprobe", "-v", "error",
|
||||
"-select_streams", "v:0",
|
||||
"-show_entries", "stream=width,height,nb_frames,r_frame_rate,avg_frame_rate",
|
||||
"-of", "json",
|
||||
video_path,
|
||||
]
|
||||
r = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
|
||||
if r.returncode != 0:
|
||||
return None
|
||||
info = _json.loads(r.stdout)
|
||||
streams = info.get("streams")
|
||||
if not streams:
|
||||
return None
|
||||
stream = streams[0]
|
||||
w, h = int(stream["width"]), int(stream["height"])
|
||||
# nb_frames 可能为 "N/A" 或缺失
|
||||
nb_raw = stream.get("nb_frames", "N/A")
|
||||
nb = int(nb_raw) if nb_raw not in ("N/A", "") else 0
|
||||
|
||||
def _parse_fps(s: str) -> float:
|
||||
if "/" in s:
|
||||
num, den = s.split("/")
|
||||
return float(num) / float(den) if float(den) != 0 else 0.0
|
||||
return float(s) if s else 0.0
|
||||
|
||||
# 优先 avg_frame_rate(真实平均帧率),r_frame_rate 可能是 timebase 倍数
|
||||
avg_fps = _parse_fps(stream.get("avg_frame_rate", "0/0"))
|
||||
r_fps = _parse_fps(stream.get("r_frame_rate", "25/1"))
|
||||
fps = avg_fps if avg_fps > 0 else (r_fps if r_fps > 0 else 25.0)
|
||||
|
||||
if nb == 0:
|
||||
cmd2 = [
|
||||
"ffprobe", "-v", "error",
|
||||
"-show_entries", "format=duration",
|
||||
"-of", "default=noprint_wrappers=1:nokey=1",
|
||||
video_path,
|
||||
]
|
||||
r2 = subprocess.run(cmd2, capture_output=True, text=True, timeout=10)
|
||||
if r2.returncode == 0 and r2.stdout.strip():
|
||||
nb = int(float(r2.stdout.strip()) * fps)
|
||||
return w, h, nb, fps
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ 获取视频信息失败: {e}")
|
||||
return None
|
||||
|
||||
@staticmethod
|
||||
def _open_video_reader(video_path: str, w: int, h: int,
|
||||
seek_sec: float = 0, duration_sec: float = 0):
|
||||
"""打开 ffmpeg rawvideo 读取管道"""
|
||||
cmd = ["ffmpeg"]
|
||||
if seek_sec > 0:
|
||||
cmd += ["-ss", f"{seek_sec:.3f}"]
|
||||
cmd += ["-i", video_path]
|
||||
if duration_sec > 0:
|
||||
cmd += ["-t", f"{duration_sec:.3f}"]
|
||||
cmd += ["-f", "rawvideo", "-pix_fmt", "bgr24", "-v", "quiet", "-"]
|
||||
return subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL)
|
||||
|
||||
@staticmethod
|
||||
def _read_one_frame(proc, w: int, h: int) -> Optional[np.ndarray]:
|
||||
raw = proc.stdout.read(w * h * 3)
|
||||
if len(raw) < w * h * 3:
|
||||
return None
|
||||
return np.frombuffer(raw, dtype=np.uint8).reshape(h, w, 3).copy()
|
||||
|
||||
@staticmethod
|
||||
def _open_video_writer(output_path: str, w: int, h: int,
|
||||
fps: int = ENCODE_FPS, crf: int = ENCODE_CRF):
|
||||
"""打开 ffmpeg rawvideo 写入管道"""
|
||||
cmd = [
|
||||
"ffmpeg", "-y",
|
||||
"-f", "rawvideo", "-pix_fmt", "bgr24",
|
||||
"-s", f"{w}x{h}", "-r", str(fps), "-i", "-",
|
||||
"-c:v", "libx264", "-crf", str(crf),
|
||||
"-preset", "fast", "-pix_fmt", "yuv420p",
|
||||
output_path,
|
||||
]
|
||||
return subprocess.Popen(cmd, stdin=subprocess.PIPE, stderr=subprocess.DEVNULL)
|
||||
|
||||
# ================================================================
|
||||
# Phase 2: 人脸尺寸检测
|
||||
# ================================================================
|
||||
|
||||
def _detect_face_size(self, video_path: str) -> Optional[float]:
|
||||
"""
|
||||
从视频 10%~30% 区间均匀采样,检测最大脸宽度中位数。
|
||||
返回 None 表示未检测到人脸或检测器不可用。
|
||||
"""
|
||||
if not self._ensure_detector():
|
||||
return None
|
||||
|
||||
info = self._get_video_info(video_path)
|
||||
if info is None:
|
||||
return None
|
||||
w, h, nb_frames, fps = info
|
||||
if nb_frames < 1 or fps <= 0:
|
||||
return None
|
||||
|
||||
# 计算采样区间
|
||||
start_frame = int(nb_frames * SAMPLE_WINDOW[0])
|
||||
end_frame = int(nb_frames * SAMPLE_WINDOW[1])
|
||||
end_frame = max(end_frame, start_frame + 1)
|
||||
n_sample = min(SAMPLE_FRAMES, end_frame - start_frame)
|
||||
if n_sample <= 0:
|
||||
return None
|
||||
|
||||
step = max(1, (end_frame - start_frame) // n_sample)
|
||||
sample_indices = set(range(start_frame, end_frame, step))
|
||||
|
||||
# 用 ffmpeg seek 定位到采样起点
|
||||
seek_sec = start_frame / fps
|
||||
duration_sec = (end_frame - start_frame) / fps + 0.5 # 余量
|
||||
|
||||
proc = self._open_video_reader(video_path, w, h, seek_sec, duration_sec)
|
||||
face_widths = []
|
||||
try:
|
||||
for local_idx in range(end_frame - start_frame + 1):
|
||||
frame = self._read_one_frame(proc, w, h)
|
||||
if frame is None:
|
||||
break
|
||||
global_idx = start_frame + local_idx
|
||||
if global_idx not in sample_indices:
|
||||
continue
|
||||
faces = self._detect_faces(frame)
|
||||
if faces:
|
||||
bbox = faces[0][0] # 最大脸
|
||||
face_widths.append(float(bbox[2] - bbox[0]))
|
||||
finally:
|
||||
proc.stdout.close()
|
||||
proc.terminate()
|
||||
proc.wait()
|
||||
|
||||
if not face_widths:
|
||||
return None
|
||||
|
||||
face_widths.sort()
|
||||
mid = len(face_widths) // 2
|
||||
if len(face_widths) % 2 == 0:
|
||||
return (face_widths[mid - 1] + face_widths[mid]) / 2
|
||||
return face_widths[mid]
|
||||
|
||||
# ================================================================
|
||||
# Phase 3: 裁切 + 轨迹
|
||||
# ================================================================
|
||||
|
||||
def _build_face_track(self, video_path: str,
|
||||
w: int, h: int, nb_frames: int) -> Optional[FaceTrack]:
|
||||
"""
|
||||
逐帧人脸追踪:每 DETECT_EVERY 帧检测,中间帧 EMA 插值。
|
||||
返回 FaceTrack 或 None(检测失败)。
|
||||
"""
|
||||
if not self._ensure_detector():
|
||||
return None
|
||||
|
||||
detect_set = set(range(0, nb_frames, DETECT_EVERY))
|
||||
|
||||
# 第一遍:检测帧
|
||||
proc = self._open_video_reader(video_path, w, h)
|
||||
keyframe_bboxes = {}
|
||||
actual_frames = 0
|
||||
try:
|
||||
for idx in range(nb_frames):
|
||||
frame = self._read_one_frame(proc, w, h)
|
||||
if frame is None:
|
||||
break
|
||||
actual_frames = idx + 1
|
||||
if idx not in detect_set:
|
||||
continue
|
||||
faces = self._detect_faces(frame)
|
||||
if faces:
|
||||
keyframe_bboxes[idx] = faces[0][0].copy()
|
||||
finally:
|
||||
proc.stdout.close()
|
||||
proc.terminate()
|
||||
proc.wait()
|
||||
|
||||
if not keyframe_bboxes:
|
||||
return None
|
||||
|
||||
# 用实际读到的帧数,避免 _get_video_info 估算偏差
|
||||
nb_frames = actual_frames
|
||||
|
||||
# 前向填充 + EMA 平滑
|
||||
sorted_keys = sorted(keyframe_bboxes.keys())
|
||||
raw_bboxes: List[np.ndarray] = [None] * nb_frames # type: ignore
|
||||
|
||||
for k in sorted_keys:
|
||||
raw_bboxes[k] = keyframe_bboxes[k]
|
||||
|
||||
prev = keyframe_bboxes[sorted_keys[0]]
|
||||
for i in range(nb_frames):
|
||||
if raw_bboxes[i] is not None:
|
||||
prev = raw_bboxes[i]
|
||||
else:
|
||||
raw_bboxes[i] = prev.copy()
|
||||
|
||||
# EMA 平滑
|
||||
smoothed = [raw_bboxes[0].copy()]
|
||||
for i in range(1, nb_frames):
|
||||
s = EMA_ALPHA * raw_bboxes[i] + (1 - EMA_ALPHA) * smoothed[-1]
|
||||
smoothed.append(s)
|
||||
|
||||
# 带 padding 的 crop box(clamp 到帧边界)
|
||||
crop_boxes = []
|
||||
for bbox in smoothed:
|
||||
x1, y1, x2, y2 = bbox
|
||||
bw, bh = x2 - x1, y2 - y1
|
||||
pad_w, pad_h = bw * PADDING, bh * PADDING
|
||||
cx1 = max(0, int(x1 - pad_w))
|
||||
cy1 = max(0, int(y1 - pad_h))
|
||||
cx2 = min(w, int(x2 + pad_w))
|
||||
cy2 = min(h, int(y2 + pad_h))
|
||||
crop_boxes.append((cx1, cy1, cx2, cy2))
|
||||
|
||||
# 中位数脸宽
|
||||
widths = sorted(float(b[2] - b[0]) for b in smoothed)
|
||||
median_w = widths[len(widths) // 2]
|
||||
|
||||
return FaceTrack(
|
||||
crop_boxes=crop_boxes,
|
||||
face_width_median=median_w,
|
||||
frame_count=nb_frames,
|
||||
frame_w=w,
|
||||
frame_h=h,
|
||||
)
|
||||
|
||||
# ================================================================
|
||||
# Phase 3: 超分
|
||||
# ================================================================
|
||||
|
||||
def _ensure_upscaler(self, upscaler: str, gpu_id: int) -> bool:
|
||||
"""懒加载超分模型"""
|
||||
if self._sr_model is not None and self._sr_type == upscaler:
|
||||
return True
|
||||
try:
|
||||
import sys
|
||||
import torch
|
||||
|
||||
# torchvision >= 0.20 移除了 functional_tensor,但 basicsr 仍引用
|
||||
if "torchvision.transforms.functional_tensor" not in sys.modules:
|
||||
try:
|
||||
import torchvision.transforms.functional as _F
|
||||
sys.modules["torchvision.transforms.functional_tensor"] = _F
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
device = torch.device(f"cuda:{gpu_id}" if torch.cuda.is_available() else "cpu")
|
||||
|
||||
if upscaler == "gfpgan":
|
||||
from gfpgan import GFPGANer
|
||||
model_path = _PROJECT_ROOT / "models" / "FaceEnhance" / "GFPGANv1.4.pth"
|
||||
if not model_path.exists():
|
||||
logger.warning(f"⚠️ GFPGAN 权重不存在: {model_path}")
|
||||
return False
|
||||
self._sr_model = GFPGANer(
|
||||
model_path=str(model_path),
|
||||
upscale=2,
|
||||
arch="clean",
|
||||
channel_multiplier=2,
|
||||
bg_upsampler=None,
|
||||
device=device,
|
||||
)
|
||||
elif upscaler == "codeformer":
|
||||
from basicsr.archs.codeformer_arch import CodeFormer as CodeFormerArch
|
||||
model_path = _PROJECT_ROOT / "models" / "FaceEnhance" / "codeformer.pth"
|
||||
if not model_path.exists():
|
||||
logger.warning(f"⚠️ CodeFormer 权重不存在: {model_path}")
|
||||
# 尝试回退 gfpgan
|
||||
return self._ensure_upscaler("gfpgan", gpu_id)
|
||||
net = CodeFormerArch(
|
||||
dim_embd=512, codebook_size=1024, n_head=8, n_layers=9,
|
||||
connect_list=["32", "64", "128", "256"],
|
||||
).to(device)
|
||||
ckpt = torch.load(str(model_path), map_location=device, weights_only=False)
|
||||
net.load_state_dict(ckpt.get("params_ema", ckpt.get("params", ckpt)))
|
||||
net.eval()
|
||||
self._sr_model = net
|
||||
self._sr_device = device
|
||||
else:
|
||||
logger.warning(f"⚠️ 未知超分器: {upscaler}")
|
||||
return False
|
||||
|
||||
self._sr_type = upscaler
|
||||
logger.info(f"✅ 超分器已加载: {upscaler}")
|
||||
return True
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ 超分器初始化失败 ({upscaler}): {e}")
|
||||
return False
|
||||
|
||||
def _upscale_face(self, face_img: np.ndarray, target_size: int) -> np.ndarray:
|
||||
"""用已加载的超分模型增强单帧,失败回退 bicubic"""
|
||||
try:
|
||||
if self._sr_type == "gfpgan":
|
||||
_, _, output = self._sr_model.enhance(
|
||||
face_img, paste_back=False, has_aligned=False,
|
||||
)
|
||||
if output is not None:
|
||||
return cv2.resize(
|
||||
output, (target_size, target_size),
|
||||
interpolation=cv2.INTER_LANCZOS4,
|
||||
)
|
||||
elif self._sr_type == "codeformer":
|
||||
import torch
|
||||
img = cv2.resize(face_img, (512, 512))
|
||||
img_t = (
|
||||
torch.from_numpy(img.astype(np.float32) / 255.0)
|
||||
.permute(2, 0, 1)
|
||||
.unsqueeze(0)
|
||||
.to(self._sr_device)
|
||||
)
|
||||
with torch.no_grad():
|
||||
out = self._sr_model(img_t, w=0.7)[0]
|
||||
out_np = (
|
||||
out.squeeze().permute(1, 2, 0).cpu().numpy() * 255
|
||||
).clip(0, 255).astype(np.uint8)
|
||||
return cv2.resize(
|
||||
out_np, (target_size, target_size),
|
||||
interpolation=cv2.INTER_LANCZOS4,
|
||||
)
|
||||
except Exception as e:
|
||||
logger.debug(f"超分失败,回退 bicubic: {e}")
|
||||
|
||||
return cv2.resize(
|
||||
face_img, (target_size, target_size),
|
||||
interpolation=cv2.INTER_CUBIC,
|
||||
)
|
||||
|
||||
# ================================================================
|
||||
# Phase 3: 裁切 + 超分 → 增强视频
|
||||
# ================================================================
|
||||
|
||||
def _crop_and_upscale_video(
|
||||
self,
|
||||
video_path: str,
|
||||
track: FaceTrack,
|
||||
tmpdir: Path,
|
||||
gpu_id: int,
|
||||
source_fps: float,
|
||||
) -> str:
|
||||
"""
|
||||
裁切人脸区域 → 稀疏关键帧超分 → 输出 TARGET_SIZE 视频。
|
||||
流式处理,不占满内存。
|
||||
"""
|
||||
output_path = str(tmpdir / "enhanced_face.mp4")
|
||||
w, h = track.frame_w, track.frame_h
|
||||
|
||||
upscaler = settings.LIPSYNC_SMALL_FACE_UPSCALER
|
||||
sr_available = self._ensure_upscaler(upscaler, gpu_id)
|
||||
detect_set = set(range(0, track.frame_count, DETECT_EVERY))
|
||||
|
||||
reader = self._open_video_reader(video_path, w, h)
|
||||
out_fps = max(1, int(round(source_fps))) if source_fps > 0 else ENCODE_FPS
|
||||
writer = self._open_video_writer(output_path, TARGET_SIZE, TARGET_SIZE, fps=out_fps)
|
||||
|
||||
try:
|
||||
for idx in range(track.frame_count):
|
||||
frame = self._read_one_frame(reader, w, h)
|
||||
if frame is None:
|
||||
break
|
||||
|
||||
cx1, cy1, cx2, cy2 = track.crop_boxes[idx]
|
||||
cropped = frame[cy1:cy2, cx1:cx2]
|
||||
|
||||
if sr_available and idx in detect_set:
|
||||
enhanced = self._upscale_face(cropped, TARGET_SIZE)
|
||||
else:
|
||||
enhanced = cv2.resize(
|
||||
cropped, (TARGET_SIZE, TARGET_SIZE),
|
||||
interpolation=cv2.INTER_CUBIC,
|
||||
)
|
||||
|
||||
writer.stdin.write(enhanced.tobytes())
|
||||
finally:
|
||||
reader.stdout.close()
|
||||
reader.terminate()
|
||||
reader.wait()
|
||||
writer.stdin.close()
|
||||
writer.wait()
|
||||
|
||||
if not Path(output_path).exists():
|
||||
raise RuntimeError("增强视频写入失败")
|
||||
|
||||
return output_path
|
||||
|
||||
# ================================================================
|
||||
# Phase 3: 贴回
|
||||
# ================================================================
|
||||
|
||||
def blend_back(
|
||||
self,
|
||||
original_video: str,
|
||||
lipsync_video: str,
|
||||
track: FaceTrack,
|
||||
tmpdir,
|
||||
) -> str:
|
||||
"""
|
||||
将 lipsync 推理结果贴回原视频。
|
||||
下半脸 mask + 高斯羽化 + seamlessClone。
|
||||
"""
|
||||
tmpdir = Path(tmpdir)
|
||||
output_path = str(tmpdir / "blended_output.mp4")
|
||||
w, h = track.frame_w, track.frame_h
|
||||
|
||||
# 获取 lipsync 视频尺寸
|
||||
ls_info = self._get_video_info(lipsync_video)
|
||||
if ls_info is None:
|
||||
raise RuntimeError("无法读取 lipsync 视频信息")
|
||||
ls_w, ls_h, ls_frames, ls_fps = ls_info
|
||||
|
||||
if ls_fps <= 0:
|
||||
ls_fps = ENCODE_FPS
|
||||
|
||||
# 帧数保护:lipsync 模型按音频时长输出,帧数通常 <= 原始(looped)视频
|
||||
if ls_frames <= 0:
|
||||
raise RuntimeError(f"lipsync 输出帧数为 {ls_frames},跳过贴回")
|
||||
if ls_frames > track.frame_count:
|
||||
raise RuntimeError(
|
||||
f"帧数异常: lipsync={ls_frames} > original={track.frame_count}"
|
||||
)
|
||||
blend_count = ls_frames
|
||||
|
||||
orig_info = self._get_video_info(original_video)
|
||||
orig_fps = orig_info[3] if orig_info is not None else 0.0
|
||||
if orig_fps <= 0:
|
||||
orig_fps = ls_fps
|
||||
|
||||
orig_reader = self._open_video_reader(original_video, w, h)
|
||||
ls_reader = self._open_video_reader(lipsync_video, ls_w, ls_h)
|
||||
writer = self._open_video_writer(
|
||||
output_path,
|
||||
w,
|
||||
h,
|
||||
fps=max(1, int(round(ls_fps))),
|
||||
)
|
||||
|
||||
current_orig_idx = -1
|
||||
current_orig_frame = None
|
||||
|
||||
try:
|
||||
for idx in range(blend_count):
|
||||
target_orig_idx = min(
|
||||
track.frame_count - 1,
|
||||
int(round((idx / ls_fps) * orig_fps)),
|
||||
)
|
||||
|
||||
while current_orig_idx < target_orig_idx:
|
||||
frame = self._read_one_frame(orig_reader, w, h)
|
||||
if frame is None:
|
||||
current_orig_frame = None
|
||||
break
|
||||
current_orig_idx += 1
|
||||
current_orig_frame = frame
|
||||
|
||||
orig_frame = current_orig_frame
|
||||
ls_frame = self._read_one_frame(ls_reader, ls_w, ls_h)
|
||||
if orig_frame is None or ls_frame is None:
|
||||
break
|
||||
|
||||
cx1, cy1, cx2, cy2 = track.crop_boxes[target_orig_idx]
|
||||
crop_w, crop_h = cx2 - cx1, cy2 - cy1
|
||||
|
||||
# 将 lipsync 输出 resize 到裁切区域尺寸
|
||||
ls_resized = cv2.resize(
|
||||
ls_frame, (crop_w, crop_h),
|
||||
interpolation=cv2.INTER_LANCZOS4,
|
||||
)
|
||||
|
||||
# 嘴部局部 mask(尽量仅覆盖嘴唇与下巴区域,避免鼻子/眼周被改动)
|
||||
mask = np.zeros((crop_h, crop_w), dtype=np.uint8)
|
||||
upper = int(crop_h * MASK_UPPER_RATIO)
|
||||
left = int(crop_w * MASK_SIDE_MARGIN)
|
||||
right = int(crop_w * (1.0 - MASK_SIDE_MARGIN))
|
||||
if right - left < 8:
|
||||
left, right = 0, crop_w
|
||||
|
||||
mask[upper:, left:right] = 255
|
||||
|
||||
# 中央椭圆增强口型区域权重
|
||||
ellipse_center = (crop_w // 2, int(crop_h * 0.82))
|
||||
ellipse_axes = (max(8, int(crop_w * 0.22)), max(8, int(crop_h * 0.13)))
|
||||
cv2.ellipse(mask, ellipse_center, ellipse_axes, 0, 0, 360, 255, -1)
|
||||
mask = cv2.GaussianBlur(mask, (0, 0), MASK_FEATHER)
|
||||
|
||||
# 融合
|
||||
blended = self._blend_face_region(
|
||||
orig_frame, ls_resized, mask, cx1, cy1, cx2, cy2,
|
||||
)
|
||||
writer.stdin.write(blended.tobytes())
|
||||
finally:
|
||||
for p in (orig_reader, ls_reader):
|
||||
p.stdout.close()
|
||||
p.terminate()
|
||||
p.wait()
|
||||
writer.stdin.close()
|
||||
writer.wait()
|
||||
|
||||
if not Path(output_path).exists():
|
||||
raise RuntimeError("融合视频写入失败")
|
||||
return output_path
|
||||
|
||||
@staticmethod
|
||||
def _blend_face_region(
|
||||
orig: np.ndarray,
|
||||
face: np.ndarray,
|
||||
mask: np.ndarray,
|
||||
x1: int, y1: int, x2: int, y2: int,
|
||||
) -> np.ndarray:
|
||||
"""seamlessClone 贴回,失败回退 alpha 混合"""
|
||||
result = orig.copy()
|
||||
crop_h, crop_w = face.shape[:2]
|
||||
|
||||
# 尝试 seamlessClone
|
||||
try:
|
||||
center_x = (x1 + x2) // 2
|
||||
center_y = int(y1 + (y2 - y1) * 0.7)
|
||||
center_x = max(1, min(center_x, orig.shape[1] - 2))
|
||||
center_y = max(1, min(center_y, orig.shape[0] - 2))
|
||||
|
||||
src = np.zeros_like(orig)
|
||||
src[y1:y2, x1:x2] = face
|
||||
|
||||
full_mask = np.zeros(orig.shape[:2], dtype=np.uint8)
|
||||
full_mask[y1:y2, x1:x2] = mask
|
||||
|
||||
if full_mask.max() > 0:
|
||||
cloned = cv2.seamlessClone(
|
||||
src, orig, full_mask, (center_x, center_y), cv2.NORMAL_CLONE,
|
||||
)
|
||||
|
||||
# 限制融合影响范围到 mask 区域,避免 Poisson 扩散导致眼部上方重影
|
||||
alpha = mask.astype(np.float32) / 255.0
|
||||
alpha_3ch = np.stack([alpha] * 3, axis=-1)
|
||||
roi_orig = orig[y1:y2, x1:x2].astype(np.float32)
|
||||
roi_clone = cloned[y1:y2, x1:x2].astype(np.float32)
|
||||
blended_roi = roi_orig * (1 - alpha_3ch) + roi_clone * alpha_3ch
|
||||
|
||||
result = orig.copy()
|
||||
result[y1:y2, x1:x2] = blended_roi.astype(np.uint8)
|
||||
return result
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Fallback: alpha 混合
|
||||
alpha = mask.astype(np.float32) / 255.0
|
||||
alpha_3ch = np.stack([alpha] * 3, axis=-1)
|
||||
crop_region = result[y1:y2, x1:x2].astype(np.float32)
|
||||
blended = crop_region * (1 - alpha_3ch) + face.astype(np.float32) * alpha_3ch
|
||||
result[y1:y2, x1:x2] = blended.astype(np.uint8)
|
||||
return result
|
||||
|
||||
# ================================================================
|
||||
# 主入口
|
||||
# ================================================================
|
||||
|
||||
def enhance_if_needed(
|
||||
self,
|
||||
video_path: str,
|
||||
tmpdir,
|
||||
gpu_id: int,
|
||||
) -> EnhanceResult:
|
||||
"""
|
||||
主入口:检测小脸 → 裁切 + 超分 → 返回增强结果。
|
||||
如不需要增强,返回 was_enhanced=False。
|
||||
"""
|
||||
if not settings.LIPSYNC_SMALL_FACE_ENHANCE:
|
||||
return EnhanceResult(video_path=video_path, was_enhanced=False)
|
||||
|
||||
if not _CV2_AVAILABLE:
|
||||
logger.warning("⚠️ opencv/numpy 未安装,小脸增强不可用")
|
||||
return EnhanceResult(video_path=video_path, was_enhanced=False)
|
||||
|
||||
start = time.time()
|
||||
tmpdir = Path(tmpdir)
|
||||
face_dir = tmpdir / "face_enhance"
|
||||
face_dir.mkdir(exist_ok=True)
|
||||
|
||||
# ── 检测 ──
|
||||
face_width = self._detect_face_size(video_path)
|
||||
if face_width is None:
|
||||
logger.info("小脸增强: 未检测到人脸,跳过")
|
||||
return EnhanceResult(video_path=video_path, was_enhanced=False)
|
||||
|
||||
threshold = settings.LIPSYNC_SMALL_FACE_THRESHOLD
|
||||
if face_width >= threshold:
|
||||
logger.info(
|
||||
f"小脸增强: face_w={face_width:.0f}px >= threshold={threshold}px, 跳过"
|
||||
)
|
||||
return EnhanceResult(
|
||||
video_path=video_path, was_enhanced=False, face_width=face_width,
|
||||
)
|
||||
|
||||
logger.info(
|
||||
f"小脸增强: face_w={face_width:.0f}px < threshold={threshold}px, 触发增强"
|
||||
)
|
||||
|
||||
# ── 构建追踪 ──
|
||||
info = self._get_video_info(video_path)
|
||||
if info is None:
|
||||
raise RuntimeError("无法读取视频信息")
|
||||
w, h, nb_frames, fps = info
|
||||
|
||||
track = self._build_face_track(video_path, w, h, nb_frames)
|
||||
if track is None:
|
||||
raise RuntimeError("人脸追踪失败")
|
||||
|
||||
# ── 裁切 + 超分 ──
|
||||
enhanced_path = self._crop_and_upscale_video(
|
||||
video_path,
|
||||
track,
|
||||
face_dir,
|
||||
gpu_id,
|
||||
source_fps=fps,
|
||||
)
|
||||
|
||||
# 清理 GPU 缓存
|
||||
try:
|
||||
import torch
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
elapsed = time.time() - start
|
||||
logger.info(
|
||||
f"小脸增强: face_w={face_width:.0f}px threshold={threshold}px "
|
||||
f"enhanced=True upscaler={settings.LIPSYNC_SMALL_FACE_UPSCALER} "
|
||||
f"time={elapsed:.1f}s"
|
||||
)
|
||||
|
||||
return EnhanceResult(
|
||||
video_path=enhanced_path,
|
||||
was_enhanced=True,
|
||||
track=track,
|
||||
face_width=face_width,
|
||||
)
|
||||
@@ -38,3 +38,7 @@ faster-whisper>=1.0.0
|
||||
# 文案提取与AI生成
|
||||
yt-dlp>=2023.0.0
|
||||
zai-sdk>=0.2.0
|
||||
|
||||
# 小脸增强
|
||||
opencv-python-headless>=4.8.0
|
||||
gfpgan>=1.3.8
|
||||
|
||||
Reference in New Issue
Block a user