ViGent2/Docs/DevLogs/Day29.md

## 字幕同步修复 + 嘴型参数调优 + 视频流水线全面优化 + 预览背景修复 + CosyVoice 语气控制 (Day 29)

### 概述

本轮对视频生成流水线做全面审查优化：修复字幕与语音不同步问题（Whisper 时间戳平滑 + 原文节奏映射）、调优 LatentSync 嘴型参数、compose 流复制省去冗余重编码、FFmpeg 超时保护、全局并发限制、Redis 任务 TTL、临时文件清理、死代码移除。修复因前端域名迁移导致的样式预览背景 CORS 失效问题。新增 CosyVoice 语气控制功能，声音克隆模式下支持开心/伤心/生气等情绪表达（基于 `inference_instruct2`）。

---

## ✅ 改动内容

### 1. 字幕同步修复（Whisper 时间戳 + 原文节奏映射）

- **问题**: 字幕高亮与语音不同步，表现为字幕超前/滞后、高亮跳空
- **根因**: Whisper 输出的逐字时间戳存在微小抖动（相邻字 end > 下一字 start），且字间间隙导致高亮"闪烁"

#### whisper_service.py — 时间戳后处理

新增 `smooth_word_timestamps()` 函数，三步平滑：

1. **单调递增保证**: 后一字的 start 不早于前一字的 start
2. **重叠消除**: 两字时间重叠时取中点分割
3. **间隙填补**: 字间间隙 < 50ms 时直接连接，避免高亮跳空

```python
def smooth_word_timestamps(words):
    for i in range(1, len(words)):
        # 重叠 → 中点分割
        if w["start"] < prev["end"]:
            mid = (prev["end"] + w["start"]) / 2
            prev["end"] = mid; w["start"] = mid
        # 微小间隙 → 直接连接
        if 0 < gap < 0.05:
            prev["end"] = w["start"]
```

#### whisper_service.py — 原文节奏映射

- **问题**: AI 改写/多语言文案与 Whisper 转录文字不一致，直接用 Whisper 文字会乱码
- **方案**: `original_text` 参数非空时，用原文字符替换 Whisper 文字，但保留 Whisper 的语音节奏时间戳
- 实现：将 N 个原文字符按比例映射到 M 个 Whisper 时间戳上（线性插值）
- 字数比例异常检测（>1.5x 或 <0.67x 时警告）
- 单字时长钳位：40ms ~ 800ms，防止极端漂移

#### captions.ts — Remotion 端字幕查找

新增 `getCurrentSegment()` 和 `getCurrentWordIndex()` 函数：

- 根据当前帧时间精确查找应显示的字幕段落和高亮字索引
- 处理字间间隙（两字之间返回前一字索引，保持高亮连续）
- 超过最后一字结束时间时返回最后一字（避免末尾闪烁）

---

### 2. LatentSync 嘴型参数调优

| 参数 | Day28 值 | Day29 值 | 说明 |
|------|----------|----------|------|
| `LATENTSYNC_INFERENCE_STEPS` | 16 | 20 | 适当增加步数提升嘴型质量 |
| `LATENTSYNC_GUIDANCE_SCALE` | (默认) | 2.0 | 平衡嘴型贴合度与自然感 |
| `LATENTSYNC_ENABLE_DEEPCACHE` | (默认) | true | DeepCache 加速推理 |
| `LATENTSYNC_SEED` | (默认) | 1247 | 固定种子保证可复现 |
| Remotion concurrency | 16 | 4 | 降低并发防止资源争抢 |

---

### 3. compose() 流复制替代冗余重编码（高优先级）

**文件**: `video_service.py`

- **问题**: `compose()` 只是合并视频轨+音频轨（mux），却每次用 `libx264 -preset medium -crf 20` 做完整重编码，耗时数分钟。整条流水线一个视频最多被 x264 编码 5 次
- **方案**: 不需要循环时（`loop_count == 1`）用 `-c:v copy` 流复制，几乎瞬间完成；需要循环时仍用 libx264

```python
if loop_count > 1:
    cmd.extend(["-c:v", "libx264", "-preset", "fast", "-crf", "23"])
else:
    cmd.extend(["-c:v", "copy"])
```

- compose 是中间产物（Remotion 会再次编码），流复制省一次编码且无质量损失

---

### 4. FFmpeg 超时保护（高优先级）

**文件**: `video_service.py`

- `_run_ffmpeg()`: 新增 `timeout=600`（10 分钟），捕获 `subprocess.TimeoutExpired`
- `_get_duration()`: 新增 `timeout=30`
- 防止畸形视频导致 FFmpeg 永久挂起阻塞后台任务

---

### 5. 全局任务并发限制（高优先级）

**文件**: `workflow.py`

- 模块级 `asyncio.Semaphore(2)`，`process_video_generation()` 入口 acquire
- 排队中的任务显示"排队中..."状态
- 防止多个请求同时跑 FFmpeg + Remotion 导致 CPU/内存爆炸

```python
_generation_semaphore = asyncio.Semaphore(2)

async def process_video_generation(task_id, req, user_id):
    _update_task(task_id, message="排队中...")
    async with _generation_semaphore:
        await _process_video_generation_inner(task_id, req, user_id)
```

---

### 6. Redis 任务 TTL + 索引清理（中优先级）

**文件**: `task_store.py`

- `create()`: 设 24 小时 TTL（`ex=86400`）
- `update()`: completed/failed 状态设 2 小时 TTL（`ex=7200`），其余 24 小时
- `list()`: 遍历时顺带清理已过期的索引条目（`srem`）
- 解决 Redis 任务 key 永久堆积问题

---

### 7. 临时字体文件清理（中优先级）

**文件**: `workflow.py`

- `prepare_style_for_remotion()` 复制字体到 temp_dir，但未加入清理列表
- 现在遍历三组前缀（subtitle/title/secondary_title）× 四种扩展名（.ttf/.otf/.woff/.woff2），将存在的字体文件加入 `temp_files`

---

### 8. Whisper+split 逻辑去重（低优先级）

**文件**: `workflow.py`

- 两个分支（custom_assignments 不匹配 vs 默认）的 Whisper→_split_equal 代码 100% 相同（36 行重复）
- 提取为内部函数 `_whisper_and_split()`，两个分支共用

---

### 9. LipSync 死代码清理（低优先级）

**文件**: `lipsync_service.py`

- 删除 `_preprocess_video()` 方法（92 行），全项目无任何调用

---

### 10. 标题字幕预览背景 CORS 修复

- **问题**: 前端域名从 `vigent.hbyrkj.top` 迁移到 `ipagent.ai-labz.cn` 后，素材签名 URL（`api.hbyrkj.top`）与新前端域名完全不同根域，Supabase Kong 网关的 CORS 不覆盖新域名 → `<video crossOrigin="anonymous">` 加载失败 → canvas 截帧失败 → 回退渐变背景
- **根因**: Day28 实现依赖 Supabase 返回 `Access-Control-Allow-Origin` 头，换域名后此依赖断裂

**修复方案 — 同源代理（彻底绕开 CORS）**:

| 组件 | 改动 |
|------|------|
| `materials/router.py` | 新增 `GET /api/materials/stream/{material_id}` 端点，通过 `get_local_file_path()` 从本地磁盘直读，返回 `FileResponse` |
| `useHomeController.ts` | 帧截取 URL 改为 `/api/materials/stream/${mat.id}`（同源），不再用跨域签名 URL |
| `useVideoFrameCapture.ts` | 移除 `crossOrigin = "anonymous"`，同源请求不需要 |

链路：`用户点预览 → /api/materials/stream/xxx → Next.js rewrite → FastAPI FileResponse → 同源 <video> → canvas 截帧成功`

---

### 11. 支付宝回调域名更新

**文件**: `.env`

```
ALIPAY_NOTIFY_URL=https://ipagent.ai-labz.cn/api/payment/notify
ALIPAY_RETURN_URL=https://ipagent.ai-labz.cn/pay
```

---

## 📁 修改文件清单

| 文件 | 改动 |
|------|------|
| `backend/app/services/whisper_service.py` | 时间戳平滑 + 原文节奏映射 + 单字时长钳位 |
| `remotion/src/utils/captions.ts` | 新增 `getCurrentSegment` / `getCurrentWordIndex` |
| `backend/app/services/video_service.py` | compose 流复制 + FFmpeg 超时保护 |
| `backend/app/modules/videos/workflow.py` | Semaphore(2) 并发限制 + 字体清理 + Whisper 逻辑去重 |
| `backend/app/modules/videos/task_store.py` | Redis TTL + 索引过期清理 |
| `backend/app/services/lipsync_service.py` | 删除 `_preprocess_video()` 死代码 |
| `backend/app/services/remotion_service.py` | concurrency 16 → 4 |
| `remotion/render.ts` | 新增 concurrency 参数支持 |
| `backend/app/modules/materials/router.py` | 新增 `/stream/{material_id}` 同源代理端点 |
| `frontend/.../useVideoFrameCapture.ts` | 移除 crossOrigin |
| `frontend/.../useHomeController.ts` | 帧截取 URL 改用同源代理 |
| `backend/.env` | 嘴型参数 + 支付宝域名更新 |

---

### 12. CosyVoice 语气控制功能

- **功能**: 声音克隆模式下新增"语气"下拉菜单（正常/欢快/低沉/严肃），利用 CosyVoice3 的 `inference_instruct2()` 方法通过自然语言指令控制语气情绪
- **默认行为不变**: 选择"正常"时仍走 `inference_zero_shot()`，与改动前完全一致

#### 数据流

```
用户选择语气 → setEmotion("happy") → localStorage 持久化
  → 生成配音 → emotion 映射为 instruct_text
    → POST /api/generated-audios/generate { instruct_text }
      → voice_clone_service → POST localhost:8010/generate { instruct_text }
        → instruct_text 非空 ? inference_instruct2() : inference_zero_shot()
```

#### CosyVoice 服务 — `cosyvoice_server.py`

- `/generate` 端点新增 `instruct_text: str = Form("")` 参数
- 推理分支：空 → `inference_zero_shot()`，非空 → `inference_instruct2(text, instruct_text, ref_audio_path, ...)`
- `inference_instruct2` 不需要 `prompt_text`，直接接受 `instruct_text` + `prompt_wav`

#### 后端透传

- `schemas.py`: `GenerateAudioRequest` 新增 `instruct_text: Optional[str] = None`
- `service.py`: `generate_audio_task()` voiceclone 分支传递 `instruct_text=req.instruct_text or ""`
- `voice_clone_service.py`: `_generate_once()` 和 `generate_audio()` 新增 `instruct_text` 参数

#### 前端

- `useHomeController.ts`: 新增 `emotion` state + `emotionToInstruct` 映射表
- `useHomePersistence.ts`: 语气选择持久化到 localStorage
- `useGeneratedAudios.ts`: `generateAudio` params 新增 `instruct_text`
- `GeneratedAudiosPanel.tsx`: 语气下拉菜单（语速按钮左侧），复用语速下拉样式，仅 voiceclone 模式可见
- `HomePage.tsx`: 透传 `emotion`/`onEmotionChange`

#### instruct_text 格式（来自 CosyVoice3 instruct_list）

```
正常: ""（走 inference_zero_shot）
欢快: "You are a helpful assistant. 请非常开心地说一句话。<|endofprompt|>"
低沉: "You are a helpful assistant. 请非常伤心地说一句话。<|endofprompt|>"
严肃: "You are a helpful assistant. 请非常生气地说一句话。<|endofprompt|>"
```

---

## 📁 修改文件清单

| 文件 | 改动 |
|------|------|
| `backend/app/services/whisper_service.py` | 时间戳平滑 + 原文节奏映射 + 单字时长钳位 |
| `remotion/src/utils/captions.ts` | 新增 `getCurrentSegment` / `getCurrentWordIndex` |
| `backend/app/services/video_service.py` | compose 流复制 + FFmpeg 超时保护 |
| `backend/app/modules/videos/workflow.py` | Semaphore(2) 并发限制 + 字体清理 + Whisper 逻辑去重 |
| `backend/app/modules/videos/task_store.py` | Redis TTL + 索引过期清理 |
| `backend/app/services/lipsync_service.py` | 删除 `_preprocess_video()` 死代码 |
| `backend/app/services/remotion_service.py` | concurrency 16 → 4 |
| `remotion/render.ts` | 新增 concurrency 参数支持 |
| `backend/app/modules/materials/router.py` | 新增 `/stream/{material_id}` 同源代理端点 |
| `frontend/.../useVideoFrameCapture.ts` | 移除 crossOrigin |
| `frontend/.../useHomeController.ts` | 帧截取 URL 改用同源代理 + emotion state + emotionToInstruct 映射 |
| `backend/.env` | 嘴型参数 + 支付宝域名更新 |
| `models/CosyVoice/cosyvoice_server.py` | `/generate` 新增 `instruct_text` 参数，分支 `inference_instruct2` / `inference_zero_shot` |
| `backend/app/services/voice_clone_service.py` | `_generate_once` / `generate_audio` 新增 `instruct_text` 透传 |
| `backend/app/modules/generated_audios/schemas.py` | `GenerateAudioRequest` 新增 `instruct_text` 字段 |
| `backend/app/modules/generated_audios/service.py` | voiceclone 分支传递 `instruct_text` |
| `frontend/.../useGeneratedAudios.ts` | `generateAudio` params 新增 `instruct_text` |
| `frontend/.../useHomePersistence.ts` | emotion 持久化 (localStorage) |
| `frontend/.../GeneratedAudiosPanel.tsx` | 语气下拉菜单 UI (embedded + standalone) |
| `frontend/.../HomePage.tsx` | 透传 emotion / onEmotionChange |

---

## 🔍 验证

1. **字幕同步**: 生成视频观察逐字高亮，不应出现超前/滞后/跳空
2. **compose 流复制**: FFmpeg 日志中 compose 步骤应出现 `-c:v copy`，耗时从分钟级降到秒级
3. **FFmpeg 超时**: 代码确认 timeout 参数已加
4. **并发限制**: 连续提交 3 个任务，第 3 个应显示"排队中"，前 2 个完成后才开始
5. **Redis TTL**: `redis-cli TTL vigent:tasks:<id>` 确认有过期时间
6. **字体清理**: 生成视频后 temp 目录不应残留字体文件
7. **预览背景**: 选择素材 → 点击"预览样式"，应显示视频第一帧（非渐变）
8. **支付宝**: 发起支付后回调和跳转地址为新域名
9. **语气控制**: 声音克隆模式选择"开心"/"生气"生成配音，CosyVoice 日志出现 `🎭 Instruct mode`，音频语气有明显变化
10. **语气默认**: 选择"正常"时行为与改动前完全相同（走 `inference_zero_shot`）
11. **语气持久化**: 切换语气后刷新页面，下拉菜单恢复上次选择
12. **语气可见性**: 语气下拉仅在 voiceclone 模式显示，edgetts 模式不显示