ViGent2/Docs/DevLogs/Day29.md at abf005f2252805aa40bb917c1a0a2cfa5cb54812

Originals/ViGent2

Fork 0

Files

Kevin Wong abf005f225 更新

2026-02-28 17:49:32 +08:00

13 KiB

Raw Blame History

字幕同步修复 + 嘴型参数调优 + 视频流水线全面优化 + 预览背景修复 + CosyVoice 语气控制 (Day 29)

概述

本轮对视频生成流水线做全面审查优化：修复字幕与语音不同步问题（Whisper 时间戳平滑 + 原文节奏映射）、调优 LatentSync 嘴型参数、compose 流复制省去冗余重编码、FFmpeg 超时保护、全局并发限制、Redis 任务 TTL、临时文件清理、死代码移除。修复因前端域名迁移导致的样式预览背景 CORS 失效问题。新增 CosyVoice 语气控制功能，声音克隆模式下支持开心/伤心/生气等情绪表达（基于 inference_instruct2）。

✅ 改动内容

1. 字幕同步修复（Whisper 时间戳 + 原文节奏映射）

问题: 字幕高亮与语音不同步，表现为字幕超前/滞后、高亮跳空
根因: Whisper 输出的逐字时间戳存在微小抖动（相邻字 end > 下一字 start），且字间间隙导致高亮"闪烁"

whisper_service.py — 时间戳后处理

新增 smooth_word_timestamps() 函数，三步平滑：

单调递增保证: 后一字的 start 不早于前一字的 start
重叠消除: 两字时间重叠时取中点分割
间隙填补: 字间间隙 < 50ms 时直接连接，避免高亮跳空

def smooth_word_timestamps(words):
    for i in range(1, len(words)):
        # 重叠 → 中点分割
        if w["start"] < prev["end"]:
            mid = (prev["end"] + w["start"]) / 2
            prev["end"] = mid; w["start"] = mid
        # 微小间隙 → 直接连接
        if 0 < gap < 0.05:
            prev["end"] = w["start"]

whisper_service.py — 原文节奏映射

问题: AI 改写/多语言文案与 Whisper 转录文字不一致，直接用 Whisper 文字会乱码
方案: original_text 参数非空时，用原文字符替换 Whisper 文字，但保留 Whisper 的语音节奏时间戳
实现：将 N 个原文字符按比例映射到 M 个 Whisper 时间戳上（线性插值）
字数比例异常检测（>1.5x 或 <0.67x 时警告）
单字时长钳位：40ms ~ 800ms，防止极端漂移

captions.ts — Remotion 端字幕查找

新增 getCurrentSegment() 和 getCurrentWordIndex() 函数：

根据当前帧时间精确查找应显示的字幕段落和高亮字索引
处理字间间隙（两字之间返回前一字索引，保持高亮连续）
超过最后一字结束时间时返回最后一字（避免末尾闪烁）

2. LatentSync 嘴型参数调优

参数	Day28 值	Day29 值	说明
`LATENTSYNC_INFERENCE_STEPS`	16	20	适当增加步数提升嘴型质量
`LATENTSYNC_GUIDANCE_SCALE`	(默认)	2.0	平衡嘴型贴合度与自然感
`LATENTSYNC_ENABLE_DEEPCACHE`	(默认)	true	DeepCache 加速推理
`LATENTSYNC_SEED`	(默认)	1247	固定种子保证可复现
Remotion concurrency	16	4	降低并发防止资源争抢

3. compose() 流复制替代冗余重编码（高优先级）

文件: video_service.py

问题: compose() 只是合并视频轨+音频轨（mux），却每次用 libx264 -preset medium -crf 20 做完整重编码，耗时数分钟。整条流水线一个视频最多被 x264 编码 5 次
方案: 不需要循环时（loop_count == 1）用 -c:v copy 流复制，几乎瞬间完成；需要循环时仍用 libx264

if loop_count > 1:
    cmd.extend(["-c:v", "libx264", "-preset", "fast", "-crf", "23"])
else:
    cmd.extend(["-c:v", "copy"])

compose 是中间产物（Remotion 会再次编码），流复制省一次编码且无质量损失

4. FFmpeg 超时保护（高优先级）

文件: video_service.py

_run_ffmpeg(): 新增 timeout=600（10 分钟），捕获 subprocess.TimeoutExpired
_get_duration(): 新增 timeout=30
防止畸形视频导致 FFmpeg 永久挂起阻塞后台任务

5. 全局任务并发限制（高优先级）

文件: workflow.py

模块级 asyncio.Semaphore(2)，process_video_generation() 入口 acquire
排队中的任务显示"排队中..."状态
防止多个请求同时跑 FFmpeg + Remotion 导致 CPU/内存爆炸

_generation_semaphore = asyncio.Semaphore(2)

async def process_video_generation(task_id, req, user_id):
    _update_task(task_id, message="排队中...")
    async with _generation_semaphore:
        await _process_video_generation_inner(task_id, req, user_id)

6. Redis 任务 TTL + 索引清理（中优先级）

文件: task_store.py

create(): 设 24 小时 TTL（ex=86400）
update(): completed/failed 状态设 2 小时 TTL（ex=7200），其余 24 小时
list(): 遍历时顺带清理已过期的索引条目（srem）
解决 Redis 任务 key 永久堆积问题

7. 临时字体文件清理（中优先级）

文件: workflow.py

prepare_style_for_remotion() 复制字体到 temp_dir，但未加入清理列表
现在遍历三组前缀（subtitle/title/secondary_title）× 四种扩展名（.ttf/.otf/.woff/.woff2），将存在的字体文件加入 temp_files

8. Whisper+split 逻辑去重（低优先级）

文件: workflow.py

两个分支（custom_assignments 不匹配 vs 默认）的 Whisper→_split_equal 代码 100% 相同（36 行重复）
提取为内部函数 _whisper_and_split()，两个分支共用

9. LipSync 死代码清理（低优先级）

文件: lipsync_service.py

删除 _preprocess_video() 方法（92 行），全项目无任何调用

10. 标题字幕预览背景 CORS 修复

问题: 前端域名从 vigent.hbyrkj.top 迁移到 ipagent.ai-labz.cn 后，素材签名 URL（api.hbyrkj.top）与新前端域名完全不同根域，Supabase Kong 网关的 CORS 不覆盖新域名 → <video crossOrigin="anonymous"> 加载失败 → canvas 截帧失败 → 回退渐变背景
根因: Day28 实现依赖 Supabase 返回 Access-Control-Allow-Origin 头，换域名后此依赖断裂

修复方案 — 同源代理（彻底绕开 CORS）:

组件	改动
`materials/router.py`	新增 `GET /api/materials/stream/{material_id}` 端点，通过 `get_local_file_path()` 从本地磁盘直读，返回 `FileResponse`
`useHomeController.ts`	帧截取 URL 改为 `/api/materials/stream/${mat.id}`（同源），不再用跨域签名 URL
`useVideoFrameCapture.ts`	移除 `crossOrigin = "anonymous"`，同源请求不需要

链路：用户点预览 → /api/materials/stream/xxx → Next.js rewrite → FastAPI FileResponse → 同源 <video> → canvas 截帧成功

11. 支付宝回调域名更新

文件: .env

ALIPAY_NOTIFY_URL=https://ipagent.ai-labz.cn/api/payment/notify
ALIPAY_RETURN_URL=https://ipagent.ai-labz.cn/pay

📁 修改文件清单

文件	改动
`backend/app/services/whisper_service.py`	时间戳平滑 + 原文节奏映射 + 单字时长钳位
`remotion/src/utils/captions.ts`	新增 `getCurrentSegment` / `getCurrentWordIndex`
`backend/app/services/video_service.py`	compose 流复制 + FFmpeg 超时保护
`backend/app/modules/videos/workflow.py`	Semaphore(2) 并发限制 + 字体清理 + Whisper 逻辑去重
`backend/app/modules/videos/task_store.py`	Redis TTL + 索引过期清理
`backend/app/services/lipsync_service.py`	删除 `_preprocess_video()` 死代码
`backend/app/services/remotion_service.py`	concurrency 16 → 4
`remotion/render.ts`	新增 concurrency 参数支持
`backend/app/modules/materials/router.py`	新增 `/stream/{material_id}` 同源代理端点
`frontend/.../useVideoFrameCapture.ts`	移除 crossOrigin
`frontend/.../useHomeController.ts`	帧截取 URL 改用同源代理
`backend/.env`	嘴型参数 + 支付宝域名更新

12. CosyVoice 语气控制功能

功能: 声音克隆模式下新增"语气"下拉菜单（正常/欢快/低沉/严肃），利用 CosyVoice3 的 inference_instruct2() 方法通过自然语言指令控制语气情绪
默认行为不变: 选择"正常"时仍走 inference_zero_shot()，与改动前完全一致

数据流

用户选择语气 → setEmotion("happy") → localStorage 持久化
  → 生成配音 → emotion 映射为 instruct_text
    → POST /api/generated-audios/generate { instruct_text }
      → voice_clone_service → POST localhost:8010/generate { instruct_text }
        → instruct_text 非空 ? inference_instruct2() : inference_zero_shot()

CosyVoice 服务 — `cosyvoice_server.py`

/generate 端点新增 instruct_text: str = Form("") 参数
推理分支：空 → inference_zero_shot()，非空 → inference_instruct2(text, instruct_text, ref_audio_path, ...)
inference_instruct2 不需要 prompt_text，直接接受 instruct_text + prompt_wav

后端透传

schemas.py: GenerateAudioRequest 新增 instruct_text: Optional[str] = None
service.py: generate_audio_task() voiceclone 分支传递 instruct_text=req.instruct_text or ""
voice_clone_service.py: _generate_once() 和 generate_audio() 新增 instruct_text 参数

前端

useHomeController.ts: 新增 emotion state + emotionToInstruct 映射表
useHomePersistence.ts: 语气选择持久化到 localStorage
useGeneratedAudios.ts: generateAudio params 新增 instruct_text
GeneratedAudiosPanel.tsx: 语气下拉菜单（语速按钮左侧），复用语速下拉样式，仅 voiceclone 模式可见
HomePage.tsx: 透传 emotion/onEmotionChange

instruct_text 格式（来自 CosyVoice3 instruct_list）

正常: ""（走 inference_zero_shot）
欢快: "You are a helpful assistant. 请非常开心地说一句话。<|endofprompt|>"
低沉: "You are a helpful assistant. 请非常伤心地说一句话。<|endofprompt|>"
严肃: "You are a helpful assistant. 请非常生气地说一句话。<|endofprompt|>"

📁 修改文件清单

文件	改动
`backend/app/services/whisper_service.py`	时间戳平滑 + 原文节奏映射 + 单字时长钳位
`remotion/src/utils/captions.ts`	新增 `getCurrentSegment` / `getCurrentWordIndex`
`backend/app/services/video_service.py`	compose 流复制 + FFmpeg 超时保护
`backend/app/modules/videos/workflow.py`	Semaphore(2) 并发限制 + 字体清理 + Whisper 逻辑去重
`backend/app/modules/videos/task_store.py`	Redis TTL + 索引过期清理
`backend/app/services/lipsync_service.py`	删除 `_preprocess_video()` 死代码
`backend/app/services/remotion_service.py`	concurrency 16 → 4
`remotion/render.ts`	新增 concurrency 参数支持
`backend/app/modules/materials/router.py`	新增 `/stream/{material_id}` 同源代理端点
`frontend/.../useVideoFrameCapture.ts`	移除 crossOrigin
`frontend/.../useHomeController.ts`	帧截取 URL 改用同源代理 + emotion state + emotionToInstruct 映射
`backend/.env`	嘴型参数 + 支付宝域名更新
`models/CosyVoice/cosyvoice_server.py`	`/generate` 新增 `instruct_text` 参数，分支 `inference_instruct2` / `inference_zero_shot`
`backend/app/services/voice_clone_service.py`	`_generate_once` / `generate_audio` 新增 `instruct_text` 透传
`backend/app/modules/generated_audios/schemas.py`	`GenerateAudioRequest` 新增 `instruct_text` 字段
`backend/app/modules/generated_audios/service.py`	voiceclone 分支传递 `instruct_text`
`frontend/.../useGeneratedAudios.ts`	`generateAudio` params 新增 `instruct_text`
`frontend/.../useHomePersistence.ts`	emotion 持久化 (localStorage)
`frontend/.../GeneratedAudiosPanel.tsx`	语气下拉菜单 UI (embedded + standalone)
`frontend/.../HomePage.tsx`	透传 emotion / onEmotionChange

🔍 验证

字幕同步: 生成视频观察逐字高亮，不应出现超前/滞后/跳空
compose 流复制: FFmpeg 日志中 compose 步骤应出现 -c:v copy，耗时从分钟级降到秒级
FFmpeg 超时: 代码确认 timeout 参数已加
并发限制: 连续提交 3 个任务，第 3 个应显示"排队中"，前 2 个完成后才开始
Redis TTL: redis-cli TTL vigent:tasks:<id> 确认有过期时间
字体清理: 生成视频后 temp 目录不应残留字体文件
预览背景: 选择素材 → 点击"预览样式"，应显示视频第一帧（非渐变）
支付宝: 发起支付后回调和跳转地址为新域名
语气控制: 声音克隆模式选择"开心"/"生气"生成配音，CosyVoice 日志出现 🎭 Instruct mode，音频语气有明显变化
语气默认: 选择"正常"时行为与改动前完全相同（走 inference_zero_shot）
语气持久化: 切换语气后刷新页面，下拉菜单恢复上次选择
语气可见性: 语气下拉仅在 voiceclone 模式显示，edgetts 模式不显示

13 KiB Raw Blame History Unescape Escape

字幕同步修复 + 嘴型参数调优 + 视频流水线全面优化 + 预览背景修复 + CosyVoice 语气控制 (Day 29)

概述

✅ 改动内容

1. 字幕同步修复（Whisper 时间戳 + 原文节奏映射）

whisper_service.py — 时间戳后处理

whisper_service.py — 原文节奏映射

captions.ts — Remotion 端字幕查找

2. LatentSync 嘴型参数调优

3. compose() 流复制替代冗余重编码（高优先级）

4. FFmpeg 超时保护（高优先级）

5. 全局任务并发限制（高优先级）

6. Redis 任务 TTL + 索引清理（中优先级）

7. 临时字体文件清理（中优先级）

8. Whisper+split 逻辑去重（低优先级）

9. LipSync 死代码清理（低优先级）

10. 标题字幕预览背景 CORS 修复

11. 支付宝回调域名更新

📁 修改文件清单

12. CosyVoice 语气控制功能

数据流

CosyVoice 服务 — cosyvoice_server.py

后端透传

前端

instruct_text 格式（来自 CosyVoice3 instruct_list）

📁 修改文件清单

🔍 验证

13 KiB

Raw Blame History

CosyVoice 服务 — `cosyvoice_server.py`