Compare commits

...

2 Commits

Author SHA1 Message Date
Kevin Wong
abf005f225 更新 2026-02-28 17:49:32 +08:00
Kevin Wong
9de2cb40b4 更新 2026-02-28 14:44:51 +08:00
29 changed files with 748 additions and 284 deletions

View File

@@ -65,6 +65,7 @@ backend/
* `POST /api/materials`: 上传素材
* `GET /api/materials`: 获取素材列表
* `PUT /api/materials/{material_id}`: 重命名素材
* `GET /api/materials/stream/{material_id}`: 同源流式返回素材文件(用于前端 canvas 截帧,避免跨域 CORS taint
4. **社交发布 (Publish)**
* `POST /api/publish`: 发布视频到 抖音/微信视频号/B站/小红书
@@ -160,6 +161,18 @@ backend/
- 多素材片段在拼接前统一重编码,并强制 `25fps + CFR`,减少段边界时间基不一致导致的画面卡顿。
- concat 流程启用 `+genpts` 重建时间戳,提升拼接后时间轴连续性。
- 对带旋转元数据的 MOV 素材会先做方向归一化,再进入分辨率判断和后续流程。
- compose 阶段(视频轨+音频轨合并)使用 `-c:v copy` 流复制替代重编码,几乎瞬间完成。
- FFmpeg 子进程设有超时保护:`_run_ffmpeg()` 600 秒、`_get_duration()` 30 秒,防止畸形文件导致永久挂起。
### 全局并发控制
- 视频生成入口使用 `asyncio.Semaphore(2)` 限制最多 2 个任务同时执行,排队中的任务显示"排队中..."状态。
- Redis 任务 key 设有 TTL创建时 24 小时completed/failed 状态 2 小时,`list()` 时自动清理过期索引。
### 字幕时间戳优化
- Whisper 输出经 `smooth_word_timestamps()` 三步平滑:单调递增保证、重叠消除(中点分割)、微小间隙填补(<50ms
- 支持 `original_text` 原文节奏映射:原文字符按比例映射到 Whisper 时间戳上,解决 AI 改写/多语言文案与转录不一致问题。
## 📦 资源库与静态资源

View File

@@ -70,6 +70,18 @@ run_cosyvoice.sh # PM2 启动脚本
| ref_text | string | 是 | 参考音频的转写文字 |
| language | string | 否 | 语言 (默认 "Chinese"CosyVoice 自动检测) |
| speed | float | 否 | 语速 (默认 1.0,范围 0.5-2.0,建议 0.8-1.2) |
| instruct_text | string | 否 | 语气指令 (默认 "",非空时切换为 `inference_instruct2` 模式) |
**推理模式分支:**
- `instruct_text` 为空 → `inference_zero_shot(text, prompt_text, ref_audio)` — 纯声音克隆
- `instruct_text` 非空 → `inference_instruct2(text, instruct_text, ref_audio)` — 带语气/情绪控制的声音克隆
**支持的语气指令示例:**
```
"You are a helpful assistant. 请非常开心地说一句话。<|endofprompt|>"
"You are a helpful assistant. 请非常伤心地说一句话。<|endofprompt|>"
"You are a helpful assistant. 请非常生气地说一句话。<|endofprompt|>"
```
**返回:** WAV 音频文件

View File

@@ -211,8 +211,10 @@ cp .env.example .env
| `SUPABASE_PUBLIC_URL` | `https://api.hbyrkj.top` | Supabase API 公网地址 (前端访问) |
| `LATENTSYNC_GPU_ID` | 1 | GPU 选择 (0 或 1) |
| `LATENTSYNC_USE_SERVER` | false | 设为 true 以启用常驻服务加速 |
| `LATENTSYNC_INFERENCE_STEPS` | 16 | 推理步数 (16-50) |
| `LATENTSYNC_GUIDANCE_SCALE` | 1.5 | 引导系数 (1.0-3.0) |
| `LATENTSYNC_INFERENCE_STEPS` | 20 | 推理步数 (16-50) |
| `LATENTSYNC_GUIDANCE_SCALE` | 2.0 | 引导系数 (1.0-3.0) |
| `LATENTSYNC_ENABLE_DEEPCACHE` | true | DeepCache 推理加速 |
| `LATENTSYNC_SEED` | 1247 | 固定随机种子(可复现) |
| `DEBUG` | true | 生产环境改为 false |
| `REDIS_URL` | `redis://localhost:6379/0` | 任务状态存储(不可用时回退内存) |
| `WEIXIN_HEADLESS_MODE` | headless-new | 视频号 Playwright 模式 (headful/headless-new) |

283
Docs/DevLogs/Day29.md Normal file
View File

@@ -0,0 +1,283 @@
## 字幕同步修复 + 嘴型参数调优 + 视频流水线全面优化 + 预览背景修复 + CosyVoice 语气控制 (Day 29)
### 概述
本轮对视频生成流水线做全面审查优化修复字幕与语音不同步问题Whisper 时间戳平滑 + 原文节奏映射)、调优 LatentSync 嘴型参数、compose 流复制省去冗余重编码、FFmpeg 超时保护、全局并发限制、Redis 任务 TTL、临时文件清理、死代码移除。修复因前端域名迁移导致的样式预览背景 CORS 失效问题。新增 CosyVoice 语气控制功能,声音克隆模式下支持开心/伤心/生气等情绪表达(基于 `inference_instruct2`)。
---
## ✅ 改动内容
### 1. 字幕同步修复Whisper 时间戳 + 原文节奏映射)
- **问题**: 字幕高亮与语音不同步,表现为字幕超前/滞后、高亮跳空
- **根因**: Whisper 输出的逐字时间戳存在微小抖动(相邻字 end > 下一字 start且字间间隙导致高亮"闪烁"
#### whisper_service.py — 时间戳后处理
新增 `smooth_word_timestamps()` 函数,三步平滑:
1. **单调递增保证**: 后一字的 start 不早于前一字的 start
2. **重叠消除**: 两字时间重叠时取中点分割
3. **间隙填补**: 字间间隙 < 50ms 时直接连接,避免高亮跳空
```python
def smooth_word_timestamps(words):
for i in range(1, len(words)):
# 重叠 → 中点分割
if w["start"] < prev["end"]:
mid = (prev["end"] + w["start"]) / 2
prev["end"] = mid; w["start"] = mid
# 微小间隙 → 直接连接
if 0 < gap < 0.05:
prev["end"] = w["start"]
```
#### whisper_service.py — 原文节奏映射
- **问题**: AI 改写/多语言文案与 Whisper 转录文字不一致,直接用 Whisper 文字会乱码
- **方案**: `original_text` 参数非空时,用原文字符替换 Whisper 文字,但保留 Whisper 的语音节奏时间戳
- 实现:将 N 个原文字符按比例映射到 M 个 Whisper 时间戳上(线性插值)
- 字数比例异常检测(>1.5x 或 <0.67x 时警告)
- 单字时长钳位40ms ~ 800ms防止极端漂移
#### captions.ts — Remotion 端字幕查找
新增 `getCurrentSegment()``getCurrentWordIndex()` 函数:
- 根据当前帧时间精确查找应显示的字幕段落和高亮字索引
- 处理字间间隙(两字之间返回前一字索引,保持高亮连续)
- 超过最后一字结束时间时返回最后一字(避免末尾闪烁)
---
### 2. LatentSync 嘴型参数调优
| 参数 | Day28 值 | Day29 值 | 说明 |
|------|----------|----------|------|
| `LATENTSYNC_INFERENCE_STEPS` | 16 | 20 | 适当增加步数提升嘴型质量 |
| `LATENTSYNC_GUIDANCE_SCALE` | (默认) | 2.0 | 平衡嘴型贴合度与自然感 |
| `LATENTSYNC_ENABLE_DEEPCACHE` | (默认) | true | DeepCache 加速推理 |
| `LATENTSYNC_SEED` | (默认) | 1247 | 固定种子保证可复现 |
| Remotion concurrency | 16 | 4 | 降低并发防止资源争抢 |
---
### 3. compose() 流复制替代冗余重编码(高优先级)
**文件**: `video_service.py`
- **问题**: `compose()` 只是合并视频轨+音频轨mux却每次用 `libx264 -preset medium -crf 20` 做完整重编码,耗时数分钟。整条流水线一个视频最多被 x264 编码 5 次
- **方案**: 不需要循环时(`loop_count == 1`)用 `-c:v copy` 流复制,几乎瞬间完成;需要循环时仍用 libx264
```python
if loop_count > 1:
cmd.extend(["-c:v", "libx264", "-preset", "fast", "-crf", "23"])
else:
cmd.extend(["-c:v", "copy"])
```
- compose 是中间产物Remotion 会再次编码),流复制省一次编码且无质量损失
---
### 4. FFmpeg 超时保护(高优先级)
**文件**: `video_service.py`
- `_run_ffmpeg()`: 新增 `timeout=600`10 分钟),捕获 `subprocess.TimeoutExpired`
- `_get_duration()`: 新增 `timeout=30`
- 防止畸形视频导致 FFmpeg 永久挂起阻塞后台任务
---
### 5. 全局任务并发限制(高优先级)
**文件**: `workflow.py`
- 模块级 `asyncio.Semaphore(2)``process_video_generation()` 入口 acquire
- 排队中的任务显示"排队中..."状态
- 防止多个请求同时跑 FFmpeg + Remotion 导致 CPU/内存爆炸
```python
_generation_semaphore = asyncio.Semaphore(2)
async def process_video_generation(task_id, req, user_id):
_update_task(task_id, message="排队中...")
async with _generation_semaphore:
await _process_video_generation_inner(task_id, req, user_id)
```
---
### 6. Redis 任务 TTL + 索引清理(中优先级)
**文件**: `task_store.py`
- `create()`: 设 24 小时 TTL`ex=86400`
- `update()`: completed/failed 状态设 2 小时 TTL`ex=7200`),其余 24 小时
- `list()`: 遍历时顺带清理已过期的索引条目(`srem`
- 解决 Redis 任务 key 永久堆积问题
---
### 7. 临时字体文件清理(中优先级)
**文件**: `workflow.py`
- `prepare_style_for_remotion()` 复制字体到 temp_dir但未加入清理列表
- 现在遍历三组前缀subtitle/title/secondary_title× 四种扩展名(.ttf/.otf/.woff/.woff2将存在的字体文件加入 `temp_files`
---
### 8. Whisper+split 逻辑去重(低优先级)
**文件**: `workflow.py`
- 两个分支custom_assignments 不匹配 vs 默认)的 Whisper→_split_equal 代码 100% 相同36 行重复)
- 提取为内部函数 `_whisper_and_split()`,两个分支共用
---
### 9. LipSync 死代码清理(低优先级)
**文件**: `lipsync_service.py`
- 删除 `_preprocess_video()` 方法92 行),全项目无任何调用
---
### 10. 标题字幕预览背景 CORS 修复
- **问题**: 前端域名从 `vigent.hbyrkj.top` 迁移到 `ipagent.ai-labz.cn` 后,素材签名 URL`api.hbyrkj.top`与新前端域名完全不同根域Supabase Kong 网关的 CORS 不覆盖新域名 → `<video crossOrigin="anonymous">` 加载失败 → canvas 截帧失败 → 回退渐变背景
- **根因**: Day28 实现依赖 Supabase 返回 `Access-Control-Allow-Origin` 头,换域名后此依赖断裂
**修复方案 — 同源代理(彻底绕开 CORS**:
| 组件 | 改动 |
|------|------|
| `materials/router.py` | 新增 `GET /api/materials/stream/{material_id}` 端点,通过 `get_local_file_path()` 从本地磁盘直读,返回 `FileResponse` |
| `useHomeController.ts` | 帧截取 URL 改为 `/api/materials/stream/${mat.id}`(同源),不再用跨域签名 URL |
| `useVideoFrameCapture.ts` | 移除 `crossOrigin = "anonymous"`,同源请求不需要 |
链路:`用户点预览 → /api/materials/stream/xxx → Next.js rewrite → FastAPI FileResponse → 同源 <video> → canvas 截帧成功`
---
### 11. 支付宝回调域名更新
**文件**: `.env`
```
ALIPAY_NOTIFY_URL=https://ipagent.ai-labz.cn/api/payment/notify
ALIPAY_RETURN_URL=https://ipagent.ai-labz.cn/pay
```
---
## 📁 修改文件清单
| 文件 | 改动 |
|------|------|
| `backend/app/services/whisper_service.py` | 时间戳平滑 + 原文节奏映射 + 单字时长钳位 |
| `remotion/src/utils/captions.ts` | 新增 `getCurrentSegment` / `getCurrentWordIndex` |
| `backend/app/services/video_service.py` | compose 流复制 + FFmpeg 超时保护 |
| `backend/app/modules/videos/workflow.py` | Semaphore(2) 并发限制 + 字体清理 + Whisper 逻辑去重 |
| `backend/app/modules/videos/task_store.py` | Redis TTL + 索引过期清理 |
| `backend/app/services/lipsync_service.py` | 删除 `_preprocess_video()` 死代码 |
| `backend/app/services/remotion_service.py` | concurrency 16 → 4 |
| `remotion/render.ts` | 新增 concurrency 参数支持 |
| `backend/app/modules/materials/router.py` | 新增 `/stream/{material_id}` 同源代理端点 |
| `frontend/.../useVideoFrameCapture.ts` | 移除 crossOrigin |
| `frontend/.../useHomeController.ts` | 帧截取 URL 改用同源代理 |
| `backend/.env` | 嘴型参数 + 支付宝域名更新 |
---
### 12. CosyVoice 语气控制功能
- **功能**: 声音克隆模式下新增"语气"下拉菜单(正常/欢快/低沉/严肃),利用 CosyVoice3 的 `inference_instruct2()` 方法通过自然语言指令控制语气情绪
- **默认行为不变**: 选择"正常"时仍走 `inference_zero_shot()`,与改动前完全一致
#### 数据流
```
用户选择语气 → setEmotion("happy") → localStorage 持久化
→ 生成配音 → emotion 映射为 instruct_text
→ POST /api/generated-audios/generate { instruct_text }
→ voice_clone_service → POST localhost:8010/generate { instruct_text }
→ instruct_text 非空 ? inference_instruct2() : inference_zero_shot()
```
#### CosyVoice 服务 — `cosyvoice_server.py`
- `/generate` 端点新增 `instruct_text: str = Form("")` 参数
- 推理分支:空 → `inference_zero_shot()`,非空 → `inference_instruct2(text, instruct_text, ref_audio_path, ...)`
- `inference_instruct2` 不需要 `prompt_text`,直接接受 `instruct_text` + `prompt_wav`
#### 后端透传
- `schemas.py`: `GenerateAudioRequest` 新增 `instruct_text: Optional[str] = None`
- `service.py`: `generate_audio_task()` voiceclone 分支传递 `instruct_text=req.instruct_text or ""`
- `voice_clone_service.py`: `_generate_once()``generate_audio()` 新增 `instruct_text` 参数
#### 前端
- `useHomeController.ts`: 新增 `emotion` state + `emotionToInstruct` 映射表
- `useHomePersistence.ts`: 语气选择持久化到 localStorage
- `useGeneratedAudios.ts`: `generateAudio` params 新增 `instruct_text`
- `GeneratedAudiosPanel.tsx`: 语气下拉菜单(语速按钮左侧),复用语速下拉样式,仅 voiceclone 模式可见
- `HomePage.tsx`: 透传 `emotion`/`onEmotionChange`
#### instruct_text 格式(来自 CosyVoice3 instruct_list
```
正常: ""(走 inference_zero_shot
欢快: "You are a helpful assistant. 请非常开心地说一句话。<|endofprompt|>"
低沉: "You are a helpful assistant. 请非常伤心地说一句话。<|endofprompt|>"
严肃: "You are a helpful assistant. 请非常生气地说一句话。<|endofprompt|>"
```
---
## 📁 修改文件清单
| 文件 | 改动 |
|------|------|
| `backend/app/services/whisper_service.py` | 时间戳平滑 + 原文节奏映射 + 单字时长钳位 |
| `remotion/src/utils/captions.ts` | 新增 `getCurrentSegment` / `getCurrentWordIndex` |
| `backend/app/services/video_service.py` | compose 流复制 + FFmpeg 超时保护 |
| `backend/app/modules/videos/workflow.py` | Semaphore(2) 并发限制 + 字体清理 + Whisper 逻辑去重 |
| `backend/app/modules/videos/task_store.py` | Redis TTL + 索引过期清理 |
| `backend/app/services/lipsync_service.py` | 删除 `_preprocess_video()` 死代码 |
| `backend/app/services/remotion_service.py` | concurrency 16 → 4 |
| `remotion/render.ts` | 新增 concurrency 参数支持 |
| `backend/app/modules/materials/router.py` | 新增 `/stream/{material_id}` 同源代理端点 |
| `frontend/.../useVideoFrameCapture.ts` | 移除 crossOrigin |
| `frontend/.../useHomeController.ts` | 帧截取 URL 改用同源代理 + emotion state + emotionToInstruct 映射 |
| `backend/.env` | 嘴型参数 + 支付宝域名更新 |
| `models/CosyVoice/cosyvoice_server.py` | `/generate` 新增 `instruct_text` 参数,分支 `inference_instruct2` / `inference_zero_shot` |
| `backend/app/services/voice_clone_service.py` | `_generate_once` / `generate_audio` 新增 `instruct_text` 透传 |
| `backend/app/modules/generated_audios/schemas.py` | `GenerateAudioRequest` 新增 `instruct_text` 字段 |
| `backend/app/modules/generated_audios/service.py` | voiceclone 分支传递 `instruct_text` |
| `frontend/.../useGeneratedAudios.ts` | `generateAudio` params 新增 `instruct_text` |
| `frontend/.../useHomePersistence.ts` | emotion 持久化 (localStorage) |
| `frontend/.../GeneratedAudiosPanel.tsx` | 语气下拉菜单 UI (embedded + standalone) |
| `frontend/.../HomePage.tsx` | 透传 emotion / onEmotionChange |
---
## 🔍 验证
1. **字幕同步**: 生成视频观察逐字高亮,不应出现超前/滞后/跳空
2. **compose 流复制**: FFmpeg 日志中 compose 步骤应出现 `-c:v copy`,耗时从分钟级降到秒级
3. **FFmpeg 超时**: 代码确认 timeout 参数已加
4. **并发限制**: 连续提交 3 个任务,第 3 个应显示"排队中",前 2 个完成后才开始
5. **Redis TTL**: `redis-cli TTL vigent:tasks:<id>` 确认有过期时间
6. **字体清理**: 生成视频后 temp 目录不应残留字体文件
7. **预览背景**: 选择素材 → 点击"预览样式",应显示视频第一帧(非渐变)
8. **支付宝**: 发起支付后回调和跳转地址为新域名
9. **语气控制**: 声音克隆模式选择"开心"/"生气"生成配音CosyVoice 日志出现 `🎭 Instruct mode`,音频语气有明显变化
10. **语气默认**: 选择"正常"时行为与改动前完全相同(走 `inference_zero_shot`
11. **语气持久化**: 切换语气后刷新页面,下拉菜单恢复上次选择
12. **语气可见性**: 语气下拉仅在 voiceclone 模式显示edgetts 模式不显示

View File

@@ -37,6 +37,7 @@ ViGent2 的前端界面,采用 Next.js 16 + TailwindCSS 构建。
- **重新识别**: 旧参考音频可重新转写并截取 (RotateCw 按钮)。
- **一键克隆**: 选择参考音频后自动调用 CosyVoice 3.0 服务。
- **语速控制**: 声音克隆模式下支持 5 档语速 (0.8-1.2),选择持久化 (Day 23)。
- **语气控制**: 声音克隆模式下支持 4 种语气 (正常/欢快/低沉/严肃),基于 CosyVoice3 `inference_instruct2`,选择持久化 (Day 29)。
- **多语言支持**: EdgeTTS 10 语言声音列表,声音克隆 language 透传 (Day 22)。
### 4. 配音前置 + 时间轴编排 [Day 23 新增]

View File

@@ -187,7 +187,7 @@ Remotion 渲染参数在 `backend/app/services/remotion_service.py` 中配置:
| 参数 | 默认值 | 说明 |
|------|--------|------|
| `fps` | 25 | 输出帧率 |
| `concurrency` | 16 | Remotion 并发渲染进程数(默认 16,可通过 `--concurrency` CLI 参数覆盖) |
| `concurrency` | 4 | Remotion 并发渲染进程数(默认 4,可通过 `--concurrency` CLI 参数覆盖) |
| `title_display_mode` | `short` | 标题显示模式(`short`=短暂显示;`persistent`=常驻显示) |
| `title_duration` | 4.0 | 标题显示时长(秒,仅 `short` 模式生效) |
@@ -295,3 +295,4 @@ WhisperService(device="cuda:0") # 或 "cuda:1"
| 2026-02-25 | 1.2.0 | 字幕时间戳从线性插值改为 Whisper 节奏映射,修复长视频字幕漂移 |
| 2026-02-27 | 1.3.0 | 架构图更新 MuseTalk 混合路由Remotion 并发渲染从 8 提升到 16GPU 分配说明更新 |
| 2026-02-28 | 1.3.1 | MuseTalk 合成阶段优化:纯 numpy blending + FFmpeg pipe NVENC GPU 硬编码替代双重编码 |
| 2026-02-28 | 1.4.0 | compose 流复制替代重编码FFmpeg 超时保护 (600s/30s)Remotion 并发 16→4Whisper 时间戳平滑 + 原文节奏映射;全局视频生成 Semaphore(2)Redis 任务 TTL |

View File

@@ -1,8 +1,8 @@
# ViGent2 开发任务清单 (Task Log)
**项目**: ViGent2 数字人口播视频生成系统
**进度**: 100% (Day 28 - CosyVoice FP16 加速 + 文档全面更新)
**更新时间**: 2026-02-27
**进度**: 100% (Day 29 - 视频流水线优化 + CosyVoice 语气控制)
**更新时间**: 2026-02-28
---
@@ -10,7 +10,18 @@
> 这里记录了每一天的核心开发内容与 milestone。
### Day 28: CosyVoice FP16 加速 + 文档全面更新 (Current)
### Day 29: 视频流水线优化 + CosyVoice 语气控制 (Current)
- [x] **字幕同步修复**: Whisper 时间戳三步平滑(单调递增+重叠消除+间隙填补)+ 原文节奏映射(线性插值 + 单字时长钳位)。
- [x] **LatentSync 嘴型参数调优**: inference_steps 16→20, guidance_scale 2.0, DeepCache 启用, Remotion concurrency 16→4。
- [x] **compose 流复制**: 不循环时 `-c:v copy` 替代 libx264 重编码compose 耗时从分钟级降到秒级。
- [x] **FFmpeg 超时保护**: `_run_ffmpeg()` timeout=600, `_get_duration()` timeout=30。
- [x] **全局并发限制**: `asyncio.Semaphore(2)` 控制同时运行的生成任务数。
- [x] **Redis 任务 TTL**: create 24h, completed/failed 2h, list 自动清理过期索引。
- [x] **临时字体清理**: 字体文件加入 temp_files 清理列表。
- [x] **预览背景 CORS 修复**: 素材同源代理 `/api/materials/stream/{id}` 彻底绕开跨域。
- [x] **CosyVoice 语气控制**: 声音克隆模式新增语气下拉(正常/欢快/低沉/严肃),基于 `inference_instruct2()` 自然语言指令控制情绪,全链路透传 instruct_text默认"正常"行为不变。
### Day 28: CosyVoice FP16 加速 + 文档全面更新
- [x] **CosyVoice FP16 半精度加速**: `AutoModel()` 开启 `fp16=True`LLM 推理和 Flow Matching 自动混合精度运行,预估提速 30-40%、显存降低 ~30%。
- [x] **文档全面更新**: README.md / DEPLOY_MANUAL.md / SUBTITLE_DEPLOY.md / BACKEND_README.md 补充 MuseTalk 混合唇形同步方案、性能优化、Remotion 并发渲染等内容。
@@ -258,7 +269,7 @@
| **核心 API** | 100% | ✅ 稳定 |
| **Web UI** | 100% | ✅ 稳定 (移动端适配) |
| **唇形同步** | 100% | ✅ LatentSync 1.6 |
| **TTS 配音** | 100% | ✅ EdgeTTS + CosyVoice 3.0 + 配音前置 + 时间轴编排 + 自动转写 + 语速控制 |
| **TTS 配音** | 100% | ✅ EdgeTTS + CosyVoice 3.0 + 配音前置 + 时间轴编排 + 自动转写 + 语速控制 + 语气控制 |
| **自动发布** | 100% | ✅ 抖音/微信视频号/B站/小红书 |
| **用户认证** | 100% | ✅ 手机号 + JWT |
| **付费会员** | 100% | ✅ 支付宝电脑网站支付 + 自动激活 |

View File

@@ -17,7 +17,7 @@
### 核心能力
- 🎬 **高清唇形同步** - 混合方案:短视频 (<120s) 用 LatentSync 1.6 (高质量 Latent Diffusion),长视频 (>=120s) 用 MuseTalk 1.5 (实时级单步推理),自动路由 + 回退。
- 🎙️ **多模态配音** - 支持 **EdgeTTS** (微软超自然语音, 10 语言) 和 **CosyVoice 3.0** (3秒极速声音克隆, 9语言+18方言, 语速可调)。上传参考音频自动 Whisper 转写 + 智能截取。配音前置工作流:先生成配音 → 选素材 → 生成视频。
- 🎙️ **多模态配音** - 支持 **EdgeTTS** (微软超自然语音, 10 语言) 和 **CosyVoice 3.0** (3秒极速声音克隆, 9语言+18方言, 语速/语气可调)。上传参考音频自动 Whisper 转写 + 智能截取。配音前置工作流:先生成配音 → 选素材 → 生成视频。
- 📝 **智能字幕** - 集成 faster-whisper + Remotion自动生成逐字高亮 (卡拉OK效果) 字幕。
- 🎨 **样式预设** - 12 种标题 + 8 种字幕样式预设,支持预览 + 字号调节 + 自定义字体库。CSS 原生描边渲染,清晰无重影。
- 🏷️ **标题显示模式** - 片头标题支持 `短暂显示` / `常驻显示`默认短暂显示4秒用户偏好自动持久化。
@@ -37,7 +37,7 @@
- 💳 **付费会员** - 支付宝电脑网站支付自动开通会员,到期自动停用并引导续费,管理员手动激活并存。
- 🔐 **认证与隔离** - 基于 Supabase 的用户隔离,支持手机号注册/登录、密码管理。
- 🛡️ **服务守护** - 内置 Watchdog 看门狗机制,自动监控并重启僵死服务,确保 7x24h 稳定运行。
- 🚀 **性能优化** - 视频预压缩、模型常驻服务(近实时加载)、双 GPU 流水线并发、MuseTalk 人脸检测降频 + BiSeNet 缓存 + NVENC GPU 硬编码、Remotion 16 并发渲染
- 🚀 **性能优化** - compose 流复制免重编码、FFmpeg 超时保护、全局视频生成并发限制 (Semaphore(2))、Remotion 4 并发渲染、MuseTalk NVENC GPU 硬编码 + 纯 numpy blending、模型常驻服务、双 GPU 流水线并发、Redis 任务 TTL 自动清理
---

View File

@@ -25,10 +25,10 @@ LATENTSYNC_USE_SERVER=true
# LATENTSYNC_API_URL=http://localhost:8007
# 推理步数 (20-50, 越高质量越好,速度越慢)
LATENTSYNC_INFERENCE_STEPS=16
LATENTSYNC_INFERENCE_STEPS=20
# 引导系数 (1.0-3.0, 越高唇同步越准,但可能抖动)
LATENTSYNC_GUIDANCE_SCALE=1.5
LATENTSYNC_GUIDANCE_SCALE=2.0
# 启用 DeepCache 加速 (推荐开启)
LATENTSYNC_ENABLE_DEEPCACHE=true
@@ -94,5 +94,5 @@ SUPABASE_STORAGE_LOCAL_PATH=/home/rongye/ProgramFiles/Supabase/volumes/storage/s
ALIPAY_APP_ID=2021006132600283
ALIPAY_PRIVATE_KEY_PATH=/home/rongye/ProgramFiles/ViGent2/backend/keys/app_private_key.pem
ALIPAY_PUBLIC_KEY_PATH=/home/rongye/ProgramFiles/ViGent2/backend/keys/alipay_public_key.pem
ALIPAY_NOTIFY_URL=https://vigent.hbyrkj.top/api/payment/notify
ALIPAY_RETURN_URL=https://vigent.hbyrkj.top/pay
ALIPAY_NOTIFY_URL=https://ipagent.ai-labz.cn/api/payment/notify
ALIPAY_RETURN_URL=https://ipagent.ai-labz.cn/pay

View File

@@ -10,6 +10,7 @@ class GenerateAudioRequest(BaseModel):
ref_text: Optional[str] = None
language: str = "zh-CN"
speed: float = 1.0
instruct_text: Optional[str] = None
class RenameAudioRequest(BaseModel):

View File

@@ -81,6 +81,7 @@ async def generate_audio_task(task_id: str, req: GenerateAudioRequest, user_id:
output_path=audio_path,
language=_locale_to_tts_lang(req.language),
speed=req.speed,
instruct_text=req.instruct_text or "",
)
finally:
if os.path.exists(ref_local):

View File

@@ -1,14 +1,28 @@
from fastapi import APIRouter, HTTPException, Request, Depends
from fastapi.responses import FileResponse
from loguru import logger
from app.core.deps import get_current_user
from app.core.response import success_response
from app.modules.materials.schemas import RenameMaterialRequest
from app.modules.materials import service
from app.services.storage import storage_service
router = APIRouter()
@router.get("/stream/{material_id:path}")
async def stream_material(material_id: str, current_user: dict = Depends(get_current_user)):
"""直接流式返回素材文件(同源,避免 CORS canvas taint"""
user_id = current_user["id"]
if not material_id.startswith(f"{user_id}/"):
raise HTTPException(403, "无权访问此素材")
local_path = storage_service.get_local_file_path("materials", material_id)
if not local_path:
raise HTTPException(404, "素材文件不存在")
return FileResponse(local_path, media_type="video/mp4")
@router.post("")
async def upload_material(
request: Request,

View File

@@ -54,7 +54,7 @@ class RedisTaskStore:
"progress": 0,
"user_id": user_id,
}
self._client.set(self._key(task_id), json.dumps(task, ensure_ascii=False))
self._client.set(self._key(task_id), json.dumps(task, ensure_ascii=False), ex=86400)
self._client.sadd(self._index_key, task_id)
return task
@@ -71,12 +71,17 @@ class RedisTaskStore:
keys = [self._key(task_id) for task_id in task_ids]
raw_items = self._client.mget(keys)
tasks = []
for raw in raw_items:
if raw:
try:
tasks.append(json.loads(raw))
except Exception:
continue
expired = []
for task_id, raw in zip(task_ids, raw_items):
if raw is None:
expired.append(task_id)
continue
try:
tasks.append(json.loads(raw))
except Exception:
continue
if expired:
self._client.srem(self._index_key, *expired)
return tasks
def update(self, task_id: str, updates: Dict[str, Any]) -> Dict[str, Any]:
@@ -84,7 +89,8 @@ class RedisTaskStore:
if task.get("status") == "not_found":
task = {"status": "pending", "task_id": task_id}
task.update(updates)
self._client.set(self._key(task_id), json.dumps(task, ensure_ascii=False))
ttl = 7200 if task.get("status") in ("completed", "failed") else 86400
self._client.set(self._key(task_id), json.dumps(task, ensure_ascii=False), ex=ttl)
self._client.sadd(self._index_key, task_id)
return task

View File

@@ -24,6 +24,9 @@ from app.services.remotion_service import remotion_service
from .schemas import GenerateRequest
from .task_store import task_store
# 全局并发限制:最多同时运行 2 个视频生成任务
_generation_semaphore = asyncio.Semaphore(2)
def _locale_to_whisper_lang(locale: str) -> str:
"""'en-US''en', 'zh-CN''zh'"""
@@ -169,6 +172,12 @@ def _split_equal(segments: List[dict], material_paths: List[str]) -> List[dict]:
async def process_video_generation(task_id: str, req: GenerateRequest, user_id: str):
_update_task(task_id, message="排队中...")
async with _generation_semaphore:
await _process_video_generation_inner(task_id, req, user_id)
async def _process_video_generation_inner(task_id: str, req: GenerateRequest, user_id: str):
temp_files = []
try:
start_time = time.time()
@@ -283,6 +292,42 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:
captions_path = None
async def _whisper_and_split():
"""Whisper 对齐 → _split_equal 均分素材(公共逻辑)"""
_update_task(task_id, message="正在生成字幕 (Whisper)...")
_captions_path = temp_dir / f"{task_id}_captions.json"
temp_files.append(_captions_path)
captions_data = None
try:
captions_data = await whisper_service.align(
audio_path=str(audio_path),
text=req.text,
output_path=str(_captions_path),
language=_locale_to_whisper_lang(req.language),
original_text=req.text,
)
print(f"[Pipeline] Whisper alignment completed (multi-material)")
except Exception as e:
logger.warning(f"Whisper alignment failed: {e}")
_captions_path = None
_update_task(task_id, progress=15, message="正在分配素材...")
if captions_data and captions_data.get("segments"):
result = _split_equal(captions_data["segments"], material_paths)
else:
logger.warning("[MultiMat] Whisper 无数据,按时长均分")
audio_dur = video._get_duration(str(audio_path))
if audio_dur <= 0:
audio_dur = 30.0
seg_dur = audio_dur / len(material_paths)
result = [
{"material_path": material_paths[i], "start": i * seg_dur,
"end": (i + 1) * seg_dur, "index": i}
for i in range(len(material_paths))
]
return result, _captions_path
if is_multi:
# ══════════════════════════════════════
# 多素材流水线
@@ -327,80 +372,10 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:
f" 与素材数量({len(material_paths)})不一致,回退自动分配"
)
# 原有逻辑Whisper_split_equal
_update_task(task_id, message="正在生成字幕 (Whisper)...")
captions_path = temp_dir / f"{task_id}_captions.json"
temp_files.append(captions_path)
try:
captions_data = await whisper_service.align(
audio_path=str(audio_path),
text=req.text,
output_path=str(captions_path),
language=_locale_to_whisper_lang(req.language),
original_text=req.text,
)
print(f"[Pipeline] Whisper alignment completed (multi-material)")
except Exception as e:
logger.warning(f"Whisper alignment failed: {e}")
captions_data = None
captions_path = None
_update_task(task_id, progress=15, message="正在分配素材...")
if captions_data and captions_data.get("segments"):
assignments = _split_equal(captions_data["segments"], material_paths)
else:
# Whisper 失败 → 按时长均分(不依赖字符对齐)
logger.warning("[MultiMat] Whisper 无数据,按时长均分")
audio_dur = video._get_duration(str(audio_path))
if audio_dur <= 0:
audio_dur = 30.0 # 安全兜底
seg_dur = audio_dur / len(material_paths)
assignments = [
{"material_path": material_paths[i], "start": i * seg_dur,
"end": (i + 1) * seg_dur, "index": i}
for i in range(len(material_paths))
]
assignments, captions_path = await _whisper_and_split()
else:
# 原有逻辑Whisper_split_equal
_update_task(task_id, message="正在生成字幕 (Whisper)...")
captions_path = temp_dir / f"{task_id}_captions.json"
temp_files.append(captions_path)
try:
captions_data = await whisper_service.align(
audio_path=str(audio_path),
text=req.text,
output_path=str(captions_path),
language=_locale_to_whisper_lang(req.language),
original_text=req.text,
)
print(f"[Pipeline] Whisper alignment completed (multi-material)")
except Exception as e:
logger.warning(f"Whisper alignment failed: {e}")
captions_data = None
captions_path = None
_update_task(task_id, progress=15, message="正在分配素材...")
if captions_data and captions_data.get("segments"):
assignments = _split_equal(captions_data["segments"], material_paths)
else:
# Whisper 失败 → 按时长均分(不依赖字符对齐)
logger.warning("[MultiMat] Whisper 无数据,按时长均分")
audio_dur = video._get_duration(str(audio_path))
if audio_dur <= 0:
audio_dur = 30.0 # 安全兜底
seg_dur = audio_dur / len(material_paths)
assignments = [
{"material_path": material_paths[i], "start": i * seg_dur,
"end": (i + 1) * seg_dur, "index": i}
for i in range(len(material_paths))
]
assignments, captions_path = await _whisper_and_split()
# 扩展段覆盖完整音频范围首段从0开始末段到音频结尾
audio_duration = video._get_duration(str(audio_path))
@@ -721,6 +696,13 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:
f"{task_id}_secondary_title_font"
)
# 清理字体临时文件
for prefix in [f"{task_id}_subtitle_font", f"{task_id}_title_font", f"{task_id}_secondary_title_font"]:
for ext in [".ttf", ".otf", ".woff", ".woff2"]:
font_tmp = temp_dir / f"{prefix}{ext}"
if font_tmp.exists():
temp_files.append(font_tmp)
final_output_local_path = temp_dir / f"{task_id}_output.mp4"
temp_files.append(final_output_local_path)

View File

@@ -121,98 +121,6 @@ class LipSyncService:
logger.warning(f"⚠️ 视频循环异常: {e}")
return video_path
def _preprocess_video(self, video_path: str, output_path: str, target_height: int = 720) -> str:
"""
视频预处理:压缩视频以加速后续处理
- 限制最大高度为 target_height (默认720p)
- 保持宽高比
- 使用快速编码预设
Returns: 预处理后的视频路径
"""
import subprocess
import json
# 获取视频信息 (使用 JSON 格式更可靠)
probe_cmd = [
"ffprobe", "-v", "error",
"-select_streams", "v:0",
"-show_entries", "stream=height,width",
"-of", "json",
video_path
]
try:
result = subprocess.run(probe_cmd, capture_output=True, text=True, timeout=10)
if result.returncode != 0:
logger.warning(f"⚠️ ffprobe 失败: {result.stderr[:100]}")
return video_path
probe_data = json.loads(result.stdout)
streams = probe_data.get("streams", [])
if not streams:
logger.warning("⚠️ 无法获取视频流信息,跳过预处理")
return video_path
current_height = streams[0].get("height", 0)
current_width = streams[0].get("width", 0)
if current_height == 0:
logger.warning("⚠️ 视频高度为 0跳过预处理")
return video_path
logger.info(f"📹 原始视频分辨率: {current_width}×{current_height}")
except json.JSONDecodeError as e:
logger.warning(f"⚠️ ffprobe 输出解析失败: {e}")
return video_path
except subprocess.TimeoutExpired:
logger.warning("⚠️ ffprobe 超时,跳过预处理")
return video_path
except Exception as e:
logger.warning(f"⚠️ 获取视频信息失败: {e}")
return video_path
# 如果视频已经足够小,跳过压缩
if current_height <= target_height:
logger.info(f"📹 视频高度 {current_height}p <= {target_height}p无需压缩")
return video_path
logger.info(f"📹 预处理视频: {current_height}p → {target_height}p")
# 使用 FFmpeg 压缩
compress_cmd = [
"ffmpeg", "-y",
"-i", video_path,
"-vf", f"scale=-2:{target_height}", # 保持宽高比,高度设为 target_height
"-c:v", "libx264",
"-preset", "ultrafast", # 最快编码速度
"-crf", "23", # 质量因子
"-c:a", "copy", # 音频直接复制
output_path
]
try:
result = subprocess.run(
compress_cmd,
capture_output=True,
text=True,
timeout=120 # 增加超时时间到2分钟
)
if result.returncode == 0 and Path(output_path).exists():
original_size = Path(video_path).stat().st_size / 1024 / 1024
new_size = Path(output_path).stat().st_size / 1024 / 1024
logger.info(f"✅ 视频压缩完成: {original_size:.1f}MB → {new_size:.1f}MB")
return output_path
else:
logger.warning(f"⚠️ 视频压缩失败: {result.stderr[:200]}")
return video_path
except subprocess.TimeoutExpired:
logger.warning("⚠️ 视频压缩超时,使用原始视频")
return video_path
except Exception as e:
logger.warning(f"⚠️ 视频压缩异常: {e}")
return video_path
async def generate(
self,
video_path: str,

View File

@@ -71,7 +71,8 @@ class RemotionService:
"--video", str(video_path),
"--output", str(output_path),
"--fps", str(fps),
"--enableSubtitles", str(enable_subtitles).lower()
"--enableSubtitles", str(enable_subtitles).lower(),
"--concurrency", "4"
])
if captions_path:

View File

@@ -118,18 +118,21 @@ class VideoService:
cmd_str = ' '.join(shlex.quote(str(c)) for c in cmd)
logger.debug(f"FFmpeg CMD: {cmd_str}")
try:
# Synchronous call for BackgroundTasks compatibility
result = subprocess.run(
cmd,
shell=False,
capture_output=True,
text=True,
encoding='utf-8',
timeout=600,
)
if result.returncode != 0:
logger.error(f"FFmpeg Error: {result.stderr}")
return False
return True
except subprocess.TimeoutExpired:
logger.error("FFmpeg timed out after 600s")
return False
except Exception as e:
logger.error(f"FFmpeg Exception: {e}")
return False
@@ -148,6 +151,7 @@ class VideoService:
cmd,
capture_output=True,
text=True,
timeout=30,
)
return float(result.stdout.strip())
except Exception:
@@ -221,17 +225,20 @@ class VideoService:
# Previous state: subtitles disabled due to font issues
# if subtitle_path: ...
# Audio map with high quality encoding
# 不需要循环时用流复制(几乎瞬间完成),需要循环时才重编码
if loop_count > 1:
cmd.extend([
"-c:v", "libx264", "-preset", "fast", "-crf", "23",
])
else:
cmd.extend(["-c:v", "copy"])
cmd.extend([
"-c:v", "libx264",
"-preset", "medium", # 平衡速度与压缩效率
"-crf", "20", # 最终输出:高质量(肉眼无损)
"-c:a", "aac",
"-b:a", "192k", # 音频比特率
"-shortest"
"-b:a", "192k",
"-shortest",
"-map", "0:v", "-map", "1:a",
])
# Use audio from input 1
cmd.extend(["-map", "0:v", "-map", "1:a"])
cmd.append(output_path)

View File

@@ -32,6 +32,7 @@ class VoiceCloneService:
ref_text: str,
language: str,
speed: float = 1.0,
instruct_text: str = "",
max_retries: int = 4,
) -> bytes:
timeout = httpx.Timeout(240.0)
@@ -39,15 +40,18 @@ class VoiceCloneService:
for attempt in range(max_retries):
try:
async with httpx.AsyncClient(timeout=timeout) as client:
data = {
"text": text,
"ref_text": ref_text,
"language": language,
"speed": str(speed),
}
if instruct_text:
data["instruct_text"] = instruct_text
response = await client.post(
f"{self.base_url}/generate",
files={"ref_audio": ("ref.wav", ref_audio_data, "audio/wav")},
data={
"text": text,
"ref_text": ref_text,
"language": language,
"speed": str(speed),
},
data=data,
)
retryable = False
@@ -99,6 +103,7 @@ class VoiceCloneService:
output_path: str,
language: str = "Chinese",
speed: float = 1.0,
instruct_text: str = "",
) -> str:
"""
使用声音克隆生成语音
@@ -132,6 +137,7 @@ class VoiceCloneService:
ref_text=ref_text,
language=language,
speed=speed,
instruct_text=instruct_text,
)
with open(output_path, "wb") as f:
f.write(audio_bytes)

View File

@@ -151,6 +151,46 @@ def split_segment_to_lines(words: List[dict], max_chars: int = MAX_CHARS_PER_LIN
return segments
def smooth_word_timestamps(words: List[dict]) -> List[dict]:
"""
时间戳后处理平滑:
1. 保证时间戳严格单调递增
2. 消除 Whisper 输出中的微小抖动(字的 end > 下一字的 start
3. 填补字间间隙,避免字幕高亮"跳空"
"""
if len(words) <= 1:
return words
result = [words[0].copy()]
for i in range(1, len(words)):
w = words[i].copy()
prev = result[-1]
# 保证 start 不早于前一字的 start单调递增
if w["start"] < prev["start"]:
w["start"] = prev["start"]
# 保证 start 不早于前一字的 end
if w["start"] < prev["end"]:
# 两字重叠,取中点分割
mid = (prev["end"] + w["start"]) / 2
prev["end"] = round(mid, 3)
w["start"] = round(mid, 3)
# 填补字间间隙(间隙 < 50ms 时直接连接,避免高亮跳空)
gap = w["start"] - prev["end"]
if 0 < gap < 0.05:
prev["end"] = w["start"]
# 保证 end >= start
if w["end"] < w["start"]:
w["end"] = w["start"] + 0.05
result.append(w)
return result
class WhisperService:
"""字幕对齐服务(基于 faster-whisper"""
@@ -219,6 +259,8 @@ class WhisperService:
language=language,
word_timestamps=True, # 启用字级别时间戳
vad_filter=True, # 启用 VAD 过滤静音
beam_size=8, # 增大搜索宽度,提升时间戳精度
# condition_on_previous_text 保持默认 True避免时间戳系统性超前
)
logger.info(f"Detected language: {info.language} (prob: {info.language_probability:.2f})")
@@ -244,6 +286,7 @@ class WhisperService:
all_words.extend(chars)
if all_words:
all_words = smooth_word_timestamps(all_words)
line_segments = split_segment_to_lines(all_words, max_chars)
all_segments.extend(line_segments)
@@ -268,6 +311,14 @@ class WhisperService:
w_starts = [c["start"] for c in whisper_chars]
w_final_end = whisper_chars[-1]["end"]
# 字数比例异常检测
ratio = n_o / n_w
if ratio > 1.5 or ratio < 0.67:
logger.warning(
f"original_text 与 Whisper 字数比例异常: {n_o}/{n_w} = {ratio:.2f}, "
f"字幕时间戳精度可能下降"
)
logger.info(
f"Using original_text for subtitles (len={len(original_text)}), "
f"rhythm-mapping {n_o} orig chars onto {n_w} Whisper chars, "
@@ -302,11 +353,21 @@ class WhisperService:
"end": round(t_end, 3),
})
all_segments = split_segment_to_lines(remapped, max_chars)
# 限制单字时长范围,防止比例异常时极端漂移
MIN_CHAR_DURATION = 0.04 # 40ms一帧@25fps
MAX_CHAR_DURATION = 0.8 # 800ms
for r in remapped:
dur = r["end"] - r["start"]
if dur < MIN_CHAR_DURATION:
r["end"] = round(r["start"] + MIN_CHAR_DURATION, 3)
elif dur > MAX_CHAR_DURATION:
r["end"] = round(r["start"] + MAX_CHAR_DURATION, 3)
all_segments = split_segment_to_lines(smooth_word_timestamps(remapped), max_chars)
logger.info(f"Rebuilt {len(all_segments)} subtitle segments (rhythm-mapped)")
elif orig_chars:
# Whisper 字符不足,退回线性插值
all_segments = split_segment_to_lines(orig_chars, max_chars)
all_segments = split_segment_to_lines(smooth_word_timestamps(orig_chars), max_chars)
logger.info(f"Rebuilt {len(all_segments)} subtitle segments (linear fallback)")
logger.info(f"Generated {len(all_segments)} subtitle segments")

View File

@@ -127,6 +127,7 @@ export const useGeneratedAudios = ({
ref_text?: string;
language: string;
speed?: number;
instruct_text?: string;
}) => {
setIsGeneratingAudio(true);
setAudioTask({ status: "pending", progress: 0, message: "正在提交..." });

View File

@@ -182,6 +182,9 @@ export const useHomeController = () => {
// 语速控制
const [speed, setSpeed] = useState<number>(1.0);
// 语气控制(仅声音克隆模式)
const [emotion, setEmotion] = useState<string>("normal");
// ClipTrimmer 模态框状态
const [clipTrimmerOpen, setClipTrimmerOpen] = useState(false);
const [clipTrimmerSegmentId, setClipTrimmerSegmentId] = useState<string | null>(null);
@@ -400,13 +403,14 @@ export const useHomeController = () => {
});
// 时间轴第一段素材的视频 URL用于帧截取预览
// 有时间轴段时用第一段,没有(如未选配音)回退到 selectedMaterials[0]
// 使用后端代理 URL同源避免 CORS canvas taint
const firstTimelineMaterialUrl = useMemo(() => {
const firstSeg = timelineSegments[0];
const matId = firstSeg?.materialId ?? selectedMaterials[0];
if (!matId) return null;
const mat = materials.find((m) => m.id === matId);
return mat?.path ? resolveMediaUrl(mat.path) : null;
if (!mat) return null;
return `/api/materials/stream/${mat.id}`;
}, [materials, timelineSegments, selectedMaterials]);
const materialPosterUrl = useVideoFrameCapture(showStylePreview ? firstTimelineMaterialUrl : null);
@@ -501,6 +505,8 @@ export const useHomeController = () => {
setSelectedAudioId,
speed,
setSpeed,
emotion,
setEmotion,
});
const { savedScripts, saveScript, deleteScript: deleteSavedScript } = useSavedScripts(storageKey);
@@ -875,6 +881,13 @@ export const useHomeController = () => {
return;
}
const emotionToInstruct: Record<string, string> = {
normal: "",
happy: "You are a helpful assistant. 请非常开心地说一句话。<|endofprompt|>",
sad: "You are a helpful assistant. 请非常伤心地说一句话。<|endofprompt|>",
angry: "You are a helpful assistant. 请非常生气地说一句话。<|endofprompt|>",
};
const params = {
text: text.trim(),
tts_mode: ttsMode,
@@ -883,6 +896,7 @@ export const useHomeController = () => {
ref_text: ttsMode === "voiceclone" ? refText : undefined,
language: textLang,
speed: ttsMode === "voiceclone" ? speed : undefined,
instruct_text: ttsMode === "voiceclone" ? emotionToInstruct[emotion] || "" : undefined,
};
await generateAudio(params);
};
@@ -1214,6 +1228,8 @@ export const useHomeController = () => {
selectAudio,
speed,
setSpeed,
emotion,
setEmotion,
timelineSegments,
reorderSegments,
setSourceRange,

View File

@@ -65,6 +65,8 @@ interface UseHomePersistenceOptions {
setSelectedAudioId: React.Dispatch<React.SetStateAction<string | null>>;
speed: number;
setSpeed: React.Dispatch<React.SetStateAction<number>>;
emotion: string;
setEmotion: React.Dispatch<React.SetStateAction<string>>;
}
export const useHomePersistence = ({
@@ -122,6 +124,8 @@ export const useHomePersistence = ({
setSelectedAudioId,
speed,
setSpeed,
emotion,
setEmotion,
}: UseHomePersistenceOptions) => {
const [isRestored, setIsRestored] = useState(false);
@@ -153,6 +157,7 @@ export const useHomePersistence = ({
const savedSubtitleBottomMargin = localStorage.getItem(`vigent_${storageKey}_subtitleBottomMargin`);
const savedOutputAspectRatio = localStorage.getItem(`vigent_${storageKey}_outputAspectRatio`);
const savedSpeed = localStorage.getItem(`vigent_${storageKey}_speed`);
const savedEmotion = localStorage.getItem(`vigent_${storageKey}_emotion`);
setText(savedText || "大家好,欢迎来到我的频道,今天给大家分享一些有趣的内容。");
setVideoTitle(savedTitle ? clampTitle(savedTitle) : "");
@@ -235,6 +240,8 @@ export const useHomePersistence = ({
if (!Number.isNaN(parsed)) setSpeed(parsed);
}
if (savedEmotion) setEmotion(savedEmotion);
// eslint-disable-next-line react-hooks/set-state-in-effect
setIsRestored(true);
}, [
@@ -249,6 +256,7 @@ export const useHomePersistence = ({
setSelectedVideoId,
setSelectedAudioId,
setSpeed,
setEmotion,
setSubtitleFontSize,
setSubtitleSizeLocked,
setText,
@@ -427,5 +435,11 @@ export const useHomePersistence = ({
}
}, [speed, storageKey, isRestored]);
useEffect(() => {
if (isRestored) {
localStorage.setItem(`vigent_${storageKey}_emotion`, emotion);
}
}, [emotion, storageKey, isRestored]);
return { isRestored };
};

View File

@@ -18,7 +18,6 @@ export function useVideoFrameCapture(videoUrl: string | null): string | null {
let isActive = true;
const video = document.createElement("video");
video.crossOrigin = "anonymous";
video.muted = true;
video.preload = "auto";
video.playsInline = true;

View File

@@ -23,6 +23,8 @@ interface GeneratedAudiosPanelProps {
speed: number;
onSpeedChange: (speed: number) => void;
ttsMode: string;
emotion: string;
onEmotionChange: (e: string) => void;
embedded?: boolean;
}
@@ -41,14 +43,18 @@ export function GeneratedAudiosPanel({
speed,
onSpeedChange,
ttsMode,
emotion,
onEmotionChange,
embedded = false,
}: GeneratedAudiosPanelProps) {
const [editingId, setEditingId] = useState<string | null>(null);
const [editName, setEditName] = useState("");
const [playingId, setPlayingId] = useState<string | null>(null);
const [speedOpen, setSpeedOpen] = useState(false);
const [emotionOpen, setEmotionOpen] = useState(false);
const audioRef = useRef<HTMLAudioElement | null>(null);
const speedRef = useRef<HTMLDivElement>(null);
const emotionRef = useRef<HTMLDivElement>(null);
const stopPlaying = useCallback(() => {
if (audioRef.current) {
@@ -80,6 +86,17 @@ export function GeneratedAudiosPanel({
return () => document.removeEventListener("mousedown", handler);
}, [speedOpen]);
// Close emotion dropdown on click outside
useEffect(() => {
const handler = (e: MouseEvent) => {
if (emotionRef.current && !emotionRef.current.contains(e.target as Node)) {
setEmotionOpen(false);
}
};
if (emotionOpen) document.addEventListener("mousedown", handler);
return () => document.removeEventListener("mousedown", handler);
}, [emotionOpen]);
const togglePlay = (audio: GeneratedAudio, e: React.MouseEvent) => {
e.stopPropagation();
if (playingId === audio.id) {
@@ -125,12 +142,48 @@ export function GeneratedAudiosPanel({
] as const;
const currentSpeedLabel = speedOptions.find((o) => o.value === speed)?.label ?? "正常";
const emotionOptions = [
{ value: "normal", label: "正常" },
{ value: "happy", label: "欢快" },
{ value: "sad", label: "低沉" },
{ value: "angry", label: "严肃" },
] as const;
const currentEmotionLabel = emotionOptions.find((o) => o.value === emotion)?.label ?? "正常";
const content = (
<>
{embedded ? (
<>
{/* Row 1: 语速 + 生成配音 (right-aligned) */}
{/* Row 1: 语气 + 语速 + 生成配音 (right-aligned) */}
<div className="flex justify-end items-center gap-1.5 mb-3">
{ttsMode === "voiceclone" && (
<div ref={emotionRef} className="relative">
<button
onClick={() => setEmotionOpen((v) => !v)}
className="px-2 py-1 text-xs bg-white/10 hover:bg-white/20 rounded text-gray-300 whitespace-nowrap flex items-center gap-1 transition-all"
>
: {currentEmotionLabel}
<ChevronDown className={`h-3 w-3 transition-transform ${emotionOpen ? "rotate-180" : ""}`} />
</button>
{emotionOpen && (
<div className="absolute right-0 top-full mt-1 bg-gray-800 border border-white/20 rounded-lg shadow-xl py-1 z-50 min-w-[80px]">
{emotionOptions.map((opt) => (
<button
key={opt.value}
onClick={() => { onEmotionChange(opt.value); setEmotionOpen(false); }}
className={`w-full text-left px-3 py-1.5 text-xs transition-colors ${
emotion === opt.value
? "bg-purple-600/40 text-purple-200"
: "text-gray-300 hover:bg-white/10"
}`}
>
{opt.label}
</button>
))}
</div>
)}
</div>
)}
{ttsMode === "voiceclone" && (
<div ref={speedRef} className="relative">
<button
@@ -192,6 +245,34 @@ export function GeneratedAudiosPanel({
</h2>
<div className="flex gap-1.5">
{ttsMode === "voiceclone" && (
<div ref={emotionRef} className="relative">
<button
onClick={() => setEmotionOpen((v) => !v)}
className="px-2 py-1 text-xs bg-white/10 hover:bg-white/20 rounded text-gray-300 whitespace-nowrap flex items-center gap-1 transition-all"
>
: {currentEmotionLabel}
<ChevronDown className={`h-3 w-3 transition-transform ${emotionOpen ? "rotate-180" : ""}`} />
</button>
{emotionOpen && (
<div className="absolute right-0 top-full mt-1 bg-gray-800 border border-white/20 rounded-lg shadow-xl py-1 z-50 min-w-[80px]">
{emotionOptions.map((opt) => (
<button
key={opt.value}
onClick={() => { onEmotionChange(opt.value); setEmotionOpen(false); }}
className={`w-full text-left px-3 py-1.5 text-xs transition-colors ${
emotion === opt.value
? "bg-purple-600/40 text-purple-200"
: "text-gray-300 hover:bg-white/10"
}`}
>
{opt.label}
</button>
))}
</div>
)}
</div>
)}
{ttsMode === "voiceclone" && (
<div ref={speedRef} className="relative">
<button

View File

@@ -168,6 +168,8 @@ export function HomePage() {
selectAudio,
speed,
setSpeed,
emotion,
setEmotion,
timelineSegments,
reorderSegments,
setSourceRange,
@@ -293,6 +295,8 @@ export function HomePage() {
speed={speed}
onSpeedChange={setSpeed}
ttsMode={ttsMode}
emotion={emotion}
onEmotionChange={setEmotion}
/>
</div>

View File

@@ -174,6 +174,7 @@ async def generate(
ref_text: str = Form(...),
language: str = Form("Chinese"),
speed: float = Form(1.0),
instruct_text: str = Form(""),
):
"""
声音克隆生成
@@ -236,16 +237,30 @@ async def generate(
# CosyVoice3 的 prompt_text 格式
prompt_text = f"You are a helpful assistant.<|endofprompt|>{ref_text}"
use_instruct = bool(instruct_text.strip())
if use_instruct:
print(f"🎭 Instruct mode: {instruct_text[:60]}...")
def _do_inference():
"""在线程池中执行推理"""
results = list(_model.inference_zero_shot(
text,
prompt_text,
ref_audio_path,
stream=False,
speed=speed,
text_frontend=True,
))
if use_instruct:
results = list(_model.inference_instruct2(
text,
instruct_text,
ref_audio_path,
stream=False,
speed=speed,
text_frontend=True,
))
else:
results = list(_model.inference_zero_shot(
text,
prompt_text,
ref_audio_path,
stream=False,
speed=speed,
text_frontend=True,
))
if not results:
raise RuntimeError("CosyVoice returned empty results")

View File

@@ -429,7 +429,7 @@ def _run_inference(req: LipSyncRequest) -> dict:
# ===== Phase 4: VAE 潜空间编码 =====
t0 = time.time()
input_latent_list = []
extra_margin = 10
extra_margin = 15
for bbox, frame in zip(coord_list, frames):
if bbox == coord_placeholder:
continue
@@ -479,91 +479,77 @@ def _run_inference(req: LipSyncRequest) -> dict:
timings["5_unet"] = time.time() - t0
print(f"✅ UNet 推理: {len(res_frame_list)} 帧 [{timings['5_unet']:.1f}s]")
# ===== Phase 6: 合成 (FFmpeg pipe 直写 H.264 + 纯 numpy blending) =====
# ===== Phase 6: 合成 (cv2.VideoWriter + 纯 numpy blending) =====
t0 = time.time()
h, w = frames[0].shape[:2]
temp_raw_path = output_vid_path + ".raw.mp4"
# FFmpeg pipe: rawvideo stdin → NVENC GPU 硬件编码 H.264
ffmpeg_cmd = [
"ffmpeg", "-y", "-v", "warning",
"-f", "rawvideo", "-pix_fmt", "bgr24",
"-s", f"{w}x{h}", "-r", str(fps),
"-i", "pipe:0",
"-c:v", "h264_nvenc", "-preset", "p4", "-cq", "20",
"-pix_fmt", "yuv420p",
temp_raw_path
]
proc = subprocess.Popen(ffmpeg_cmd, stdin=subprocess.PIPE,
stdout=subprocess.DEVNULL, stderr=subprocess.PIPE)
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
writer = cv2.VideoWriter(temp_raw_path, fourcc, fps, (w, h))
if not writer.isOpened():
raise RuntimeError(f"cv2.VideoWriter 打开失败: {temp_raw_path}")
cached_mask = None
cached_crop_box = None
blend_mode = "jaw" if version == "v15" else "raw"
try:
for i in tqdm(range(len(res_frame_list)), desc="合成"):
res_frame = res_frame_list[i]
bbox = coord_list_cycle[i % len(coord_list_cycle)]
ori_frame = frame_list_cycle[i % len(frame_list_cycle)].copy()
for i in tqdm(range(len(res_frame_list)), desc="合成"):
res_frame = res_frame_list[i]
bbox = coord_list_cycle[i % len(coord_list_cycle)]
ori_frame = frame_list_cycle[i % len(frame_list_cycle)].copy()
x1, y1, x2, y2 = bbox
if version == "v15":
y2 = min(y2 + extra_margin, ori_frame.shape[0])
adjusted_bbox = (x1, y1, x2, y2)
x1, y1, x2, y2 = bbox
if version == "v15":
y2 = min(y2 + extra_margin, ori_frame.shape[0])
adjusted_bbox = (x1, y1, x2, y2)
try:
res_frame = cv2.resize(res_frame.astype(np.uint8), (x2 - x1, y2 - y1))
except Exception:
writer.write(ori_frame)
continue
# 每 N 帧更新 BiSeNet 人脸解析 mask, 其余帧复用缓存
if i % BLEND_CACHE_EVERY == 0 or cached_mask is None:
try:
res_frame = cv2.resize(res_frame.astype(np.uint8), (x2 - x1, y2 - y1))
cached_mask, cached_crop_box = get_image_prepare_material(
ori_frame, adjusted_bbox, mode=blend_mode, fp=fp)
except Exception:
proc.stdin.write(ori_frame.tobytes())
# 如果 prepare 失败, 用完整方式
combine_frame = get_image(
ori_frame, res_frame, list(adjusted_bbox),
mode=blend_mode, fp=fp)
writer.write(combine_frame)
continue
# 每 N 帧更新 BiSeNet 人脸解析 mask, 其余帧复用缓存
if i % BLEND_CACHE_EVERY == 0 or cached_mask is None:
try:
cached_mask, cached_crop_box = get_image_prepare_material(
ori_frame, adjusted_bbox, mode=blend_mode, fp=fp)
except Exception:
# 如果 prepare 失败, 用完整方式
combine_frame = get_image(
ori_frame, res_frame, list(adjusted_bbox),
mode=blend_mode, fp=fp)
proc.stdin.write(combine_frame.tobytes())
continue
try:
combine_frame = get_image_blending_fast(
ori_frame, res_frame, adjusted_bbox, cached_mask, cached_crop_box)
except Exception:
# blending_fast 失败时 fallback 到 PIL 方式
try:
combine_frame = get_image_blending_fast(
combine_frame = get_image_blending(
ori_frame, res_frame, adjusted_bbox, cached_mask, cached_crop_box)
except Exception:
# blending_fast 失败时 fallback 到 PIL 方式
try:
combine_frame = get_image_blending(
ori_frame, res_frame, adjusted_bbox, cached_mask, cached_crop_box)
except Exception:
combine_frame = get_image(
ori_frame, res_frame, list(adjusted_bbox),
mode=blend_mode, fp=fp)
combine_frame = get_image(
ori_frame, res_frame, list(adjusted_bbox),
mode=blend_mode, fp=fp)
proc.stdin.write(combine_frame.tobytes())
finally:
proc.stdin.close()
proc.wait()
stderr_out = proc.stderr.read().decode("utf-8", errors="ignore") if proc.stderr else ""
proc.stderr.close()
if proc.returncode != 0:
raise RuntimeError(f"FFmpeg pipe 编码失败 (rc={proc.returncode}): {stderr_out[:500]}")
writer.write(combine_frame)
writer.release()
timings["6_blend"] = time.time() - t0
print(f"🎨 合成 [{timings['6_blend']:.1f}s]")
# ===== Phase 7: 音频合并 (-c:v copy, 不重编码视频) =====
# ===== Phase 7: FFmpeg H.264 编码 + 合并音频 =====
t0 = time.time()
cmd = [
"ffmpeg", "-y", "-v", "warning",
"-i", temp_raw_path, "-i", audio_path,
"-c:v", "copy", "-c:a", "copy", "-shortest",
"-c:v", "libx264", "-crf", "18", "-pix_fmt", "yuv420p",
"-c:a", "copy", "-shortest",
output_vid_path
]
if not run_ffmpeg(cmd):

View File

@@ -155,18 +155,56 @@ async function main() {
console.log(`Public dir: ${publicDir}, Video file: ${videoFileName}`);
// Bundle the Remotion project
console.log('Bundling Remotion project...');
// 修复: 使用 process.cwd() 解析 src/index.ts确保在 dist/render.js 和 ts-node 下都能找到
// 假设脚本总是在 remotion 根目录下运行 (由 python service 保证)
const entryPoint = path.resolve(process.cwd(), 'src/index.ts');
console.log(`Entry point: ${entryPoint}`);
const bundleLocation = await bundle({
entryPoint,
webpackOverride: (config) => config,
publicDir,
});
// Bundle 缓存逻辑:通过 src 目录 mtime hash 判断是否需要重新打包
const BUNDLE_CACHE_DIR = path.resolve(process.cwd(), '.remotion-bundle-cache');
const hashFile = path.join(BUNDLE_CACHE_DIR, '.hash');
function getSourceHash(): string {
// 收集 src 目录下所有文件的 mtime 作为缓存 key
const srcDir = path.resolve(process.cwd(), 'src');
const mtimes: string[] = [];
function walkDir(dir: string) {
for (const entry of fs.readdirSync(dir, { withFileTypes: true })) {
const fullPath = path.join(dir, entry.name);
if (entry.isDirectory()) {
walkDir(fullPath);
} else {
mtimes.push(`${fullPath}:${fs.statSync(fullPath).mtimeMs}`);
}
}
}
walkDir(srcDir);
mtimes.sort();
return mtimes.join('|');
}
const currentHash = getSourceHash();
let bundleLocation: string;
if (fs.existsSync(hashFile) && fs.readFileSync(hashFile, 'utf-8') === currentHash) {
bundleLocation = BUNDLE_CACHE_DIR;
console.log('Using cached bundle');
} else {
console.log('Bundling Remotion project...');
console.log(`Entry point: ${entryPoint}`);
const freshBundle = await bundle({
entryPoint,
webpackOverride: (config) => config,
publicDir,
});
// 复制到缓存目录
if (fs.existsSync(BUNDLE_CACHE_DIR)) {
fs.rmSync(BUNDLE_CACHE_DIR, { recursive: true });
}
fs.cpSync(freshBundle, BUNDLE_CACHE_DIR, { recursive: true });
fs.writeFileSync(hashFile, currentHash);
bundleLocation = BUNDLE_CACHE_DIR;
console.log('Bundle cached for future use');
}
// 统一 inputProps包含视频尺寸供 calculateMetadata 使用
const inputProps = {
@@ -198,7 +236,7 @@ async function main() {
composition.height = videoHeight;
// Render the video
const concurrency = options.concurrency || 16;
const concurrency = options.concurrency || 4;
console.log(`Rendering video (concurrency=${concurrency})...`);
await renderMedia({
composition,

View File

@@ -27,7 +27,7 @@ export function getCurrentSegment(
currentTimeInSeconds: number
): Segment | null {
for (const segment of captions.segments) {
if (currentTimeInSeconds >= segment.start && currentTimeInSeconds <= segment.end) {
if (currentTimeInSeconds >= segment.start && currentTimeInSeconds < segment.end) {
return segment;
}
}
@@ -43,7 +43,7 @@ export function getCurrentWordIndex(
): number {
for (let i = 0; i < segment.words.length; i++) {
const word = segment.words[i];
if (currentTimeInSeconds >= word.start && currentTimeInSeconds <= word.end) {
if (currentTimeInSeconds >= word.start && currentTimeInSeconds < word.end) {
return i;
}
// 如果当前时间在两个字之间,返回前一个字