更新

2026-03-03 12:23:49 +08:00 · 2026-03-02 16:35:16 +08:00 · 2026-02-28 17:49:32 +08:00 · 2026-02-28 14:44:51 +08:00 · 2026-02-28 09:16:41 +08:00 · 2026-02-27 16:11:34 +08:00
162 changed files with 122195 additions and 2238 deletions
--- a/Docs/BACKEND_DEV.md
+++ b/Docs/BACKEND_DEV.md
@@ -2,6 +2,12 @@

 本文档定义后端开发的结构规范、接口契约与实现习惯。目标是让新功能按统一范式落地，旧逻辑在修复时逐步抽离。

+## 文档定位
+
+- 本文档只定义后端开发规范与工程约束（分层职责、契约、流程、代码习惯）。
+- 接口说明、部署运行与环境配置示例请查看 `Docs/BACKEND_README.md`。
+- 历史变更请记录在 `Docs/DevLogs/` 与 `Docs/TASK_COMPLETE.md`，不要写入本规范文档。
+
 ---

 ## 1. 模块化与分层原则
@@ -43,7 +49,7 @@ backend/
 │   │   └── admin/           # 管理员功能
 │   ├── repositories/        # Supabase 数据访问
 │   ├── services/            # 外部服务集成
-│   │   ├── uploader/        # 平台发布器（douyin/weixin）
+│   │   ├── uploader/        # 平台发布器（douyin/weixin/xiaohongshu/bilibili）
 │   │   ├── qr_login_service.py
 │   │   ├── publish_service.py
 │   │   ├── remotion_service.py
@@ -156,6 +162,14 @@ backend/user_data/{user_uuid}/cookies/
 - `LATENTSYNC_*`
 - `CORS_ORIGINS` (CORS 白名单，默认 *)

+### MuseTalk / 混合唇形同步
+- `MUSETALK_GPU_ID` (GPU 编号，默认 0)
+- `MUSETALK_API_URL` (常驻服务地址，默认 http://localhost:8011)
+- `MUSETALK_BATCH_SIZE` (推理批大小，默认 32)
+- `MUSETALK_VERSION` (v15)
+- `MUSETALK_USE_FLOAT16` (半精度，默认 true)
+- `LIPSYNC_DURATION_THRESHOLD` (秒，>=此值用 MuseTalk；代码默认 120，本仓库当前 `.env` 配置 100)
+
 ### 微信视频号
 - `WEIXIN_HEADLESS_MODE` (headful/headless-new)
 - `WEIXIN_CHROME_PATH` / `WEIXIN_BROWSER_CHANNEL`
@@ -171,6 +185,14 @@ backend/user_data/{user_uuid}/cookies/
 - `DOUYIN_FORCE_SWIFTSHADER`
 - `DOUYIN_DEBUG_ARTIFACTS` / `DOUYIN_RECORD_VIDEO` / `DOUYIN_KEEP_SUCCESS_VIDEO`

+### 小红书
+- `XIAOHONGSHU_HEADLESS_MODE` (headful/headless-new，默认 headless-new)
+- `XIAOHONGSHU_CHROME_PATH` / `XIAOHONGSHU_BROWSER_CHANNEL`
+- `XIAOHONGSHU_USER_AGENT`
+- `XIAOHONGSHU_LOCALE` / `XIAOHONGSHU_TIMEZONE_ID`
+- `XIAOHONGSHU_FORCE_SWIFTSHADER`
+- `XIAOHONGSHU_DEBUG_ARTIFACTS`
+
 ### 支付宝
 - `ALIPAY_APP_ID` / `ALIPAY_PRIVATE_KEY_PATH` / `ALIPAY_PUBLIC_KEY_PATH`
 - `ALIPAY_NOTIFY_URL` / `ALIPAY_RETURN_URL`
@@ -183,8 +205,9 @@ backend/user_data/{user_uuid}/cookies/
 ## 10. Playwright 发布调试

 - 诊断日志落盘：`backend/app/debug_screenshots/weixin_network.log` / `douyin_network.log`
- 关键失败截图：`backend/app/debug_screenshots/weixin_*.png` / `douyin_*.png`
+- 关键失败截图：`backend/app/debug_screenshots/weixin_*.png` / `douyin_*.png` / `xiaohongshu_*.png`
 - 视频号建议使用 headful + xvfb-run（避免 headless 解码/指纹问题）
+- 发布专项实现细节（登录链路、成功判定、排障）统一维护在 `Docs/PUBLISH_DEPLOY.md`

 ---

--- a/Docs/BACKEND_README.md
+++ b/Docs/BACKEND_README.md
@@ -1,6 +1,12 @@
 # ViGent2 后端开发指南

-本文档提供后端架构概览与接口规范。开发规范与分层约定见 `Docs/BACKEND_DEV.md`。
+本文档提供后端架构概览、接口说明与运行配置。
+
+## 📌 文档定位
+
+- 本文档用于说明后端服务能力、接口与部署运行方式（面向使用与联调）。
+- 开发规范、分层约束与代码实现习惯请查看 `Docs/BACKEND_DEV.md`。
+- 历史变更与里程碑请查看 `Docs/DevLogs/` 与 `Docs/TASK_COMPLETE.md`。

 ---

@@ -8,7 +14,7 @@

 后端采用 **FastAPI** 框架，基于 Python 3.10+ 构建，主要负责业务逻辑处理、AI 任务调度以及与各微服务组件的交互。

-### 目录结构
+### 目录结构（概览）

 ```
 backend/
@@ -36,6 +42,8 @@ backend/
 └── requirements.txt      # 依赖清单
 ```

+> 详细分层职责（router/service/workflow/repositories）与开发约束请查看 `Docs/BACKEND_DEV.md`。
+
 ---

 ## 🔌 API 接口规范
@@ -56,6 +64,7 @@ backend/

 2.  **视频生成 (Videos)**
    *   `POST /api/videos/generate`: 提交生成任务
+    *   `GET/POST /api/videos/voice-preview`: 生成音色试听短音频（返回二进制音频流）
    *   `GET /api/videos/tasks/{task_id}`: 查询单个任务状态
    *   `GET /api/videos/tasks`: 获取用户所有任务列表
    *   `GET /api/videos/generated`: 获取历史视频列表
@@ -65,14 +74,18 @@ backend/
    *   `POST /api/materials`: 上传素材
    *   `GET /api/materials`: 获取素材列表
    *   `PUT /api/materials/{material_id}`: 重命名素材
+    *   `GET /api/materials/stream/{material_id}`: 同源流式返回素材文件（用于前端 canvas 截帧，避免跨域 CORS taint）

 4.  **社交发布 (Publish)**
    *   `POST /api/publish`: 发布视频到 抖音/微信视频号/B站/小红书
-    *   `POST /api/publish/login`: 扫码登录平台
-    *   `GET /api/publish/login/status`: 查询登录状态（含刷脸验证二维码）
+    *   `POST /api/publish/login/{platform}`: 获取平台二维码并启动扫码登录
+    *   `GET /api/publish/login/status/{platform}`: 轮询登录状态（含抖音刷脸验证二维码）
+    *   `POST /api/publish/logout/{platform}`: 注销平台登录（删除 Cookie）
+    *   `POST /api/publish/cookies/save/{platform}`: 保存客户端提取的 Cookie
    *   `GET /api/publish/accounts`: 获取已登录账号列表
+    *   `GET /api/publish/screenshot/{filename}`: 获取发布成功截图（需登录）

-> 提示：视频号/抖音发布建议使用 headful + xvfb-run 运行后端。
+> 提示：视频号/抖音发布建议使用 headful + xvfb-run 运行后端。发布专项实现与部署说明见 `Docs/PUBLISH_DEPLOY.md`。

 5.  **资源库 (Assets)**
    *   `GET /api/assets/subtitle-styles`: 字幕样式列表
@@ -101,7 +114,7 @@ backend/
    *   `POST /api/tools/extract-script`: 从视频链接提取文案

 10. **健康检查**
-    *   `GET /api/lipsync/health`: LatentSync 服务健康状态
+    *   `GET /api/lipsync/health`: 唇形同步服务健康状态（含 LatentSync + MuseTalk + 混合路由阈值）
    *   `GET /api/voiceclone/health`: CosyVoice 3.0 服务健康状态

 11. **支付 (Payment)**
@@ -137,7 +150,11 @@ backend/
 - `speed`: 语速（声音克隆模式，默认 1.0，范围 0.8-1.2）
 - `custom_assignments`: 自定义素材分配数组（每项含 `material_path` / `start` / `end` / `source_start` / `source_end?`），存在时优先按时间轴可见段生成
 - `output_aspect_ratio`: 输出画面比例（`9:16` 或 `16:9`，默认 `9:16`）
- `language`: TTS 语言（默认自动检测，声音克隆时透传给 CosyVoice 3.0）
+- `lipsync_model`: 唇形模型路由模式（`default` / `fast` / `advanced`）
+  - `default`: 阈值路由（`LIPSYNC_DURATION_THRESHOLD`）
+  - `fast`: 强制 MuseTalk，不可用时回退 LatentSync
+  - `advanced`: 强制 LatentSync
+- `language`: TTS 语言区域（默认 `zh-CN`；会映射为 Whisper 的 `zh/en/...` 与 CosyVoice 的 `Chinese/English/Auto`）
 - `title`: 片头标题文字
 - `title_display_mode`: 标题显示模式（`short` / `persistent`，默认 `short`）
 - `title_duration`: 标题显示时长（秒，默认 `4.0`；`short` 模式生效）
@@ -160,6 +177,18 @@ backend/
 - 多素材片段在拼接前统一重编码，并强制 `25fps + CFR`，减少段边界时间基不一致导致的画面卡顿。
 - concat 流程启用 `+genpts` 重建时间戳，提升拼接后时间轴连续性。
 - 对带旋转元数据的 MOV 素材会先做方向归一化，再进入分辨率判断和后续流程。
+- compose 阶段（视频轨+音频轨合并）在**无需循环视频**时使用 `-c:v copy` 流复制；需要循环时才重编码。
+- FFmpeg 子进程设有超时保护：`_run_ffmpeg()` 600 秒、`_get_duration()` 30 秒，防止畸形文件导致永久挂起。
+
+### 全局并发控制
+
+- 视频生成入口使用 `asyncio.Semaphore(2)` 限制最多 2 个任务同时执行，排队中的任务显示"排队中..."状态。
+- Redis 任务 key 设有 TTL：创建时 24 小时，completed/failed 状态 2 小时，`list()` 时自动清理过期索引。
+
+### 字幕时间戳优化
+
+- Whisper 输出经 `smooth_word_timestamps()` 三步平滑：单调递增保证、重叠消除（中点分割）、微小间隙填补（<50ms）。
+- 支持 `original_text` 原文节奏映射：原文字符按比例映射到 Whisper 时间戳上，解决 AI 改写/多语言文案与转录不一致问题。

 ## 📦 资源库与静态资源

@@ -190,7 +219,7 @@ pip install -r requirements.txt

 ### 3. 环境变量配置

-复制 `.env.example` 到 `.env` 并配置必要的 Key：
+当前仓库使用 `backend/.env` 作为运行配置基准；请按你的环境替换敏感值并核对以下关键项（生产环境请勿提交真实密钥）：

 ```ini
 # Supabase
@@ -202,6 +231,18 @@ GLM_API_KEY=your_glm_api_key

 # LatentSync 配置
 LATENTSYNC_GPU_ID=1
+
+# MuseTalk 配置 (长视频唇形同步)
+MUSETALK_GPU_ID=0
+MUSETALK_API_URL=http://localhost:8011
+MUSETALK_BATCH_SIZE=32
+LIPSYNC_DURATION_THRESHOLD=100
+
+# MuseTalk 可调参数（示例）
+MUSETALK_DETECT_EVERY=2
+MUSETALK_BLEND_CACHE_EVERY=2
+MUSETALK_ENCODE_CRF=14
+MUSETALK_ENCODE_PRESET=slow
 ```

 ### 4. 启动服务
@@ -213,43 +254,11 @@ uvicorn app.main:app --host 0.0.0.0 --port 8006 --reload

 ---

-## 🧩 服务集成指南
+## 🧩 开发约定与测试

-### 集成新模型
-
-如果需要集成新的 AI 模型 (例如新的 TTS 引擎)：
-
-1.  在 `app/services/` 下创建新的 Service 类 (如 `NewTTSService`)。
-2.  实现 `generate` 方法，可以使用 subprocess 调用，也可以是 HTTP 请求。
-3.  **重要**: 如果模型占用 GPU，请务必使用 `asyncio.Lock` 进行并发控制，防止 OOM。
-4.  在 `app/modules/` 下创建对应模块，添加 router/service/schemas，并在 `main.py` 注册路由。
-
-### 添加定时任务
-
-目前推荐使用 **APScheduler** 或 **Crontab** 来管理定时任务。
-社交媒体的定时发布功能目前依赖 `playwright` 的延迟执行，未来计划迁移到 Celery 队列。
-
---
-
-## 🛡️ 错误处理
-
-全项目统一使用 `Loguru` 进行日志记录。
-
-```python
-from loguru import logger
-
-try:
-    # 业务逻辑
-except Exception as e:
-    logger.error(f"操作失败: {str(e)}")
-    raise HTTPException(status_code=500, detail="服务器内部错误")
-```
-
---
-
-## 🧪 测试
-
-运行测试套件：
+- 新增模块、分层职责、统一响应、错误处理与调试规范请查看 `Docs/BACKEND_DEV.md`。
+- 建议在核心流程变更后做基础冒烟：登录、视频生成、发布。
+- 测试命令：

 ```bash
 pytest
--- a/Docs/COSYVOICE3_DEPLOY.md
+++ b/Docs/COSYVOICE3_DEPLOY.md
@@ -7,7 +7,8 @@
 | 模型 | Fun-CosyVoice3-0.5B-2512 (0.5B 参数) |
 | 端口 | 8010 |
 | GPU | 0 (CUDA_VISIBLE_DEVICES=0) |
-| PM2 名称 | vigent2-cosyvoice (id=15) |
+| 推理精度 | FP16 (自动混合精度) |
+| PM2 名称 | vigent2-cosyvoice |
 | Conda 环境 | cosyvoice (Python 3.10) |
 | 启动脚本 | `run_cosyvoice.sh` |
 | 服务脚本 | `models/CosyVoice/cosyvoice_server.py` |
@@ -69,6 +70,18 @@ run_cosyvoice.sh                      # PM2 启动脚本
 | ref_text | string | 是 | 参考音频的转写文字 |
 | language | string | 否 | 语言 (默认 "Chinese"，CosyVoice 自动检测) |
 | speed | float | 否 | 语速 (默认 1.0，范围 0.5-2.0，建议 0.8-1.2) |
+| instruct_text | string | 否 | 语气指令 (默认 ""，非空时切换为 `inference_instruct2` 模式) |
+
+**推理模式分支：**
+- `instruct_text` 为空 → `inference_zero_shot(text, prompt_text, ref_audio)` — 纯声音克隆
+- `instruct_text` 非空 → `inference_instruct2(text, instruct_text, ref_audio)` — 带语气/情绪控制的声音克隆
+
+**支持的语气指令示例：**
+```
+"You are a helpful assistant. 请非常开心地说一句话。<|endofprompt|>"
+"You are a helpful assistant. 请非常伤心地说一句话。<|endofprompt|>"
+"You are a helpful assistant. 请非常生气地说一句话。<|endofprompt|>"
+```

 **返回：** WAV 音频文件

--- a/Docs/DEPLOY_MANUAL.md
+++ b/Docs/DEPLOY_MANUAL.md
@@ -7,8 +7,8 @@
 | 服务器 | Dell PowerEdge R730 |
 | CPU | 2× Intel Xeon E5-2680 v4 (56 线程) |
 | 内存 | 192GB DDR4 |
-| GPU 0 | NVIDIA RTX 3090 24GB |
-| GPU 1 | NVIDIA RTX 3090 24GB (用于 LatentSync) |
+| GPU 0 | NVIDIA RTX 3090 24GB (MuseTalk + CosyVoice) |
+| GPU 1 | NVIDIA RTX 3090 24GB (LatentSync) |
 | 部署路径 | `/home/rongye/ProgramFiles/ViGent2` |

 ---
@@ -72,7 +72,9 @@ cd /home/rongye/ProgramFiles/ViGent2

 ---

-## 步骤 3: 部署 AI 模型 (LatentSync 1.6)
+## 步骤 3: 部署 AI 模型
+
+### 3a. LatentSync 1.6 (短视频唇形同步, GPU1)

 > ⚠️ **重要**：LatentSync 需要独立的 Conda 环境和 **~18GB VRAM**。请**不要**直接安装在后端环境中。

@@ -93,6 +95,26 @@ conda activate latentsync
 python -m scripts.server  # 测试能否启动，Ctrl+C 退出
 ```

+### 3b. MuseTalk 1.5 (长视频唇形同步, GPU0)
+
+> MuseTalk 是单步潜空间修复模型（非扩散模型），推理速度接近实时，适合达到路由阈值的长视频（本仓库当前 `.env` 示例为 >=100s）。与 CosyVoice 共享 GPU0，fp16 推理约需 4-8GB 显存。合成阶段已改为 FFmpeg rawvideo 管道直编码（`libx264` + 可配 CRF/preset）并保留 numpy blending，减少中间有损文件。
+
+请参考详细的独立部署指南：
+**[MuseTalk 部署指南](MUSETALK_DEPLOY.md)**
+
+简要步骤：
+1. 创建独立的 `musetalk` Conda 环境 (Python 3.10 + PyTorch 2.0.1 + CUDA 11.8)
+2. 安装 mmcv/mmdet/mmpose 等依赖
+3. 下载模型权重 (`download_weights.sh`)
+4. 创建必要的软链接 (`musetalk/config.json`, `musetalk/musetalkV15`)
+
+**验证 MuseTalk 部署**:
+```bash
+cd /home/rongye/ProgramFiles/ViGent2/models/MuseTalk
+/home/rongye/ProgramFiles/miniconda3/envs/musetalk/bin/python scripts/server.py
+# 另一个终端: curl http://localhost:8011/health
+```
+
 ---

 ## 步骤 4: 安装后端依赖
@@ -114,17 +136,21 @@ pip install -r requirements.txt
 playwright install chromium
 ```

-> 提示：视频号发布建议使用系统 Chrome + xvfb-run（避免 headless 解码失败）。
-> 抖音发布同样建议 headful 模式 (`DOUYIN_HEADLESS_MODE=headful`)。
+> 提示：视频号发布建议使用系统 Chrome + xvfb-run（避免 headless 解码失败）。
+> 抖音发布同样建议 headful 模式 (`DOUYIN_HEADLESS_MODE=headful`)。
+> 四平台发布专项实现说明请见 `Docs/PUBLISH_DEPLOY.md`。

 ### 扫码登录注意事项

 - **Cookie 按用户隔离**：每个用户的 Cookie 存储在 `backend/user_data/{uuid}/cookies/` 目录下，多用户并发登录互不干扰。
- **抖音 QR 登录关键教训**：
-  - 扫码后绝对**不能重新加载 QR 页面**，否则会销毁会话 token
-  - 使用**新标签页**检测登录完成状态（检查 URL 包含 `creator-micro` + session cookies 存在）
-  - 抖音可能弹出**刷脸验证**，后端会自动提取验证二维码返回给前端展示
- **微信视频号发布**：标题、描述、标签统一写入"视频描述"字段
+- **抖音 QR 登录关键教训**：
+  - 扫码后绝对**不能重新加载 QR 页面**，否则会销毁会话 token
+  - 使用**新标签页**检测登录完成状态（检查 URL 包含 `creator-micro` + session cookies 存在）
+  - 抖音可能弹出**刷脸验证**，后端会自动提取验证二维码返回给前端展示
+- **小红书 QR 登录关键点**：
+  - 创作平台默认可能是短信登录视图，需先切换到扫码登录再抓取二维码
+  - 扫码后可能跳转 `creator.xiaohongshu.com/new/home`，不一定命中旧 `publish` 成功指示 URL
+- **微信视频号发布**：标题、描述、标签统一写入"视频描述"字段

 ---

@@ -173,24 +199,23 @@ playwright install chromium
 ## 步骤 7: 配置环境变量


-```bash
-cd /home/rongye/ProgramFiles/ViGent2/backend
-
-# 复制配置模板
-cp .env.example .env
-```
-
-> 💡 **说明**：`.env.example` 已包含正确的默认配置，直接复制即可使用。  
-> 如需自定义，可编辑 `.env` 修改以下参数：
-
-| 配置项 | 默认值 | 说明 |
-|--------|--------|------|
-| `SUPABASE_URL` | `http://localhost:8008` | Supabase API 内部地址 |
-| `SUPABASE_PUBLIC_URL` | `https://api.hbyrkj.top` | Supabase API 公网地址 (前端访问) |
-| `LATENTSYNC_GPU_ID` | 1 | GPU 选择 (0 或 1) |
-| `LATENTSYNC_USE_SERVER` | false | 设为 true 以启用常驻服务加速 |
-| `LATENTSYNC_INFERENCE_STEPS` | 20 | 推理步数 (20-50) |
-| `LATENTSYNC_GUIDANCE_SCALE` | 1.5 | 引导系数 (1.0-3.0) |
+```bash
+cd /home/rongye/ProgramFiles/ViGent2/backend
+```
+
+> 💡 **说明**：当前仓库直接使用 `backend/.env`。请按你的环境替换敏感值并确认以下参数。  
+> 如需自定义，可编辑 `.env` 修改以下参数：
+
+| 配置项 | 当前示例值 | 说明 |
+|--------|------------|------|
+| `SUPABASE_URL` | `http://localhost:8008` | Supabase API 内部地址 |
+| `SUPABASE_PUBLIC_URL` | `https://api.hbyrkj.top` | Supabase API 公网地址 (前端访问) |
+| `LATENTSYNC_GPU_ID` | 1 | GPU 选择 (0 或 1) |
+| `LATENTSYNC_USE_SERVER` | true | 设为 true 以启用常驻服务加速 |
+| `LATENTSYNC_INFERENCE_STEPS` | 30 | 推理步数 (16-50) |
+| `LATENTSYNC_GUIDANCE_SCALE` | 1.9 | 引导系数 (1.0-3.0) |
+| `LATENTSYNC_ENABLE_DEEPCACHE` | true | DeepCache 推理加速 |
+| `LATENTSYNC_SEED` | 1247 | 固定随机种子（可复现） |
 | `DEBUG` | true | 生产环境改为 false |
 | `REDIS_URL` | `redis://localhost:6379/0` | 任务状态存储（不可用时回退内存） |
 | `WEIXIN_HEADLESS_MODE` | headless-new | 视频号 Playwright 模式 (headful/headless-new) |
@@ -205,14 +230,26 @@ cp .env.example .env
 | `DOUYIN_CHROME_PATH` | `/usr/bin/google-chrome` | 抖音 Chrome 路径 |
 | `DOUYIN_BROWSER_CHANNEL` |  | 抖音 Chromium 通道 (可选) |
 | `DOUYIN_USER_AGENT` | Chrome/144 UA | 抖音浏览器指纹 UA |
-| `DOUYIN_LOCALE` | zh-CN | 抖音语言环境 |
-| `DOUYIN_TIMEZONE_ID` | Asia/Shanghai | 抖音时区 |
-| `DOUYIN_FORCE_SWIFTSHADER` | true | 强制软件 WebGL |
-| `DOUYIN_DEBUG_ARTIFACTS` | false | 保留调试截图 |
-| `DOUYIN_RECORD_VIDEO` | false | 录制浏览器操作视频 |
-| `DOUYIN_KEEP_SUCCESS_VIDEO` | false | 成功后保留录屏 |
+| `DOUYIN_LOCALE` | zh-CN | 抖音语言环境 |
+| `DOUYIN_TIMEZONE_ID` | Asia/Shanghai | 抖音时区 |
+| `DOUYIN_FORCE_SWIFTSHADER` | true | 强制软件 WebGL |
+| `XIAOHONGSHU_HEADLESS_MODE` | headless-new | 小红书 Playwright 模式 (headful/headless-new) |
+| `XIAOHONGSHU_CHROME_PATH` | `/usr/bin/google-chrome` | 小红书 Chrome 路径 |
+| `XIAOHONGSHU_BROWSER_CHANNEL` |  | 小红书 Chromium 通道 (可选) |
+| `XIAOHONGSHU_USER_AGENT` | Chrome/144 UA | 小红书浏览器指纹 UA |
+| `XIAOHONGSHU_LOCALE` | zh-CN | 小红书语言环境 |
+| `XIAOHONGSHU_TIMEZONE_ID` | Asia/Shanghai | 小红书时区 |
+| `XIAOHONGSHU_FORCE_SWIFTSHADER` | true | 强制软件 WebGL |
+| `DOUYIN_DEBUG_ARTIFACTS` | false | 保留调试截图 |
+| `DOUYIN_RECORD_VIDEO` | false | 录制浏览器操作视频 |
+| `DOUYIN_KEEP_SUCCESS_VIDEO` | false | 成功后保留录屏 |
 | `CORS_ORIGINS` | `*` | CORS 允许源 (生产环境建议白名单) |
-| `DOUYIN_COOKIE` | 空 | 抖音视频下载 Cookie (文案提取功能) |
+| `MUSETALK_GPU_ID` | 0 | MuseTalk GPU 编号 |
+| `MUSETALK_API_URL` | `http://localhost:8011` | MuseTalk 常驻服务地址 |
+| `MUSETALK_BATCH_SIZE` | 32 | MuseTalk 推理批大小 |
+| `MUSETALK_VERSION` | v15 | MuseTalk 模型版本 |
+| `MUSETALK_USE_FLOAT16` | true | MuseTalk 半精度加速 |
+| `LIPSYNC_DURATION_THRESHOLD` | 100 | 秒，>=此值用 MuseTalk，<此值用 LatentSync（代码默认 120，建议在 `.env` 显式配置） |
 | `ALIPAY_APP_ID` | 空 | 支付宝应用 APPID |
 | `ALIPAY_PRIVATE_KEY_PATH` | 空 | 应用私钥 PEM 文件路径 |
 | `ALIPAY_PUBLIC_KEY_PATH` | 空 | 支付宝公钥 PEM 文件路径 |
@@ -271,6 +308,13 @@ cd /home/rongye/ProgramFiles/ViGent2/models/LatentSync
 conda activate latentsync
 python -m scripts.server
 ```
+
+### 启动 MuseTalk (终端 4, 长视频唇形同步)
+
+```bash
+cd /home/rongye/ProgramFiles/ViGent2/models/MuseTalk
+/home/rongye/ProgramFiles/miniconda3/envs/musetalk/bin/python scripts/server.py
+```
 
 ### 验证

@@ -364,7 +408,27 @@ pm2 save
 curl http://localhost:8010/health
 ```

-### 5. 启动服务看门狗 (Watchdog)
+### 5. 启动 MuseTalk 长视频唇形同步服务
+
+> 达到阈值（当前 `.env` 示例为 >=100s）自动路由到 MuseTalk。MuseTalk 不可用时自动回退 LatentSync。
+> 详细部署步骤见 [MuseTalk 部署指南](MUSETALK_DEPLOY.md)。
+
+1. 启动脚本位于项目根目录: `run_musetalk.sh`
+
+2. 使用 pm2 启动:
+```bash
+cd /home/rongye/ProgramFiles/ViGent2
+pm2 start ./run_musetalk.sh --name vigent2-musetalk
+pm2 save
+```
+
+3. 验证服务:
+```bash
+curl http://localhost:8011/health
+# {"status":"ok","model_loaded":true}
+```
+
+### 6. 启动服务看门狗 (Watchdog)

 > 🛡️ **推荐**：监控 CosyVoice 和 LatentSync 服务健康状态，卡死时自动重启。

@@ -381,6 +445,8 @@ pm2 save
 pm2 startup
 ```

+> **提示**: 完整的 PM2 进程列表应包含 5-6 个服务: vigent2-backend, vigent2-frontend, vigent2-latentsync, vigent2-cosyvoice, vigent2-musetalk, vigent2-watchdog。
+
 ### pm2 常用命令

 ```bash
@@ -388,6 +454,7 @@ pm2 status                    # 查看所有服务状态
 pm2 logs                      # 查看所有日志
 pm2 logs vigent2-backend      # 查看后端日志
 pm2 logs vigent2-cosyvoice    # 查看 CosyVoice 日志
+pm2 logs vigent2-musetalk     # 查看 MuseTalk 日志
 pm2 restart all               # 重启所有服务
 pm2 stop vigent2-latentsync   # 停止 LatentSync 服务
 pm2 delete all                # 删除所有服务
@@ -527,6 +594,7 @@ sudo lsof -i :8006
 sudo lsof -i :3002
 sudo lsof -i :8007
 sudo lsof -i :8010  # CosyVoice
+sudo lsof -i :8011  # MuseTalk
 ```

 ### 查看日志
@@ -537,6 +605,7 @@ pm2 logs vigent2-backend
 pm2 logs vigent2-frontend
 pm2 logs vigent2-latentsync
 pm2 logs vigent2-cosyvoice
+pm2 logs vigent2-musetalk
 ```

 ### SSH 连接卡顿 / 系统响应慢
--- a/Docs/DevLogs/Day27.md
+++ b/Docs/DevLogs/Day27.md
@@ -0,0 +1,231 @@
+## Remotion 描边修复 + 字体样式扩展 + TypeScript 修复 (Day 27)
+
+### 概述
+
+修复标题/字幕描边渲染问题（描边过粗 + 副标题重影），扩展字体样式选项（标题 4→12、字幕 4→8），修复 Remotion 项目 TypeScript 类型错误。
+
+---
+
+## ✅ 改动内容
+
+### 1. 描边渲染修复（标题 + 字幕）
+
+- **问题**: 标题黑色描边过粗，副标题出现重影/鬼影
+- **根因**: `buildTextShadow` 用 4 方向 `textShadow` 模拟描边 — 对角线叠加导致描边视觉上比实际 `stroke_size` 更粗；4 角方向在中间有间隙和叠加，造成重影
+- **修复**: 改用 CSS 原生描边 `-webkit-text-stroke` + `paint-order: stroke fill`（Remotion 用 Chromium 渲染，完美支持）
+- **旧方案**:
+  ```javascript
+  textShadow: `-8px -8px 0 #000, 8px -8px 0 #000, -8px 8px 0 #000, 8px 8px 0 #000, 0 0 16px rgba(0,0,0,0.5), 0 2px 4px rgba(0,0,0,0.3)`
+  ```
+- **新方案**:
+  ```javascript
+  WebkitTextStroke: `5px #000000`,
+  paintOrder: 'stroke fill',
+  textShadow: `0 2px 4px rgba(0,0,0,0.3)`,
+  ```
+- 同时将所有预设样式的 `stroke_size` 从 8 降到 5，配合原生描边视觉更干净
+
+### 2. 字体样式扩展
+
+**标题样式**: 4 个 → 12 个（+8）
+
+| ID | 样式名 | 字体 | 配色 |
+|----|--------|------|------|
+| title_pangmen | 庞门正道 | 庞门正道标题体3.0 | 白字黑描 |
+| title_round | 优设标题圆 | 优设标题圆 | 白字紫描 |
+| title_alibaba | 阿里数黑体 | 阿里巴巴数黑体 | 白字黑描 |
+| title_chaohei | 文道潮黑 | 文道潮黑 | 青蓝字深蓝描 |
+| title_wujie | 无界黑 | 标小智无界黑 | 白字深灰描 |
+| title_houdi | 厚底黑 | Aa厚底黑 | 红字深黑描 |
+| title_banyuan | 寒蝉半圆体 | 寒蝉半圆体 | 白字黑描 |
+| title_jixiang | 欣意吉祥宋 | 字体圈欣意吉祥宋 | 金字棕描 |
+
+**字幕样式**: 4 个 → 8 个（+4）
+
+| ID | 样式名 | 字体 | 高亮色 |
+|----|--------|------|--------|
+| subtitle_pink | 少女粉 | DingTalk JinBuTi | 粉色 #FF69B4 |
+| subtitle_lime | 清新绿 | DingTalk Sans | 荧光绿 #76FF03 |
+| subtitle_gold | 金色隶书 | 阿里妈妈刀隶体 | 金色 #FDE68A |
+| subtitle_kai | 楷体红字 | SimKai | 红色 #FF4444 |
+
+### 3. TypeScript 类型错误修复
+
+- **Root.tsx**: `Composition` 泛型类型与 `calculateMetadata` 参数类型不匹配 — 内联 `calculateMetadata` 并显式标注参数类型，`defaultProps` 使用 `satisfies VideoProps` 约束
+- **Video.tsx**: `VideoProps` 接口添加 `[key: string]: unknown` 索引签名，兼容 Remotion 要求的 `Record<string, unknown>` 约束
+- **VideoLayer.tsx**: `OffthreadVideo` 组件不支持 `loop` prop — 移除（该 prop 原本就被忽略）
+
+### 4. 进度条文案还原
+
+- **问题**: 进度条显示后端推送的详细阶段消息（如"正在合成唇型"），用户希望只显示"正在AI生成中..."
+- **修复**: `HomePage.tsx` 进度条文案从 `{currentTask.message || "正在AI生成中..."}` 改为固定 `正在AI生成中...`
+
+---
+
+## 📁 修改文件清单
+
+| 文件 | 改动 |
+|------|------|
+| `remotion/src/components/Title.tsx` | `buildTextShadow` → `buildStrokeStyle`（CSS 原生描边），标题+副标题同时生效 |
+| `remotion/src/components/Subtitles.tsx` | `buildTextShadow` → `buildStrokeStyle`（CSS 原生描边） |
+| `remotion/src/Root.tsx` | 修复 `Composition` 泛型类型、`calculateMetadata` 参数类型 |
+| `remotion/src/Video.tsx` | `VideoProps` 添加索引签名 |
+| `remotion/src/components/VideoLayer.tsx` | 移除 `OffthreadVideo` 不支持的 `loop` prop |
+| `backend/assets/styles/title.json` | 标题样式从 4 个扩展到 12 个，`stroke_size` 8→5 |
+| `backend/assets/styles/subtitle.json` | 字幕样式从 4 个扩展到 8 个 |
+| `frontend/.../HomePage.tsx` | 进度条文案还原为固定"正在AI生成中..." |
+
+---
+
+## 🔍 验证
+
+- `npx tsc --noEmit` — 零错误
+- `npm run build:render` — 渲染脚本编译成功
+- `npm run build`（前端）— 零报错
+- 描边：标题/副标题/字幕使用 CSS 原生描边，无重影、无虚胖
+- 样式选择：前端下拉可加载全部 12 个标题 + 8 个字幕样式
+
+---
+
+## 视频生成流水线性能优化
+
+### 概述
+
+针对视频生成流水线进行全面性能优化，涵盖 FFmpeg 编码参数、LatentSync 推理参数、多素材并行化、以及后处理阶段并行化。预估 15s 单素材视频从 ~280s 降至 ~190s (32%)，30s 双素材从 ~400s 降至 ~240s (40%)。
+
+**服务器配置**: 2x RTX 3090 (24GB), 2x Xeon E5-2680 v4 (56核), 192GB RAM
+
+### 第一阶段：FFmpeg 编码优化
+
+**最终合成 preset `slow` → `medium`**
+- 合成阶段从 ~50s 降到 ~25s，质量几乎无变化
+
+**中间文件 CRF 18 → 23**
+- 中间产物（trim、prepare_segment、concat、loop、normalize_orientation）不是最终输出，不需要高质量编码
+- 每个中间步骤快 3-8 秒
+
+**最终合成 CRF 18 → 20**
+- 15 秒口播视频 CRF 18 vs 20 肉眼无法区分
+
+### 第二阶段：LatentSync 推理参数调优
+
+**inference_steps 20 → 16**
+- 推理时间线性减少 20%（~180s → ~144s）
+
+**guidance_scale 2.0 → 1.5**
+- classifier-free guidance 权重降低，每步计算量微降（5-10%）
+
+> ⚠️ 两项需重启 LatentSync 服务后测试唇形质量，确认可接受再保留。如质量不佳可回退 .env 参数。
+
+### 第三阶段：多素材流水线并行化
+
+**素材下载 + 归一化并行**
+- 串行 `for` 循环改为 `asyncio.gather()`，`normalize_orientation` 通过 `run_in_executor` 在线程池执行
+- N 个素材从串行 N×5s → ~5s
+
+**片段预处理并行**
+- 逐个 `prepare_segment` 改为 `asyncio.gather()` + `run_in_executor`
+- 2 素材 ~90s → ~50s；4 素材 ~180s → ~60s
+
+### 第四阶段：流水线交叠
+
+**Whisper 字幕对齐 与 BGM 混音 并行**
+- 两者互不依赖（都只依赖 audio_path），用 `asyncio.gather()` 并行执行
+- 单素材模式下 Whisper 从 LatentSync 之后的串行步骤移至与 BGM 并行
+- 不开 BGM 或不开字幕时行为不变，只有同时启用时才并行
+
+### 修改文件
+
+| 文件 | 改动 |
+|------|------|
+| `backend/app/services/video_service.py` | compose: preset slow→medium, CRF 18→20; normalize_orientation/prepare_segment/concat: CRF 18→23 |
+| `backend/app/services/lipsync_service.py` | _loop_video_to_duration: CRF 18→23 |
+| `backend/.env` | LATENTSYNC_INFERENCE_STEPS=16, LATENTSYNC_GUIDANCE_SCALE=1.5 |
+| `backend/app/modules/videos/workflow.py` | import asyncio; 素材下载/归一化并行; 片段预处理并行; Whisper+BGM 并行 |
+
+### 回退方案
+
+- FFmpeg 参数：如画质不满意，将最终 CRF 改回 18、preset 改回 slow
+- LatentSync：如唇形质量下降，将 .env 中 `INFERENCE_STEPS` 改回 20、`GUIDANCE_SCALE` 改回 2.0
+- 并行化：纯架构优化，无质量影响，无需回退
+
+---
+
+## MuseTalk + LatentSync 混合唇形同步方案
+
+### 概述
+
+LatentSync 1.6 质量高但推理极慢（~78% 总时长），长视频（>=2min）耗时 20-60 分钟不可接受。MuseTalk 1.5 是单步潜空间修复（非扩散模型），逐帧推理速度接近实时（30fps+ on V100），适合长视频。混合方案按音频时长自动路由：短视频用 LatentSync 保质量，长视频用 MuseTalk 保速度。
+
+### 架构
+
+- **路由阈值**: `LIPSYNC_DURATION_THRESHOLD` (默认 120s)
+- **短视频 (<120s)**: LatentSync 1.6 (GPU1, 端口 8007)
+- **长视频 (>=120s)**: MuseTalk 1.5 (GPU0, 端口 8011)
+- **回退**: MuseTalk 不可用时自动 fallback 到 LatentSync
+
+### 改动文件
+
+| 文件 | 改动 |
+|------|------|
+| `models/MuseTalk/` | 从 Temp/MuseTalk 复制代码 + 下载权重 |
+| `models/MuseTalk/scripts/server.py` | 新建 FastAPI 常驻服务 (端口 8011, GPU0) |
+| `backend/app/core/config.py` | 新增 MUSETALK_* 和 LIPSYNC_DURATION_THRESHOLD |
+| `backend/.env` | 新增对应环境变量 |
+| `backend/app/services/lipsync_service.py` | 新增 `_call_musetalk_server()` + 混合路由逻辑 + 扩展 `check_health()` |
+
+---
+
+## MuseTalk 推理性能优化 (server.py v2)
+
+### 概述
+
+MuseTalk 首次长视频测试 (136s, 3404 帧) 耗时 1799s (~30 分钟)，分析发现瓶颈集中在人脸检测 (28%)、BiSeNet 合成 (22%)、I/O (17%)，而非 UNet 推理本身 (17%)。通过 6 项优化预估降至 8-10 分钟 (~3x 加速)。
+
+### 性能瓶颈分析 (优化前, 1799s)
+
+| 阶段 | 耗时 | 占比 | 瓶颈原因 |
+|------|------|------|---------|
+| DWPose + 人脸检测 | ~510s | 28% | `batch_size_fa=1`, 每帧跑 2 个 NN, 完全串行 |
+| 合成 + BiSeNet 人脸解析 | ~400s | 22% | 每帧都跑 BiSeNet + PNG 写盘 |
+| UNet 推理 | ~300s | 17% | batch_size=8 太小 |
+| I/O (PNG 读写 + FFmpeg) | ~300s | 17% | PNG 压缩慢, ffmpeg→PNG→imread 链路 |
+| VAE 编码 | ~100s | 6% | 逐帧编码, 未批处理 |
+
+### 6 项优化
+
+| # | 优化项 | 详情 |
+|---|--------|------|
+| 1 | **batch_size 8→32** | `.env` 修改, RTX 3090 显存充裕 |
+| 2 | **cv2.VideoCapture 直读帧** | 跳过 ffmpeg→PNG→imread 链路, 省去 3404 次 PNG 编解码 |
+| 3 | **人脸检测降频 (每5帧)** | 每 5 帧运行 DWPose + FaceAlignment, 中间帧线性插值 bbox |
+| 4 | **BiSeNet mask 缓存 (每5帧)** | 每 5 帧运行 `get_image_prepare_material`, 中间帧用 `get_image_blending` 复用缓存 mask |
+| 5 | **cv2.VideoWriter 直写** | 跳过逐帧 PNG 写盘 + ffmpeg 重编码, 用 VideoWriter 直写 mp4 |
+| 6 | **每阶段计时** | 7 个阶段精确计时, 方便后续进一步调优 |
+
+### 修改文件
+
+| 文件 | 改动 |
+|------|------|
+| `models/MuseTalk/scripts/server.py` | 完全重写 `_run_inference()`, 新增 `_detect_faces_subsampled()` |
+| `backend/.env` | `MUSETALK_BATCH_SIZE` 8→32 |
+
+---
+
+## Remotion 并发渲染优化
+
+### 概述
+
+Remotion 渲染在 56 核服务器上默认只用 8 并发 (`min(8, cores/2)`)，改为 16 并发，预估从 ~5 分钟降到 ~2-3 分钟。
+
+### 改动
+
+- `remotion/render.ts`: `renderMedia()` 新增 `concurrency` 参数 (默认 16), 支持 `--concurrency` CLI 参数覆盖
+- `remotion/dist/render.js`: 重新编译
+
+### 修改文件
+
+| 文件 | 改动 |
+|------|------|
+| `remotion/render.ts` | `RenderOptions` 新增 `concurrency` 字段, `renderMedia()` 传入 `concurrency` |
+| `remotion/dist/render.js` | TypeScript 重新编译 |
--- a/Docs/DevLogs/Day28.md
+++ b/Docs/DevLogs/Day28.md
@@ -0,0 +1,263 @@
+## CosyVoice FP16 加速 + 文档更新 + AI改写界面重构 + 标题字幕面板重排与视频帧预览 (Day 28)
+
+### 概述
+
+CosyVoice 3.0 声音克隆服务开启 FP16 半精度推理，预估提速 30-40%。同步更新 4 个项目文档。重构 AI 改写文案界面（RewriteModal 两步流程 + ScriptExtractionModal 逻辑抽取）。前端将"标题与字幕"面板从第二步移至第四步（素材编辑之后），样式预览窗口背景从紫粉渐变改为视频片头帧截图，实现所见即所得。
+
+---
+
+## ✅ 改动内容
+
+### 1. CosyVoice FP16 半精度加速
+
+- **问题**: CosyVoice 3.0 以 FP32 全精度运行，RTF (Real-Time Factor) 约 0.9-1.35x，生成 2 分钟音频需要约 2 分钟
+- **根因**: `AutoModel()` 初始化时未传入 `fp16=True`，LLM 推理和 Flow Matching (DiT) 均在 FP32 下运行
+- **修复**: 一行改动开启 FP16 自动混合精度
+
+```python
+# 旧: _model = AutoModel(model_dir=str(MODEL_DIR))
+# 新:
+_model = AutoModel(model_dir=str(MODEL_DIR), fp16=True)
+```
+
+- **生效机制**: `CosyVoice3Model` 在 `llm_job()` 和 `token2wav()` 中通过 `torch.cuda.amp.autocast(self.fp16)` 自动将计算转为 FP16
+- **预期效果**:
+  - 推理速度提升 30-40%
+  - 显存占用降低 ~30%
+  - 语音质量基本无损（0.5B 模型 FP16 精度充足）
+- **验证**: 服务重启后自检通过，健康检查 `ready: true`
+
+### 2. 文档全面更新 (4 个文件)
+
+补充 Day 27 新增的 MuseTalk 混合唇形同步方案、性能优化、Remotion 并发渲染等内容到所有相关文档。
+
+#### README.md
+- 项目描述更新为 "LatentSync 1.6 + MuseTalk 1.5 混合唇形同步"
+- 唇形同步功能描述改为混合方案（短视频 LatentSync，长视频 MuseTalk）
+- 技术栈表新增 MuseTalk 1.5
+- 项目结构新增 `models/MuseTalk/`
+- 服务架构表新增 MuseTalk (端口 8011)
+- 文档中心新增 MuseTalk 部署指南链接
+- 性能优化描述新增降频检测 + Remotion 16 并发
+
+#### DEPLOY_MANUAL.md
+- GPU 分配说明更新 (GPU0=MuseTalk+CosyVoice, GPU1=LatentSync)
+- 步骤 3 拆分为 3a (LatentSync) + 3b (MuseTalk)
+- 环境变量表新增 7 个 MuseTalk 变量，移除过时的 `DOUYIN_COOKIE`
+- LatentSync 推理步数默认值 20→16
+- 测试运行新增 MuseTalk 启动终端
+- PM2 管理新增 MuseTalk 服务（第 5 项）
+- 端口检查、日志查看命令新增 8011/vigent2-musetalk
+
+#### SUBTITLE_DEPLOY.md
+- 技术架构图更新为 LatentSync/MuseTalk 混合路由
+- 新增唇形同步路由说明
+- Remotion 配置表新增 `concurrency` 参数 (默认 16)
+- GPU 分配说明更新
+- 更新日志新增 v1.3.0 条目
+
+#### BACKEND_README.md
+- 健康检查接口描述更新为含 LatentSync + MuseTalk + 混合路由阈值
+- 环境变量配置新增 MuseTalk 相关变量
+- 服务集成指南新增"唇形同步混合路由"章节
+
+---
+
+### 3. AI 改写文案界面重构
+
+#### RewriteModal 重构
+
+将 AI 改写弹窗改为两步式流程，提升交互体验：
+
+**第一步 — 配置与触发**：
+- 自定义提示词输入（可选），自动持久化到 localStorage
+- "开始改写"按钮触发 `/api/ai/rewrite` 请求
+
+**第二步 — 结果对比与选择**：
+- 上方：AI 改写结果 + "使用此结果"按钮（紫粉渐变色，醒目）
+- 下方：原文对比 + "保留原文"按钮（灰色低调）
+- 底部：可"重新改写"（重回第一步，保留自定义提示词）
+- ESC 快捷键关闭
+
+#### ScriptExtractionModal 逻辑抽取
+
+将文案提取模态框的全部业务逻辑抽取到独立 hook `useScriptExtraction`：
+
+- **useScriptExtraction.ts** (新建): 管理 URL/文件双模式输入、拖拽上传、提取请求、步骤状态机 (config → processing → result)、剪贴板复制
+- **ScriptExtractionModal.tsx**: 纯展示组件，消费 hook 返回值，新增 ESC/Enter 快捷键
+
+#### ScriptEditor 工具栏调整
+
+- 按钮组右对齐 (`justify-end`)，统一高度 `h-7` 和圆角
+- "历史文案"按钮用灰色 (bg-gray-600) 区分辅助功能
+- "文案提取助手"用紫色 (bg-purple-600) 表示主功能
+- "AI多语言"用绿渐变 (emerald-teal)，"AI生成标题标签"用蓝渐变 (blue-cyan)
+- "AI智能改写"和"保存文案"移至文本框下方状态栏
+
+---
+
+### 4. 标题字幕面板重排 + 视频帧背景预览
+
+#### 面板顺序重排
+
+将 `<TitleSubtitlePanel>` 从第二步移至第四步（素材编辑之后），使用户在设置标题字幕样式时已经完成了素材选择和时间轴编排。
+
+新顺序：
+```
+一、文案提取与编辑（不变）
+二、配音（原三）
+三、素材编辑（原四）
+四、标题与字幕（原二）→ 移到素材编辑之后
+```
+
+#### 新建 useVideoFrameCapture hook
+
+从视频 URL 截取 0.1s 处帧画面，返回 JPEG data URL：
+
+- 创建 `<video>` 元素，设置 `crossOrigin="anonymous"`（素材存储在 Supabase Storage 跨域地址）
+- 先绑定 `loadedmetadata` / `canplay` / `seeked` / `error` 事件监听，再设 src（避免事件丢失）
+- `loadedmetadata` 或 `canplay` 触发后 seek 到 0.1s，`seeked` 回调中用 canvas `drawImage` 截帧
+- canvas 缩放到 480px 宽再编码（预览窗口最大 280px，节省内存）
+- `canvas.toDataURL("image/jpeg", 0.7)` 导出
+- 防御 `videoWidth/videoHeight` 为 0 的边界情况
+- try-catch 防 canvas taint，失败返回 null（降级渐变）
+- `isActive` 标志 + `seeked` 去重标志防止 stale 和重复更新
+- 截图完成后清理 video 元素释放内存
+
+#### 按需截取（性能优化）
+
+只在样式预览窗口打开时才触发截取：
+
+```typescript
+const materialPosterUrl = useVideoFrameCapture(
+  showStylePreview ? firstTimelineMaterialUrl : null
+);
+```
+
+截取源优先使用**时间轴第一段素材**（用户拖拽排序后的真实片头），回退到 `selectedMaterials[0]`（未生成配音、时间轴为空时）。
+
+#### 预览背景替换
+
+`FloatingStylePreview` 有视频帧时直接显示原始画面（不加半透明，保证颜色真实），文字靠描边保证可读性；无视频帧时降级为原紫粉渐变背景。
+
+#### 踩坑记录
+
+1. **CORS tainted canvas**: 素材文件存储在 Supabase Storage (`api.hbyrkj.top`)，是跨域签名链接。必须设 `video.crossOrigin = "anonymous"` 才能让 canvas `toDataURL` 不被 SecurityError 拦截
+2. **时间轴为空**: `useTimelineEditor` 在 `audioDuration <= 0`（未选配音）时返回空数组，需回退到 `selectedMaterials[0]`
+3. **事件监听顺序**: 必须先绑定事件监听再设 `video.src`，否则快速加载时事件可能丢失
+
+---
+
+## 📁 修改文件清单
+
+| 文件 | 改动 |
+|------|------|
+| `models/CosyVoice/cosyvoice_server.py` | `AutoModel()` 新增 `fp16=True` 参数 |
+| `README.md` | 混合唇形同步描述、技术栈、服务架构、项目结构更新 |
+| `Docs/DEPLOY_MANUAL.md` | MuseTalk 部署步骤、环境变量、PM2 管理、端口检查 |
+| `Docs/SUBTITLE_DEPLOY.md` | 架构图、Remotion concurrency、GPU 分配、更新日志 |
+| `Docs/BACKEND_README.md` | 健康检查、环境变量、混合路由章节 |
+| `frontend/.../RewriteModal.tsx` | 两步式改写流程（自定义提示词 → 结果对比） |
+| `frontend/.../script-extraction/useScriptExtraction.ts` | **新建** — 文案提取逻辑 hook |
+| `frontend/.../ScriptExtractionModal.tsx` | 纯展示组件，消费 hook，新增快捷键 |
+| `frontend/.../ScriptEditor.tsx` | 工具栏右对齐 + 按钮分色 + 改写/保存移至底部 |
+| `frontend/.../useVideoFrameCapture.ts` | **新建** — 视频帧截取 hook，crossOrigin + canvas 缩放 |
+| `frontend/.../useHomeController.ts` | 新增 useMemo 计算素材 URL，调用帧截取 hook，showStylePreview 门控 |
+| `frontend/.../HomePage.tsx` | 面板重排（二↔四互换），编号更新，透传 materialPosterUrl |
+| `frontend/.../TitleSubtitlePanel.tsx` | 编号"二"→"四"，新增 previewBackgroundUrl prop |
+| `frontend/.../FloatingStylePreview.tsx` | 新增 previewBackgroundUrl prop，条件渲染视频帧/渐变背景 |
+
+---
+
+## 🔍 验证
+
+- CosyVoice 重启成功，健康检查 `{"ready": true}`
+- 自检推理通过（7.2s for "你好"）
+- FP16 通过 `torch.cuda.amp.autocast(self.fp16)` 在 LLM 和 Flow Matching 阶段生效
+- `npx tsc --noEmit` — 零错误
+- AI 改写：自定义提示词持久化 → 改写结果 + 原文对比 → "使用此结果"/"保留原文"
+- 文案提取：URL / 文件双模式 → 处理中动画 → 结果填入
+- 面板顺序：一→文案、二→配音、三→素材编辑、四→标题与字幕
+- 样式预览背景：有素材时显示真实视频片头帧，无素材降级紫粉渐变
+- 预览关闭时不触发截取，不浪费资源
+
+---
+
+## 💡 CosyVoice 性能分析备注
+
+### 当前性能基线 (FP32, 优化前)
+
+| 文本长度 | 音频时长 | 推理耗时 | RTF |
+|----------|----------|----------|-----|
+| 42 字 | 9.8s | 13.2s | 1.35x |
+| 89 字 | 18.2s | 20.3s | 1.12x |
+| ~530 字 | 115.8s | 107.7s | 0.93x |
+| ~670 字 | 143.5s | 131.6s | 0.92x |
+
+### 未来可选优化（收益递减，暂不实施）
+
+| 优化项 | 预期提升 | 复杂度 |
+|--------|----------|--------|
+| TensorRT (DiT 模块) | +20-30% | 需编译 .plan 引擎 |
+| torch.compile() | +10-20% | 一行代码，但首次编译慢 |
+| vLLM (LLM 模块) | +10-15% | 额外依赖 |
+
+---
+
+## MuseTalk 合成阶段性能优化
+
+### 概述
+
+MuseTalk v2 优化后总耗时从 1799s 降到 819s（2.2x），但合成阶段（Phase 6）仍占 462.2s (56.4%)，是最大单一瓶颈。本次优化两个方向：纯 numpy blending 替代 PIL 转换、FFmpeg pipe + NVENC GPU 硬编码替代双重编码。
+
+### 1. 纯 numpy blending 替代 PIL（blending.py）
+
+- **问题**: `get_image_blending` 每帧做 3 次 numpy↔PIL 转换 + BGR↔RGB 通道翻转，纯粹浪费
+- **方案**: 新增 `get_image_blending_fast()` 函数
+  - 全程保持 BGR numpy 数组，不做 PIL 转换和通道翻转
+  - mask 混合用 numpy 向量化广播 `mask * (1/255)` 替代 `PIL.paste with mask`
+  - 原 `get_image_blending` 保留作为 fallback
+- **降级链**: `blending_fast` → `blending`（PIL）→ `get_image`（完整重算）
+
+### 2. FFmpeg pipe + NVENC 硬编码替代双重编码（server.py）
+
+**优化前（双重编码）**:
+```
+Phase 6: 逐帧 → cv2.VideoWriter (mp4v CPU 软编码) → temp_raw.mp4
+Phase 7: FFmpeg 读 temp_raw.mp4 → H.264 CPU 重编码 + 合并音频 → output.mp4
+```
+
+**优化后（单次 GPU 编码）**:
+```
+Phase 6: 逐帧 → FFmpeg stdin pipe (rawvideo → h264_nvenc GPU 编码) → temp_raw.mp4
+Phase 7: FFmpeg 只做音频合并 (-c:v copy -c:a copy) → output.mp4  （秒级）
+```
+
+- NVENC 参数: `-c:v h264_nvenc -preset p4 -cq 20 -pix_fmt yuv420p`
+- RTX 3090 NVENC 专用芯片编码，不占 CUDA 核心，编码速度 >500fps
+
+### 3. FFmpeg 进程资源管理加固
+
+- `try/finally` 包裹写帧循环，确保异常时 `proc.stdin.close()` 执行
+- `proc.wait()` 后读 stderr 再关闭，避免缓冲区死锁
+- stderr decode 加 `errors="ignore"` 防止非 UTF-8 崩溃
+
+### 4. `run_ffmpeg` 安全改进
+
+- 去掉 `shell=True`，改用列表传参，避免路径特殊字符导致命令注入
+- Phase 7 FFmpeg 命令从字符串拼接改为列表传参
+
+### 调优过程
+
+| 版本 | Phase 6 | Phase 7 | 总计 | 结论 |
+|------|---------|---------|------|------|
+| Day27 基线 | 462s | 38s | 819s | — |
+| v1: libx264 -preset medium | 548s | 0.3s | 854s | CPU 编码背压，反而更慢 |
+| v2: h264_nvenc（当前） | 待测 | 待测 | 待测 | NVENC 零背压，预估 Phase 6 < 200s |
+
+### 修改文件
+
+| 文件 | 改动 |
+|------|------|
+| `models/MuseTalk/musetalk/utils/blending.py` | 新增 `get_image_blending_fast()` 纯 numpy 函数 |
+| `models/MuseTalk/scripts/server.py` | Phase 6: FFmpeg pipe + NVENC + blending_fast；Phase 7: -c:v copy；`run_ffmpeg` 去掉 shell=True |
--- a/Docs/DevLogs/Day29.md
+++ b/Docs/DevLogs/Day29.md
@@ -0,0 +1,283 @@
+## 字幕同步修复 + 嘴型参数调优 + 视频流水线全面优化 + 预览背景修复 + CosyVoice 语气控制 (Day 29)
+
+### 概述
+
+本轮对视频生成流水线做全面审查优化：修复字幕与语音不同步问题（Whisper 时间戳平滑 + 原文节奏映射）、调优 LatentSync 嘴型参数、compose 流复制省去冗余重编码、FFmpeg 超时保护、全局并发限制、Redis 任务 TTL、临时文件清理、死代码移除。修复因前端域名迁移导致的样式预览背景 CORS 失效问题。新增 CosyVoice 语气控制功能，声音克隆模式下支持开心/伤心/生气等情绪表达（基于 `inference_instruct2`）。
+
+---
+
+## ✅ 改动内容
+
+### 1. 字幕同步修复（Whisper 时间戳 + 原文节奏映射）
+
+- **问题**: 字幕高亮与语音不同步，表现为字幕超前/滞后、高亮跳空
+- **根因**: Whisper 输出的逐字时间戳存在微小抖动（相邻字 end > 下一字 start），且字间间隙导致高亮"闪烁"
+
+#### whisper_service.py — 时间戳后处理
+
+新增 `smooth_word_timestamps()` 函数，三步平滑：
+
+1. **单调递增保证**: 后一字的 start 不早于前一字的 start
+2. **重叠消除**: 两字时间重叠时取中点分割
+3. **间隙填补**: 字间间隙 < 50ms 时直接连接，避免高亮跳空
+
+```python
+def smooth_word_timestamps(words):
+    for i in range(1, len(words)):
+        # 重叠 → 中点分割
+        if w["start"] < prev["end"]:
+            mid = (prev["end"] + w["start"]) / 2
+            prev["end"] = mid; w["start"] = mid
+        # 微小间隙 → 直接连接
+        if 0 < gap < 0.05:
+            prev["end"] = w["start"]
+```
+
+#### whisper_service.py — 原文节奏映射
+
+- **问题**: AI 改写/多语言文案与 Whisper 转录文字不一致，直接用 Whisper 文字会乱码
+- **方案**: `original_text` 参数非空时，用原文字符替换 Whisper 文字，但保留 Whisper 的语音节奏时间戳
+- 实现：将 N 个原文字符按比例映射到 M 个 Whisper 时间戳上（线性插值）
+- 字数比例异常检测（>1.5x 或 <0.67x 时警告）
+- 单字时长钳位：40ms ~ 800ms，防止极端漂移
+
+#### captions.ts — Remotion 端字幕查找
+
+新增 `getCurrentSegment()` 和 `getCurrentWordIndex()` 函数：
+
+- 根据当前帧时间精确查找应显示的字幕段落和高亮字索引
+- 处理字间间隙（两字之间返回前一字索引，保持高亮连续）
+- 超过最后一字结束时间时返回最后一字（避免末尾闪烁）
+
+---
+
+### 2. LatentSync 嘴型参数调优
+
+| 参数 | Day28 值 | Day29 值 | 说明 |
+|------|----------|----------|------|
+| `LATENTSYNC_INFERENCE_STEPS` | 16 | 20 | 适当增加步数提升嘴型质量 |
+| `LATENTSYNC_GUIDANCE_SCALE` | (默认) | 2.0 | 平衡嘴型贴合度与自然感 |
+| `LATENTSYNC_ENABLE_DEEPCACHE` | (默认) | true | DeepCache 加速推理 |
+| `LATENTSYNC_SEED` | (默认) | 1247 | 固定种子保证可复现 |
+| Remotion concurrency | 16 | 4 | 降低并发防止资源争抢 |
+
+---
+
+### 3. compose() 流复制替代冗余重编码（高优先级）
+
+**文件**: `video_service.py`
+
+- **问题**: `compose()` 只是合并视频轨+音频轨（mux），却每次用 `libx264 -preset medium -crf 20` 做完整重编码，耗时数分钟。整条流水线一个视频最多被 x264 编码 5 次
+- **方案**: 不需要循环时（`loop_count == 1`）用 `-c:v copy` 流复制，几乎瞬间完成；需要循环时仍用 libx264
+
+```python
+if loop_count > 1:
+    cmd.extend(["-c:v", "libx264", "-preset", "fast", "-crf", "23"])
+else:
+    cmd.extend(["-c:v", "copy"])
+```
+
+- compose 是中间产物（Remotion 会再次编码），流复制省一次编码且无质量损失
+
+---
+
+### 4. FFmpeg 超时保护（高优先级）
+
+**文件**: `video_service.py`
+
+- `_run_ffmpeg()`: 新增 `timeout=600`（10 分钟），捕获 `subprocess.TimeoutExpired`
+- `_get_duration()`: 新增 `timeout=30`
+- 防止畸形视频导致 FFmpeg 永久挂起阻塞后台任务
+
+---
+
+### 5. 全局任务并发限制（高优先级）
+
+**文件**: `workflow.py`
+
+- 模块级 `asyncio.Semaphore(2)`，`process_video_generation()` 入口 acquire
+- 排队中的任务显示"排队中..."状态
+- 防止多个请求同时跑 FFmpeg + Remotion 导致 CPU/内存爆炸
+
+```python
+_generation_semaphore = asyncio.Semaphore(2)
+
+async def process_video_generation(task_id, req, user_id):
+    _update_task(task_id, message="排队中...")
+    async with _generation_semaphore:
+        await _process_video_generation_inner(task_id, req, user_id)
+```
+
+---
+
+### 6. Redis 任务 TTL + 索引清理（中优先级）
+
+**文件**: `task_store.py`
+
+- `create()`: 设 24 小时 TTL（`ex=86400`）
+- `update()`: completed/failed 状态设 2 小时 TTL（`ex=7200`），其余 24 小时
+- `list()`: 遍历时顺带清理已过期的索引条目（`srem`）
+- 解决 Redis 任务 key 永久堆积问题
+
+---
+
+### 7. 临时字体文件清理（中优先级）
+
+**文件**: `workflow.py`
+
+- `prepare_style_for_remotion()` 复制字体到 temp_dir，但未加入清理列表
+- 现在遍历三组前缀（subtitle/title/secondary_title）× 四种扩展名（.ttf/.otf/.woff/.woff2），将存在的字体文件加入 `temp_files`
+
+---
+
+### 8. Whisper+split 逻辑去重（低优先级）
+
+**文件**: `workflow.py`
+
+- 两个分支（custom_assignments 不匹配 vs 默认）的 Whisper→_split_equal 代码 100% 相同（36 行重复）
+- 提取为内部函数 `_whisper_and_split()`，两个分支共用
+
+---
+
+### 9. LipSync 死代码清理（低优先级）
+
+**文件**: `lipsync_service.py`
+
+- 删除 `_preprocess_video()` 方法（92 行），全项目无任何调用
+
+---
+
+### 10. 标题字幕预览背景 CORS 修复
+
+- **问题**: 前端域名从 `vigent.hbyrkj.top` 迁移到 `ipagent.ai-labz.cn` 后，素材签名 URL（`api.hbyrkj.top`）与新前端域名完全不同根域，Supabase Kong 网关的 CORS 不覆盖新域名 → `<video crossOrigin="anonymous">` 加载失败 → canvas 截帧失败 → 回退渐变背景
+- **根因**: Day28 实现依赖 Supabase 返回 `Access-Control-Allow-Origin` 头，换域名后此依赖断裂
+
+**修复方案 — 同源代理（彻底绕开 CORS）**:
+
+| 组件 | 改动 |
+|------|------|
+| `materials/router.py` | 新增 `GET /api/materials/stream/{material_id}` 端点，通过 `get_local_file_path()` 从本地磁盘直读，返回 `FileResponse` |
+| `useHomeController.ts` | 帧截取 URL 改为 `/api/materials/stream/${mat.id}`（同源），不再用跨域签名 URL |
+| `useVideoFrameCapture.ts` | 移除 `crossOrigin = "anonymous"`，同源请求不需要 |
+
+链路：`用户点预览 → /api/materials/stream/xxx → Next.js rewrite → FastAPI FileResponse → 同源 <video> → canvas 截帧成功`
+
+---
+
+### 11. 支付宝回调域名更新
+
+**文件**: `.env`
+
+```
+ALIPAY_NOTIFY_URL=https://ipagent.ai-labz.cn/api/payment/notify
+ALIPAY_RETURN_URL=https://ipagent.ai-labz.cn/pay
+```
+
+---
+
+## 📁 修改文件清单
+
+| 文件 | 改动 |
+|------|------|
+| `backend/app/services/whisper_service.py` | 时间戳平滑 + 原文节奏映射 + 单字时长钳位 |
+| `remotion/src/utils/captions.ts` | 新增 `getCurrentSegment` / `getCurrentWordIndex` |
+| `backend/app/services/video_service.py` | compose 流复制 + FFmpeg 超时保护 |
+| `backend/app/modules/videos/workflow.py` | Semaphore(2) 并发限制 + 字体清理 + Whisper 逻辑去重 |
+| `backend/app/modules/videos/task_store.py` | Redis TTL + 索引过期清理 |
+| `backend/app/services/lipsync_service.py` | 删除 `_preprocess_video()` 死代码 |
+| `backend/app/services/remotion_service.py` | concurrency 16 → 4 |
+| `remotion/render.ts` | 新增 concurrency 参数支持 |
+| `backend/app/modules/materials/router.py` | 新增 `/stream/{material_id}` 同源代理端点 |
+| `frontend/.../useVideoFrameCapture.ts` | 移除 crossOrigin |
+| `frontend/.../useHomeController.ts` | 帧截取 URL 改用同源代理 |
+| `backend/.env` | 嘴型参数 + 支付宝域名更新 |
+
+---
+
+### 12. CosyVoice 语气控制功能
+
+- **功能**: 声音克隆模式下新增"语气"下拉菜单（正常/欢快/低沉/严肃），利用 CosyVoice3 的 `inference_instruct2()` 方法通过自然语言指令控制语气情绪
+- **默认行为不变**: 选择"正常"时仍走 `inference_zero_shot()`，与改动前完全一致
+
+#### 数据流
+
+```
+用户选择语气 → setEmotion("happy") → localStorage 持久化
+  → 生成配音 → emotion 映射为 instruct_text
+    → POST /api/generated-audios/generate { instruct_text }
+      → voice_clone_service → POST localhost:8010/generate { instruct_text }
+        → instruct_text 非空 ? inference_instruct2() : inference_zero_shot()
+```
+
+#### CosyVoice 服务 — `cosyvoice_server.py`
+
+- `/generate` 端点新增 `instruct_text: str = Form("")` 参数
+- 推理分支：空 → `inference_zero_shot()`，非空 → `inference_instruct2(text, instruct_text, ref_audio_path, ...)`
+- `inference_instruct2` 不需要 `prompt_text`，直接接受 `instruct_text` + `prompt_wav`
+
+#### 后端透传
+
+- `schemas.py`: `GenerateAudioRequest` 新增 `instruct_text: Optional[str] = None`
+- `service.py`: `generate_audio_task()` voiceclone 分支传递 `instruct_text=req.instruct_text or ""`
+- `voice_clone_service.py`: `_generate_once()` 和 `generate_audio()` 新增 `instruct_text` 参数
+
+#### 前端
+
+- `useHomeController.ts`: 新增 `emotion` state + `emotionToInstruct` 映射表
+- `useHomePersistence.ts`: 语气选择持久化到 localStorage
+- `useGeneratedAudios.ts`: `generateAudio` params 新增 `instruct_text`
+- `GeneratedAudiosPanel.tsx`: 语气下拉菜单（语速按钮左侧），复用语速下拉样式，仅 voiceclone 模式可见
+- `HomePage.tsx`: 透传 `emotion`/`onEmotionChange`
+
+#### instruct_text 格式（来自 CosyVoice3 instruct_list）
+
+```
+正常: ""（走 inference_zero_shot）
+欢快: "You are a helpful assistant. 请非常开心地说一句话。<|endofprompt|>"
+低沉: "You are a helpful assistant. 请非常伤心地说一句话。<|endofprompt|>"
+严肃: "You are a helpful assistant. 请非常生气地说一句话。<|endofprompt|>"
+```
+
+---
+
+## 📁 修改文件清单
+
+| 文件 | 改动 |
+|------|------|
+| `backend/app/services/whisper_service.py` | 时间戳平滑 + 原文节奏映射 + 单字时长钳位 |
+| `remotion/src/utils/captions.ts` | 新增 `getCurrentSegment` / `getCurrentWordIndex` |
+| `backend/app/services/video_service.py` | compose 流复制 + FFmpeg 超时保护 |
+| `backend/app/modules/videos/workflow.py` | Semaphore(2) 并发限制 + 字体清理 + Whisper 逻辑去重 |
+| `backend/app/modules/videos/task_store.py` | Redis TTL + 索引过期清理 |
+| `backend/app/services/lipsync_service.py` | 删除 `_preprocess_video()` 死代码 |
+| `backend/app/services/remotion_service.py` | concurrency 16 → 4 |
+| `remotion/render.ts` | 新增 concurrency 参数支持 |
+| `backend/app/modules/materials/router.py` | 新增 `/stream/{material_id}` 同源代理端点 |
+| `frontend/.../useVideoFrameCapture.ts` | 移除 crossOrigin |
+| `frontend/.../useHomeController.ts` | 帧截取 URL 改用同源代理 + emotion state + emotionToInstruct 映射 |
+| `backend/.env` | 嘴型参数 + 支付宝域名更新 |
+| `models/CosyVoice/cosyvoice_server.py` | `/generate` 新增 `instruct_text` 参数，分支 `inference_instruct2` / `inference_zero_shot` |
+| `backend/app/services/voice_clone_service.py` | `_generate_once` / `generate_audio` 新增 `instruct_text` 透传 |
+| `backend/app/modules/generated_audios/schemas.py` | `GenerateAudioRequest` 新增 `instruct_text` 字段 |
+| `backend/app/modules/generated_audios/service.py` | voiceclone 分支传递 `instruct_text` |
+| `frontend/.../useGeneratedAudios.ts` | `generateAudio` params 新增 `instruct_text` |
+| `frontend/.../useHomePersistence.ts` | emotion 持久化 (localStorage) |
+| `frontend/.../GeneratedAudiosPanel.tsx` | 语气下拉菜单 UI (embedded + standalone) |
+| `frontend/.../HomePage.tsx` | 透传 emotion / onEmotionChange |
+
+---
+
+## 🔍 验证
+
+1. **字幕同步**: 生成视频观察逐字高亮，不应出现超前/滞后/跳空
+2. **compose 流复制**: FFmpeg 日志中 compose 步骤应出现 `-c:v copy`，耗时从分钟级降到秒级
+3. **FFmpeg 超时**: 代码确认 timeout 参数已加
+4. **并发限制**: 连续提交 3 个任务，第 3 个应显示"排队中"，前 2 个完成后才开始
+5. **Redis TTL**: `redis-cli TTL vigent:tasks:<id>` 确认有过期时间
+6. **字体清理**: 生成视频后 temp 目录不应残留字体文件
+7. **预览背景**: 选择素材 → 点击"预览样式"，应显示视频第一帧（非渐变）
+8. **支付宝**: 发起支付后回调和跳转地址为新域名
+9. **语气控制**: 声音克隆模式选择"开心"/"生气"生成配音，CosyVoice 日志出现 `🎭 Instruct mode`，音频语气有明显变化
+10. **语气默认**: 选择"正常"时行为与改动前完全相同（走 `inference_zero_shot`）
+11. **语气持久化**: 切换语气后刷新页面，下拉菜单恢复上次选择
+12. **语气可见性**: 语气下拉仅在 voiceclone 模式显示，edgetts 模式不显示
--- a/Docs/DevLogs/Day30.md
+++ b/Docs/DevLogs/Day30.md
@@ -0,0 +1,405 @@
+## Remotion 缓存修复 + 编码流水线质量优化 + 唇形同步容错 + 统一下拉交互 (Day 30)
+
+### 概述
+
+本轮最终合并为五大方面：(1) Remotion bundle 缓存导致标题/字幕丢失的严重 Bug；(2) 全面优化 LatentSync + MuseTalk 双引擎编码流水线，消除冗余有损编码；(3) 增强 LatentSync 的鲁棒性，允许素材中部分帧检测不到人脸时继续推理而非中断任务；(4) 唇形模型选择全链路透传（默认/快速/高级）；(5) 首页与发布页选择器统一为 SelectPopover 交互，并修复遮挡、定位与预览层级问题。
+
+---
+
+## ✅ 改动内容
+
+### 1. Remotion Bundle 缓存 404 修复（严重 Bug）
+
+- **问题**: 生成的视频没有标题和字幕，Remotion 渲染失败后静默回退到 FFmpeg（无文字叠加能力）
+- **根因**: Remotion 的 bundle 缓存机制只在首次打包时复制 `publicDir`（视频/字体所在目录）。代码稳定后缓存持续命中，新生成的视频和字体文件不在旧缓存的 `public/` 目录 → Remotion HTTP server 返回 404 → 渲染失败
+- **尝试**: 先用 `fs.symlinkSync` 符号链接，但 Remotion 内部 HTTP server 不支持跟随符号链接
+- **最终方案**: 使用 `fs.linkSync` 硬链接（同文件系统零拷贝，对应用完全透明），跨文件系统时自动回退为 `fs.copyFileSync`
+
+**文件**: `remotion/render.ts`
+
+```typescript
+function ensureInCachedPublic(cachedPublicDir, srcAbsPath, fileName) {
+  // 检查是否已存在且为同一 inode
+  // 优先硬链接（零拷贝），跨文件系统回退为复制
+  try {
+    fs.linkSync(srcAbsPath, cachedPath);
+  } catch {
+    fs.copyFileSync(srcAbsPath, cachedPath);
+  }
+}
+```
+
+使用缓存 bundle 时，自动将当前渲染所需的文件（视频 + 字体）硬链接到缓存的 `public/` 目录：
+- 视频文件（`videoFileName`）
+- 字体文件（从 `subtitleStyle` / `titleStyle` / `secondaryTitleStyle` 的 `font_file` 字段提取）
+
+---
+
+### 2. 视频编码流水线质量优化
+
+对完整流水线做全面审查，发现从素材上传到最终输出，视频最多经历 **5-6 次有损重编码**，而官方 LatentSync demo 只有 1-2 次。
+
+#### 优化前编码链路
+
+| # | 阶段 | CRF | 问题 |
+|---|------|-----|------|
+| 1 | 方向归一化 | 23 | 条件触发 |
+| 2 | `prepare_segment` 缩放+时长 | 23 | 必经，质量偏低 |
+| 3 | LatentSync `read_video` FPS 转换 | 18 | **即使已是 25fps 也重编码** |
+| 4 | LatentSync `imageio` 写帧 | 13 | 模型输出 |
+| 5 | LatentSync final mux | 18 | **CRF13 刚写完立刻 CRF18 重编码** |
+| 6 | compose | copy | Day29 已优化 |
+| 7 | 多素材 concat | 23 | **段参数已统一，不需要重编码** |
+| 8 | Remotion 渲染 | ~18 | 必经（叠加文字） |
+
+#### 优化措施
+
+##### 2a. LatentSync `read_video` 跳过冗余 FPS 重编码
+
+**文件**: `models/LatentSync/latentsync/utils/util.py`
+
+- 原代码无条件执行 `ffmpeg -r 25 -crf 18`，即使输入视频已是 25fps
+- 新增 FPS 检测：`abs(current_fps - 25.0) < 0.5` 时直接使用原文件
+- 我们的 `prepare_segment` 已统一输出 25fps，此步完全多余
+
+```python
+cap = cv2.VideoCapture(video_path)
+current_fps = cap.get(cv2.CAP_PROP_FPS)
+cap.release()
+
+if abs(current_fps - 25.0) < 0.5:
+    print(f"Video already at {current_fps:.1f}fps, skipping FPS conversion")
+    target_video_path = video_path
+else:
+    # 仅非 25fps 时才重编码
+    command = f"ffmpeg ... -r 25 -crf 18 ..."
+```
+
+##### 2b. LatentSync final mux 流复制替代重编码
+
+**文件**: `models/LatentSync/latentsync/pipelines/lipsync_pipeline.py`
+
+- 原代码：`imageio` 以 CRF 13 高质量写完帧后，final mux 又用 `libx264 -crf 18` 完整重编码
+- 修复：改为 `-c:v copy` 流复制，仅 mux 音频轨，视频零损失
+
+```diff
+- ffmpeg ... -c:v libx264 -crf 18 -c:a aac -q:v 0 -q:a 0
+ ffmpeg ... -c:v copy -c:a aac -q:a 0
+```
+
+##### 2c. `prepare_segment` + `normalize_orientation` CRF 23 → 18
+
+**文件**: `backend/app/services/video_service.py`
+
+- `normalize_orientation`：CRF 23 → 18
+- `prepare_segment` trim 临时文件：CRF 23 → 18
+- `prepare_segment` 主命令：CRF 23 → 18
+- CRF 18 是"高质量"级别，与 LatentSync 内部标准一致
+
+##### 2d. 多素材 concat 流复制
+
+**文件**: `backend/app/services/video_service.py`
+
+- 原代码用 `libx264 -crf 23` 重编码拼接
+- 所有段已由 `prepare_segment` 统一为相同分辨率/帧率/编码参数
+- 改为 `-c:v copy` 流复制，消除一次完整重编码
+
+```diff
+- -vsync cfr -r 25 -c:v libx264 -preset fast -crf 23 -pix_fmt yuv420p
+ -c:v copy
+```
+
+#### 优化后编码链路
+
+| # | 阶段 | CRF | 状态 |
+|---|------|-----|------|
+| 1 | 方向归一化 | **18** | 提质（条件触发） |
+| 2 | `prepare_segment` | **18** | 提质（必经） |
+| 3 | ~~LatentSync FPS 转换~~ | - | **已消除** |
+| 4 | LatentSync 模型输出 | 13 | 不变（不可避免） |
+| 5 | ~~LatentSync final mux~~ | - | **已消除（copy）** |
+| 6 | compose | copy | 不变 |
+| 7 | ~~多素材 concat~~ | - | **已消除（copy）** |
+| 8 | Remotion 渲染 | ~18 | 不变（不可避免） |
+
+**总计：5-6 次有损编码 → 3 次**（prepare_segment → LatentSync 模型输出 → Remotion），质量损失减少近一半。
+
+---
+
+## 📁 修改文件清单
+
+| 文件 | 改动 |
+|------|------|
+| `remotion/render.ts` | bundle 缓存使用时硬链接视频+字体到 public 目录 |
+| `models/LatentSync/latentsync/utils/util.py` | `read_video` 检测 FPS，25fps 时跳过重编码 |
+| `models/LatentSync/latentsync/pipelines/lipsync_pipeline.py` | final mux `-c:v copy`；无脸帧容错（affine_transform + restore_video） |
+| `backend/app/services/video_service.py` | `normalize_orientation` CRF 23→18；`prepare_segment` CRF 23→18；`concat_videos` `-c:v copy` |
+| `backend/app/modules/videos/workflow.py` | 单素材 LatentSync 异常时回退原视频 |
+
+---
+
+### 3. LatentSync 无脸帧容错
+
+- **问题**: 素材中如果有部分帧检测不到人脸（转头、遮挡、空镜头），`affine_transform` 会抛异常导致整个推理任务失败
+- **改动**:
+  - `affine_transform_video`: 单帧异常时 catch 住，用最近有效帧的 face/box/affine_matrix 填充（保证 tensor batch 维度完整），全部帧无脸时仍 raise
+  - `restore_video`: 新增 `valid_face_flags` 参数，无脸帧直接保留原画面（不做嘴型替换）
+  - `loop_video`: `valid_face_flags` 跟随循环和翻转
+  - `workflow.py`: 单素材路径 `lipsync.generate()` 整体异常时 copy 原视频继续流程，任务不会失败
+
+---
+
+### 4. MuseTalk 编码链路优化
+
+#### 4a. FFmpeg rawvideo 管道直编码（消除中间有损文件）
+
+**文件**: `models/MuseTalk/scripts/server.py`
+
+- **原流程**: UNet 推理帧 → `cv2.VideoWriter(mp4v)` 写中间文件（有损） → FFmpeg 重编码+音频 mux（又一次有损）
+- **新流程**: UNet 推理帧 → FFmpeg rawvideo stdin 管道 → 一次 libx264 编码+音频 mux
+
+```python
+ffmpeg_cmd = [
+    "ffmpeg", "-y", "-v", "warning",
+    "-f", "rawvideo", "-pix_fmt", "bgr24",
+    "-s", f"{w}x{h}", "-r", str(fps),
+    "-i", "-",                        # stdin 管道输入
+    "-i", audio_path,
+    "-c:v", "libx264", "-preset", ENCODE_PRESET, "-crf", str(ENCODE_CRF),
+    "-pix_fmt", "yuv420p",
+    "-c:a", "copy", "-shortest",
+    output_vid_path,
+]
+ffmpeg_proc = subprocess.Popen(ffmpeg_cmd, stdin=subprocess.PIPE, ...)
+# 每帧直接 pipe_in.write(frame.tobytes())
+```
+
+关键实现细节：
+- `-pix_fmt bgr24` 匹配 OpenCV 原生帧格式，零转换开销
+- `np.ascontiguousarray` 确保帧内存连续
+- `BrokenPipeError` 捕获 + return code 检查覆盖异常路径
+- `pipe_in.close()` 在 `ffmpeg_proc.wait()` 之前，正确发送 EOF
+- 合成 fallback（resize 失败、mask 失败、blending 失败）均通过 `_write_pipe_frame` 输出原帧
+
+#### 4b. MuseTalk 参数环境变量化 + 质量优先档
+
+**文件**: `models/MuseTalk/scripts/server.py` + `backend/.env`
+
+所有推理与编码参数从硬编码改为 `.env` 可配置，当前使用"质量优先"档：
+
+| 参数 | 原默认值 | 质量优先值 | 作用 |
+|------|----------|-----------|------|
+| `MUSETALK_DETECT_EVERY` | 5 | **2** | 人脸检测频率 ↑2.5x，画面跟踪更稳 |
+| `MUSETALK_BLEND_CACHE_EVERY` | 5 | **2** | mask 更新更频，面部边缘融合更干净 |
+| `MUSETALK_EXTRA_MARGIN` | 15 | **14** | 下巴区域微调 |
+| `MUSETALK_BLEND_MODE` | auto | **jaw** | v1.5 显式 jaw 模式 |
+| `MUSETALK_ENCODE_CRF` | 18 | **14** | 接近视觉无损（输出还要进 Remotion 再编码） |
+| `MUSETALK_ENCODE_PRESET` | medium | **slow** | 同 CRF 下压缩效率更高 |
+| `MUSETALK_AUDIO_PADDING` | 2/2 | 2/2 | 不变 |
+| `MUSETALK_FACEPARSING_CHEEK` | 90/90 | 90/90 | 不变 |
+
+新增可配置参数完整列表：`DETECT_EVERY`、`BLEND_CACHE_EVERY`、`AUDIO_PADDING_LEFT/RIGHT`、`EXTRA_MARGIN`、`DELAY_FRAME`、`BLEND_MODE`、`FACEPARSING_LEFT/RIGHT_CHEEK_WIDTH`、`ENCODE_CRF`、`ENCODE_PRESET`。
+
+---
+
+### 5. Workflow 异步防阻塞 + compose 跳过优化
+
+#### 5a. 阻塞调用线程池化
+
+**文件**: `backend/app/modules/videos/workflow.py`
+
+workflow 中多处同步 FFmpeg 调用会阻塞 asyncio 事件循环，导致其他 API 请求（健康检查、任务状态查询）无法响应。新增通用辅助函数 `_run_blocking()`，将所有阻塞调用统一走线程池：
+
+```python
+async def _run_blocking(func, *args):
+    """在线程池执行阻塞函数，避免卡住事件循环。"""
+    loop = asyncio.get_running_loop()
+    return await loop.run_in_executor(None, func, *args)
+```
+
+已改造的阻塞调用点：
+
+| 调用 | 位置 | 说明 |
+|------|------|------|
+| `video.normalize_orientation()` | 单素材旋转归一化 | FFmpeg 旋转/转码 |
+| `video.prepare_segment()` | 多素材片段准备 | FFmpeg 缩放+时长裁剪，配合 `asyncio.gather` 多段并行 |
+| `video.concat_videos()` | 多素材拼接 | FFmpeg concat |
+| `video.prepare_segment()` | 单素材 prepare | FFmpeg 缩放+时长裁剪 |
+| `video.mix_audio()` | BGM 混音 | FFmpeg 音频混合 |
+| `video._get_duration()` | 音频/视频时长探测 (3处) | ffprobe 子进程 |
+
+#### 5b. `prepare_segment` 同分辨率跳过 scale
+
+**文件**: `backend/app/modules/videos/workflow.py`
+
+原来无论素材分辨率是否已匹配目标，都强制传 `target_resolution` 给 `prepare_segment`，触发 scale filter + libx264 重编码。优化后逐素材比对分辨率：
+
+- **多素材**: 逐段判断，分辨率匹配的传 `None`（`prepare_target_res = None if res == base_res else base_res`），走 `-c:v copy` 分支
+- **单素材**: 先 `get_resolution` 比对，匹配则传 `None`
+
+当分辨率匹配且无截取、不需要循环、不需要变帧率时，`prepare_segment` 内部走 `-c:v copy`，完全零损编码。
+
+#### 5c. `_get_duration()` 线程池化
+
+**文件**: `backend/app/modules/videos/workflow.py`
+
+3 处 `video._get_duration()` 同步 ffprobe 调用改为 `await _run_blocking(video._get_duration, ...)`，避免阻塞事件循环。
+
+#### 5d. compose 循环场景 CRF 统一
+
+**文件**: `backend/app/services/video_service.py`
+
+`compose()` 在视频需要循环时的编码从 CRF 23 提升到 CRF 18，与全流水线质量标准统一。
+
+#### 5e. 多素材片段校验
+
+**文件**: `backend/app/modules/videos/workflow.py`
+
+多素材 `prepare_segment` 完成后新增片段数量一致性校验，避免空片段进入 concat 导致异常。
+
+#### 5f. compose() 内部防阻塞
+
+**文件**: `backend/app/services/video_service.py`
+
+`compose()` 改为 `async def`，内部的 `_get_duration()` 和 `_run_ffmpeg()` 都通过 `loop.run_in_executor` 在线程池执行。
+
+#### 5g. 无需二次 compose 直接透传
+
+**文件**: `backend/app/modules/videos/workflow.py`
+
+当没有 BGM 时（`final_audio_path == audio_path`），LatentSync/MuseTalk 输出已包含正确音轨，跳过多余的 compose 步骤：
+
+```python
+needs_audio_compose = str(final_audio_path) != str(audio_path)
+```
+
+- **Remotion 路径**: 音频没变则跳过 pre-compose，直接用 lipsync 输出进 Remotion
+- **非 Remotion 路径**: 音频没变则 `shutil.copy` 直接透传 lipsync 输出，不再走 compose
+
+---
+
+### 6. 唇形模型选择全链路
+
+前端“生成视频”按钮右侧新增模型选择，下拉值全链路透传到后端路由与推理服务。
+
+#### 模型选项
+
+| 选项 | 值 | 路由逻辑 |
+|------|------|------|
+| 默认模型 | `default` | 保持阈值路由（`LIPSYNC_DURATION_THRESHOLD`，当前建议 100s） |
+| 快速模型 | `fast` | 强制 MuseTalk，不可用时回退 LatentSync |
+| 高级模型 | `advanced` | 强制 LatentSync |
+
+#### 最终 UI 形态
+
+- 模型按钮由原生 `<select>` 升级为统一 `SelectPopover`
+- 触发器文案改为业务语义（`默认模型 / 快速模型 / 高级模型` + `按时长智能路由 / 速度优先 / 质量优先`）
+- 选择状态持久化到 `useHomePersistence`（`lipsyncModelMode`）
+
+#### 数据流
+
+```
+前端 SelectPopover → setLipsyncModelMode("fast") → localStorage 持久化
+                                                  ↓
+用户点击"生成视频" → handleGenerate()
+  → payload.lipsync_model = lipsyncModelMode
+  → POST /api/videos/generate { ..., lipsync_model: "fast" }
+    → workflow: req.lipsync_model 透传给 lipsync.generate(model_mode=...)
+      → lipsync_service.generate(): 按 model_mode 路由
+        → fast: 强制 MuseTalk → 回退 LatentSync
+        → advanced: 强制 LatentSync
+        → default: 阈值策略
+```
+
+---
+
+### 7. 首页/发布页统一下拉交互（SelectPopover）
+
+#### 7a. 统一改造范围
+
+首页与发布页的业务选择项统一迁移到 `SelectPopover`：
+
+- 首页：音色、参考音频、配音列表、素材选择、BGM 选择、作品选择、标题显示模式、标题/副标题/字幕样式、时间轴画面比例、唇形模型
+- 发布页：选择发布作品（搜索 + 预览）
+
+例外：`ScriptEditor` 的“历史文案 / AI多语言”按产品要求恢复为原有轻量菜单，不强制统一。
+
+#### 7b. 关键交互修复
+
+- **遮挡修复**：桌面端面板改为 `Portal + fixed`，脱离局部 stacking context，彻底解决被卡片遮挡
+- **上拉/下拉自适应**：底部空间不足时自动上拉，避免菜单显示不全
+- **同宽展示**：面板宽度与触发器保持一致
+- **风格统一**：面板背景加实（高不透明度），滚动条隐藏但可滚动
+- **已选定位**：再次打开下拉时自动滚动到已选项（`data-popover-selected="true"`）
+- **预览协同**：
+  - 下拉内点“预览”不强制关闭，支持连续预览
+  - 视频预览弹窗层级高于下拉，避免被遮挡
+  - 预览弹窗打开时，下拉不会因外部点击/Esc被误关闭；关闭预览后仍可继续操作
+
+#### 7c. BGM 面板收敛
+
+- BGM 改为与“发布作品”同款选择器（搜索 + 列表 + 试听 + 选中态）
+- 按产品要求移除首页 BGM 音量滑杆
+- 生成请求统一使用固定 `bgm_volume=0.2`
+
+---
+
+## 📁 总修改文件清单
+
+| 文件 | 改动 |
+|------|------|
+| `remotion/render.ts` | bundle 缓存使用时硬链接视频+字体到 public 目录 |
+| `models/LatentSync/latentsync/utils/util.py` | `read_video` 检测 FPS，25fps 时跳过重编码 |
+| `models/LatentSync/latentsync/pipelines/lipsync_pipeline.py` | final mux `-c:v copy`；无脸帧容错 |
+| `backend/app/services/video_service.py` | CRF 23→18；`concat_videos` copy；`compose()` 异步化 + 循环 CRF 18 |
+| `backend/app/modules/videos/workflow.py` | 线程池化；同分辨率跳过 scale；compose 跳过；片段校验；模型选择透传 |
+| `backend/app/modules/videos/schemas.py` | 新增 `lipsync_model` 字段 |
+| `backend/app/services/lipsync_service.py` | `generate()` 新增 `model_mode` 三路分支路由 |
+| `models/MuseTalk/scripts/server.py` | FFmpeg rawvideo 管道；参数环境变量化 |
+| `backend/.env` | MuseTalk 推理/融合/编码参数可配；路由阈值与质量档调优 |
+| `frontend/src/shared/ui/SelectPopover.tsx` | 新增统一选择器：Portal+fixed、防遮挡、上拉/下拉自适应、同宽、隐藏滚动条、已选定位、预览协同 |
+| `frontend/src/features/home/ui/HomePage.tsx` | 配音卡层级修复；传递统一下拉状态 |
+| `frontend/src/features/home/model/useHomeController.ts` | `lipsyncModelMode` 透传；BGM 固定 `bgm_volume=0.2` |
+| `frontend/src/features/home/model/useHomePersistence.ts` | 模型模式等新增字段持久化 |
+| `frontend/src/features/home/ui/GenerateActionBar.tsx` | 模型选择改为 SelectPopover（速度/质量语义文案） |
+| `frontend/src/features/home/ui/VoiceSelector.tsx` | 音色选择统一为 SelectPopover（音色名+语言） |
+| `frontend/src/features/home/ui/RefAudioPanel.tsx` | 参考音频选择统一为 SelectPopover（含试听/重命名/删除/重识别） |
+| `frontend/src/features/home/ui/GeneratedAudiosPanel.tsx` | 配音列表、语速、语气统一为 SelectPopover |
+| `frontend/src/features/home/ui/MaterialSelector.tsx` | 素材选择改为发布页同款下拉（搜索/多选/预览/重命名/删除） |
+| `frontend/src/features/home/ui/BgmPanel.tsx` | BGM 选择改为发布页同款下拉（搜索+试听），移除音量滑杆 |
+| `frontend/src/features/home/ui/HistoryList.tsx` | 首页作品选择改为下拉（搜索+删除+选中态） |
+| `frontend/src/features/home/ui/TitleSubtitlePanel.tsx` | 标题显示模式与样式选择统一为 SelectPopover |
+| `frontend/src/features/home/ui/TimelineEditor.tsx` | 画面比例选择统一为 SelectPopover（单行按钮） |
+| `frontend/src/features/publish/ui/PublishPage.tsx` | 发布作品选择改为 SelectPopover；预览时下拉保持打开 |
+| `frontend/src/components/VideoPreviewModal.tsx` | 提升层级并添加预览标记，与下拉联动 |
+| `frontend/src/features/home/ui/ScriptEditor.tsx` | 历史文案/AI多语言恢复原轻量菜单（产品例外） |
+| `Docs/FRONTEND_DEV.md` | 新增 SelectPopover 规范、预览层级规范、持久化字段修订 |
+
+---
+
+## 🔍 验证
+
+1. **标题字幕恢复**: 生成视频应有标题和逐字高亮字幕（Remotion 渲染成功，非 FFmpeg 回退）
+2. **Remotion 日志**: 应出现 `Hardlinked into cached bundle:` 或 `Copied into cached bundle:` 而非 404
+3. **LatentSync FPS 跳过**: 日志应出现 `Video already at 25.0fps, skipping FPS conversion`
+4. **LatentSync mux**: FFmpeg 日志中 final mux 应为 `-c:v copy`
+5. **画质对比**: 同一素材+音频，优化后生成的视频嘴型区域（尤其牙齿）应比优化前更清晰
+6. **多素材拼接**: concat 步骤应为流复制，耗时从秒级降到毫秒级
+7. **无脸帧容错**: 包含转头/遮挡帧的素材不再导致任务失败，无脸帧保留原画面
+8. **MuseTalk 管道编码**: 日志中不应出现中间 mp4v 文件，合成阶段直接管道写入
+9. **MuseTalk 质量参数**: `curl localhost:8011/health` 确认服务在线，生成视频嘴型边缘更清晰
+10. **事件循环不阻塞**: 生成视频期间，`/api/tasks/{id}` 等接口应正常响应，不出现超时
+11. **compose 跳过**: 无 BGM 时日志应出现 `Audio unchanged, skip pre-Remotion compose`
+12. **同分辨率跳过 scale**: 素材已是目标分辨率时，`prepare_segment` 应走 `-c:v copy`（日志中无 scale filter）
+13. **compose 循环 CRF**: 循环场景编码应为 CRF 18（非 23）
+14. **模型选择 UI**: 生成按钮右侧应出现默认模型/快速模型/高级模型下拉
+15. **模型选择持久化**: 切换模型后刷新页面，下拉应恢复上次选择
+16. **快速模型路由**: 选择"快速模型"时，后端日志应出现 `强制快速模型：MuseTalk`
+17. **高级模型路由**: 选择"高级模型"时，后端日志应出现 `强制高级模型：LatentSync`
+18. **默认模型不变**: 选择"默认模型"时行为与改动前完全一致（阈值路由）
+19. **统一下拉样式**: 首页/发布页业务选择项均为同款 SelectPopover（触发器 + 面板 + 选中态）
+20. **上拉自适应**: 页面底部打开下拉时应自动上拉，不出现被截断
+21. **已选定位**: 任意下拉再次打开时应自动定位到已选项，而非列表顶端
+22. **预览层级**: 视频预览弹窗应始终覆盖在下拉之上，不被菜单遮挡
+23. **连续预览**: 下拉内点击预览后菜单保持打开，关闭预览后可继续点击其他预览项
+24. **BGM 行为**: 首页 BGM 不再显示音量滑杆，生成请求固定 `bgm_volume=0.2`
--- a/Docs/DevLogs/Day31.md
+++ b/Docs/DevLogs/Day31.md
@@ -0,0 +1,404 @@
+## 文档分层收敛 + 音色试听修复 + 录音弹窗重构 + 弹窗体系统一 (Day 31)
+
+### 概述
+
+今天的工作聚焦四件事：
+
+1. 清理并收敛根目录文档（README/DEV 职责边界、历史内容归档、参数描述与代码对齐）
+2. 完成 EdgeTTS 音色列表「一键试听」能力，并修复浏览器端试听失败问题
+3. 重构声音克隆录音交互：录音入口下沉到参考音频区域底部右侧，流程改为弹窗
+4. 抽离统一弹窗基座 `AppModal`，将主要弹窗迁移到同一视觉和交互规范
+
+---
+
+## ✅ 1) 文档体系与内容一致性优化
+
+### 1.1 README / DEV 边界明确
+
+- 为 `FRONTEND_README.md`、`BACKEND_README.md`、`FRONTEND_DEV.md`、`BACKEND_DEV.md` 增加「文档定位」
+- README 只保留稳定说明（功能、接口、运行），DEV 保留规范（约束、分层、Checklist）
+- 将 README 中偏日志化内容（如 Day 标注）清理为稳定表述
+
+### 1.2 部署与参数文档对齐当前代码
+
+- 将唇形路由阈值文案统一为阈值驱动，并以当前 `.env` 示例 `100` 为参考
+- 修正旧编码描述（将 MuseTalk 合成描述对齐为 rawvideo 管道 + `libx264`）
+- 修复文档中不存在的 `.env.example` 指引，改为基于 `backend/.env` 的说明
+- 将 Qwen3-TTS 文档标注为「历史归档（已停用）」并指向 CosyVoice 3.0
+
+---
+
+## ✅ 2) 音色试听能力落地与故障修复
+
+### 2.1 功能实现
+
+- 音色下拉项新增试听按钮（播放/暂停/加载态）
+- 新增后端试听接口：`/api/videos/voice-preview`
+- 试听文本按音色 locale 自动选择固定示例文案（9 国语言 + 中文兜底）
+
+### 2.2 兼容与稳定性调整
+
+- 保留 `POST /api/videos/voice-preview`（兼容）
+- 新增 `GET /api/videos/voice-preview?voice=...`，前端改为直接播放 GET 音频流，减少浏览器自动播放策略干扰
+
+```python
+@router.get("/voice-preview")
+async def preview_voice_get(voice: str, current_user: dict = Depends(get_current_user)):
+    voice_value = voice.strip()
+    if not voice_value:
+        raise HTTPException(status_code=400, detail="voice 不能为空")
+    text = _get_preview_text_for_voice(voice_value)
+    return await _render_voice_preview(voice=voice_value, text=text)
+```
+
+### 2.3 本次线上问题结论（已修复）
+
+- 现象：浏览器端试听请求 404
+- 根因：新增 GET 路由后，后端进程未重启，运行中的代码仍是旧版本
+- 处理：`pm2 restart vigent2-backend` 后路由生效
+- 补充：`curl` 返回 401（无 auth cookie）属于预期；浏览器同源请求会自动带 cookie
+
+---
+
+## ✅ 3) 录音交互重构（声音克隆）
+
+### 3.1 入口重排
+
+- 去掉参考音频面板内的独立录音大块区域
+- 将「上传音频 / 录音」入口放到「我的参考音频」区域底部右侧
+
+### 3.2 录音流程改为弹窗
+
+- 录音弹窗支持：开始录音 / 停止录音 / 状态计时 / 试听
+- 保留并强化「使用此录音」和「弃用本次录音」
+- 关闭弹窗时若仍在录音，会先停止录音再关闭
+- 修正弹窗挂载位置：从局部组件渲染改为 `AppModal` Portal 到 `document.body`，确保是全页面弹窗体验
+- 参考音频区按钮文案更新：`录音` -> `在线录音`
+
+### 3.4 文案区按钮视觉统一
+
+- 统一「文案提取与编辑」区按钮尺寸与圆角（`px-3 py-1.5 text-xs rounded-lg`）
+- 将 `AI智能改写`、`保存文案` 按钮改为与上传/在线录音同等级的视觉规格
+- 同步统一图标尺寸与禁用态样式，消除“底部按钮偏小”问题
+
+### 3.5 录音试听条 UI 美化
+
+- 将录音完成后的原生白色 `<audio controls>` 替换为项目深色风格的自定义试听条
+- 新试听条包含：播放/暂停按钮、进度拖拽、当前时长/总时长显示
+- 统一配色到当前页面（深色底 + 绿色强调），避免与整体 UI 风格割裂
+
+### 3.6 录音上传关闭时机优化
+
+- 原逻辑：点击「使用此录音」后，需等待上传+识别完成才关闭弹窗（体感卡顿）
+- 新逻辑：点击后立即关闭弹窗，上传/识别在后台继续进行
+- 状态反馈仍在参考音频区域显示（上传识别中的提示 + 失败错误提示）
+
+---
+
+## ✅ 5) 发布管理抖音登录「无法获取二维码」修复
+
+### 问题定位
+
+- 现象：发布管理中点击抖音登录，前端提示无法获取二维码
+- 后端日志显示根因：
+  - `Page.goto: Timeout 30000ms exceeded`
+  - 导航目标：`https://creator.douyin.com/`
+  - 等待条件：`wait_until="networkidle"`
+
+### 修复方案
+
+- 抖音登录页改为与微信一致的更稳策略：`wait_until="domcontentloaded"`
+- 对抖音导航超时增加容错：即使 `goto` 超时，也继续执行二维码提取流程（避免长连接导致误失败）
+
+### 验证
+
+- 本地接口冒烟：`POST /api/publish/login/douyin` 返回 `success=true` 且包含 `qr_code`
+- 已重启后端进程使修复生效：`pm2 restart vigent2-backend`
+
+### 3.3 状态逻辑补齐
+
+- 新增 `discardRecording()`：清空本次录音与计时
+- 开始新录音前先清空旧录音，避免旧状态残留
+
+---
+
+## ✅ 4) 弹窗 UI/UX 统一（AppModal）
+
+新增统一弹窗基座：`frontend/src/shared/ui/AppModal.tsx`
+
+- 统一遮罩：`bg-black/80 + backdrop-blur-sm`
+- 统一容器：深色半透明背景、`border-white/10`、`rounded-2xl`、重阴影
+- 统一 Header：标题/副标题/关闭按钮
+- 统一行为：ESC 关闭、背景滚动锁定、按需控制 overlay 点击关闭
+- 统一挂载：通过 Portal 渲染到 `document.body`，避免出现“看起来只在配音区弹出”的层叠问题
+- 统一可访问性：补齐 `role="dialog"` + `aria-modal="true"`
+- 统一焦点管理：打开弹窗自动聚焦，关闭后恢复到打开前焦点元素
+- 统一滚动锁计数：支持多弹窗并存，避免一个弹窗关闭后提前恢复页面滚动
+
+已迁移弹窗：
+
+- 视频预览（`VideoPreviewModal`）
+- 文案提取（`ScriptExtractionModal`）
+- AI 改写（`RewriteModal`）
+- 截取设置（`ClipTrimmer`）
+- 录音弹窗（`RefAudioPanel` 内）
+- 修改密码弹窗（`AccountSettingsDropdown`）
+- 发布管理扫码登录弹窗（`PublishPage` 内 QR 登录弹窗）
+
+---
+
+## ✅ 6) 微信视频号登录二维码观感优化（“能扫但像被截断”）
+
+### 问题现象
+
+- 微信视频号登录二维码可扫码成功，但视觉上像“边缘不完整/被切掉”，观感不佳
+
+### 修复方案
+
+- 后端二维码提取策略增强（`qr_login_service.py`）：
+  - 优先导出二维码原始 PNG 数据（`canvas.toDataURL('image/png')` / `img[data:image/png]`），减少二次截图导致的边缘损失
+  - 微信回退截图时改为“按二维码 bbox 外扩留白裁剪”，避免贴边截取带来的不完整感
+  - 仅接受 PNG Data URL，避免把非 PNG（如 SVG 片段）直接当二维码返回造成边角异常
+- 前端扫码弹窗展示优化（`PublishPage.tsx`）：
+  - 取消二维码图片本体圆角裁切，改为外层白底容器 + 内边距（模拟 quiet zone）
+  - 同步调整二维码显示宽度与边框，提升完整感与观感一致性
+
+### 验证
+
+- 本地接口冒烟：`POST /api/publish/login/weixin` 返回 `success=true` 且包含 `qr_code`
+- 解码后图片尺寸为 `1000x1000`，扫码仍正常
+- 前后端进程已重启使修复生效：
+  - `pm2 restart vigent2-frontend`
+  - `pm2 restart vigent2-backend`
+
+---
+
+## ✅ 7) 发布流程性能与日志可读性优化（双平台发布场景）
+
+### 7.1 发布请求并发优化（前端）
+
+- 原逻辑：发布页按平台串行 `for...of await`，多平台总耗时为各平台耗时累加
+- 新逻辑：引入受限并发执行（并发度=2），两平台可并行发布，显著缩短总等待时长
+- 结果列表仍按用户选择的平台顺序回填，避免并发返回导致顺序抖动
+
+### 7.2 微信上传日志噪声优化（后端）
+
+- 原逻辑：`set_input_files` 后若立即读不到 `input.files[0]` 就直接打 warning：`[weixin][file_input] empty`
+- 新逻辑：先轮询确认“是否已进入上传中状态”，再决定是否告警；非最后一次重试只记 info，最后一次才 warning
+- 效果：减少误报警（实际已开始上传时不再刷 warning），排障日志更干净
+
+### 验证
+
+- `python -m py_compile backend/app/services/uploader/weixin_uploader.py` ✅
+- `npm run build`（frontend）✅
+- 服务重启：`pm2 restart vigent2-frontend && pm2 restart vigent2-backend` ✅
+
+---
+
+## ✅ 8) 小红书发布链路对齐改造（启动模式 / Cookie 格式 / 成功截图）
+
+### 8.1 启动模式与反检测参数对齐
+
+- 在 `config.py` 新增小红书 Playwright 配置：
+  - `XIAOHONGSHU_HEADLESS_MODE`（默认 `headless-new`）
+  - `XIAOHONGSHU_USER_AGENT / LOCALE / TIMEZONE_ID`
+  - `XIAOHONGSHU_CHROME_PATH / BROWSER_CHANNEL`
+  - `XIAOHONGSHU_FORCE_SWIFTSHADER / DEBUG_ARTIFACTS`
+- `xiaohongshu_uploader.py` 改为与抖音/微信一致的可配置启动策略，并保留反检测基础参数（`--disable-blink-features=AutomationControlled`）
+
+### 8.2 小红书 uploader 重构增强
+
+- 重写小红书 uploader 主流程（参考抖音/微信模式）：
+  - 上传入口/文件 input 多选择器回退
+  - 上传中/成功/失败状态轮询判定
+  - 标题与正文/话题填充容错
+  - 发布按钮多选择器与可点击检查
+- 发布成功判定从“仅 URL”增强为“多信号组合”：
+  - URL 跳转判定
+  - 页面成功/失败文案判定
+  - 发布 API 响应监听（`publish` / `note create` 类接口）
+- 发布成功后补齐截图能力并返回 `screenshot_url`（路径格式与抖音/微信一致）：
+  - `/api/publish/screenshot/{filename}`
+
+### 8.3 Cookie 保存格式统一
+
+- `publish_service.save_cookie_string()` 调整：
+  - `bilibili` 继续使用原有简化 cookie dict（兼容既有上传库）
+  - 非 `bilibili` 平台统一保存为 Playwright `storage_state`：
+    - `{"cookies": [...], "origins": []}`
+  - 补充平台默认 domain（抖音/微信/小红书），使 cookie 文件可直接用于 `browser.new_context(storage_state=...)`
+
+### 8.4 验证与生效
+
+- `python -m py_compile backend/app/core/config.py backend/app/services/publish_service.py backend/app/services/uploader/xiaohongshu_uploader.py` ✅
+- `pm2 restart vigent2-backend` ✅
+
+---
+
+## ✅ 9) 小红书登录二维码修复（默认短信登录需先切换）
+
+### 问题现象
+
+- 小红书创作平台 `https://creator.xiaohongshu.com/` 默认落在“短信登录”视图
+- 二维码需要先点击右上角切换图标才会出现，导致后端直接按二维码选择器抓取失败
+
+### 修复方案（`qr_login_service.py`）
+
+- 新增 `_ensure_xiaohongshu_qr_mode()`：
+  - 先检测是否处于短信登录（`input[placeholder*='手机号']`）
+  - 自动点击登录卡片右上角切换图标（优先稳定选择器，失败后用几何位置兜底）
+  - 切换后等待二维码渲染再进入提取流程
+- 扩展小红书二维码选择器集合：
+  - 增加登录卡片内二维码图片选择器（包含当前页面结构）
+  - 保留通用 `img[src*='qr'/'qrcode']` 兜底
+- 提高小红书候选过滤阈值（`min_side=120`），避免误选右上角切换小图标
+- 文本策略补充小红书关键词（如 `APP扫一扫登录`）
+
+### 验证
+
+- 本地接口冒烟：`POST /api/publish/login/xiaohongshu` 返回 `success=true` 且 `qr_code` 非空
+- 后端日志确认修复链路生效：
+  - `已点击登录方式切换，等待二维码渲染`
+  - `策略1(CSS): 匹配成功`
+
+---
+
+## ✅ 10) 小红书发布上传阶段修复（“发布笔记 - 上传视频”场景）
+
+### 问题现象
+
+- 小红书发布在“上传视频”阶段失败，页面停留在发布页，前端提示发布失败
+- 后端日志显示 `set_input_files` 触发成功，但短时间内未检测到上传状态，导致重复触发上传并误判失败
+- 进一步定位到上传文件实际是 Supabase 本地对象文件（无后缀），日志里 `file_input type=` 为空，平台可能无法正确识别视频 MIME
+
+### 修复方案（`xiaohongshu_uploader.py`）
+
+- 新增上传启动探测窗口 `UPLOAD_SIGNAL_TIMEOUT=12s`：
+  - `set_input_files` 成功后给上传状态留出启动时间
+  - 检测到“上传中/处理中/转码中”等信号即进入后续上传轮询
+  - 启动窗口内未出现明显信号时，不再立即判失败，转入主上传监控阶段继续等待
+- 修正失败判定词：
+  - 从失败关键词中移除 `重新上传`（该文案在小红书页面常作为正常状态/操作入口，不能直接视为失败）
+- 增补上传文件诊断日志：
+  - 输出 `file_input` 选中文件名/大小/类型，便于确认文件是否真正注入 input
+  - 上传失败命中时记录明确告警日志，便于线上快速定位
+- 增加无后缀视频文件兜底：
+  - 若原文件无后缀且父目录名带后缀（如 `xxx.mp4/<uuid>`），自动在 `/tmp/vigent_uploads` 生成同名临时文件（硬链接/软链接/复制兜底）
+  - 上传改用带后缀临时文件，提升站点 MIME 识别稳定性
+  - 任务结束后自动清理临时上传文件
+
+### 10.1 二次定位与加固（卡住复现后）
+
+- 复现日志显示：即使传入了带后缀临时路径，`file_input` 中仍出现无后缀文件名，且长时间停留在 `等待上传状态...`
+- 根因进一步确认：此前在跨设备场景下会走 `symlink` 回退，浏览器实际取到原始目标文件名（无后缀），导致站点识别失败
+- 加固修复：
+  - 去掉 `symlink` 回退，仅保留 `hardlink -> copy`，确保最终上传文件名稳定带 `.mp4`
+  - 新增 `file_input` 文件名后缀一致性校验：若与预期后缀不一致，直接重试并在最终失败时提前返回（不再无意义长时间等待）
+  - 新增上传空转超时保护（`UPLOAD_IDLE_TIMEOUT=90s`）：长时间无有效上传信号时提前失败并保留调试截图，避免前端“看起来卡死”
+  - 优化失败文案为“未能触发有效视频上传，请确认发布页状态及视频文件格式”
+
+### 10.2 实时发布验证（修复后）
+
+- 重新发起 `POST /api/publish`（小红书），后端完整走通上传+发布，接口返回 `200`
+- 本次实测耗时约 `45.77s`，属于上传与发布等待区间内的正常时长
+- 发布成功截图可访问：`GET /api/publish/screenshot/xiaohongshu_success_20260303_115944_633.png` 返回 `200`
+- 关键日志链路：`正在上传` -> `已设置上传文件` -> `等待发布结果` -> `Cookie 更新完毕`
+
+### 验证
+
+- `python -m py_compile backend/app/services/uploader/xiaohongshu_uploader.py` ✅
+- `pm2 restart vigent2-backend` ✅
+- `curl http://127.0.0.1:8006/health` 返回 `{"status":"ok"}` ✅
+
+---
+
+## 📁 今日主要修改文件
+
+| 文件 | 改动 |
+|------|------|
+| `backend/app/modules/videos/router.py` | 新增/增强 `voice-preview` GET+POST，试听文本 locale 路由，临时文件清理 |
+| `backend/app/modules/videos/schemas.py` | 新增 `VoicePreviewRequest` |
+| `frontend/src/features/home/ui/VoiceSelector.tsx` | 音色下拉增加试听按钮，改为 GET 音频流播放 |
+| `frontend/src/features/home/model/useHomeController.ts` | 录音状态重置、`discardRecording` |
+| `frontend/src/features/home/ui/HomePage.tsx` | 透传录音弃用动作 |
+| `frontend/src/features/home/ui/RefAudioPanel.tsx` | 上传/录音入口重排；录音改弹窗；使用/弃用流程 |
+| `frontend/src/features/home/ui/ScriptEditor.tsx` | 文案编辑区按钮视觉统一（含 AI智能改写/保存文案） |
+| `frontend/src/features/home/ui/RefAudioPanel.tsx` | 录音完成试听条改为自定义深色播放器（替换原生白色控制条） |
+| `frontend/src/features/home/ui/RefAudioPanel.tsx` | 使用录音后弹窗立即关闭，上传识别后台进行（提升交互流畅度） |
+| `frontend/src/features/publish/model/usePublishController.ts` | 发布改为受限并发（并发度=2），缩短多平台发布总耗时 |
+| `backend/app/core/config.py` | 新增小红书 Playwright 配置（headless/UA/locale/timezone/chrome/debug） |
+| `backend/app/services/uploader/xiaohongshu_uploader.py` | 按抖音/微信模式重构；补充上传启动容错窗口、无后缀文件兜底（hardlink/copy）、后缀一致性校验、空转超时保护与上传诊断日志 |
+| `backend/app/services/publish_service.py` | `save_cookie_string` 非 bilibili 统一存储为 Playwright `storage_state`；小红书 uploader 透传 `user_id` |
+| `backend/app/services/qr_login_service.py` | 抖音导航超时容错 + 微信二维码提取增强 + 小红书登录自动切换到扫码模式并提取二维码 |
+| `backend/app/services/uploader/weixin_uploader.py` | `file_input empty` 告警策略优化：先检测上传信号，非最后一次重试降级为 info |
+| `frontend/src/shared/ui/AppModal.tsx` | 统一弹窗组件 + 无障碍语义 + 焦点管理 + 多弹窗滚动锁计数 |
+| `frontend/src/components/VideoPreviewModal.tsx` | 迁移到 `AppModal` |
+| `frontend/src/features/home/ui/ScriptExtractionModal.tsx` | 迁移到 `AppModal` |
+| `frontend/src/features/home/ui/RewriteModal.tsx` | 迁移到 `AppModal` |
+| `frontend/src/features/home/ui/ClipTrimmer.tsx` | 迁移到 `AppModal` |
+| `frontend/src/components/AccountSettingsDropdown.tsx` | 修改密码弹窗迁移到 `AppModal` |
+| `frontend/src/features/publish/ui/PublishPage.tsx` | 扫码登录（QR）弹窗迁移到 `AppModal` + 二维码白底留白容器优化（避免边缘观感被裁） |
+| `Docs/FRONTEND_DEV.md` | 新增统一弹窗规范（AppModal）和录音交互规范 |
+| `Docs/FRONTEND_README.md` | 增补录音入口与弹窗交互说明 |
+| `Docs/BACKEND_README.md` | 增补 `voice-preview` 接口说明；更新发布 API 路径（`/login/{platform}` 等）并链接发布专项文档 |
+| `Docs/BACKEND_DEV.md` | 更新后端规范中的发布器覆盖范围与小红书配置项；补充发布专项文档指引 |
+| `Docs/PUBLISH_DEPLOY.md` | 新增多平台发布专项文档（登录实现、自动化发布流程、部署要点与排障） |
+| `Docs/DEPLOY_MANUAL.md` | 部署参数与扫码说明补充小红书要点；新增发布专项文档入口 |
+| `README.md` | 文档中心新增 `PUBLISH_DEPLOY.md` 入口；发布结果可视化描述补齐小红书 |
+| `Docs/TASK_COMPLETE.md` | 新增 Day31 任务汇总，更新 Current 标签与更新时间 |
+| `Docs/DOC_RULES.md` | 增补“发布相关三检”（路由真值/专项文档/入口回写）、敏感信息处理规范，更新工具规范为 `Read/Grep/apply_patch`，并对齐 TASK_COMPLETE 检查清单 |
+| `Docs/SUBTITLE_DEPLOY.md` | 与当前阈值/参数说明对齐 |
+| `Docs/LATENTSYNC_DEPLOY.md` | 与当前阈值/参数说明对齐 |
+| `Docs/COSYVOICE3_DEPLOY.md` | TTS 部署说明与当前运行路径对齐 |
+| `Docs/QWEN3_TTS_DEPLOY.md` | 标注为历史归档并指向 CosyVoice 3.0 |
+
+---
+
+## 🔍 验证记录
+
+- `python -m py_compile backend/app/modules/videos/router.py backend/app/modules/videos/schemas.py` ✅
+- `python -m py_compile backend/app/services/qr_login_service.py` ✅
+- `python -m py_compile backend/app/services/uploader/weixin_uploader.py` ✅
+- `python -m py_compile backend/app/core/config.py backend/app/services/publish_service.py backend/app/services/uploader/xiaohongshu_uploader.py` ✅
+- `POST /api/publish/login/xiaohongshu` 冒烟返回 `success=true` + `qr_code` ✅
+- `python -m py_compile backend/app/services/uploader/xiaohongshu_uploader.py`（上传阶段修复后）✅
+- `pm2 restart vigent2-backend`（上传阶段修复后）✅
+- `curl http://127.0.0.1:8006/health` 返回 `{"status":"ok"}` ✅
+- `backend/venv/bin/python` 本地探针验证 `_prepare_upload_file()`：临时文件非 symlink、后缀 `.mp4`、清理成功 ✅
+- 小红书发布实测：`POST /api/publish` 返回 `200`（`Duration: 45.77s`）且成功截图接口返回 `200` ✅
+- 新增 `Docs/PUBLISH_DEPLOY.md`（抖音/微信/B站/小红书登录与发布实现说明）✅
+- `npm run build`（frontend）✅
+- `POST /api/publish/login/weixin` 冒烟返回 `success=true` + `qr_code` ✅
+- `npx eslint` 定向检查以下文件通过：
+  - `VoiceSelector.tsx`
+  - `RefAudioPanel.tsx`
+  - `HomePage.tsx`
+  - `useHomeController.ts`
+  - `AppModal.tsx`
+  - `VideoPreviewModal.tsx`
+  - `ScriptExtractionModal.tsx`
+  - `RewriteModal.tsx`
+  - `AccountSettingsDropdown.tsx`
+- `ClipTrimmer.tsx` 仍有仓库既有 lint 规则项（`react-hooks/set-state-in-effect`），与本次弹窗风格迁移无关
+- 音色试听线上问题经后端重启后已恢复可用（浏览器同源携带 cookie）
+
+---
+
+## ☑️ Day31 覆盖核对（今日新增补充）
+
+已对照今天新增改动做二次核对，以下内容已写入本日志：
+
+- `AppModal` 的可访问性与焦点/滚动锁稳健性增强
+- 微信视频号二维码“观感不完整”问题的后端提取修复
+- 发布页二维码展示样式优化（白底留白、去除本体圆角裁切）
+- 小红书 uploader 对齐重构（启动参数、发布判定、成功截图）
+- 小红书“上传阶段卡住”二次定位与加固（文件名后缀一致性 + 空转超时）并完成实测发布成功
+- 形成发布专项文档 `Docs/PUBLISH_DEPLOY.md`，沉淀四平台登录与自动化发布实现
+- 回写 `Docs/BACKEND_README.md` / `Docs/BACKEND_DEV.md` / `Docs/DEPLOY_MANUAL.md`，统一发布 API 与部署说明口径
+- 回写 `README.md`，补充发布专项文档入口与小红书发布成功截图能力描述
+- 回写 `Docs/TASK_COMPLETE.md`，补齐 Day31 任务完成记录
+- 回写 `Docs/DOC_RULES.md`，同步文档更新规则到当前文档结构与工具链
+- 非 bilibili 平台 cookie 保存为 `storage_state` 格式
+- 小红书登录二维码自动切换（短信登录 -> 扫码登录）与提取修复
+- 对应构建/重启/冒烟验证记录
+- 今日运行期产物（`backend/user_data/**/cookies/*.json`、`watchdog.log`）为会话副产物，不属于代码/文档变更项
--- a/Docs/Doc_Rules.md
+++ b/Docs/Doc_Rules.md
@@ -6,13 +6,14 @@

 ## ⚡ 核心原则

-| 规则 | 说明 |
-|------|------|
-| **默认更新** | 更新 `DayN.md` 和 `TASK_COMPLETE.md` |
-| **按需更新** | 其他文档仅在内容变化涉及时更新 |
-| **智能修改** | 错误→替换，改进→追加（见下方详细规则） |
-| **先读后写** | 更新前先查看文件当前内容 |
-| **日内合并** | 同一天的多次小修改合并为最终版本 |
+| 规则 | 说明 |
+|------|------|
+| **默认更新** | 更新 `DayN.md` 和 `TASK_COMPLETE.md` |
+| **按需更新** | 其他文档仅在内容变化涉及时更新 |
+| **链路对齐** | 新增/重构文档后，回写入口文档（`README.md` 或对应 `*_README.md`） |
+| **智能修改** | 错误→替换，改进→追加（见下方详细规则） |
+| **先读后写** | 更新前先查看文件当前内容 |
+| **日内合并** | 同一天的多次小修改合并为最终版本 |

 ---

@@ -20,17 +21,19 @@

 > **每次提交重要变更时，请核对以下文件是否需要同步：**

-| 优先级 | 文件路径 | 检查重点 |
-| :---: | :--- | :--- |
-| 🔥 **High** | `Docs/DevLogs/DayN.md` | **(最新日志)** 详细记录变更、修复、代码片段 |
-| 🔥 **High** | `Docs/TASK_COMPLETE.md` | **(任务总览)** 更新 `[x]`、进度条、时间线 |
-| ⚡ **Med** | `README.md` | **(项目主页)** 功能特性、技术栈、最新截图 |
-| ⚡ **Med** | `Docs/DEPLOY_MANUAL.md` | **(部署手册)** 环境变量、依赖包、启动命令变更 |
-| ⚡ **Med** | `Docs/BACKEND_DEV.md` | **(后端规范)** 接口契约、模块划分、环境变量 |
-| ⚡ **Med** | `Docs/BACKEND_README.md` | **(后端文档)** 接口说明、架构设计 |
-| ⚡ **Med** | `Docs/FRONTEND_DEV.md` | **(前端规范)** API封装、日期格式化、新页面规范 |
-| ⚡ **Med** | `Docs/FRONTEND_README.md` | **(前端文档)** 功能说明、页面变更 |
-| 🧊 **Low** | `Docs/*_DEPLOY.md` | **(子系统部署)** LatentSync/CosyVoice/字幕等独立部署文档 |
+| 优先级 | 文件路径 | 检查重点 |
+| :---: | :--- | :--- |
+| 🔥 **High** | `Docs/DevLogs/DayN.md` | **(最新日志)** 详细记录变更、修复、代码片段 |
+| 🔥 **High** | `Docs/TASK_COMPLETE.md` | **(任务总览)** 更新 Day Current、`[x]` 与更新时间 |
+| ⚡ **Med** | `README.md` | **(项目主页)** 功能特性、技术栈、最新截图 |
+| ⚡ **Med** | `Docs/DEPLOY_MANUAL.md` | **(部署手册)** 环境变量、依赖包、启动命令变更 |
+| ⚡ **Med** | `Docs/PUBLISH_DEPLOY.md` | **(发布专项)** 四平台登录/发布实现、排障、验收流程 |
+| ⚡ **Med** | `Docs/BACKEND_DEV.md` | **(后端规范)** 接口契约、模块划分、环境变量 |
+| ⚡ **Med** | `Docs/BACKEND_README.md` | **(后端文档)** 接口说明、架构设计 |
+| ⚡ **Med** | `Docs/FRONTEND_DEV.md` | **(前端规范)** API封装、日期格式化、新页面规范 |
+| ⚡ **Med** | `Docs/FRONTEND_README.md` | **(前端文档)** 功能说明、页面变更 |
+| 🧊 **Low** | `Docs/DOC_RULES.md` | **(规则文档)** 文档结构变化或流程变化时同步更新 |
+| 🧊 **Low** | `Docs/*_DEPLOY.md` | **(子系统部署)** LatentSync/CosyVoice/字幕等独立部署文档 |

 ---

@@ -89,7 +92,7 @@

 ---

-## 🔍 更新前检查清单
+## 🔍 更新前检查清单

 > **核心原则**：追加前先查找，避免重复和遗漏

@@ -112,12 +115,20 @@
 | **有待验证状态** | 更新状态标记 |
 | **全新独立内容** | 追加到末尾 |

-**3. 必须更新的内容**
+**3. 必须更新的内容**

 - ✅ **状态标记**：`🔄 待验证` → `✅ 已修复` / `❌ 失败`
 - ✅ **进度百分比**：更新为最新值
- ✅ **文件修改列表**：补充新修改的文件
- ❌ **禁止**：创建重复的章节标题
+- ✅ **文件修改列表**：补充新修改的文件
+- ❌ **禁止**：创建重复的章节标题
+
+### 发布相关变更的三检（新增）
+
+若涉及抖音/微信/B站/小红书发布或扫码登录，额外执行：
+
+1. **路由真值检查**：以 `backend/app/modules/publish/router.py` 为准校验 API 路径，避免文档写成旧路径（例如 `/screenshots/`）。
+2. **专项文档对齐**：更新 `Docs/PUBLISH_DEPLOY.md` 中对应平台章节（登录、发布判定、排障）。
+3. **入口文档回写**：至少回写一处入口文档（`README.md` 或 `Docs/BACKEND_README.md` / `Docs/DEPLOY_MANUAL.md`）。

 ### 示例场景

@@ -138,23 +149,23 @@

 ---

-## ️ 工具使用规范
+## ️ 工具使用规范

 > **核心原则**：使用正确的工具，避免字符编码问题

-### ✅ 推荐工具：Edit / Read / Grep
+### ✅ 推荐工具：Read / Grep / apply_patch

-**使用场景**：
- `Read`：更新前先查看文件当前内容
- `Edit`：精确替换现有内容、追加新章节
- `Grep`：搜索文件中是否已有相关章节
- `Write`：创建新文件（如 Day{N+1}.md）
+**使用场景**：
+- `Read`：更新前先查看文件当前内容
+- `apply_patch`：精确替换现有内容、追加新章节
+- `Grep`：搜索文件中是否已有相关章节
+- `Write`：创建新文件（如 Day{N+1}.md）

 **注意事项**：
 ```markdown
-1. **先读后写**：编辑前先用 Read 确认内容
-2. **精确匹配**：Edit 的 old_string 必须与文件内容完全一致
-3. **避免重复**：编辑前用 Grep 检查是否已存在同主题章节
+1. **先读后写**：编辑前先用 Read 确认内容
+2. **精确匹配**：`apply_patch` 的上下文必须与文件内容一致
+3. **避免重复**：编辑前用 Grep 检查是否已存在同主题章节
 ```

 ### ❌ 禁止使用：命令行工具修改文档
@@ -171,13 +182,14 @@

 ### 📝 最佳实践示例

-**追加新章节**：使用 `Edit` 工具，`old_string` 匹配文件末尾内容，`new_string` 包含原内容 + 新章节。
-
-**修改现有内容**：使用 `Edit` 工具精确替换。
-```markdown
-old_string: "**状态**：🔄 待修复"
-new_string: "**状态**：✅ 已修复"
-```
+**追加新章节**：使用 `apply_patch`，以文件末尾稳定上下文为锚点追加。
+
+**修改现有内容**：使用 `apply_patch` 精确替换。
+```markdown
+@@
+-**状态**：🔄 待修复
+**状态**：✅ 已修复
+```


 ---
@@ -191,11 +203,12 @@ ViGent2/Docs/
 ├── BACKEND_DEV.md                # 后端开发规范
 ├── BACKEND_README.md             # 后端功能文档
 ├── FRONTEND_DEV.md               # 前端开发规范
-├── FRONTEND_README.md            # 前端功能文档
-├── DEPLOY_MANUAL.md              # 部署手册
-├── SUPABASE_DEPLOY.md            # Supabase 部署文档
-├── LATENTSYNC_DEPLOY.md          # LatentSync 部署文档
-├── COSYVOICE3_DEPLOY.md           # 声音克隆部署文档
+├── FRONTEND_README.md            # 前端功能文档
+├── DEPLOY_MANUAL.md              # 部署手册
+├── PUBLISH_DEPLOY.md             # 多平台发布专项文档
+├── SUPABASE_DEPLOY.md            # Supabase 部署文档
+├── LATENTSYNC_DEPLOY.md          # LatentSync 部署文档
+├── COSYVOICE3_DEPLOY.md           # 声音克隆部署文档
 ├── ALIPAY_DEPLOY.md              # 支付宝付费部署文档
 ├── SUBTITLE_DEPLOY.md            # 字幕系统部署文档
 └── DevLogs/
@@ -254,16 +267,21 @@ ViGent2/Docs/

 ---

-## 📏 内容简洁性规则
+## 📏 内容简洁性规则

 ### 代码示例长度控制
 - **原则**：只展示关键代码片段（10-20行以内）
 - **超长代码**：使用 `// ... 省略 ...` 或仅列出文件名+行号
 - **完整代码**：引用文件链接，而非粘贴全文

-### 调试信息处理
- **临时调试**：验证后删除（如调试日志、测试截图）
- **有价值信息**：保留（如错误日志、性能数据）
+### 调试信息处理
+- **临时调试**：验证后删除（如调试日志、测试截图）
+- **有价值信息**：保留（如错误日志、性能数据）
+
+### 敏感信息处理
+- **禁止落盘**：Cookie 值、Token、密钥、完整手机号、支付凭证。
+- **日志引用**：仅记录必要关键词与结论，避免粘贴大段原始日志。
+- **路径引用**：优先给相对路径与文件名，不记录无关个人目录信息。

 ### 状态标记更新
 - **🔄 待验证** → 验证后更新为 **✅ 已修复** 或 **❌ 失败**
@@ -280,29 +298,29 @@ ViGent2/Docs/
 - **格式一致性**：直接参考 `TASK_COMPLETE.md` 现有格式追加内容。
 - **进度更新**：仅在阶段性里程碑时更新进度百分比。

-### 🔍 完整性检查清单 (必做)
-
-每次更新 `TASK_COMPLETE.md` 时，必须**逐一检查**以下所有板块：
-
-1. **文件头部 & 导航**
-   - [ ] `更新时间`：必须是当天日期
-   - [ ] `整体进度`：简述当前状态
-   - [ ] `快速导航`：Day 范围与文档一致
-
-2. **核心任务区**
-   - [ ] `已完成任务`：添加新的 [x] 项目
-   - [ ] `后续规划`：管理三色板块 (优先/债务/未来)
-
-3. **统计与回顾**
-   - [ ] `进度统计`：更新对应模块状态和百分比
-   - [ ] `里程碑`：若有重大进展，追加 `## Milestone N`
-
-4. **底部链接**
-   - [ ] `时间线`：追加今日概括
-   - [ ] `相关文档`：更新 DayLog 链接范围
-
-> **口诀**：头尾时间要对齐，任务规划两手抓，里程碑上别落下。
+### 🔍 完整性检查清单 (必做)
+
+每次更新 `TASK_COMPLETE.md` 时，必须**逐一检查**以下板块：
+
+1. **文件头部**
+   - [ ] `更新时间`：必须是当天日期
+   - [ ] `整体进度`：与当前 Day 状态一致（例如 Day31）
+
+2. **当日 Current 区块**
+   - [ ] 新增/更新 `Day N (Current)` 标题
+   - [ ] 关键任务以 `[x]` 列出（避免仅写结论）
+   - [ ] 前一天 Day 标题取消 `(Current)` 标记
+
+3. **Roadmap 与模块状态**
+   - [ ] 如有已完成长期事项，及时从待办迁移到已完成
+   - [ ] 模块完成度有变化时同步更新
+
+4. **相关文档链接**
+   - [ ] 新增的核心文档（如 `PUBLISH_DEPLOY.md`）要在相关位置可追溯
+   - [ ] 若 DayN 记录了“文档回写”，`TASK_COMPLETE.md` 的当日条目也要体现
+
+> **口诀**：头部日期、当日 Current、模块状态、链接可追溯。

 ---

-**最后更新**：2026-02-11
+**最后更新**：2026-03-03
--- a/Docs/FRONTEND_DEV.md
+++ b/Docs/FRONTEND_DEV.md
@@ -1,5 +1,11 @@
 # 前端开发规范

+## 文档定位
+
+- 本文档只定义前端开发规范与约束（结构、交互、持久化、接口调用、Checklist）。
+- 功能说明与启动方式请查看 `Docs/FRONTEND_README.md`。
+- 历史变更请记录在 `Docs/DevLogs/` 与 `Docs/TASK_COMPLETE.md`，不要写入本规范文档。
+
 ## 目录结构

 采用轻量 FSD（Feature-Sliced Design）结构：
@@ -62,6 +68,8 @@ frontend/src/
 │   ├── hooks/
 │   │   ├── useTitleInput.ts
 │   │   └── usePublishPrefetch.ts
+│   ├── ui/
+│   │   └── SelectPopover.tsx   # 统一下拉/BottomSheet 选择器
 │   ├── types/
 │   │   ├── user.ts            # User 类型定义
 │   │   └── publish.ts         # 发布相关类型
@@ -180,6 +188,41 @@ body {

 ---

+## 统一下拉选择器规范 (SelectPopover)
+
+首页/发布页的业务选择项（音色、参考音频、配音、素材、BGM、作品、样式、模型、画面比例）统一使用 `@/shared/ui/SelectPopover`：
+
+- 桌面端使用 Popover，移动端自动切换 BottomSheet
+- 触发器与面板风格统一：`border-white/10 + bg-black/25`（或同级变体）
+- 下拉项选中态统一：`border-purple-500 bg-purple-500/20`
+- 选中项需添加 `data-popover-selected="true"`，确保再次打开时自动滚动定位到已选项
+- 底部空间不足时自动上拉；滚动条隐藏但保留滚动能力
+
+### 视频预览与下拉层级
+
+- 下拉菜单层级应低于视频预览弹窗，避免遮挡预览内容
+- 在下拉内点击“预览”时，不强制关闭下拉（便于连续预览）
+- 关闭预览后，用户可继续在下拉内操作；点击外部时下拉正常收起
+
+### 例外说明
+
+- `ScriptEditor` 的“历史文案 / AI多语言”保持原有轻量菜单样式，不强制迁移到 `SelectPopover`
+
+---
+
+## 统一弹窗规范 (AppModal)
+
+所有居中弹窗（如视频预览、文案提取、AI 改写、录音、密码修改）统一使用 `@/shared/ui/AppModal` + `AppModalHeader`：
+
+- 统一遮罩与层级：`fixed inset-0` + `bg-black/80` + `backdrop-blur-sm` + 明确 `z-index`
+- 统一挂载位置：通过 Portal 挂载到 `document.body`，避免局部容器/层叠上下文影响，确保是全页面弹窗
+- 统一容器风格：`border-white/10`、深色半透明背景、圆角 `rounded-2xl`、重阴影
+- 统一关闭行为：支持 `ESC`；是否允许点击遮罩关闭通过 `closeOnOverlay` 显式配置
+- 统一滚动策略：弹窗打开时锁定背景滚动（`lockBodyScroll`），内容区自行滚动
+- 特殊层级场景（例如视频预览压过下拉）使用更高 `z-index`（如 `z-[320]`）
+
+---
+
 ## API 请求规范

 ### 必须使用 `api` (axios 实例)
@@ -346,6 +389,7 @@ useEffect(() => {
 - `shared/api`：Axios 实例与统一响应类型
 - `shared/lib`：通用工具函数（media.ts / auth.ts / title.ts）
 - `shared/hooks`：跨功能通用 hooks
+- `shared/ui`：跨功能通用 UI（如 SelectPopover）
 - `shared/types`：跨功能实体类型（User / PublishVideo 等）
 - `shared/contexts`：全局 Context（AuthContext / TaskContext）
 - `components/`：遗留通用组件（VideoPreviewModal）
@@ -366,11 +410,14 @@ useEffect(() => {
  - 标题样式 ID / 字幕样式 ID
  - 标题字号 / 字幕字号
  - 标题显示模式（`short` / `persistent`）
-  - 背景音乐选择 / 音量 / 开关状态
+  - 唇形模型模式（`default` / `fast` / `advanced`）
+  - 背景音乐选择 / 开关状态（当前前端不提供音量滑杆，生成时使用固定音量）
  - 输出画面比例（`9:16` / `16:9`）
  - 素材选择 / 历史作品选择
  - 选中配音 ID (`selectedAudioId`)
+  - 选中参考音频 ID (`selectedRefAudio` 对应 id)
  - 语速 (`speed`，声音克隆模式)
+  - 语气 (`emotion`，声音克隆模式)
  - 时间轴段信息 (`useTimelineEditor` 的 localStorage)

 ### 历史文案（独立持久化）
@@ -406,6 +453,7 @@ useEffect(() => {
 - 发布按钮在未选择任何平台时禁用
 - 仅保留"立即发布"，不再提供定时发布 UI/参数
 - **作品选择持久化**：使用 `video.id`（稳定标识）而非 `video.path`（签名 URL）进行选择、比较和 localStorage 存储。发布时根据 `id` 查找对应 `path` 发送请求。
+- **新作品优先级**：检测到“刚生成的新视频”时，页面首次恢复优先选中最新视频；之后用户手动改选会继续按持久化值恢复。

 ---

@@ -457,6 +505,10 @@ await api.post('/api/videos/generate', {

 使用 `MediaRecorder` API 录制音频，格式为 `audio/webm`，上传后后端自动转换为 WAV (16kHz mono)。

+- 录音入口放在“我的参考音频”区域底部右侧（与“上传音频”并排）。
+- 录音交互使用弹窗：开始/停止 -> 试听 -> 使用此录音 / 弃用本次录音。
+- 关闭录音弹窗时如仍在录制，会先停止录音再关闭。
+
 ```typescript
 // 录音需要用户授权麦克风
 const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
@@ -472,5 +524,5 @@ const mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
 ### UI 结构

 配音方式使用 Tab 切换：
- **EdgeTTS 音色** - 预设音色 2x3 网格
- **声音克隆** - 参考音频列表 + 在线录音 + 语速下拉菜单 (5 档: 较慢/稍慢/正常/稍快/较快)
+- **EdgeTTS 音色** - 统一下拉选择（显示“音色名 + 语言”）
+- **声音克隆** - 参考音频选择器（含试听/重命名/删除/重识别）+ 底部右侧上传/录音入口（录音弹窗）+ 语速/语气下拉
--- a/Docs/FRONTEND_README.md
+++ b/Docs/FRONTEND_README.md
@@ -2,44 +2,54 @@

 ViGent2 的前端界面，采用 Next.js 16 + TailwindCSS 构建。

+## 📌 文档定位
+
+- 本文档用于说明前端功能、运行方式与目录概览（面向使用与协作）。
+- 开发规范与实现约束请查看 `Docs/FRONTEND_DEV.md`。
+- 历史变更与里程碑请查看 `Docs/DevLogs/` 与 `Docs/TASK_COMPLETE.md`。
+
 ## ✨ 核心功能

 ### 1. 视频生成 (`/`)
 - **一、文案提取与编辑**: 文案输入/提取/翻译/保存。
- **二、标题与字幕**: 片头标题/副标题/字幕样式配置；短暂显示/常驻显示对标题和副标题同时生效。
- **三、配音**: 配音方式（EdgeTTS/声音克隆）+ 配音列表（生成/试听/管理）合并为一个板块。
- **四、素材编辑**: 视频素材（上传/选择/管理）+ 时间轴编辑（波形/色块/拖拽排序）合并为一个板块。
- **五、背景音乐**: 试听 + 音量控制 + 选择持久化。
+- **二、配音**: 配音方式（EdgeTTS/声音克隆）+ 配音列表（生成/试听/管理）合并为一个板块。
+- **三、素材编辑**: 视频素材（上传/选择/管理）+ 时间轴编辑（波形/色块/拖拽排序）合并为一个板块。
+- **四、标题与字幕**: 片头标题/副标题/字幕样式配置；短暂显示/常驻显示；样式预览使用视频片头帧作为真实背景。
+- **五、背景音乐**: 试听 + 搜索选择 + 选择持久化（无音量滑杆，生成时固定混音系数）。
 - **六、作品**（右栏）: 作品列表 + 作品预览合并为一个板块。
 - **进度追踪**: 实时显示视频生成进度 (10% -> 100%)。
 - **作品预览**: 生成完成后直接播放下载（作品预览 + 历史作品）。
 - **预览优化**: 预览视频 `metadata` 预取，首帧加载更快。
- **本地保存**: 文案/标题/偏好由 `useHomePersistence` 统一持久化，刷新后恢复 (Day 14/17)。
- **历史文案**: 手动保存/加载/删除历史文案，独立 localStorage 持久化 (Day 23)。
- **选择持久化**: 首页/发布页作品选择均使用稳定 `id` 持久化，刷新保持用户选择；新视频生成后自动选中最新 (Day 21)。
- **AI 多语言翻译**: 支持 9 种目标语言翻译文案 + 还原原文 (Day 22)。
+- **本地保存**: 文案/标题/偏好由 `useHomePersistence` 统一持久化，刷新后恢复。
+- **历史文案**: 手动保存/加载/删除历史文案，独立 localStorage 持久化。
+- **选择持久化**: 首页/发布页作品选择均使用稳定 `id` 持久化；新视频生成后优先选中最新，后续用户手动选择持续持久化恢复。
+- **统一下拉交互**: 首页/发布页业务选择器统一为 SelectPopover（支持自动上拉、已选定位、移动端 BottomSheet）；`ScriptEditor` 的“历史文案 / AI多语言”为产品例外，保留原轻量菜单。
+- **AI 多语言翻译**: 支持 9 种目标语言翻译文案 + 还原原文。

-### 2. 全自动发布 (`/publish`) [Day 7 新增]
+### 2. 全自动发布 (`/publish`)
 - **多平台管理**: 统一管理抖音、微信视频号、B站、小红书账号状态。
 - **扫码登录**: 
  - 集成后端 Playwright 生成的 QR Code。
  - 实时检测扫码状态 (Wait/Success)。
  - Cookie 自动保存与状态同步。
 - **发布配置**: 设置视频标题、标签、简介。
- **作品选择**: 卡片列表 + 搜索 + 预览弹窗。
- **选择持久化**: 使用稳定 `video.id` 持久化选择，刷新保持；新视频生成自动选中最新 (Day 21)。
+- **作品选择**: SelectPopover 下拉 + 搜索 + 预览弹窗（下拉内可连续预览，不强制收起）。
+- **选择持久化**: 使用稳定 `video.id` 持久化选择，刷新保持；新视频生成自动选中最新。
 - **预览兼容**: 签名 URL / 相对路径均可直接预览。
 - **发布方式**: 仅支持 "立即发布"。

-### 3. 声音克隆 [Day 13 新增]
- **TTS 模式选择**: EdgeTTS (预设音色) / 声音克隆 (自定义音色) 切换。
+### 3. 声音克隆
+- **TTS 模式选择**: EdgeTTS / 声音克隆切换，音色选择统一下拉（显示音色名 + 语言）。
+- **音色试听**: EdgeTTS 音色列表支持一键试听，按音色 locale 自动选择对应语言的固定示例文案。
 - **参考音频管理**: 上传/列表/重命名/删除参考音频，上传后自动 Whisper 转写 ref_text + 超 10s 自动截取。
+- **录音入口**: 参考音频区域底部右侧提供“上传音频 / 录音”入口；录音采用弹窗流程（录制 -> 试听 -> 使用/弃用）。
 - **重新识别**: 旧参考音频可重新转写并截取 (RotateCw 按钮)。
 - **一键克隆**: 选择参考音频后自动调用 CosyVoice 3.0 服务。
- **语速控制**: 声音克隆模式下支持 5 档语速 (0.8-1.2)，选择持久化 (Day 23)。
- **多语言支持**: EdgeTTS 10 语言声音列表，声音克隆 language 透传 (Day 22)。
+- **语速控制**: 声音克隆模式下支持 5 档语速 (0.8-1.2)，统一下拉，选择持久化。
+- **语气控制**: 声音克隆模式下支持 4 种语气 (正常/欢快/低沉/严肃)，统一下拉，选择持久化。
+- **多语言支持**: EdgeTTS 10 语言声音列表，声音克隆 language 透传。

-### 4. 配音前置 + 时间轴编排 [Day 23 新增]
+### 4. 配音前置 + 时间轴编排
 - **配音独立生成**: 先生成配音 → 选中配音 → 再选素材 → 生成视频。
 - **配音管理面板**: 生成/试听/改名/删除/选中，异步生成 + 进度轮询。
 - **时间轴编辑器**: wavesurfer.js 音频波形 + 色块可视化素材分配，拖拽分割线调整各段时长。
@@ -49,21 +59,22 @@ ViGent2 的前端界面，采用 Next.js 16 + TailwindCSS 构建。
 - **时间轴语义对齐**: 超出音频时仅保留可见段并截齐末段，超出段不参与生成；不足音频时最后可见段自动循环补齐。
 - **画面比例控制**: 时间轴顶部支持 `9:16 / 16:9` 输出比例选择，设置持久化并透传后端。

-### 5. 字幕与标题 [Day 13 新增]
+### 5. 字幕与标题
 - **片头标题**: 可选输入，限制 15 字；支持”短暂显示 / 常驻显示”，默认短暂显示（4 秒），对标题和副标题同时生效。
- **片头副标题**: 可选输入，限制 20 字；显示在主标题下方，用于补充说明或悬念引导；独立样式配置（字体/字号/颜色/间距），可由 AI 同时生成；与标题共享显示模式设定；仅在视频画面中显示，不参与发布标题 (Day 25)。
+- **片头副标题**: 可选输入，限制 20 字；显示在主标题下方，用于补充说明或悬念引导；独立样式配置（字体/字号/颜色/间距），可由 AI 同时生成；与标题共享显示模式设定；仅在视频画面中显示，不参与发布标题。
 - **标题同步**: 首页片头标题修改会同步到发布信息标题。
- **逐字高亮字幕**: 卡拉OK效果，默认开启，可关闭。
+- **逐字高亮字幕**: 卡拉OK效果，默认开启。
 - **自动对齐**: 基于 faster-whisper 生成字级别时间戳。
- **样式预设**: 标题/字幕/副标题样式选择 + 预览 + 字号调节 (Day 16/25)。
- **默认样式**: 标题 90px 站酷快乐体；字幕 60px 经典黄字 + DingTalkJinBuTi (Day 17)。
- **样式持久化**: 标题/字幕/副标题样式与字号刷新保留 (Day 17/25)。
+- **样式预设**: 标题/字幕/副标题样式选择 + 预览 + 字号调节。
+- **默认样式**: 标题 90px 站酷快乐体；字幕 60px 经典黄字 + DingTalkJinBuTi。
+- **样式持久化**: 标题/字幕/副标题样式与字号刷新保留。

-### 6. 背景音乐 [Day 16 新增]
- **试听预览**: 点击试听即选中，音量滑块实时生效。
- **混音控制**: 仅影响 BGM，配音保持原音量。
+### 6. 背景音乐
+- **试听预览**: 下拉列表内可直接试听。
+- **选择体验**: 发布页同款搜索选择器，打开时自动定位到当前已选。
+- **混音控制**: 当前前端不提供音量滑杆，生成时固定 `bgm_volume=0.2`，保持配音音量稳定。

-### 7. 账户设置 [Day 15 新增]
+### 7. 账户设置
 - **手机号登录**: 11位中国手机号验证登录。
 - **账户下拉菜单**: 显示手机号（中间四位脱敏）+ 有效期 + 修改密码 + 安全退出。
 - **修改密码**: 弹窗输入当前密码与新密码，修改后强制重新登录。
@@ -75,10 +86,10 @@ ViGent2 的前端界面，采用 Next.js 16 + TailwindCSS 构建。
 - **到期续费**: 会员到期后登录自动跳转付费页续费，流程与首次开通一致。
 - **管理员激活**: 管理员手动激活功能并存，两种方式互不影响。

-### 8. 文案提取助手 (`ScriptExtractionModal`) [Day 15 新增]
+### 9. 文案提取助手 (`ScriptExtractionModal`)
 - **多源提取**: 支持文件拖拽上传与 URL 粘贴 (B站/抖音/TikTok)。
 - **AI 智能改写**: 集成 GLM-4.7-Flash，自动改写为口播文案。
- **自定义提示词**: 可自定义改写提示词，留空使用默认；设置持久化到 localStorage (Day 25)。
+- **自定义提示词**: 可自定义改写提示词，留空使用默认；设置持久化到 localStorage。
 - **一键填入**: 提取结果直接填充至视频生成输入框。
 - **智能交互**: 实时进度展示，防误触设计。

@@ -91,7 +102,7 @@ ViGent2 的前端界面，采用 Next.js 16 + TailwindCSS 构建。
 - **音频波形**: wavesurfer.js (时间轴编辑器)
 - **API**: Axios 实例 `@/shared/api/axios` (对接后端 FastAPI :8006)

-## 🚀 开发指南
+## 🚀 快速开始

 ### 安装依赖

@@ -139,11 +150,10 @@ src/
 - **URL 统一工具**: `@/shared/lib/media` 提供 `resolveMediaUrl` / `resolveAssetUrl`
 - **代理配置**: Next.js Rewrites (如需) 或直接 CORS。

-## 🎨 设计规范
+## 🎨 UI 说明（概览）

- **主色调**: 深紫/黑色系 (Dark Mode)
- **交互**: 悬停微动画 (Hover Effects)；操作按钮默认半透明可见 (opacity-40)，hover 时全亮，兼顾触屏设备
- **响应式**: 适配桌面端与移动端；发布页平台卡片响应式布局（移动端紧凑/桌面端宽松）
- **滚动体验**: 列表滚动条统一隐藏 (hide-scrollbar)；刷新后自动回到顶部（禁用浏览器滚动恢复 + 列表 scroll 时间门控）
- **样式预览**: 浮动预览窗口，桌面端左上角 280px，移动端右下角 160px（不遮挡控件）
- **输入辅助**: 标题/副标题输入框实时字数计数器，超限变红
+- 业务选择器统一使用 `SelectPopover`（桌面 Popover / 移动端 BottomSheet）；`ScriptEditor` 的“历史文案 / AI多语言”保留原轻量菜单。
+- 业务弹窗统一使用 `AppModal`（统一遮罩、头部、关闭行为与滚动策略）。
+- 视频预览弹窗层级高于下拉菜单；下拉内支持连续预览。
+- 页面同时适配桌面端与移动端；长列表统一隐藏滚动条。
+- 详细 UI 规范、持久化规范与交互约束请查看 `Docs/FRONTEND_DEV.md`。
--- a/Docs/LatentSync_DEPLOY.md
+++ b/Docs/LatentSync_DEPLOY.md
@@ -137,11 +137,9 @@ CUDA_VISIBLE_DEVICES=1 python -m scripts.inference \
 └── DEPLOY.md
 ```

---
-
---
-
-## 步骤 7: 性能优化 (预加载模型服务)
+---
+
+## 步骤 6: 性能优化（预加载模型服务）

 为了消除每次生成视频时 30-40秒 的模型加载时间，建议运行常驻服务。

@@ -201,6 +199,29 @@ LatentSync 1.6 需要 ~18GB VRAM。如果遇到 OOM 错误：
 - `inference_steps`: 增加到 30-50 可提高质量
 - `guidance_scale`: 增加可改善唇同步，但过高可能导致抖动

+### 编码流水线优化（当前实现）
+
+LatentSync 内部默认流程有两处冗余编码已优化：
+
+1. **`read_video` FPS 转换**: 原代码无条件 `ffmpeg -r 25 -crf 18`，现已改为检测 FPS，25fps 时跳过（我们的 `prepare_segment` 已输出 25fps）
+2. **final mux 双重编码**: 原代码 `imageio` CRF 13 写帧后又用 `libx264 -crf 18` 重编码做 mux，现已改为 `-c:v copy` 流复制
+
+这两项优化位于：
+- `latentsync/utils/util.py` — `read_video()` 函数
+- `latentsync/pipelines/lipsync_pipeline.py` — final mux 命令
+
+---
+
+### 无脸帧容错（当前实现）
+
+素材中部分帧检测不到人脸（转头、遮挡、空镜头）时，不再中断整次推理：
+
+- `affine_transform_video`: 单帧异常时用最近有效帧填充，全部帧无脸时仍报错
+- `restore_video`: 无脸帧保留原画面，不做嘴型替换
+- 后端 `workflow.py`: LatentSync 整体异常时自动回退原视频，任务不会失败
+
+改动位于 `latentsync/pipelines/lipsync_pipeline.py`。
+
 ---

 ## 参考链接
--- a/Docs/MUSETALK_DEPLOY.md
+++ b/Docs/MUSETALK_DEPLOY.md
@@ -0,0 +1,285 @@
+# MuseTalk 部署指南
+
+> **更新时间**：2026-03-02
+> **适用版本**：MuseTalk v1.5 (常驻服务模式)
+> **架构**：FastAPI 常驻服务 + PM2 进程管理
+
+---
+
+## 架构概览
+
+MuseTalk 作为 **混合唇形同步方案** 的长视频引擎：
+
+- **短视频 (<100s，按当前 `.env` 示例)** → LatentSync 1.6 (GPU1, 端口 8007)
+- **长视频 (>=100s，按当前 `.env` 示例)** → MuseTalk 1.5 (GPU0, 端口 8011)
+- 路由阈值由 `LIPSYNC_DURATION_THRESHOLD` 控制
+- MuseTalk 不可用时自动回退到 LatentSync
+
+---
+
+## 硬件要求
+
+| 配置 | 最低要求 | 推荐配置 |
+|------|----------|----------|
+| GPU | 8GB VRAM (RTX 3060) | 24GB VRAM (RTX 3090) |
+| 内存 | 32GB | 64GB |
+| CUDA | 11.7+ | 11.8 |
+
+> MuseTalk fp16 推理约需 4-8GB 显存，可与 CosyVoice 共享 GPU0。
+
+---
+
+## 安装步骤
+
+### 1. Conda 环境
+
+```bash
+cd /home/rongye/ProgramFiles/ViGent2/models/MuseTalk
+conda create -n musetalk python=3.10 -y
+conda activate musetalk
+```
+
+### 2. PyTorch 2.0.1 + CUDA 11.8
+
+> 必须使用此版本，mmcv 预编译包依赖。
+
+```bash
+pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
+```
+
+### 3. 依赖安装
+
+```bash
+pip install -r requirements.txt
+
+# MMLab 系列
+pip install --no-cache-dir -U openmim
+mim install mmengine
+mim install "mmcv==2.0.1"
+mim install "mmdet==3.1.0"
+pip install chumpy --no-build-isolation
+pip install "mmpose==1.1.0" --no-deps
+
+# FastAPI 服务依赖
+pip install fastapi uvicorn httpx
+```
+
+---
+
+## 模型权重
+
+### 目录结构
+
+```
+models/MuseTalk/models/
+├── musetalk/                   ← v1 基础模型
+│   ├── config.json -> musetalk.json    (软链接)
+│   ├── musetalk.json
+│   ├── musetalkV15 -> ../musetalkV15   (软链接, 关键!)
+│   └── pytorch_model.bin       (~3.2GB)
+├── musetalkV15/                ← v1.5 UNet 模型
+│   ├── musetalk.json
+│   └── unet.pth                (~3.2GB)
+├── sd-vae/                     ← Stable Diffusion VAE
+│   ├── config.json
+│   └── diffusion_pytorch_model.bin
+├── whisper/                    ← OpenAI Whisper Tiny
+│   ├── config.json
+│   ├── pytorch_model.bin       (~151MB)
+│   └── preprocessor_config.json
+├── dwpose/                     ← DWPose 人体姿态检测
+│   └── dw-ll_ucoco_384.pth     (~387MB)
+├── syncnet/                    ← SyncNet 唇形同步评估
+│   └── latentsync_syncnet.pt
+└── face-parse-bisent/          ← 人脸解析模型
+    ├── 79999_iter.pth          (~53MB)
+    └── resnet18-5c106cde.pth   (~45MB)
+```
+
+### 下载方式
+
+使用项目自带脚本：
+
+```bash
+cd /home/rongye/ProgramFiles/ViGent2/models/MuseTalk
+conda activate musetalk
+bash download_weights.sh
+```
+
+或手动 Python API 下载：
+
+```bash
+conda activate musetalk
+export HF_ENDPOINT=https://hf-mirror.com
+python -c "
+from huggingface_hub import snapshot_download
+snapshot_download('TMElyralab/MuseTalk', local_dir='models',
+    allow_patterns=['musetalk/*', 'musetalkV15/*'])
+snapshot_download('stabilityai/sd-vae-ft-mse', local_dir='models/sd-vae',
+    allow_patterns=['config.json', 'diffusion_pytorch_model.bin'])
+snapshot_download('openai/whisper-tiny', local_dir='models/whisper',
+    allow_patterns=['config.json', 'pytorch_model.bin', 'preprocessor_config.json'])
+snapshot_download('yzd-v/DWPose', local_dir='models/dwpose',
+    allow_patterns=['dw-ll_ucoco_384.pth'])
+"
+```
+
+### 创建必要的软链接
+
+```bash
+cd /home/rongye/ProgramFiles/ViGent2/models/MuseTalk/models/musetalk
+ln -sf musetalk.json config.json
+ln -sf ../musetalkV15 musetalkV15
+```
+
+> **关键**：`musetalk/musetalkV15` 软链接缺失会导致权重检测失败 (`weights: False`)。
+
+---
+
+## 服务启动
+
+### PM2 进程管理（推荐）
+
+```bash
+# 首次注册
+cd /home/rongye/ProgramFiles/ViGent2
+pm2 start run_musetalk.sh --name vigent2-musetalk
+pm2 save
+
+# 日常管理
+pm2 restart vigent2-musetalk
+pm2 logs vigent2-musetalk
+pm2 stop vigent2-musetalk
+```
+
+### 手动启动
+
+```bash
+cd /home/rongye/ProgramFiles/ViGent2/models/MuseTalk
+/home/rongye/ProgramFiles/miniconda3/envs/musetalk/bin/python scripts/server.py
+```
+
+### 健康检查
+
+```bash
+curl http://localhost:8011/health
+# {"status":"ok","model_loaded":true}
+```
+
+---
+
+## 后端配置
+
+`backend/.env` 中的相关变量：
+
+```ini
+# MuseTalk 基础配置
+MUSETALK_GPU_ID=0                        # GPU 编号 (与 CosyVoice 共存)
+MUSETALK_API_URL=http://localhost:8011    # 常驻服务地址
+MUSETALK_BATCH_SIZE=32                   # 推理批大小
+MUSETALK_VERSION=v15                     # 模型版本
+MUSETALK_USE_FLOAT16=true                # 半精度加速
+
+# 推理质量参数
+MUSETALK_DETECT_EVERY=2                  # 人脸检测降频间隔 (帧，越小越准但更慢)
+MUSETALK_BLEND_CACHE_EVERY=2             # BiSeNet mask 缓存更新间隔 (帧)
+MUSETALK_AUDIO_PADDING_LEFT=2            # Whisper 时序上下文 (左)
+MUSETALK_AUDIO_PADDING_RIGHT=2           # Whisper 时序上下文 (右)
+MUSETALK_EXTRA_MARGIN=14                 # v1.5 下巴区域扩展像素
+MUSETALK_DELAY_FRAME=0                   # 音频-口型对齐偏移 (帧)
+MUSETALK_BLEND_MODE=jaw                  # 融合模式: auto / jaw / raw
+MUSETALK_FACEPARSING_LEFT_CHEEK_WIDTH=90   # 面颊宽度 (仅 v1.5)
+MUSETALK_FACEPARSING_RIGHT_CHEEK_WIDTH=90
+
+# 编码质量参数
+MUSETALK_ENCODE_CRF=14                   # CRF 越小越清晰 (14≈接近视觉无损)
+MUSETALK_ENCODE_PRESET=slow              # x264 preset (slow=高压缩效率)
+
+# 混合唇形同步路由
+LIPSYNC_DURATION_THRESHOLD=100           # 秒, >=此值用 MuseTalk
+```
+
+> **参数档位参考**：
+> - 速度优先：`DETECT_EVERY=5, BLEND_CACHE_EVERY=5, ENCODE_CRF=18, ENCODE_PRESET=medium`
+> - 质量优先（当前）：`DETECT_EVERY=2, BLEND_CACHE_EVERY=2, ENCODE_CRF=14, ENCODE_PRESET=slow`
+
+---
+
+## 相关文件
+
+| 文件 | 说明 |
+|------|------|
+| `models/MuseTalk/scripts/server.py` | FastAPI 常驻服务 (端口 8011) |
+| `run_musetalk.sh` | PM2 启动脚本 |
+| `backend/app/services/lipsync_service.py` | 混合路由 + `_call_musetalk_server()` |
+| `backend/app/core/config.py` | `MUSETALK_*` 配置项 |
+
+---
+
+## 性能优化 (server.py v2)
+
+首次长视频测试 (136s, 3404 帧) 耗时 30 分钟。分析发现瓶颈在人脸检测 (28%)、BiSeNet 合成 (22%)、I/O (17%)，而非 UNet 推理 (17%)。
+
+### 已实施优化
+
+| 优化项 | 说明 |
+|--------|------|
+| `MUSETALK_BATCH_SIZE` 8→32 | RTX 3090 显存充裕，UNet 推理加速 ~3x |
+| cv2.VideoCapture 直读帧 | 跳过 ffmpeg→PNG→imread 链路 |
+| 人脸检测降频 (每N帧) | DWPose + FaceAlignment 只在采样帧运行，中间帧线性插值 bbox |
+| BiSeNet mask 缓存 (每N帧) | `get_image_prepare_material` 每 N 帧运行，中间帧复用 |
+| FFmpeg rawvideo 管道直编码 | 原 `cv2.VideoWriter(mp4v)` 中间有损文件改为 stdin 管道直写，消除一次冗余有损编码 |
+| 参数环境变量化 | 所有推理/编码参数从 `.env` 读取，支持速度优先/质量优先快速切换 |
+| 每阶段计时 | 7 个阶段精确计时，方便后续调优 |
+
+### 编码链路
+
+```
+UNet 推理帧 (raw BGR24)
+  → FFmpeg rawvideo stdin 管道
+    → 一次 libx264 编码 (CRF 14, preset slow) + 音频 mux
+      → 最终输出 .mp4
+```
+
+与旧流程对比：消除了 `cv2.VideoWriter(mp4v)` 中间有损文件，编码次数从 2 次减至 1 次。
+
+### 调优参数
+
+所有参数通过 `backend/.env` 配置（修改后需重启 MuseTalk 服务生效）：
+
+```ini
+MUSETALK_DETECT_EVERY=2        # 人脸检测降频间隔 (帧)，质量优先用 2，速度优先用 5
+MUSETALK_BLEND_CACHE_EVERY=2   # BiSeNet mask 缓存间隔 (帧)
+MUSETALK_ENCODE_CRF=14         # 编码质量 (14≈接近视觉无损，18=高质量)
+MUSETALK_ENCODE_PRESET=slow    # 编码速度 (slow=高压缩效率，medium=平衡)
+```
+
+> 对于口播视频 (人脸几乎不动)，detect_every=5 的插值误差可忽略。
+> 如人脸运动剧烈或追求最佳质量，使用 detect_every=2。
+
+---
+
+## 常见问题
+
+### huggingface-hub 版本冲突
+
+```
+ImportError: huggingface-hub>=0.19.3,<1.0 is required
+```
+
+**解决**：降级 huggingface-hub
+
+```bash
+pip install "huggingface-hub>=0.19.3,<1.0"
+```
+
+### mmcv 导入失败
+
+```bash
+pip uninstall mmcv mmcv-full -y
+mim install "mmcv==2.0.1"
+```
+
+### 音视频长度不匹配
+
+已在 `musetalk/utils/audio_processor.py` 中修复（零填充逻辑），无需额外处理。
--- a/Docs/PUBLISH_DEPLOY.md
+++ b/Docs/PUBLISH_DEPLOY.md
@@ -0,0 +1,206 @@
+# 多平台发布部署与实现说明（抖音 / 微信视频号 / B站 / 小红书）
+
+## 1. 目标
+
+本文件用于集中说明以下内容：
+
+- 平台登录（扫码）如何实现
+- 自动化发布链路如何实现
+- 部署时必须具备的运行环境与配置
+- 常见故障如何快速定位
+
+适用代码范围：`backend/app/modules/publish`、`backend/app/services/publish_service.py`、`backend/app/services/qr_login_service.py`、`backend/app/services/uploader/*`。
+
+---
+
+## 2. 总体架构
+
+### 2.1 API 入口
+
+- `POST /api/publish`：执行发布
+- `POST /api/publish/login/{platform}`：获取二维码并启动登录会话
+- `GET /api/publish/login/status/{platform}`：轮询扫码状态
+- `POST /api/publish/logout/{platform}`：注销并删除对应 Cookie
+- `POST /api/publish/cookies/save/{platform}`：手动保存浏览器 `document.cookie`
+- `GET /api/publish/accounts`：查询各平台是否已登录
+- `GET /api/publish/screenshot/{filename}`：读取发布成功截图（需登录）
+
+核心路由文件：`backend/app/modules/publish/router.py`。
+
+### 2.2 服务分层
+
+- `PublishService`：平台路由、账号隔离、视频路径处理、调用具体 uploader
+- `QRLoginService`：Playwright 获取二维码、监控扫码结果、保存 Cookie
+- `*Uploader`：平台发布自动化（抖音/微信/小红书基于 Playwright，B站基于 biliup）
+
+---
+
+## 3. Cookie 与账号隔离
+
+### 3.1 存储路径
+
+- 用户隔离路径：`backend/user_data/{user_uuid}/cookies/{platform}_cookies.json`
+- 兼容旧版路径：`backend/app/cookies/{platform}_cookies.json`
+
+路径管理文件：`backend/app/core/paths.py`。
+
+### 3.2 Cookie 格式
+
+- `bilibili`：简化字典格式（`SESSDATA` / `bili_jct` / `DedeUserID` / `DedeUserID__ckMd5`）
+- `douyin` / `weixin` / `xiaohongshu`：Playwright `storage_state` 格式（`cookies + origins`）
+
+对应逻辑：`backend/app/services/publish_service.py` 与 `backend/app/services/qr_login_service.py`。
+
+---
+
+## 4. 运行与部署要求
+
+### 4.1 系统依赖
+
+- Python 3.10+
+- Node.js 18+
+- Playwright Chromium（`playwright install chromium`）
+- 系统 Chrome（建议）
+- Xvfb（建议，尤其抖音/微信 headful）
+
+### 4.2 启动建议
+
+- 推荐使用根目录脚本启动后端：`./run_backend.sh`
+- 脚本内置 `xvfb-run`，适合无物理桌面服务器场景
+
+脚本：`run_backend.sh`。
+
+### 4.3 环境变量（核心）
+
+统一在 `backend/.env` 配置，配置定义见 `backend/app/core/config.py`。
+
+- 抖音：`DOUYIN_HEADLESS_MODE`、`DOUYIN_CHROME_PATH`、`DOUYIN_USER_AGENT`、`DOUYIN_LOCALE`、`DOUYIN_TIMEZONE_ID`
+- 微信：`WEIXIN_HEADLESS_MODE`、`WEIXIN_CHROME_PATH`、`WEIXIN_USER_AGENT`、`WEIXIN_LOCALE`、`WEIXIN_TIMEZONE_ID`、`WEIXIN_TRANSCODE_MODE`
+- 小红书：`XIAOHONGSHU_HEADLESS_MODE`、`XIAOHONGSHU_CHROME_PATH`、`XIAOHONGSHU_USER_AGENT`、`XIAOHONGSHU_LOCALE`、`XIAOHONGSHU_TIMEZONE_ID`
+- 发布截图目录：`PUBLISH_SCREENSHOT_DIR`
+
+说明：小红书这些配置当前用于发布 uploader；扫码登录服务里抖音/微信使用独立配置，B站/小红书登录走通用默认浏览器参数。
+
+---
+
+## 5. 登录实现（扫码）
+
+统一由 `QRLoginService` 处理：
+
+1. 打开平台登录页并提取二维码（CSS/Text 多策略）
+2. 前端展示二维码给用户扫码
+3. 后台监控 URL + Session Cookie 变化
+4. 登录成功后保存 Cookie 文件
+
+关键文件：`backend/app/services/qr_login_service.py`。
+
+### 5.1 抖音
+
+- 登录页：`https://creator.douyin.com/`
+- 额外能力：监听 `check_qrconnect` 接口，支持识别 `redirect_url`
+- 特殊场景：若触发刷脸验证，会提取验证二维码 `face_verify_qr` 返回前端
+
+### 5.2 微信视频号
+
+- 登录页：`https://channels.weixin.qq.com/platform/`
+- 二维码提取支持 `img/canvas/svg` 等兜底选择器
+
+### 5.3 小红书
+
+- 登录页：`https://creator.xiaohongshu.com/`
+- 关键修复：默认可能落在短信登录页，先自动切换到扫码模式再提取二维码
+- 成功判定支持 `/new/home`，避免仅依赖旧 `success_indicator`
+
+### 5.4 B站
+
+- 登录页：`https://passport.bilibili.com/login`
+- 扫码成功后保存 B站所需核心 Cookie 字段
+
+---
+
+## 6. 自动化发布实现
+
+### 6.1 抖音（Playwright）
+
+文件：`backend/app/services/uploader/douyin_uploader.py`
+
+- 使用 `storage_state` 打开浏览器上下文
+- 自动进入上传页，触发 file chooser 上传
+- 上传完成后填写标题/简介/话题，必要时处理封面
+- 发布成功判定：页面跳转、接口信号、管理页核验
+- 成功后回写 Cookie，并保存发布成功截图
+
+### 6.2 微信视频号（Playwright）
+
+文件：`backend/app/services/uploader/weixin_uploader.py`
+
+- 进入视频号创作平台，自动定位上传入口
+- 标题/描述/标签按当前产品规则统一写入“视频描述”字段
+- 发布成功判定：`post_create` API 或页面离开创建页
+- 成功后回写 Cookie，并保存发布成功截图
+
+### 6.3 小红书（Playwright）
+
+文件：`backend/app/services/uploader/xiaohongshu_uploader.py`
+
+- 自动进入发布页并触发上传
+- 上传阶段增强：
+  - `UPLOAD_SIGNAL_TIMEOUT` 启动探测窗口
+  - 无后缀视频文件自动准备带后缀临时文件（`hardlink/copy`）
+  - 文件名后缀一致性校验
+  - `UPLOAD_IDLE_TIMEOUT` 空转超时保护，避免长时间“假卡住”
+- 发布成功判定：URL 跳转 + 成功文案 + 发布 API 信号
+- 成功后回写 Cookie，并返回成功截图 URL
+
+### 6.4 B站（biliup）
+
+文件：`backend/app/services/uploader/bilibili_uploader.py`
+
+- 使用 biliup SDK，不依赖 Playwright 发布流程
+- 读取 B站 Cookie，调用 biliup 上传并提交
+- 返回 `bvid/aid` 对应链接（若 API 返回）
+
+---
+
+## 7. 调试与排障
+
+### 7.1 后端日志
+
+- PM2 输出日志：`~/.pm2/logs/vigent2-backend-out.log`
+- PM2 错误日志：`~/.pm2/logs/vigent2-backend-error.log`
+
+### 7.2 常见问题
+
+- 现象：登录二维码拿不到
+  - 优先检查平台登录页是否改版（selector 失效）
+  - 小红书需确认是否仍停留短信登录视图
+
+- 现象：发布看起来卡住
+  - 检查是否长期停留“等待上传状态/等待发布结果”
+  - 小红书优先检查上传文件名后缀与 MIME 识别
+
+- 现象：突然要求重新登录
+  - 通常为 Cookie 失效或平台风控，需要重新扫码
+
+### 7.3 调试产物
+
+- 开启对应 `*_DEBUG_ARTIFACTS` 可输出调试截图/网络日志
+- 成功截图通过 `/api/publish/screenshot/{filename}` 回传前端
+
+---
+
+## 8. 建议的验收流程（每次部署后）
+
+1. 健康检查：`curl http://127.0.0.1:8006/health`
+2. 登录检查：分别触发 4 个平台扫码登录并确认状态轮询可达成功
+3. 发布检查：四个平台各发 1 条测试视频（或最少覆盖当日变更平台）
+4. 截图检查：确认成功截图可通过 `/api/publish/screenshot/{filename}` 拉取
+5. 日志检查：确认无持续重试、无长时间空转、无明显 selector 失败风暴
+
+---
+
+## 9. 关联文档
+
+- 总部署文档：`Docs/DEPLOY_MANUAL.md`
+- 后端说明：`Docs/BACKEND_README.md`
+- 当日变更记录：`Docs/DevLogs/Day31.md`
--- a/Docs/QWEN3_TTS_DEPLOY.md
+++ b/Docs/QWEN3_TTS_DEPLOY.md
@@ -1,6 +1,10 @@
 # Qwen3-TTS 1.7B 部署指南

 > 本文档描述如何在 Ubuntu 服务器上部署 Qwen3-TTS 1.7B-Base 声音克隆模型。
+>
+> ⚠️ **状态：历史归档（已停用）**
+> 当前项目生产环境已切换到 CosyVoice 3.0，请优先参考 `Docs/COSYVOICE3_DEPLOY.md`。
+> 本文档仅保留用于回溯旧方案，不建议新部署继续使用。

 ## 系统要求

--- a/Docs/SUBTITLE_DEPLOY.md
+++ b/Docs/SUBTITLE_DEPLOY.md
@@ -16,14 +16,16 @@
  文本 → EdgeTTS → 音频 → LatentSync → FFmpeg合成 → 最终视频

 新流程 (单素材):
-  文本 → EdgeTTS/Qwen3-TTS/预生成配音 → 音频 ─┬→ LatentSync → 唇形视频 ─┐
+  文本 → EdgeTTS/CosyVoice/预生成配音 → 音频 ─┬→ LatentSync/MuseTalk → 唇形视频 ─┐
                                              └→ faster-whisper → 字幕JSON ─┴→ Remotion合成 → 最终视频

 新流程 (多素材):
-  音频 → 多素材按 custom_assignments 拼接 → LatentSync (单次推理) → 唇形视频 ─┐
+  音频 → 多素材按 custom_assignments 拼接 → LatentSync/MuseTalk (单次推理) → 唇形视频 ─┐
  音频 → faster-whisper → 字幕JSON ─────────────────────────────────────────────┴→ Remotion合成 → 最终视频
 ```

+> **唇形同步路由**: 短视频 (<100s，按当前 `.env` 示例) 用 LatentSync 1.6 (GPU1)，长视频 (>=100s，按当前 `.env` 示例) 用 MuseTalk 1.5 (GPU0)，由 `LIPSYNC_DURATION_THRESHOLD` 控制。
+
 ## 系统要求

 | 组件 | 要求 |
@@ -144,8 +146,8 @@ remotion/
 | 阶段 | 进度 | 说明 |
 |------|------|------|
 | 下载素材 | 0% → 5% | 从 Supabase 下载输入视频 |
-| TTS 语音生成 | 5% → 25% | EdgeTTS / Qwen3-TTS / 预生成配音下载 |
-| 唇形同步 | 25% → 80% | LatentSync 推理 |
+| TTS 语音生成 | 5% → 25% | EdgeTTS / CosyVoice / 预生成配音下载 |
+| 唇形同步 | 25% → 80% | LatentSync / MuseTalk（按阈值路由） |
 | 字幕对齐 | 80% → 85% | faster-whisper 生成字级别时间戳 |
 | Remotion 渲染 | 85% → 95% | 合成字幕和标题 |
 | 上传结果 | 95% → 100% | 上传到 Supabase Storage |
@@ -185,6 +187,7 @@ Remotion 渲染参数在 `backend/app/services/remotion_service.py` 中配置：
 | 参数 | 默认值 | 说明 |
 |------|--------|------|
 | `fps` | 25 | 输出帧率 |
+| `concurrency` | 4 | Remotion 并发渲染进程数（默认 4，可通过 `--concurrency` CLI 参数覆盖） |
 | `title_display_mode` | `short` | 标题显示模式（`short`=短暂显示；`persistent`=常驻显示） |
 | `title_duration` | 4.0 | 标题显示时长（秒，仅 `short` 模式生效） |

@@ -238,6 +241,15 @@ const bundleLocation = await bundle({
 const videoUrl = staticFile(videoSrc);  // 使用 staticFile
 ```

+**问题**: Remotion 渲染失败 - 404 视频文件找不到（bundle 缓存问题）
+
+Remotion 使用 bundle 缓存加速打包。缓存命中时，新生成的视频/字体文件需要硬链接到缓存的 `public/` 目录。如果出现 404 错误，清除缓存重试：
+
+```bash
+rm -rf /home/rongye/ProgramFiles/ViGent2/remotion/.remotion-bundle-cache
+pm2 restart vigent2-backend
+```
+
 **问题**: Remotion 渲染失败

 查看后端日志：
@@ -273,7 +285,7 @@ wget https://github.com/googlefonts/noto-cjk/raw/main/Sans/OTF/SimplifiedChinese

 ### 使用 GPU 0

-faster-whisper 默认使用 GPU 0，与 LatentSync (GPU 1) 分开，避免显存冲突。如需指定 GPU：
+faster-whisper 默认使用 GPU 0，与 MuseTalk 共享 GPU 0；LatentSync 使用 GPU 1，互不冲突。如需指定 GPU：

 ```python
 # 在 whisper_service.py 中修改
@@ -290,3 +302,7 @@ WhisperService(device="cuda:0")  # 或 "cuda:1"
 | 2026-02-10 | 1.1.0 | 更新架构图：多素材 concat-then-infer、预生成配音选项 |
 | 2026-01-30 | 1.0.1 | 字幕高亮样式与标题动画优化，视觉表现更清晰 |
 | 2026-02-25 | 1.2.0 | 字幕时间戳从线性插值改为 Whisper 节奏映射，修复长视频字幕漂移 |
+| 2026-02-27 | 1.3.0 | 架构图更新 MuseTalk 混合路由；Remotion 并发渲染从 8 提升到 16；GPU 分配说明更新 |
+| 2026-02-28 | 1.3.1 | MuseTalk 合成阶段优化：纯 numpy blending + FFmpeg pipe NVENC GPU 硬编码替代双重编码 |
+| 2026-02-28 | 1.4.0 | compose 流复制替代重编码；FFmpeg 超时保护 (600s/30s)；Remotion 并发 16→4；Whisper 时间戳平滑 + 原文节奏映射；全局视频生成 Semaphore(2)；Redis 任务 TTL |
+| 2026-03-02 | 1.5.0 | Remotion bundle 缓存修复（硬链接视频/字体到 cached public 目录）；编码流水线优化 prepare_segment/normalize CRF 23→18；多素材 concat 改为流复制；MuseTalk 合成改为 rawvideo 管道 + `libx264`（可配 CRF/preset） |
--- a/Docs/task_complete.md
+++ b/Docs/task_complete.md
@@ -1,8 +1,8 @@
 # ViGent2 开发任务清单 (Task Log)

 **项目**: ViGent2 数字人口播视频生成系统
-**进度**: 100% (Day 26 - 前端优化：板块合并 + 序号标题)
-**更新时间**: 2026-02-25
+**进度**: 100% (Day 31 - 发布登录稳定性修复 + 文档体系补齐)
+**更新时间**: 2026-03-03

 ---

@@ -10,7 +10,74 @@

 > 这里记录了每一天的核心开发内容与 milestone。

-### Day 26: 前端优化：板块合并 + 序号标题 + UI 精细化 (Current)
+### Day 31: 文档体系收敛 + 音色试听 + 录音弹窗重构 + 发布登录稳定性修复 (Current)
+- [x] **文档体系收敛**: README/DEV 职责边界明确，部署参数与代码对齐，Qwen3-TTS 文档归档至历史状态。
+- [x] **音色试听能力**: 新增并启用 `GET/POST /api/videos/voice-preview`，前端改为直接播放 GET 音频流，修复线上 404（重启后端生效）。
+- [x] **录音交互重构**: 录音入口迁移到参考音频区底部，流程改为弹窗；支持录音后即时关闭弹窗、后台上传识别。
+- [x] **弹窗系统统一**: 抽离 `AppModal`，统一遮罩/焦点/滚动锁/Portal，可访问性补齐；主要弹窗完成迁移（预览、提取、改写、截取、录音、改密、发布登录）。
+- [x] **抖音扫码修复**: 登录页等待策略改为 `domcontentloaded`，并对导航超时容错，避免“无法获取二维码”。
+- [x] **微信二维码优化**: 后端优先导出原始 PNG，前端展示加入白底留白容器，修复“二维码边缘像被截断”的观感问题。
+- [x] **发布性能优化**: 发布页改为受限并发（并发度 2），多平台发布总等待时长明显下降。
+- [x] **微信上传日志降噪**: `file_input empty` 告警改为信号驱动，非最终重试降级为 info，减少误报警。
+- [x] **小红书发布重构**: 对齐抖音/微信上传架构，补齐启动配置、上传/发布多信号判定、成功截图与 `screenshot_url` 回传。
+- [x] **Cookie 格式统一**: 非 B 站平台统一保存为 Playwright `storage_state`，支持 uploader 直接加载上下文。
+- [x] **小红书扫码修复**: 自动从短信登录切换到扫码页并提取二维码，登录成功判定补齐 `/new/home` 路径。
+- [x] **小红书“上传卡住”修复**: 新增无后缀视频临时文件兜底（hardlink/copy）、文件名后缀一致性校验、上传空转超时保护（90s）。
+- [x] **实测闭环**: 小红书 `POST /api/publish` 实测成功（45.77s）并可访问成功截图接口。
+- [x] **文档补齐**: 新增 `Docs/PUBLISH_DEPLOY.md`，并回写 `README.md`、`BACKEND_README.md`、`BACKEND_DEV.md`、`DEPLOY_MANUAL.md`。
+- [x] **文档规则对齐**: 更新 `Docs/DOC_RULES.md`，补充发布相关“三检”与敏感信息处理规范，加入 `PUBLISH_DEPLOY.md` 检查项，工具规范改为 `Read/Grep/apply_patch`，并对齐 TASK_COMPLETE 检查清单。
+
+### Day 30: Remotion 缓存修复 + 编码流水线质量优化 + 唇形同步容错 + 统一下拉交互
+- [x] **Remotion 缓存 404 修复**: bundle 缓存命中时，新生成的视频/字体文件不在旧缓存 `public/` 目录 → 404 → 回退 FFmpeg（无标题字幕）。改为硬链接（`fs.linkSync`）当前渲染所需文件到缓存目录。
+- [x] **LatentSync `read_video` 跳过冗余 FPS 重编码**: 检测输入 FPS，已是 25fps 时跳过 `ffmpeg -r 25 -crf 18` 重编码。
+- [x] **LatentSync final mux 流复制**: `imageio` CRF 13 写帧后的 mux 步骤从 `libx264 -crf 18` 改为 `-c:v copy`，消除冗余双重编码。
+- [x] **`prepare_segment` + `normalize_orientation` CRF 提质**: CRF 23 → 18，与 LatentSync 内部质量标准统一。
+- [x] **多素材 concat 流复制**: 各段参数已统一，`concat_videos` 从 `libx264 -crf 23` 改为 `-c:v copy`。
+- [x] **编码次数总计**: 从 5-6 次有损编码降至 3 次（prepare_segment → LatentSync/MuseTalk 模型输出 → Remotion）。
+- [x] **LatentSync 无脸帧容错**: 素材部分帧检测不到人脸时不再中断推理，无脸帧保留原画面，单素材异常时回退原视频。
+- [x] **MuseTalk 管道直编码**: `cv2.VideoWriter(mp4v)` 中间有损文件改为 FFmpeg rawvideo stdin 管道，消除一次冗余有损编码。
+- [x] **MuseTalk 参数环境变量化**: 推理与编码参数（detect_every/blend_cache/CRF/preset 等）从硬编码迁移到 `backend/.env`，当前使用质量优先档（CRF 14, preset slow, detect_every 2, blend_cache_every 2）。
+- [x] **Workflow 异步防阻塞**: 新增 `_run_blocking()` 线程池辅助，5 处同步 FFmpeg 调用（旋转归一化/prepare_segment/concat/BGM 混音）改为 `await _run_blocking()`，事件循环不再被阻塞。
+- [x] **compose 跳过优化**: 无 BGM 时 `final_audio_path == audio_path`，跳过多余的 compose 步骤，Remotion 路径直接用 lipsync 输出，非 Remotion 路径 `shutil.copy` 透传。
+- [x] **compose() 异步化**: `compose()` 改为 `async def`，内部 `_get_duration` 和 `_run_ffmpeg` 走 `run_in_executor`。
+- [x] **同分辨率跳过 scale**: 多素材逐段比对分辨率，匹配的传 `None` 走 copy 分支；单素材同理。避免已是目标分辨率时的无效重编码。
+- [x] **`_get_duration()` 线程池化**: workflow 中 3 处同步 ffprobe 探测改为 `await _run_blocking()`。
+- [x] **compose 循环 CRF 统一**: 循环场景 CRF 23 → 18，与全流水线质量标准一致。
+- [x] **多素材片段校验**: prepare 完成后校验片段数量一致，防止空片段进入 concat。
+- [x] **唇形模型前端选择**: 生成按钮右侧新增模型下拉（默认模型/快速模型/高级模型），全链路透传 `lipsync_model` 到后端路由。默认保持阈值策略，快速强制 MuseTalk，高级强制 LatentSync，三种模式均有 LatentSync 兜底。选择 localStorage 持久化。
+- [x] **业务下拉统一组件化**: 新增 `SelectPopover`（桌面 Popover + 移动端 BottomSheet），覆盖首页/发布页主要业务选择器（音色、参考音频、配音、素材、BGM、作品、样式、模型、画面比例）。
+- [x] **下拉体验修复**: 统一处理遮挡（Portal + fixed）、自动上拉、触发器同宽、背景不透明、滚动条隐藏、再次打开定位到已选项。
+- [x] **预览联动修复**: 下拉内点击视频预览不强制收起菜单；预览弹窗层级高于下拉；关闭预览后可继续在菜单内连续预览。
+- [x] **BGM 交互收敛**: BGM 选择改为发布页同款（搜索 + 列表 + 试听）；按产品要求移除首页音量滑杆，生成请求固定 `bgm_volume=0.2`。
+- [x] **例外回退**: `ScriptEditor` 的“历史文案 / AI多语言”恢复原轻量菜单样式（不强制统一 SelectPopover）。
+- [x] **文档同步**: Day30 / TASK_COMPLETE / FRONTEND_DEV / FRONTEND_README / README / BACKEND_README 同步更新到最终实现。
+
+### Day 29: 视频流水线优化 + CosyVoice 语气控制
+- [x] **字幕同步修复**: Whisper 时间戳三步平滑（单调递增+重叠消除+间隙填补）+ 原文节奏映射（线性插值 + 单字时长钳位）。
+- [x] **LatentSync 嘴型参数调优**: inference_steps 16→20, guidance_scale 2.0, DeepCache 启用, Remotion concurrency 16→4。
+- [x] **compose 流复制**: 不循环时 `-c:v copy` 替代 libx264 重编码，compose 耗时从分钟级降到秒级。
+- [x] **FFmpeg 超时保护**: `_run_ffmpeg()` timeout=600, `_get_duration()` timeout=30。
+- [x] **全局并发限制**: `asyncio.Semaphore(2)` 控制同时运行的生成任务数。
+- [x] **Redis 任务 TTL**: create 24h, completed/failed 2h, list 自动清理过期索引。
+- [x] **临时字体清理**: 字体文件加入 temp_files 清理列表。
+- [x] **预览背景 CORS 修复**: 素材同源代理 `/api/materials/stream/{id}` 彻底绕开跨域。
+- [x] **CosyVoice 语气控制**: 声音克隆模式新增语气下拉（正常/欢快/低沉/严肃），基于 `inference_instruct2()` 自然语言指令控制情绪，全链路透传 instruct_text，默认"正常"行为不变。
+
+### Day 28: CosyVoice FP16 加速 + 文档全面更新
+- [x] **CosyVoice FP16 半精度加速**: `AutoModel()` 开启 `fp16=True`，LLM 推理和 Flow Matching 自动混合精度运行，预估提速 30-40%、显存降低 ~30%。
+- [x] **文档全面更新**: README.md / DEPLOY_MANUAL.md / SUBTITLE_DEPLOY.md / BACKEND_README.md 补充 MuseTalk 混合唇形同步方案、性能优化、Remotion 并发渲染等内容。
+
+### Day 27: Remotion 描边修复 + 字体样式扩展 + 混合唇形同步 + 性能优化
+- [x] **描边渲染修复**: 标题/副标题/字幕从 `textShadow` 4 方向模拟改为 CSS 原生 `-webkit-text-stroke` + `paint-order: stroke fill`，修复描边过粗和副标题重影问题。
+- [x] **字体样式扩展**: 标题样式 4→12 个（+庞门正道/优设标题圆/阿里数黑体/文道潮黑/无界黑/厚底黑/寒蝉半圆体/欣意吉祥宋），字幕样式 4→8 个（+少女粉/清新绿/金色隶书/楷体红字）。
+- [x] **描边参数优化**: 所有预设 `stroke_size` 从 8 降至 4~5，配合原生描边视觉更干净。
+- [x] **TypeScript 类型修复**: Root.tsx `Composition` 泛型与 `calculateMetadata` 参数类型对齐；Video.tsx `VideoProps` 添加索引签名兼容 `Record<string, unknown>`；VideoLayer.tsx 移除 `OffthreadVideo` 不支持的 `loop` prop。
+- [x] **进度条文案还原**: 进度条从显示后端推送消息改回固定 `正在AI生成中...`。
+- [x] **MuseTalk 混合唇形同步**: 部署 MuseTalk 1.5 常驻服务 (GPU0, 端口 8011)，按音频时长自动路由（由 `LIPSYNC_DURATION_THRESHOLD` 控制；本仓库当前 `.env` 为 100）— 短视频走 LatentSync，长视频走 MuseTalk，MuseTalk 不可用时自动回退。
+- [x] **MuseTalk 推理性能优化**: server.py v2 重写 — cv2 直读帧(跳过 ffmpeg→PNG)、人脸检测降频(每5帧)、BiSeNet mask 缓存(每5帧)、cv2.VideoWriter 直写(跳过 PNG 写盘)、batch_size 8→32，预估 30min→8-10min (~3x)。
+- [x] **Remotion 并发渲染优化**: render.ts 新增 concurrency 参数，从默认 8 提升到 16 (56核 CPU)，预估 5min→2-3min。
+
+### Day 26: 前端优化：板块合并 + 序号标题 + UI 精细化
 - [x] **板块合并**: 首页 9 个独立板块合并为 5 个主板块（配音方式+配音列表→三、配音；视频素材+时间轴→四、素材编辑；历史作品+作品预览→六、作品）。
 - [x] **中文序号标题**: 一~十编号（首页一~六，发布页七~十），移除所有 emoji 图标。
 - [x] **embedded 模式**: 6 个组件支持 `embedded` prop，嵌入时不渲染外层卡片/标题。
@@ -244,7 +311,7 @@
 | **核心 API** | 100% | ✅ 稳定 |
 | **Web UI** | 100% | ✅ 稳定 (移动端适配) |
 | **唇形同步** | 100% | ✅ LatentSync 1.6 |
-| **TTS 配音** | 100% | ✅ EdgeTTS + CosyVoice 3.0 + 配音前置 + 时间轴编排 + 自动转写 + 语速控制 |
+| **TTS 配音** | 100% | ✅ EdgeTTS + CosyVoice 3.0 + 配音前置 + 时间轴编排 + 自动转写 + 语速控制 + 语气控制 |
 | **自动发布** | 100% | ✅ 抖音/微信视频号/B站/小红书 |
 | **用户认证** | 100% | ✅ 手机号 + JWT |
 | **付费会员** | 100% | ✅ 支付宝电脑网站支付 + 自动激活 |
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@

 > 📹 **上传人物** · 🎙️ **输入文案** · 🎬 **一键成片**

-基于 **LatentSync 1.6 + EdgeTTS** 的开源数字人口播视频生成系统。
+基于 **LatentSync 1.6 + MuseTalk 1.5 混合唇形同步** 的开源数字人口播视频生成系统。
 集成 **CosyVoice 3.0** 声音克隆与自动社交媒体发布功能。

 [功能特性](#-功能特性) • [技术栈](#-技术栈) • [文档中心](#-文档中心) • [部署指南](Docs/DEPLOY_MANUAL.md)
@@ -16,28 +16,29 @@
 ## ✨ 功能特性

 ### 核心能力
- 🎬 **高清唇形同步** - LatentSync 1.6 驱动，512×512 高分辨率 Latent Diffusion 模型。
- 🎙️ **多模态配音** - 支持 **EdgeTTS** (微软超自然语音, 10 语言) 和 **CosyVoice 3.0** (3秒极速声音克隆, 9语言+18方言, 语速可调)。上传参考音频自动 Whisper 转写 + 智能截取。配音前置工作流：先生成配音 → 选素材 → 生成视频。
+- 🎬 **高清唇形同步** - 混合方案：短视频（本仓库当前 `.env` 阈值 100s，可配）用 LatentSync 1.6（高质量 Latent Diffusion），长视频用 MuseTalk 1.5（实时级单步推理），自动路由 + 回退。前端可选模型：默认模型（阈值自动路由）/ 快速模型（速度优先）/ 高级模型（质量优先）。
+- 🎙️ **多模态配音** - 支持 **EdgeTTS** (微软超自然语音, 10 语言) 和 **CosyVoice 3.0** (3秒极速声音克隆, 9语言+18方言, 语速/语气可调)。上传参考音频自动 Whisper 转写 + 智能截取。配音前置工作流：先生成配音 → 选素材 → 生成视频。
 - 📝 **智能字幕** - 集成 faster-whisper + Remotion，自动生成逐字高亮 (卡拉OK效果) 字幕。
- 🎨 **样式预设** - 标题/副标题/字幕样式选择 + 预览 + 字号调节，支持自定义字体库。
+- 🎨 **样式预设** - 12 种标题 + 8 种字幕样式预设，支持预览 + 字号调节 + 自定义字体库。CSS 原生描边渲染，清晰无重影。
 - 🏷️ **标题显示模式** - 片头标题支持 `短暂显示` / `常驻显示`，默认短暂显示（4秒），用户偏好自动持久化。
 - 📌 **片头副标题** - 可选副标题显示在主标题下方，独立样式配置，AI 可同时生成，20 字限制。
 - 🖼️ **作品预览一致性** - 标题/字幕预览与 Remotion 成片统一响应式缩放和自动换行，窄屏画布也稳定显示。
 - 🎞️ **多素材多机位** - 支持多选素材 + 时间轴编辑器 (wavesurfer.js 波形可视化)，拖拽分割线调整时长、拖拽排序切换机位、按 `source_start/source_end` 截取片段。
 - 📐 **画面比例控制** - 时间轴一键切换 `9:16 / 16:9` 输出比例，生成链路全程按目标比例处理。
- 💾 **用户偏好持久化** - 首页状态统一恢复/保存，刷新后延续上次配置。历史文案手动保存与加载。
- 🎵 **背景音乐** - 试听 + 音量控制 + 混音，保持配音音量稳定。
+- 💾 **用户偏好持久化** - 首页状态统一恢复/保存，刷新后延续上次配置；新作品生成后优先选中最新，后续用户手动选择持续持久化。
+- 🎵 **背景音乐** - 试听 + 搜索选择 + 混音（当前前端固定混音系数，保持配音音量稳定）。
+- 🧩 **统一选择器交互** - 首页/发布页业务选择项统一 SelectPopover（桌面 Popover / 移动端 BottomSheet），支持自动上拉、已选定位与连续预览。
 - 🤖 **AI 辅助创作** - 内置 GLM-4.7-Flash，支持 B站/抖音链接文案提取、AI 智能改写（支持自定义提示词）、标题/标签自动生成、9 语言翻译。

 ### 平台化功能
 - 📱 **全自动发布** - 支持抖音/微信视频号/B站/小红书立即发布；扫码登录 + Cookie 持久化。
 - 🖥️ **发布管理预览** - 支持签名 URL / 相对路径作品预览，确保可直接播放。
- 📸 **发布结果可视化** - 抖音/微信视频号发布成功后返回截图，发布页结果卡片可直接查看。
+- 📸 **发布结果可视化** - 抖音/微信视频号/小红书发布成功后返回截图，发布页结果卡片可直接查看。
 - 🛡️ **发布防误操作** - 发布进行中自动提示“请勿刷新或关闭网页”，并拦截刷新/关页二次确认。
 - 💳 **付费会员** - 支付宝电脑网站支付自动开通会员，到期自动停用并引导续费，管理员手动激活并存。
 - 🔐 **认证与隔离** - 基于 Supabase 的用户隔离，支持手机号注册/登录、密码管理。
 - 🛡️ **服务守护** - 内置 Watchdog 看门狗机制，自动监控并重启僵死服务，确保 7x24h 稳定运行。
- 🚀 **性能优化** - 视频预压缩、模型常驻服务（近实时加载）、双 GPU 流水线并发。
+- 🚀 **性能优化** - 编码流水线从 5-6 次有损编码精简至 3 次（prepare_segment → 模型输出 → Remotion）、compose 流复制免重编码、同分辨率跳过 scale、FFmpeg 超时保护、全局视频生成并发限制 (Semaphore(2))、Remotion 4 并发渲染、MuseTalk rawvideo 管道直编码（消除中间有损文件）、模型常驻服务、双 GPU 流水线并发、Redis 任务 TTL 自动清理、workflow 阻塞调用线程池化。

 ---

@@ -46,9 +47,9 @@
 | 领域 | 核心技术 | 说明 |
 |------|----------|------|
 | **前端** | Next.js 16 | TypeScript, TailwindCSS, SWR, wavesurfer.js |
-| **后端** | FastAPI | Python 3.10, AsyncIO, PM2 |
+| **后端** | FastAPI | Python 3.12, AsyncIO, PM2 |
 | **数据库** | Supabase | PostgreSQL, Storage (本地/S3), Auth |
-| **唇形同步** | LatentSync 1.6 | PyTorch 2.5, Diffusers, DeepCache |
+| **唇形同步** | LatentSync 1.6 + MuseTalk 1.5 | 混合路由：短视频 Diffusion 高质量，长视频单步实时推理 |
 | **声音克隆** | CosyVoice 3.0 | 0.5B 参数量，9 语言 + 18 方言 |
 | **自动化** | Playwright | 社交媒体无头浏览器自动化 |
 | **部署** | Docker & PM2 | 混合部署架构 |
@@ -60,16 +61,20 @@
 我们提供了详尽的开发与部署文档：

 ### 部署运维
- **[部署手册 (DEPLOY_MANUAL.md)](Docs/DEPLOY_MANUAL.md)** - 👈 **部署请看这里**！包含完整的环境搭建步骤。
- [参考音频服务部署 (COSYVOICE3_DEPLOY.md)](Docs/COSYVOICE3_DEPLOY.md) - 声音克隆模型部署指南。
- [LatentSync 部署指南](models/LatentSync/DEPLOY.md) - 唇形同步模型独立部署。
+- **[部署手册 (DEPLOY_MANUAL.md)](Docs/DEPLOY_MANUAL.md)** - 👈 **部署请看这里**！包含完整的环境搭建步骤。
+- [多平台发布部署说明 (PUBLISH_DEPLOY.md)](Docs/PUBLISH_DEPLOY.md) - 抖音/微信视频号/B站/小红书登录与自动化发布专项文档。
+- [参考音频服务部署 (COSYVOICE3_DEPLOY.md)](Docs/COSYVOICE3_DEPLOY.md) - 声音克隆模型部署指南。
+- [LatentSync 部署指南 (LATENTSYNC_DEPLOY.md)](Docs/LATENTSYNC_DEPLOY.md) - 唇形同步模型独立部署。
+- [MuseTalk 部署指南 (MUSETALK_DEPLOY.md)](Docs/MUSETALK_DEPLOY.md) - 长视频唇形同步模型部署。
 - [Supabase 部署指南 (SUPABASE_DEPLOY.md)](Docs/SUPABASE_DEPLOY.md) - Supabase 与认证系统配置。
 - [支付宝部署指南 (ALIPAY_DEPLOY.md)](Docs/ALIPAY_DEPLOY.md) - 支付宝付费开通会员配置。

 ### 开发文档
- [后端开发指南](Docs/BACKEND_README.md) - 接口规范与开发流程。
- [后端开发规范](Docs/BACKEND_DEV.md) - 分层约定与开发习惯。
- [前端开发指南](Docs/FRONTEND_DEV.md) - UI 组件与页面规范。
+- [后端开发指南 (BACKEND_README.md)](Docs/BACKEND_README.md) - 接口规范与开发流程。
+- [后端开发规范 (BACKEND_DEV.md)](Docs/BACKEND_DEV.md) - 分层约定与开发习惯。
+- [前端开发指南 (FRONTEND_DEV.md)](Docs/FRONTEND_DEV.md) - UI 组件与页面规范。
+- [前端组件文档 (FRONTEND_README.md)](Docs/FRONTEND_README.md) - 组件结构与板块说明。
+- [Remotion 字幕部署 (SUBTITLE_DEPLOY.md)](Docs/SUBTITLE_DEPLOY.md) - 字幕渲染服务部署。
 - [开发日志 (DevLogs)](Docs/DevLogs/) - 每日开发进度与技术决策记录。

 ---
@@ -86,7 +91,8 @@ ViGent2/
 ├── frontend/             # Next.js 前端应用
 ├── remotion/             # Remotion 视频渲染 (标题/字幕合成)
 ├── models/               # AI 模型仓库
-│   ├── LatentSync/       # 唇形同步服务
+│   ├── LatentSync/       # 唇形同步服务 (GPU1, 短视频)
+│   ├── MuseTalk/         # 唇形同步服务 (GPU0, 长视频)
 │   └── CosyVoice/       # 声音克隆服务
 └── Docs/                 # 项目文档
 ```
@@ -101,7 +107,8 @@ ViGent2/
 |----------|------|------|
 | **Web UI** | 3002 | 用户访问入口 (Next.js) |
 | **Backend API** | 8006 | 核心业务接口 (FastAPI) |
-| **LatentSync** | 8007 | 唇形同步推理服务 |
+| **LatentSync** | 8007 | 唇形同步推理服务 (GPU1, 短视频) |
+| **MuseTalk** | 8011 | 唇形同步推理服务 (GPU0, 长视频) |
 | **CosyVoice 3.0** | 8010 | 声音克隆推理服务 |
 | **Supabase** | 8008 | 数据库与认证网关 |

--- a/backend/.env.example
+++ b/backend/.env.example
@@ -25,10 +25,10 @@ LATENTSYNC_USE_SERVER=true
 # LATENTSYNC_API_URL=http://localhost:8007

 # 推理步数 (20-50, 越高质量越好，速度越慢)
-LATENTSYNC_INFERENCE_STEPS=40
+LATENTSYNC_INFERENCE_STEPS=30

 # 引导系数 (1.0-3.0, 越高唇同步越准，但可能抖动)
-LATENTSYNC_GUIDANCE_SCALE=2.0
+LATENTSYNC_GUIDANCE_SCALE=1.9

 # 启用 DeepCache 加速 (推荐开启)
 LATENTSYNC_ENABLE_DEEPCACHE=true
@@ -36,6 +36,53 @@ LATENTSYNC_ENABLE_DEEPCACHE=true
 # 随机种子 (设为 -1 则随机)
 LATENTSYNC_SEED=1247

+# =============== MuseTalk 配置 ===============
+# GPU 选择 (默认 GPU0，与 CosyVoice 共存)
+MUSETALK_GPU_ID=0
+
+# 常驻服务地址 (端口 8011)
+MUSETALK_API_URL=http://localhost:8011
+
+# 推理批大小
+MUSETALK_BATCH_SIZE=32
+
+# 模型版本
+MUSETALK_VERSION=v15
+
+# 半精度加速
+MUSETALK_USE_FLOAT16=true
+
+# 人脸检测降频间隔（帧，越小质量越稳但更慢）
+MUSETALK_DETECT_EVERY=2
+
+# BiSeNet mask 缓存更新间隔（帧，越小质量越稳但更慢）
+MUSETALK_BLEND_CACHE_EVERY=2
+
+# Whisper 时序上下文（越大越平滑，口型响应会更钝）
+MUSETALK_AUDIO_PADDING_LEFT=2
+MUSETALK_AUDIO_PADDING_RIGHT=2
+
+# v1.5 下巴区域扩展像素（越大越容易看到下唇/牙齿，也更易边缘不稳）
+MUSETALK_EXTRA_MARGIN=14
+
+# 音频-口型对齐偏移（帧，正数=口型更晚，负数=口型更早）
+MUSETALK_DELAY_FRAME=0
+
+# 融合模式：auto(按版本自动) / jaw / raw
+MUSETALK_BLEND_MODE=jaw
+
+# FaceParsing 面颊宽度（仅 v1.5 生效，影响融合掩膜范围）
+MUSETALK_FACEPARSING_LEFT_CHEEK_WIDTH=90
+MUSETALK_FACEPARSING_RIGHT_CHEEK_WIDTH=90
+
+# 最终编码质量（CRF 越小越清晰但体积更大）
+MUSETALK_ENCODE_CRF=14
+MUSETALK_ENCODE_PRESET=slow
+
+# =============== 混合唇形同步路由 ===============
+# 音频时长 >= 此阈值（秒）用 MuseTalk，< 此阈值用 LatentSync
+LIPSYNC_DURATION_THRESHOLD=100
+
 # =============== 上传配置 ===============
 # 最大上传文件大小 (MB)
 MAX_UPLOAD_SIZE_MB=500
@@ -70,11 +117,9 @@ GLM_MODEL=glm-4.7-flash
 # 确保存储卷映射正确，避免硬编码路径
 SUPABASE_STORAGE_LOCAL_PATH=/home/rongye/ProgramFiles/Supabase/volumes/storage/stub/stub

-# =============== 抖音视频下载 Cookie ===============
-
 # =============== 支付宝配置 ===============
 ALIPAY_APP_ID=2021006132600283
 ALIPAY_PRIVATE_KEY_PATH=/home/rongye/ProgramFiles/ViGent2/backend/keys/app_private_key.pem
 ALIPAY_PUBLIC_KEY_PATH=/home/rongye/ProgramFiles/ViGent2/backend/keys/alipay_public_key.pem
-ALIPAY_NOTIFY_URL=https://vigent.hbyrkj.top/api/payment/notify
-ALIPAY_RETURN_URL=https://vigent.hbyrkj.top/pay
+ALIPAY_NOTIFY_URL=https://ipagent.ai-labz.cn/api/payment/notify
+ALIPAY_RETURN_URL=https://ipagent.ai-labz.cn/pay
--- a/backend/app/core/config.py
+++ b/backend/app/core/config.py
@@ -37,12 +37,22 @@ class Settings(BaseSettings):
    DOUYIN_BROWSER_CHANNEL: str = ""
    DOUYIN_FORCE_SWIFTSHADER: bool = True

-    # Douyin 调试录屏
-    DOUYIN_DEBUG_ARTIFACTS: bool = False
-    DOUYIN_RECORD_VIDEO: bool = False
-    DOUYIN_KEEP_SUCCESS_VIDEO: bool = False
-    DOUYIN_RECORD_VIDEO_WIDTH: int = 1280
-    DOUYIN_RECORD_VIDEO_HEIGHT: int = 720
+    # Douyin 调试录屏
+    DOUYIN_DEBUG_ARTIFACTS: bool = False
+    DOUYIN_RECORD_VIDEO: bool = False
+    DOUYIN_KEEP_SUCCESS_VIDEO: bool = False
+    DOUYIN_RECORD_VIDEO_WIDTH: int = 1280
+    DOUYIN_RECORD_VIDEO_HEIGHT: int = 720
+
+    # Xiaohongshu Playwright 配置
+    XIAOHONGSHU_HEADLESS_MODE: str = "headless-new"
+    XIAOHONGSHU_USER_AGENT: str = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.0.0 Safari/537.36"
+    XIAOHONGSHU_LOCALE: str = "zh-CN"
+    XIAOHONGSHU_TIMEZONE_ID: str = "Asia/Shanghai"
+    XIAOHONGSHU_CHROME_PATH: str = "/usr/bin/google-chrome"
+    XIAOHONGSHU_BROWSER_CHANNEL: str = ""
+    XIAOHONGSHU_FORCE_SWIFTSHADER: bool = True
+    XIAOHONGSHU_DEBUG_ARTIFACTS: bool = False
    
    # TTS 配置
    DEFAULT_TTS_VOICE: str = "zh-CN-YunxiNeural"
@@ -57,7 +67,17 @@ class Settings(BaseSettings):
    LATENTSYNC_ENABLE_DEEPCACHE: bool = True        # 启用 DeepCache 加速
    LATENTSYNC_SEED: int = 1247                     # 随机种子 (-1 则随机)
    LATENTSYNC_USE_SERVER: bool = True              # 使用常驻服务 (Persistent Server) 加速
-    
+
+    # MuseTalk 配置
+    MUSETALK_GPU_ID: int = 0                        # GPU ID (默认使用 GPU0)
+    MUSETALK_API_URL: str = "http://localhost:8011"  # 常驻服务地址
+    MUSETALK_BATCH_SIZE: int = 8                    # 推理批大小
+    MUSETALK_VERSION: str = "v15"                   # 模型版本
+    MUSETALK_USE_FLOAT16: bool = True               # 半精度加速
+
+    # 混合唇形同步路由
+    LIPSYNC_DURATION_THRESHOLD: float = 120.0       # 秒，>=此值用 MuseTalk
+
    # Supabase 配置
    SUPABASE_URL: str = ""
    SUPABASE_PUBLIC_URL: str = ""  # 公网访问地址，用于生成前端可访问的 URL
@@ -93,6 +113,11 @@ class Settings(BaseSettings):
        """LatentSync 目录路径 (动态计算)"""
        return self.BASE_DIR.parent.parent / "models" / "LatentSync"

+    @property
+    def MUSETALK_DIR(self) -> Path:
+        """MuseTalk 目录路径 (动态计算)"""
+        return self.BASE_DIR.parent.parent / "models" / "MuseTalk"
+
    class Config:
        env_file = ".env"
        extra = "ignore"  # 忽略未知的环境变量
--- a/backend/app/modules/ai/router.py
+++ b/backend/app/modules/ai/router.py
@@ -2,6 +2,8 @@
 AI 相关 API 路由
 """

+from typing import Optional
+
 from fastapi import APIRouter, HTTPException
 from pydantic import BaseModel
 from loguru import logger
@@ -25,6 +27,12 @@ class GenerateMetaResponse(BaseModel):
    tags: list[str]


+class RewriteRequest(BaseModel):
+    """改写请求"""
+    text: str
+    custom_prompt: Optional[str] = None
+
+
 class TranslateRequest(BaseModel):
    """翻译请求"""
    text: str
@@ -73,3 +81,18 @@ async def generate_meta(req: GenerateMetaRequest):
    except Exception as e:
        logger.error(f"Generate meta failed: {e}")
        raise HTTPException(status_code=500, detail=str(e))
+
+
+@router.post("/rewrite")
+async def rewrite_script(req: RewriteRequest):
+    """AI 改写文案"""
+    if not req.text or not req.text.strip():
+        raise HTTPException(status_code=400, detail="文案不能为空")
+
+    try:
+        logger.info(f"Rewriting text: {req.text[:50]}...")
+        rewritten = await glm_service.rewrite_script(req.text.strip(), req.custom_prompt)
+        return success_response({"rewritten_text": rewritten})
+    except Exception as e:
+        logger.error(f"Rewrite failed: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
--- a/backend/app/modules/generated_audios/schemas.py
+++ b/backend/app/modules/generated_audios/schemas.py
@@ -10,6 +10,7 @@ class GenerateAudioRequest(BaseModel):
    ref_text: Optional[str] = None
    language: str = "zh-CN"
    speed: float = 1.0
+    instruct_text: Optional[str] = None


 class RenameAudioRequest(BaseModel):
--- a/backend/app/modules/generated_audios/service.py
+++ b/backend/app/modules/generated_audios/service.py
@@ -81,6 +81,7 @@ async def generate_audio_task(task_id: str, req: GenerateAudioRequest, user_id:
                        output_path=audio_path,
                        language=_locale_to_tts_lang(req.language),
                        speed=req.speed,
+                        instruct_text=req.instruct_text or "",
                    )
                finally:
                    if os.path.exists(ref_local):
--- a/backend/app/modules/materials/router.py
+++ b/backend/app/modules/materials/router.py
@@ -1,14 +1,28 @@
 from fastapi import APIRouter, HTTPException, Request, Depends
+from fastapi.responses import FileResponse
 from loguru import logger

 from app.core.deps import get_current_user
 from app.core.response import success_response
 from app.modules.materials.schemas import RenameMaterialRequest
 from app.modules.materials import service
+from app.services.storage import storage_service

 router = APIRouter()


+@router.get("/stream/{material_id:path}")
+async def stream_material(material_id: str, current_user: dict = Depends(get_current_user)):
+    """直接流式返回素材文件（同源，避免 CORS canvas taint）"""
+    user_id = current_user["id"]
+    if not material_id.startswith(f"{user_id}/"):
+        raise HTTPException(403, "无权访问此素材")
+    local_path = storage_service.get_local_file_path("materials", material_id)
+    if not local_path:
+        raise HTTPException(404, "素材文件不存在")
+    return FileResponse(local_path, media_type="video/mp4")
+
+
@router.post("")
 async def upload_material(
    request: Request,
--- a/backend/app/modules/tools/service.py
+++ b/backend/app/modules/tools/service.py
@@ -63,11 +63,15 @@ async def extract_script(file=None, url: Optional[str] = None, rewrite: bool = T
        # 2. 提取文案 (Whisper)
        script = await whisper_service.transcribe(str(audio_path))

-        # 3. AI 改写 (GLM)
+        # 3. AI 改写 (GLM) — 失败时降级返回原文
        rewritten = None
        if rewrite and script and len(script.strip()) > 0:
            logger.info("Rewriting script...")
-            rewritten = await glm_service.rewrite_script(script, custom_prompt)
+            try:
+                rewritten = await glm_service.rewrite_script(script, custom_prompt)
+            except Exception as e:
+                logger.warning(f"GLM rewrite failed, returning original script: {e}")
+                rewritten = None

        return {
            "original_script": script,
--- a/backend/app/modules/videos/router.py
+++ b/backend/app/modules/videos/router.py
@@ -1,10 +1,17 @@
-from fastapi import APIRouter, BackgroundTasks, Depends
+import os
+import tempfile
 import uuid

+from fastapi import APIRouter, BackgroundTasks, Depends, HTTPException
+from fastapi.responses import FileResponse
+from loguru import logger
+from starlette.background import BackgroundTask
+
 from app.core.deps import get_current_user
 from app.core.response import success_response
+from app.services.tts_service import TTSService

-from .schemas import GenerateRequest
+from .schemas import GenerateRequest, VoicePreviewRequest
 from .task_store import create_task, get_task, list_tasks
 from .workflow import process_video_generation, get_lipsync_health, get_voiceclone_health
 from .service import list_generated_videos, delete_generated_video
@@ -12,6 +19,59 @@ from .service import list_generated_videos, delete_generated_video

 router = APIRouter()

+PREVIEW_TEXTS = {
+    "zh-CN": "你好，请选择你喜欢的音色吧。",
+    "en-US": "Hello, please choose the voice you like.",
+    "ja-JP": "こんにちは。お好きな音声を選んでください。",
+    "ko-KR": "안녕하세요, 마음에 드는 음성을 선택해 주세요.",
+    "fr-FR": "Bonjour, veuillez choisir la voix que vous preferez.",
+    "de-DE": "Hallo, bitte waehlen Sie die Stimme, die Ihnen gefaellt.",
+    "es-ES": "Hola, por favor elige la voz que mas te guste.",
+    "ru-RU": "Zdravstvuite, pozhaluista, vyberite golos, kotoryi vam nravitsya.",
+    "it-IT": "Ciao, scegli la voce che preferisci.",
+    "pt-BR": "Ola, escolha a voz de que voce mais gosta.",
+}
+
+
+def _cleanup_temp_file(path: str) -> None:
+    try:
+        os.unlink(path)
+    except Exception:
+        pass
+
+
+def _get_voice_locale(voice: str) -> str:
+    parts = voice.split("-")
+    if len(parts) >= 2:
+        return f"{parts[0]}-{parts[1]}"
+    return "zh-CN"
+
+
+def _get_preview_text_for_voice(voice: str) -> str:
+    locale = _get_voice_locale(voice)
+    return PREVIEW_TEXTS.get(locale, PREVIEW_TEXTS["zh-CN"])
+
+
+async def _render_voice_preview(voice: str, text: str) -> FileResponse:
+    tmp_file = tempfile.NamedTemporaryFile(prefix="voice_preview_", suffix=".mp3", delete=False)
+    output_path = tmp_file.name
+    tmp_file.close()
+
+    tts = TTSService()
+    try:
+        await tts.generate_audio(text=text, voice=voice, output_path=output_path)
+    except Exception as e:
+        _cleanup_temp_file(output_path)
+        logger.error(f"音色试听生成失败: voice={voice}, error={e}")
+        raise HTTPException(status_code=500, detail="音色试听生成失败，请稍后重试")
+
+    return FileResponse(
+        path=output_path,
+        media_type="audio/mpeg",
+        filename="voice_preview.mp3",
+        background=BackgroundTask(_cleanup_temp_file, output_path),
+    )
+

@router.post("/generate")
 async def generate_video(
@@ -62,3 +122,38 @@ async def list_generated(current_user: dict = Depends(get_current_user)):
 async def delete_generated(video_id: str, current_user: dict = Depends(get_current_user)):
    result = await delete_generated_video(current_user["id"], video_id)
    return success_response(result, message="视频已删除")
+
+
+@router.post("/voice-preview")
+async def preview_voice_post(
+    req: VoicePreviewRequest,
+    current_user: dict = Depends(get_current_user),
+):
+    # 复用统一鉴权，接口本身不需要 user_id
+    _ = current_user
+
+    voice = req.voice.strip()
+    text = req.text.strip()
+
+    if not voice:
+        raise HTTPException(status_code=400, detail="voice 不能为空")
+    if not text:
+        raise HTTPException(status_code=400, detail="text 不能为空")
+
+    return await _render_voice_preview(voice=voice, text=text)
+
+
+@router.get("/voice-preview")
+async def preview_voice_get(
+    voice: str,
+    current_user: dict = Depends(get_current_user),
+):
+    # 复用统一鉴权，接口本身不需要 user_id
+    _ = current_user
+
+    voice_value = voice.strip()
+    if not voice_value:
+        raise HTTPException(status_code=400, detail="voice 不能为空")
+
+    text = _get_preview_text_for_voice(voice_value)
+    return await _render_voice_preview(voice=voice_value, text=text)
--- a/backend/app/modules/videos/schemas.py
+++ b/backend/app/modules/videos/schemas.py
@@ -1,4 +1,4 @@
-from pydantic import BaseModel
+from pydantic import BaseModel, Field
 from typing import Optional, List, Literal


@@ -38,3 +38,9 @@ class GenerateRequest(BaseModel):
    bgm_volume: Optional[float] = 0.2
    custom_assignments: Optional[List[CustomAssignment]] = None
    output_aspect_ratio: Literal["9:16", "16:9"] = "9:16"
+    lipsync_model: Literal["default", "fast", "advanced"] = "default"
+
+
+class VoicePreviewRequest(BaseModel):
+    voice: str
+    text: str = Field(..., min_length=1, max_length=120)
--- a/backend/app/modules/videos/task_store.py
+++ b/backend/app/modules/videos/task_store.py
@@ -54,7 +54,7 @@ class RedisTaskStore:
            "progress": 0,
            "user_id": user_id,
        }
-        self._client.set(self._key(task_id), json.dumps(task, ensure_ascii=False))
+        self._client.set(self._key(task_id), json.dumps(task, ensure_ascii=False), ex=86400)
        self._client.sadd(self._index_key, task_id)
        return task

@@ -71,12 +71,17 @@ class RedisTaskStore:
        keys = [self._key(task_id) for task_id in task_ids]
        raw_items = self._client.mget(keys)
        tasks = []
-        for raw in raw_items:
-            if raw:
-                try:
-                    tasks.append(json.loads(raw))
-                except Exception:
-                    continue
+        expired = []
+        for task_id, raw in zip(task_ids, raw_items):
+            if raw is None:
+                expired.append(task_id)
+                continue
+            try:
+                tasks.append(json.loads(raw))
+            except Exception:
+                continue
+        if expired:
+            self._client.srem(self._index_key, *expired)
        return tasks

    def update(self, task_id: str, updates: Dict[str, Any]) -> Dict[str, Any]:
@@ -84,7 +89,8 @@ class RedisTaskStore:
        if task.get("status") == "not_found":
            task = {"status": "pending", "task_id": task_id}
        task.update(updates)
-        self._client.set(self._key(task_id), json.dumps(task, ensure_ascii=False))
+        ttl = 7200 if task.get("status") in ("completed", "failed") else 86400
+        self._client.set(self._key(task_id), json.dumps(task, ensure_ascii=False), ex=ttl)
        self._client.sadd(self._index_key, task_id)
        return task

--- a/backend/app/modules/videos/workflow.py
+++ b/backend/app/modules/videos/workflow.py
@@ -1,5 +1,6 @@
 from typing import Optional, Any, List
 from pathlib import Path
+import asyncio
 import time
 import traceback
 import httpx
@@ -23,6 +24,9 @@ from app.services.remotion_service import remotion_service
 from .schemas import GenerateRequest
 from .task_store import task_store

+# 全局并发限制：最多同时运行 2 个视频生成任务
+_generation_semaphore = asyncio.Semaphore(2)
+

 def _locale_to_whisper_lang(locale: str) -> str:
    """'en-US' → 'en', 'zh-CN' → 'zh'"""
@@ -90,6 +94,12 @@ def _update_task(task_id: str, **updates: Any) -> None:
    task_store.update(task_id, updates)


+async def _run_blocking(func, *args):
+    """在线程池执行阻塞函数，避免卡住事件循环。"""
+    loop = asyncio.get_running_loop()
+    return await loop.run_in_executor(None, func, *args)
+
+
 # ── 多素材辅助函数 ──


@@ -168,6 +178,12 @@ def _split_equal(segments: List[dict], material_paths: List[str]) -> List[dict]:


 async def process_video_generation(task_id: str, req: GenerateRequest, user_id: str):
+    _update_task(task_id, message="排队中...")
+    async with _generation_semaphore:
+        await _process_video_generation_inner(task_id, req, user_id)
+
+
+async def _process_video_generation_inner(task_id: str, req: GenerateRequest, user_id: str):
    temp_files = []
    try:
        start_time = time.time()
@@ -204,7 +220,8 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:

            # 归一化旋转元数据（如 iPhone MOV 1920x1080 + rotation=-90）
            normalized_input_path = temp_dir / f"{task_id}_input_norm.mp4"
-            normalized_result = video.normalize_orientation(
+            normalized_result = await _run_blocking(
+                video.normalize_orientation,
                str(input_material_path),
                str(normalized_input_path),
            )
@@ -282,6 +299,42 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:

        captions_path = None

+        async def _whisper_and_split():
+            """Whisper 对齐 → _split_equal 均分素材（公共逻辑）"""
+            _update_task(task_id, message="正在生成字幕 (Whisper)...")
+            _captions_path = temp_dir / f"{task_id}_captions.json"
+            temp_files.append(_captions_path)
+            captions_data = None
+            try:
+                captions_data = await whisper_service.align(
+                    audio_path=str(audio_path),
+                    text=req.text,
+                    output_path=str(_captions_path),
+                    language=_locale_to_whisper_lang(req.language),
+                    original_text=req.text,
+                )
+                print(f"[Pipeline] Whisper alignment completed (multi-material)")
+            except Exception as e:
+                logger.warning(f"Whisper alignment failed: {e}")
+                _captions_path = None
+
+            _update_task(task_id, progress=15, message="正在分配素材...")
+
+            if captions_data and captions_data.get("segments"):
+                result = _split_equal(captions_data["segments"], material_paths)
+            else:
+                logger.warning("[MultiMat] Whisper 无数据，按时长均分")
+                audio_dur = await _run_blocking(video._get_duration, str(audio_path))
+                if audio_dur <= 0:
+                    audio_dur = 30.0
+                seg_dur = audio_dur / len(material_paths)
+                result = [
+                    {"material_path": material_paths[i], "start": i * seg_dur,
+                     "end": (i + 1) * seg_dur, "index": i}
+                    for i in range(len(material_paths))
+                ]
+            return result, _captions_path
+
        if is_multi:
            # ══════════════════════════════════════
            # 多素材流水线
@@ -326,83 +379,13 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:
                    f" 与素材数量({len(material_paths)})不一致，回退自动分配"
                )

-                # 原有逻辑：Whisper → _split_equal
-                _update_task(task_id, message="正在生成字幕 (Whisper)...")
-
-                captions_path = temp_dir / f"{task_id}_captions.json"
-                temp_files.append(captions_path)
-
-                try:
-                    captions_data = await whisper_service.align(
-                        audio_path=str(audio_path),
-                        text=req.text,
-                        output_path=str(captions_path),
-                        language=_locale_to_whisper_lang(req.language),
-                        original_text=req.text,
-                    )
-                    print(f"[Pipeline] Whisper alignment completed (multi-material)")
-                except Exception as e:
-                    logger.warning(f"Whisper alignment failed: {e}")
-                    captions_data = None
-                    captions_path = None
-
-                _update_task(task_id, progress=15, message="正在分配素材...")
-
-                if captions_data and captions_data.get("segments"):
-                    assignments = _split_equal(captions_data["segments"], material_paths)
-                else:
-                    # Whisper 失败 → 按时长均分（不依赖字符对齐）
-                    logger.warning("[MultiMat] Whisper 无数据，按时长均分")
-                    audio_dur = video._get_duration(str(audio_path))
-                    if audio_dur <= 0:
-                        audio_dur = 30.0  # 安全兜底
-                    seg_dur = audio_dur / len(material_paths)
-                    assignments = [
-                        {"material_path": material_paths[i], "start": i * seg_dur,
-                         "end": (i + 1) * seg_dur, "index": i}
-                        for i in range(len(material_paths))
-                    ]
+                assignments, captions_path = await _whisper_and_split()

            else:
-                # 原有逻辑：Whisper → _split_equal
-                _update_task(task_id, message="正在生成字幕 (Whisper)...")
-
-                captions_path = temp_dir / f"{task_id}_captions.json"
-                temp_files.append(captions_path)
-
-                try:
-                    captions_data = await whisper_service.align(
-                        audio_path=str(audio_path),
-                        text=req.text,
-                        output_path=str(captions_path),
-                        language=_locale_to_whisper_lang(req.language),
-                        original_text=req.text,
-                    )
-                    print(f"[Pipeline] Whisper alignment completed (multi-material)")
-                except Exception as e:
-                    logger.warning(f"Whisper alignment failed: {e}")
-                    captions_data = None
-                    captions_path = None
-
-                _update_task(task_id, progress=15, message="正在分配素材...")
-
-                if captions_data and captions_data.get("segments"):
-                    assignments = _split_equal(captions_data["segments"], material_paths)
-                else:
-                    # Whisper 失败 → 按时长均分（不依赖字符对齐）
-                    logger.warning("[MultiMat] Whisper 无数据，按时长均分")
-                    audio_dur = video._get_duration(str(audio_path))
-                    if audio_dur <= 0:
-                        audio_dur = 30.0  # 安全兜底
-                    seg_dur = audio_dur / len(material_paths)
-                    assignments = [
-                        {"material_path": material_paths[i], "start": i * seg_dur,
-                         "end": (i + 1) * seg_dur, "index": i}
-                        for i in range(len(material_paths))
-                    ]
+                assignments, captions_path = await _whisper_and_split()

            # 扩展段覆盖完整音频范围：首段从0开始，末段到音频结尾
-            audio_duration = video._get_duration(str(audio_path))
+            audio_duration = await _run_blocking(video._get_duration, str(audio_path))
            if assignments and audio_duration > 0:
                assignments[0]["start"] = 0.0
                assignments[-1]["end"] = audio_duration
@@ -415,18 +398,19 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:

            lipsync_start = time.time()

-            # ── 第一步：下载所有素材并检测分辨率 ──
+            # ── 第一步：并行下载所有素材并检测分辨率 ──
            material_locals: List[Path] = []
            resolutions = []

-            for i, assignment in enumerate(assignments):
+            async def _download_and_normalize(i: int, assignment: dict):
+                """下载单个素材并归一化方向"""
                material_local = temp_dir / f"{task_id}_material_{i}.mp4"
                temp_files.append(material_local)
                await _download_material(assignment["material_path"], material_local)

-                # 归一化旋转元数据，确保分辨率判断与后续推理一致
                normalized_material = temp_dir / f"{task_id}_material_{i}_norm.mp4"
-                normalized_result = video.normalize_orientation(
+                normalized_result = await _run_blocking(
+                    video.normalize_orientation,
                    str(material_local),
                    str(normalized_material),
                )
@@ -434,8 +418,17 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:
                    temp_files.append(normalized_material)
                    material_local = normalized_material

-                material_locals.append(material_local)
-                resolutions.append(video.get_resolution(str(material_local)))
+                res = video.get_resolution(str(material_local))
+                return material_local, res
+
+            download_tasks = [
+                _download_and_normalize(i, assignment)
+                for i, assignment in enumerate(assignments)
+            ]
+            download_results = await asyncio.gather(*download_tasks)
+            for local, res in download_results:
+                material_locals.append(local)
+                resolutions.append(res)

            # 按用户选择的画面比例统一分辨率
            base_res = target_resolution
@@ -443,38 +436,54 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:
            if need_scale:
                logger.info(f"[MultiMat] 素材分辨率不一致，统一到 {base_res[0]}x{base_res[1]}")

-            # ── 第二步：裁剪每段素材到对应时长 ──
-            prepared_segments: List[Path] = []
+            # ── 第二步：并行裁剪每段素材到对应时长 ──
+            prepared_segments: List[Optional[Path]] = [None] * num_segments

-            for i, assignment in enumerate(assignments):
-                seg_progress = 15 + int((i / num_segments) * 30)  # 15% → 45%
+            async def _prepare_one_segment(i: int, assignment: dict):
+                """将单个素材裁剪/循环到对应时长"""
                seg_dur = assignment["end"] - assignment["start"]
-                _update_task(
-                    task_id,
-                    progress=seg_progress,
-                    message=f"正在准备素材 {i+1}/{num_segments}..."
-                )
-
                prepared_path = temp_dir / f"{task_id}_prepared_{i}.mp4"
                temp_files.append(prepared_path)
-                video.prepare_segment(
-                    str(material_locals[i]), seg_dur, str(prepared_path),
-                    # 多素材拼接前统一重编码为同分辨率/同编码，避免 concat 仅保留首段
-                    target_resolution=base_res,
-                    source_start=assignment.get("source_start", 0.0),
-                    source_end=assignment.get("source_end"),
-                    target_fps=25,
+                prepare_target_res = None if resolutions[i] == base_res else base_res
+
+                await _run_blocking(
+                    video.prepare_segment,
+                    str(material_locals[i]),
+                    seg_dur,
+                    str(prepared_path),
+                    prepare_target_res,
+                    assignment.get("source_start", 0.0),
+                    assignment.get("source_end"),
+                    25,
                )
-                prepared_segments.append(prepared_path)
+                return i, prepared_path
+
+            _update_task(
+                task_id,
+                progress=15,
+                message=f"正在并行准备 {num_segments} 个素材片段..."
+            )
+
+            prepare_tasks = [
+                _prepare_one_segment(i, assignment)
+                for i, assignment in enumerate(assignments)
+            ]
+            prepare_results = await asyncio.gather(*prepare_tasks)
+            for i, path in prepare_results:
+                prepared_segments[i] = path

            # ── 第二步：拼接所有素材片段 ──
            _update_task(task_id, progress=50, message="正在拼接素材片段...")
            concat_path = temp_dir / f"{task_id}_concat.mp4"
            temp_files.append(concat_path)
-            video.concat_videos(
-                [str(p) for p in prepared_segments],
+            prepared_segment_paths = [str(p) for p in prepared_segments if p is not None]
+            if len(prepared_segment_paths) != num_segments:
+                raise RuntimeError("Multi-material: prepared segments mismatch")
+            await _run_blocking(
+                video.concat_videos,
+                prepared_segment_paths,
                str(concat_path),
-                target_fps=25,
+                25,
            )

            # ── 第三步：一次 LatentSync 推理 ──
@@ -484,7 +493,12 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:
                _update_task(task_id, progress=55, message="正在合成唇形 (LatentSync)...")
                print(f"[LipSync] Multi-material: single LatentSync on concatenated video")
                try:
-                    await lipsync.generate(str(concat_path), str(audio_path), str(lipsync_video_path))
+                    await lipsync.generate(
+                        str(concat_path),
+                        str(audio_path),
+                        str(lipsync_video_path),
+                        model_mode=req.lipsync_model,
+                    )
                except Exception as e:
                    logger.warning(f"[LipSync] Failed, fallback to concat without lipsync: {e}")
                    import shutil
@@ -518,18 +532,22 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:
                single_source_end = req.custom_assignments[0].source_end

            _update_task(task_id, progress=20, message="正在准备素材片段...")
-            audio_dur = video._get_duration(str(audio_path))
+            audio_dur = await _run_blocking(video._get_duration, str(audio_path))
            if audio_dur <= 0:
                audio_dur = 30.0
+            single_res = await _run_blocking(video.get_resolution, str(input_material_path))
+            single_target_res = None if single_res == target_resolution else target_resolution
            prepared_single_path = temp_dir / f"{task_id}_prepared_single.mp4"
            temp_files.append(prepared_single_path)
-            video.prepare_segment(
+            await _run_blocking(
+                video.prepare_segment,
                str(input_material_path),
                audio_dur,
                str(prepared_single_path),
-                target_resolution=target_resolution,
-                source_start=single_source_start,
-                source_end=single_source_end,
+                single_target_res,
+                single_source_start,
+                single_source_end,
+                None,
            )
            input_material_path = prepared_single_path

@@ -542,7 +560,18 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:
            if is_ready:
                print(f"[LipSync] Starting LatentSync inference...")
                _update_task(task_id, progress=35, message="正在运行 LatentSync 推理...")
-                await lipsync.generate(str(input_material_path), str(audio_path), str(lipsync_video_path))
+                try:
+                    await lipsync.generate(
+                        str(input_material_path),
+                        str(audio_path),
+                        str(lipsync_video_path),
+                        model_mode=req.lipsync_model,
+                    )
+                except Exception as e:
+                    logger.warning(f"[LipSync] Failed on single-material, fallback to prepared video: {e}")
+                    _update_task(task_id, message="唇形同步失败，使用原始视频...")
+                    import shutil
+                    shutil.copy(str(input_material_path), str(lipsync_video_path))
            else:
                print(f"[LipSync] LatentSync not ready, copying original video")
                _update_task(task_id, message="唇形同步不可用，使用原始视频...")
@@ -553,51 +582,88 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:
            print(f"[Pipeline] LipSync completed in {lipsync_time:.1f}s")
            _update_task(task_id, progress=80)

-            # 单素材模式：Whisper 在 LatentSync 之后
-            if req.enable_subtitles:
+            # 单素材模式：Whisper 延迟到下方与 BGM 并行执行
+            if not req.enable_subtitles:
+                captions_path = None
+
+        _update_task(task_id, progress=85)
+
+        # ── Whisper 字幕 + BGM 混音 并行（两者都只依赖 audio_path）──
+        final_audio_path = audio_path
+        _whisper_task = None
+        _bgm_task = None
+        mix_output_path: Optional[Path] = None
+
+        # 单素材模式下 Whisper 尚未执行，这里与 BGM 并行启动
+        need_whisper = not is_multi and req.enable_subtitles and captions_path is None
+        if need_whisper:
+            captions_path = temp_dir / f"{task_id}_captions.json"
+            temp_files.append(captions_path)
+            _captions_path_str = str(captions_path)
+
+            async def _run_whisper():
                _update_task(task_id, message="正在生成字幕 (Whisper)...", progress=82)
-
-                captions_path = temp_dir / f"{task_id}_captions.json"
-                temp_files.append(captions_path)
-
                try:
                    await whisper_service.align(
                        audio_path=str(audio_path),
                        text=req.text,
-                        output_path=str(captions_path),
+                        output_path=_captions_path_str,
                        language=_locale_to_whisper_lang(req.language),
                        original_text=req.text,
                    )
                    print(f"[Pipeline] Whisper alignment completed")
+                    return True
                except Exception as e:
                    logger.warning(f"Whisper alignment failed, skipping subtitles: {e}")
-                    captions_path = None
+                    return False

-        _update_task(task_id, progress=85)
+            _whisper_task = _run_whisper()

-        final_audio_path = audio_path
        if req.bgm_id:
-            _update_task(task_id, message="正在合成背景音乐...", progress=86)
-
            bgm_path = resolve_bgm_path(req.bgm_id)
            if bgm_path:
                mix_output_path = temp_dir / f"{task_id}_audio_mix.wav"
                temp_files.append(mix_output_path)
                volume = req.bgm_volume if req.bgm_volume is not None else 0.2
                volume = max(0.0, min(float(volume), 1.0))
-                try:
-                    video.mix_audio(
-                        voice_path=str(audio_path),
-                        bgm_path=str(bgm_path),
-                        output_path=str(mix_output_path),
-                        bgm_volume=volume
-                    )
-                    final_audio_path = mix_output_path
-                except Exception as e:
-                    logger.warning(f"BGM mix failed, fallback to voice only: {e}")
+                _mix_output = str(mix_output_path)
+                _bgm_path = str(bgm_path)
+                _voice_path = str(audio_path)
+                _volume = volume
+
+                async def _run_bgm():
+                    _update_task(task_id, message="正在合成背景音乐...", progress=86)
+                    try:
+                        await _run_blocking(
+                            video.mix_audio,
+                            _voice_path,
+                            _bgm_path,
+                            _mix_output,
+                            _volume,
+                        )
+                        return True
+                    except Exception as e:
+                        logger.warning(f"BGM mix failed, fallback to voice only: {e}")
+                        return False
+
+                _bgm_task = _run_bgm()
            else:
                logger.warning(f"BGM not found: {req.bgm_id}")

+        # 并行等待 Whisper + BGM
+        parallel_tasks = [t for t in (_whisper_task, _bgm_task) if t is not None]
+        if parallel_tasks:
+            results = await asyncio.gather(*parallel_tasks)
+            result_idx = 0
+            if _whisper_task is not None:
+                if not results[result_idx]:
+                    captions_path = None
+                result_idx += 1
+            if _bgm_task is not None:
+                if results[result_idx] and mix_output_path is not None:
+                    final_audio_path = mix_output_path
+
+
        use_remotion = (captions_path and captions_path.exists()) or req.title or req.secondary_title

        subtitle_style = None
@@ -657,16 +723,28 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:
                f"{task_id}_secondary_title_font"
            )

+            # 清理字体临时文件
+            for prefix in [f"{task_id}_subtitle_font", f"{task_id}_title_font", f"{task_id}_secondary_title_font"]:
+                for ext in [".ttf", ".otf", ".woff", ".woff2"]:
+                    font_tmp = temp_dir / f"{prefix}{ext}"
+                    if font_tmp.exists():
+                        temp_files.append(font_tmp)
+
        final_output_local_path = temp_dir / f"{task_id}_output.mp4"
        temp_files.append(final_output_local_path)
+        needs_audio_compose = str(final_audio_path) != str(audio_path)

        if use_remotion:
            _update_task(task_id, message="正在合成视频 (Remotion)...", progress=87)
+            remotion_input_path = lipsync_video_path

-            composed_video_path = temp_dir / f"{task_id}_composed.mp4"
-            temp_files.append(composed_video_path)
-
-            await video.compose(str(lipsync_video_path), str(final_audio_path), str(composed_video_path))
+            if needs_audio_compose:
+                composed_video_path = temp_dir / f"{task_id}_composed.mp4"
+                temp_files.append(composed_video_path)
+                await video.compose(str(lipsync_video_path), str(final_audio_path), str(composed_video_path))
+                remotion_input_path = composed_video_path
+            else:
+                logger.info("[Pipeline] Audio unchanged, skip pre-Remotion compose")

            remotion_health = await remotion_service.check_health()
            if remotion_health.get("ready"):
@@ -683,7 +761,7 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:
                    title_duration = max(0.5, min(float(req.title_duration or 4.0), 30.0))

                    await remotion_service.render(
-                        video_path=str(composed_video_path),
+                        video_path=str(remotion_input_path),
                        output_path=str(final_output_local_path),
                        captions_path=str(captions_path) if captions_path else None,
                        title=req.title,
@@ -701,15 +779,18 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:
                except Exception as e:
                    logger.warning(f"Remotion render failed, using FFmpeg fallback: {e}")
                    import shutil
-                    shutil.copy(str(composed_video_path), final_output_local_path)
+                    shutil.copy(str(remotion_input_path), str(final_output_local_path))
            else:
                logger.warning(f"Remotion not ready: {remotion_health.get('error')}, using FFmpeg")
                import shutil
-                shutil.copy(str(composed_video_path), final_output_local_path)
+                shutil.copy(str(remotion_input_path), str(final_output_local_path))
        else:
            _update_task(task_id, message="正在合成最终视频...", progress=90)
-
-            await video.compose(str(lipsync_video_path), str(final_audio_path), str(final_output_local_path))
+            if needs_audio_compose:
+                await video.compose(str(lipsync_video_path), str(final_audio_path), str(final_output_local_path))
+            else:
+                import shutil
+                shutil.copy(str(lipsync_video_path), str(final_output_local_path))

        total_time = time.time() - start_time

--- a/backend/app/services/lipsync_service.py
+++ b/backend/app/services/lipsync_service.py
@@ -1,7 +1,7 @@
 """
 唇形同步服务
-通过 subprocess 调用 LatentSync conda 环境进行推理
-配置为使用 GPU1 (CUDA:1)
+混合方案: 短视频用 LatentSync (高质量), 长视频用 MuseTalk (高速度)
+路由阈值: LIPSYNC_DURATION_THRESHOLD (默认 120s)
 """
 import os
 import shutil
@@ -11,21 +11,24 @@ import asyncio
 import httpx
 from pathlib import Path
 from loguru import logger
-from typing import Optional
+from typing import Optional, Literal

 from app.core.config import settings


-class LipSyncService:
-    """唇形同步服务 - LatentSync 1.6 集成 (Subprocess 方式)"""
-    
+class LipSyncService:
+    """唇形同步服务 - LatentSync 1.6 + MuseTalk 1.5 混合方案"""
+
    def __init__(self):
        self.use_local = settings.LATENTSYNC_LOCAL
        self.api_url = settings.LATENTSYNC_API_URL
        self.latentsync_dir = settings.LATENTSYNC_DIR
        self.gpu_id = settings.LATENTSYNC_GPU_ID
        self.use_server = settings.LATENTSYNC_USE_SERVER
-        
+
+        # MuseTalk 配置
+        self.musetalk_api_url = settings.MUSETALK_API_URL
+
        # GPU 并发锁 (Serial Queue)
        self._lock = asyncio.Lock()
        
@@ -103,7 +106,7 @@ class LipSyncService:
                "-t", str(target_duration),  # 截取到目标时长
                "-c:v", "libx264",
                "-preset", "fast",
-                "-crf", "18",
+                "-crf", "23",
                "-an",  # 去掉原音频
                output_path
            ]
@@ -118,139 +121,43 @@ class LipSyncService:
            logger.warning(f"⚠️ 视频循环异常: {e}")
            return video_path

-    def _preprocess_video(self, video_path: str, output_path: str, target_height: int = 720) -> str:
-        """
-        视频预处理：压缩视频以加速后续处理
-        - 限制最大高度为 target_height (默认720p)
-        - 保持宽高比
-        - 使用快速编码预设
-        
-        Returns: 预处理后的视频路径
-        """
-        import subprocess
-        import json
-        
-        # 获取视频信息 (使用 JSON 格式更可靠)
-        probe_cmd = [
-            "ffprobe", "-v", "error",
-            "-select_streams", "v:0",
-            "-show_entries", "stream=height,width",
-            "-of", "json",
-            video_path
-        ]
-        try:
-            result = subprocess.run(probe_cmd, capture_output=True, text=True, timeout=10)
-            if result.returncode != 0:
-                logger.warning(f"⚠️ ffprobe 失败: {result.stderr[:100]}")
-                return video_path
-            
-            probe_data = json.loads(result.stdout)
-            streams = probe_data.get("streams", [])
-            if not streams:
-                logger.warning("⚠️ 无法获取视频流信息，跳过预处理")
-                return video_path
-            
-            current_height = streams[0].get("height", 0)
-            current_width = streams[0].get("width", 0)
-            
-            if current_height == 0:
-                logger.warning("⚠️ 视频高度为 0，跳过预处理")
-                return video_path
-                
-            logger.info(f"📹 原始视频分辨率: {current_width}×{current_height}")
-            
-        except json.JSONDecodeError as e:
-            logger.warning(f"⚠️ ffprobe 输出解析失败: {e}")
-            return video_path
-        except subprocess.TimeoutExpired:
-            logger.warning("⚠️ ffprobe 超时，跳过预处理")
-            return video_path
-        except Exception as e:
-            logger.warning(f"⚠️ 获取视频信息失败: {e}")
-            return video_path
-        
-        # 如果视频已经足够小，跳过压缩
-        if current_height <= target_height:
-            logger.info(f"📹 视频高度 {current_height}p <= {target_height}p，无需压缩")
-            return video_path
-        
-        logger.info(f"📹 预处理视频: {current_height}p → {target_height}p")
-        
-        # 使用 FFmpeg 压缩
-        compress_cmd = [
-            "ffmpeg", "-y",
-            "-i", video_path,
-            "-vf", f"scale=-2:{target_height}",  # 保持宽高比，高度设为 target_height
-            "-c:v", "libx264",
-            "-preset", "ultrafast",  # 最快编码速度
-            "-crf", "23",  # 质量因子
-            "-c:a", "copy",  # 音频直接复制
-            output_path
-        ]
-        
-        try:
-            result = subprocess.run(
-                compress_cmd,
-                capture_output=True,
-                text=True,
-                timeout=120  # 增加超时时间到2分钟
-            )
-            if result.returncode == 0 and Path(output_path).exists():
-                original_size = Path(video_path).stat().st_size / 1024 / 1024
-                new_size = Path(output_path).stat().st_size / 1024 / 1024
-                logger.info(f"✅ 视频压缩完成: {original_size:.1f}MB → {new_size:.1f}MB")
-                return output_path
-            else:
-                logger.warning(f"⚠️ 视频压缩失败: {result.stderr[:200]}")
-                return video_path
-        except subprocess.TimeoutExpired:
-            logger.warning("⚠️ 视频压缩超时，使用原始视频")
-            return video_path
-        except Exception as e:
-            logger.warning(f"⚠️ 视频压缩异常: {e}")
-            return video_path
+    async def generate(
+        self, 
+        video_path: str, 
+        audio_path: str, 
+        output_path: str, 
+        fps: int = 25,
+        model_mode: Literal["default", "fast", "advanced"] = "default",
+    ) -> str:
+        """生成唇形同步视频"""
+        logger.info(f"🎬 唇形同步任务: {Path(video_path).name} + {Path(audio_path).name}")
+        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+
+        normalized_mode: Literal["default", "fast", "advanced"] = model_mode
+        if normalized_mode not in ("default", "fast", "advanced"):
+            normalized_mode = "default"
+        logger.info(f"🧠 Lipsync 模式: {normalized_mode}")
+        
+        if self.use_local:
+            return await self._local_generate(video_path, audio_path, output_path, fps, normalized_mode)
+        else:
+            return await self._remote_generate(video_path, audio_path, output_path, fps, normalized_mode)
    
-    async def generate(
-        self, 
-        video_path: str, 
-        audio_path: str, 
-        output_path: str, 
-        fps: int = 25
-    ) -> str:
-        """生成唇形同步视频"""
-        logger.info(f"🎬 唇形同步任务: {Path(video_path).name} + {Path(audio_path).name}")
-        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
-        
-        if self.use_local:
-            return await self._local_generate(video_path, audio_path, output_path, fps)
-        else:
-            return await self._remote_generate(video_path, audio_path, output_path, fps)
-    
-    async def _local_generate(
-        self, 
-        video_path: str, 
-        audio_path: str, 
-        output_path: str, 
-        fps: int
-    ) -> str:
-        """使用 subprocess 调用 LatentSync conda 环境"""
-        
-        # 检查前置条件
-        if not self._check_conda_env():
-            logger.warning("⚠️ Conda 环境不可用，使用 Fallback")
-            shutil.copy(video_path, output_path)
-            return output_path
-        
-        if not self._check_weights():
-            logger.warning("⚠️ 模型权重不存在，使用 Fallback")
-            shutil.copy(video_path, output_path)
-            return output_path
-        
-        logger.info("⏳ 等待 GPU 资源 (排队中)...")
-        async with self._lock:
-            # 使用临时目录存放中间文件
-            with tempfile.TemporaryDirectory() as tmpdir:
-                tmpdir = Path(tmpdir)
+    async def _local_generate(
+        self, 
+        video_path: str, 
+        audio_path: str, 
+        output_path: str, 
+        fps: int,
+        model_mode: Literal["default", "fast", "advanced"],
+    ) -> str:
+        """使用 subprocess 调用 LatentSync conda 环境"""
+
+        logger.info("⏳ 等待 GPU 资源 (排队中)...")
+        async with self._lock:
+            # 使用临时目录存放中间文件
+            with tempfile.TemporaryDirectory() as tmpdir:
+                tmpdir = Path(tmpdir)

                # 获取音频和视频时长
                audio_duration = self._get_media_duration(audio_path)
@@ -265,12 +172,53 @@ class LipSyncService:
                        str(looped_video),
                        audio_duration
                    )
-                else:
-                    actual_video_path = video_path
-
-                if self.use_server:
-                    # 模式 A: 调用常驻服务 (加速模式)
-                    return await self._call_persistent_server(actual_video_path, audio_path, output_path)
+                else:
+                    actual_video_path = video_path
+
+                # 模型路由
+                force_musetalk = model_mode == "fast"
+                force_latentsync = model_mode == "advanced"
+                auto_to_musetalk = (
+                    model_mode == "default"
+                    and audio_duration is not None
+                    and audio_duration >= settings.LIPSYNC_DURATION_THRESHOLD
+                )
+
+                if force_musetalk:
+                    logger.info("⚡ 强制快速模型：MuseTalk")
+                    musetalk_result = await self._call_musetalk_server(
+                        actual_video_path, audio_path, output_path
+                    )
+                    if musetalk_result:
+                        return musetalk_result
+                    logger.warning("⚠️ MuseTalk 不可用，快速模型回退到 LatentSync")
+                elif auto_to_musetalk:
+                    logger.info(
+                        f"🔄 音频 {audio_duration:.1f}s >= {settings.LIPSYNC_DURATION_THRESHOLD}s，路由到 MuseTalk"
+                    )
+                    musetalk_result = await self._call_musetalk_server(
+                        actual_video_path, audio_path, output_path
+                    )
+                    if musetalk_result:
+                        return musetalk_result
+                    logger.warning("⚠️ MuseTalk 不可用，回退到 LatentSync（长视频，会较慢）")
+                elif force_latentsync:
+                    logger.info("🎯 强制高级模型：LatentSync")
+
+                # 检查 LatentSync 前置条件（仅在需要回退或使用 LatentSync 时）
+                if not self._check_conda_env():
+                    logger.warning("⚠️ Conda 环境不可用，使用 Fallback")
+                    shutil.copy(video_path, output_path)
+                    return output_path
+
+                if not self._check_weights():
+                    logger.warning("⚠️ 模型权重不存在，使用 Fallback")
+                    shutil.copy(video_path, output_path)
+                    return output_path
+
+                if self.use_server:
+                    # 模式 A: 调用常驻服务 (加速模式)
+                    return await self._call_persistent_server(actual_video_path, audio_path, output_path)

                logger.info("🔄 调用 LatentSync 推理 (subprocess)...")

@@ -352,6 +300,55 @@ class LipSyncService:
                    shutil.copy(video_path, output_path)
                    return output_path
    
+    async def _call_musetalk_server(
+        self, video_path: str, audio_path: str, output_path: str
+    ) -> Optional[str]:
+        """
+        调用 MuseTalk 常驻服务。
+        成功返回 output_path，不可用返回 None（信号上层回退到 LatentSync）。
+        """
+        server_url = self.musetalk_api_url
+        logger.info(f"⚡ 调用 MuseTalk 服务: {server_url}")
+
+        try:
+            async with httpx.AsyncClient(timeout=3600.0) as client:
+                # 健康检查
+                try:
+                    resp = await client.get(f"{server_url}/health", timeout=5.0)
+                    if resp.status_code != 200:
+                        logger.warning("⚠️ MuseTalk 健康检查失败")
+                        return None
+                    health = resp.json()
+                    if not health.get("model_loaded"):
+                        logger.warning("⚠️ MuseTalk 模型未加载")
+                        return None
+                except Exception:
+                    logger.warning("⚠️ 无法连接 MuseTalk 服务")
+                    return None
+
+                # 发送推理请求
+                payload = {
+                    "video_path": str(Path(video_path).resolve()),
+                    "audio_path": str(Path(audio_path).resolve()),
+                    "video_out_path": str(Path(output_path).resolve()),
+                    "batch_size": settings.MUSETALK_BATCH_SIZE,
+                }
+
+                response = await client.post(f"{server_url}/lipsync", json=payload)
+
+                if response.status_code == 200:
+                    result = response.json()
+                    if Path(result["output_path"]).exists():
+                        logger.info(f"✅ MuseTalk 推理完成: {output_path}")
+                        return output_path
+
+                logger.error(f"❌ MuseTalk 服务报错: {response.text}")
+                return None
+
+        except Exception as e:
+            logger.error(f"❌ MuseTalk 调用失败: {e}")
+            return None
+
    async def _call_persistent_server(self, video_path: str, audio_path: str, output_path: str) -> str:
        """调用本地常驻服务 (server.py)"""
        server_url = "http://localhost:8007"
@@ -416,15 +413,18 @@ class LipSyncService:
            "请确保 LatentSync 服务已启动 (cd models/LatentSync && python scripts/server.py)"
        )
    
-    async def _remote_generate(
-        self, 
-        video_path: str, 
-        audio_path: str, 
-        output_path: str, 
-        fps: int
-    ) -> str:
-        """调用远程 LatentSync API 服务"""
-        logger.info(f"📡 调用远程 API: {self.api_url}")
+    async def _remote_generate(
+        self, 
+        video_path: str, 
+        audio_path: str, 
+        output_path: str, 
+        fps: int,
+        model_mode: Literal["default", "fast", "advanced"],
+    ) -> str:
+        """调用远程 LatentSync API 服务"""
+        if model_mode == "fast":
+            logger.warning("⚠️ 远程模式未接入 MuseTalk，快速模型将使用远程 LatentSync")
+        logger.info(f"📡 调用远程 API: {self.api_url}")
        
        try:
            async with httpx.AsyncClient(timeout=600.0) as client:
@@ -477,8 +477,18 @@ class LipSyncService:
            except:
                pass
        
+        # 检查 MuseTalk 服务
+        musetalk_ready = False
+        try:
+            async with httpx.AsyncClient(timeout=5.0) as client:
+                resp = await client.get(f"{self.musetalk_api_url}/health")
+                if resp.status_code == 200:
+                    musetalk_ready = resp.json().get("model_loaded", False)
+        except Exception:
+            pass
+
        return {
-            "model": "LatentSync 1.6",
+            "model": "LatentSync 1.6 + MuseTalk 1.5",
            "conda_env": conda_ok,
            "weights": weights_ok,
            "gpu": gpu_ok,
@@ -486,5 +496,7 @@ class LipSyncService:
            "gpu_id": self.gpu_id,
            "inference_steps": settings.LATENTSYNC_INFERENCE_STEPS,
            "guidance_scale": settings.LATENTSYNC_GUIDANCE_SCALE,
-            "ready": conda_ok and weights_ok and gpu_ok
+            "ready": conda_ok and weights_ok and gpu_ok,
+            "musetalk_ready": musetalk_ready,
+            "lipsync_threshold": settings.LIPSYNC_DURATION_THRESHOLD,
        }
--- a/backend/app/services/publish_service.py
+++ b/backend/app/services/publish_service.py
@@ -21,16 +21,22 @@ from .uploader.xiaohongshu_uploader import XiaohongshuUploader
 from .uploader.weixin_uploader import WeixinUploader


-class PublishService:
-    """Social media publishing service (with user isolation)"""
+class PublishService:
+    """Social media publishing service (with user isolation)"""

    # 支持的平台配置
-    PLATFORMS: Dict[str, Dict[str, Any]] = {
-        "douyin": {"name": "抖音", "url": "https://creator.douyin.com/", "enabled": True},
-        "weixin": {"name": "微信视频号", "url": "https://channels.weixin.qq.com/", "enabled": True},
-        "bilibili": {"name": "B站", "url": "https://member.bilibili.com/platform/upload/video/frame", "enabled": True},
-        "xiaohongshu": {"name": "小红书", "url": "https://creator.xiaohongshu.com/", "enabled": True},
-    }
+    PLATFORMS: Dict[str, Dict[str, Any]] = {
+        "douyin": {"name": "抖音", "url": "https://creator.douyin.com/", "enabled": True},
+        "weixin": {"name": "微信视频号", "url": "https://channels.weixin.qq.com/", "enabled": True},
+        "bilibili": {"name": "B站", "url": "https://member.bilibili.com/platform/upload/video/frame", "enabled": True},
+        "xiaohongshu": {"name": "小红书", "url": "https://creator.xiaohongshu.com/", "enabled": True},
+    }
+
+    COOKIE_DOMAINS: Dict[str, str] = {
+        "douyin": ".douyin.com",
+        "weixin": ".weixin.qq.com",
+        "xiaohongshu": ".xiaohongshu.com",
+    }
    
    def __init__(self) -> None:
        # 存储活跃的登录会话，用于跟踪登录状态
@@ -185,15 +191,16 @@ class PublishService:
                    description=description,
                    user_id=user_id,
                )
-            elif platform == "xiaohongshu":
-                uploader = XiaohongshuUploader(
-                    title=title,
-                    file_path=local_video_path,
-                    tags=tags,
-                    publish_date=publish_time,
-                    account_file=str(account_file),
-                    description=description
-                )
+            elif platform == "xiaohongshu":
+                uploader = XiaohongshuUploader(
+                    title=title,
+                    file_path=local_video_path,
+                    tags=tags,
+                    publish_date=publish_time,
+                    account_file=str(account_file),
+                    description=description,
+                    user_id=user_id,
+                )
            elif platform == "weixin":
                uploader = WeixinUploader(
                    title=title,
@@ -330,48 +337,88 @@ class PublishService:
            logger.exception(f"[登出] 失败: {e}")
            return {"success": False, "message": f"注销失败: {str(e)}"}

-    async def save_cookie_string(self, platform: str, cookie_string: str, user_id: Optional[str] = None) -> Dict[str, Any]:
-        """
-        保存从客户端浏览器提取的Cookie字符串
+    async def save_cookie_string(self, platform: str, cookie_string: str, user_id: Optional[str] = None) -> Dict[str, Any]:
+        """
+        保存从客户端浏览器提取的Cookie字符串
        
        Args:
            platform: 平台ID
            cookie_string: document.cookie 格式的Cookie字符串
            user_id: 用户 ID (用于 Cookie 隔离)
-        """
-        try:
-            account_file = self._get_cookie_path(platform, user_id)
-            
-            # 解析Cookie字符串
-            cookie_dict = {}
-            for item in cookie_string.split('; '):
-                if '=' in item:
-                    name, value = item.split('=', 1)
-                    cookie_dict[name] = value
-            
-            # 对B站进行特殊处理
-            if platform == "bilibili":
-                bilibili_cookies = {}
-                required_fields = ['SESSDATA', 'bili_jct', 'DedeUserID', 'DedeUserID__ckMd5']
+        """
+        try:
+            if platform not in self.PLATFORMS:
+                return {
+                    "success": False,
+                    "message": f"不支持的平台: {platform}",
+                }
+
+            account_file = self._get_cookie_path(platform, user_id)
+            
+            # 解析Cookie字符串
+            cookie_dict: Dict[str, str] = {}
+            for item in cookie_string.split(';'):
+                item = item.strip()
+                if not item:
+                    continue
+                if '=' in item:
+                    name, value = item.split('=', 1)
+                    cookie_dict[name.strip()] = value.strip()
+
+            if not cookie_dict:
+                return {
+                    "success": False,
+                    "message": "Cookie 为空，请确认已完成登录",
+                }
+            
+            # 对B站进行特殊处理
+            if platform == "bilibili":
+                bilibili_cookies = {}
+                required_fields = ['SESSDATA', 'bili_jct', 'DedeUserID', 'DedeUserID__ckMd5']
                
                for field in required_fields:
                    if field in cookie_dict:
                        bilibili_cookies[field] = cookie_dict[field]
                
-                if len(bilibili_cookies) < 3:
-                    return {
-                        "success": False,
-                        "message": "Cookie不完整，请确保已登录"
-                    }
-                
-                cookie_dict = bilibili_cookies
-            
-            # 确保目录存在
-            account_file.parent.mkdir(parents=True, exist_ok=True)
-            
-            # 保存Cookie
-            with open(account_file, 'w', encoding='utf-8') as f:
-                json.dump(cookie_dict, f, indent=2)
+                if len(bilibili_cookies) < 3:
+                    return {
+                        "success": False,
+                        "message": "Cookie不完整，请确保已登录"
+                    }
+                payload: Any = bilibili_cookies
+            else:
+                cookie_domain = self.COOKIE_DOMAINS.get(platform, "")
+                if not cookie_domain:
+                    platform_url = self.PLATFORMS.get(platform, {}).get("url", "")
+                    host = re.sub(r"^https?://", "", platform_url).strip("/")
+                    cookie_domain = f".{host}" if host else ""
+
+                storage_cookies = []
+                for name, value in cookie_dict.items():
+                    if not name:
+                        continue
+                    storage_cookies.append({
+                        "name": name,
+                        "value": value,
+                        "domain": cookie_domain,
+                        "path": "/",
+                        "httpOnly": False,
+                        "secure": True,
+                        "sameSite": "Lax",
+                        "expires": -1,
+                    })
+
+                payload = {
+                    "cookies": storage_cookies,
+                    "origins": [],
+                }
+            
+            # 确保目录存在
+            account_file.parent.mkdir(parents=True, exist_ok=True)
+            
+            # 保存Cookie
+            with open(account_file, 'w', encoding='utf-8') as f:
+                json.dump(payload, f, indent=2)
            
            logger.success(f"[登录] {platform} Cookie已保存 (user: {user_id or 'legacy'})")
            
--- a/backend/app/services/qr_login_service.py
+++ b/backend/app/services/qr_login_service.py
@@ -8,7 +8,8 @@ import base64
 import json
 from pathlib import Path
 from typing import Optional, Dict, Any, List, Sequence, Mapping, Union
-from playwright.async_api import async_playwright, Page, Frame, BrowserContext, Browser, Playwright as PW
+from urllib.parse import unquote_to_bytes
+from playwright.async_api import async_playwright, Page, Frame, BrowserContext, Browser, Playwright as PW, TimeoutError as PlaywrightTimeoutError
 from loguru import logger
 from app.core.config import settings

@@ -65,10 +66,16 @@ class QRLoginService:
            "xiaohongshu": {
                "url": "https://creator.xiaohongshu.com/",
                "qr_selectors": [
+                    ".login-box-container img.css-1lhmg90",
+                    ".login-box-container .css-dvxtzn img",
+                    ".login-box-container img",
+                    "div[class*='login-box'] img",
                    ".qrcode img",
                    "img[alt*='二维码']",
                    "canvas.qr-code",
-                    "img[class*='qr']"
+                    "img[class*='qr']",
+                    "img[src*='qrcode']",
+                    "img[src*='qr']"
                ],
                "success_indicator": "https://creator.xiaohongshu.com/publish"
            },
@@ -109,6 +116,103 @@ class QRLoginService:
        ratio = width / height
        return 0.75 <= ratio <= 1.33

+    def _data_url_to_base64(self, data_url: str) -> Optional[str]:
+        if not data_url or "," not in data_url:
+            return None
+        header, payload = data_url.split(",", 1)
+        header_lower = header.lower()
+        if not header_lower.startswith("data:image/png"):
+            return None
+        if ";base64" in header:
+            return payload
+        try:
+            raw = unquote_to_bytes(payload)
+            return base64.b64encode(raw).decode()
+        except Exception:
+            return None
+
+    async def _try_export_qr_data_url(self, qr_element) -> Optional[str]:
+        """优先导出元素原图，避免截图带来的缩放/裁切损失。"""
+        try:
+            data_url = await qr_element.evaluate("""async (el) => {
+                const tag = (el.tagName || '').toLowerCase();
+
+                if (tag === 'canvas') {
+                    try {
+                        return el.toDataURL('image/png');
+                    } catch {
+                        return null;
+                    }
+                }
+
+                if (tag === 'img') {
+                    const src = el.currentSrc || el.src || '';
+                    if (!src) return null;
+
+                    if (src.startsWith('data:image/png')) {
+                        return src;
+                    }
+
+                    if (src.startsWith('blob:')) {
+                        try {
+                            const resp = await fetch(src);
+                            const blob = await resp.blob();
+                            return await new Promise((resolve) => {
+                                const reader = new FileReader();
+                                reader.onloadend = () => resolve(typeof reader.result === 'string' ? reader.result : null);
+                                reader.onerror = () => resolve(null);
+                                reader.readAsDataURL(blob);
+                            });
+                        } catch {
+                            return null;
+                        }
+                    }
+
+                    return null;
+                }
+
+                return null;
+            }""")
+
+            if not data_url:
+                return None
+
+            return self._data_url_to_base64(data_url)
+        except Exception:
+            return None
+
+    async def _screenshot_qr_base64(self, page: Page, qr_element) -> Optional[str]:
+        try:
+            if self.platform == "weixin":
+                bbox = await qr_element.bounding_box()
+                viewport = page.viewport_size or {"width": 1920, "height": 1080}
+                if bbox:
+                    pad = max(16, int(min(bbox.get("width", 0), bbox.get("height", 0)) * 0.08))
+                    x = max(0.0, bbox.get("x", 0.0) - pad)
+                    y = max(0.0, bbox.get("y", 0.0) - pad)
+                    max_width = float(viewport.get("width", 1920))
+                    max_height = float(viewport.get("height", 1080))
+                    width = min(max_width - x, bbox.get("width", 0.0) + pad * 2)
+                    height = min(max_height - y, bbox.get("height", 0.0) + pad * 2)
+                    if width > 8 and height > 8:
+                        clipped = await page.screenshot(
+                            clip={"x": x, "y": y, "width": width, "height": height},
+                            type="png",
+                        )
+                        return base64.b64encode(clipped).decode()
+
+            screenshot = await qr_element.screenshot(type="png")
+            return base64.b64encode(screenshot).decode()
+        except Exception as e:
+            logger.warning(f"[{self.platform}] QR截图失败: {e}")
+            return None
+
+    async def _capture_qr_base64(self, page: Page, qr_element) -> Optional[str]:
+        data_url_base64 = await self._try_export_qr_data_url(qr_element)
+        if data_url_base64:
+            return data_url_base64
+        return await self._screenshot_qr_base64(page, qr_element)
+
    async def _pick_best_candidate(self, locator, min_side: int = 100):
        best = None
        best_area = 0
@@ -160,6 +264,88 @@ class QRLoginService:

        return await self._find_qr_in_frames(page, selectors, min_side=min_side)

+    async def _ensure_xiaohongshu_qr_mode(self, page: Page) -> None:
+        """小红书登录页默认短信登录，需要先切到扫码登录。"""
+        if self.platform != "xiaohongshu":
+            return
+
+        try:
+            for _ in range(3):
+                sms_mode = False
+                try:
+                    sms_mode = await page.locator("input[placeholder*='手机号']").first.is_visible(timeout=800)
+                except Exception:
+                    sms_mode = False
+
+                if not sms_mode:
+                    return
+
+                clicked = False
+
+                # 先尝试稳定选择器
+                switch_selectors = [
+                    "img.css-wemwzq",
+                    ".login-box-container img[style*='cursor: pointer']",
+                ]
+
+                for selector in switch_selectors:
+                    try:
+                        locator = page.locator(selector)
+                        count = await locator.count()
+                        for i in range(count):
+                            candidate = locator.nth(i)
+                            if not await candidate.is_visible():
+                                continue
+                            bbox = await candidate.bounding_box()
+                            if not bbox:
+                                continue
+                            if bbox.get("width", 0) < 24 or bbox.get("width", 0) > 96:
+                                continue
+                            if bbox.get("height", 0) < 24 or bbox.get("height", 0) > 96:
+                                continue
+                            try:
+                                await candidate.click(timeout=1200)
+                            except Exception:
+                                await candidate.evaluate("el => el.click()")
+                            clicked = True
+                            break
+                        if clicked:
+                            break
+                    except Exception:
+                        continue
+
+                if not clicked:
+                    # 兜底：在登录卡片右上角找可点击小图标
+                    clicked = bool(await page.evaluate("""() => {
+                        const phoneInput = Array.from(document.querySelectorAll('input'))
+                          .find((el) => (el.placeholder || '').includes('手机号'));
+                        const card = document.querySelector('.login-box-container') || phoneInput?.closest('div');
+                        if (!card) return false;
+
+                        const cardRect = card.getBoundingClientRect();
+                        const imgs = Array.from(card.querySelectorAll('img'));
+                        for (const img of imgs) {
+                            const r = img.getBoundingClientRect();
+                            if (r.width < 24 || r.width > 96 || r.height < 24 || r.height > 96) continue;
+                            if (r.right < cardRect.right - 90) continue;
+                            if (r.top > cardRect.top + 90) continue;
+                            const style = getComputedStyle(img);
+                            if (style.cursor !== 'pointer') continue;
+                            img.click();
+                            return true;
+                        }
+                        return false;
+                    }"""))
+
+                if not clicked:
+                    logger.warning("[xiaohongshu] 未找到登录方式切换按钮，继续尝试二维码提取")
+                    return
+
+                logger.info("[xiaohongshu] 已点击登录方式切换，等待二维码渲染")
+                await asyncio.sleep(1.5)
+        except Exception as e:
+            logger.warning(f"[xiaohongshu] 切换扫码登录模式失败: {e}")
+
    async def _try_text_strategy_in_frames(self, page: Page):
        for frame in page.frames:
            if frame == page.main_frame:
@@ -317,12 +503,22 @@ class QRLoginService:

            for url in urls_to_try:
                logger.info(f"[{self.platform}] 打开登录页: {url}")
-                wait_until = "domcontentloaded" if self.platform == "weixin" else "networkidle"
-                await page.goto(url, wait_until=wait_until)
+                wait_until = "domcontentloaded" if self.platform in ("weixin", "douyin") else "networkidle"
+                try:
+                    await page.goto(url, wait_until=wait_until, timeout=30000)
+                except PlaywrightTimeoutError as nav_err:
+                    # 抖音页存在长连接，偶发无法满足等待条件；超时后继续尝试提取二维码
+                    if self.platform == "douyin":
+                        logger.warning(f"[douyin] 页面加载超时，继续尝试提取二维码: {nav_err}")
+                    else:
+                        raise

                # 等待页面加载
                await asyncio.sleep(1 if self.platform == "weixin" else 2)

+                if self.platform == "xiaohongshu":
+                    await self._ensure_xiaohongshu_qr_mode(page)
+
                # 提取二维码 (并行策略)
                qr_image = await self._extract_qr_code(page, config["qr_selectors"])
                if qr_image:
@@ -373,8 +569,9 @@ class QRLoginService:
                    el = await page.wait_for_selector(combined_selector, state="visible", timeout=5000)
                    if el:
                        logger.info(f"[{self.platform}] 策略CSS: 匹配成功")
-                        screenshot = await el.screenshot()
-                        return base64.b64encode(screenshot).decode()
+                        qr_base64 = await self._capture_qr_base64(page, el)
+                        if qr_base64:
+                            return qr_base64
                except Exception as e:
                    logger.warning(f"[{self.platform}] 策略CSS 失败: {e}")

@@ -382,8 +579,9 @@ class QRLoginService:
                qr_element = await self._try_text_strategy(page)
                if qr_element:
                    try:
-                        screenshot = await qr_element.screenshot()
-                        return base64.b64encode(screenshot).decode()
+                        qr_base64 = await self._capture_qr_base64(page, qr_element)
+                        if qr_base64:
+                            return qr_base64
                    except Exception as e:
                        logger.warning(f"[{self.platform}] Text策略截图失败: {e}")

@@ -397,8 +595,9 @@ class QRLoginService:
                qr_element = await self._try_text_strategy(page)
                if qr_element:
                    try:
-                        screenshot = await qr_element.screenshot()
-                        return base64.b64encode(screenshot).decode()
+                        qr_base64 = await self._capture_qr_base64(page, qr_element)
+                        if qr_base64:
+                            return qr_base64
                    except Exception as e:
                        logger.warning(f"[{self.platform}] Text策略截图失败: {e}")
                        qr_element = None
@@ -410,12 +609,16 @@ class QRLoginService:
                        el = await page.wait_for_selector(combined_selector, state="visible", timeout=5000)
                        if el:
                            logger.info(f"[{self.platform}] 策略CSS: 匹配成功")
-                            screenshot = await el.screenshot()
-                            return base64.b64encode(screenshot).decode()
+                            qr_base64 = await self._capture_qr_base64(page, el)
+                            if qr_base64:
+                                return qr_base64
                    except Exception as e:
                        logger.warning(f"[{self.platform}] 策略CSS 失败: {e}")
        else:
            # 其他平台 (小红书/微信等)：保持原顺序 CSS -> Text
+            if self.platform == "xiaohongshu":
+                await self._ensure_xiaohongshu_qr_mode(page)
+
            # 策略1: CSS 选择器
            try:
                combined_selector = ", ".join(selectors)
@@ -432,7 +635,8 @@ class QRLoginService:
                else:
                    await page.wait_for_selector(combined_selector, state="visible", timeout=5000)
                    locator = page.locator(combined_selector)
-                    qr_element = await self._pick_best_candidate(locator, min_side=100)
+                    min_side = 120 if self.platform == "xiaohongshu" else 100
+                    qr_element = await self._pick_best_candidate(locator, min_side=min_side)
                    if qr_element:
                        logger.info(f"[{self.platform}] 策略1(CSS): 匹配成功")
            except Exception as e:
@@ -448,8 +652,9 @@ class QRLoginService:
            # 如果找到元素，截图返回
            if qr_element:
                try:
-                    screenshot = await qr_element.screenshot()
-                    return base64.b64encode(screenshot).decode()
+                    qr_base64 = await self._capture_qr_base64(page, qr_element)
+                    if qr_base64:
+                        return qr_base64
                except Exception as e:
                    logger.error(f"[{self.platform}] 截图失败: {e}")
        
@@ -465,6 +670,8 @@ class QRLoginService:
            keywords = [
                "扫码登录",
                "二维码",
+                "APP扫一扫登录",
+                "可用小红书扫码",
                "打开抖音",
                "抖音APP",
                "使用APP扫码",
@@ -483,7 +690,7 @@ class QRLoginService:
                    for _ in range(5):
                        parent = parent.locator("..")
                        candidates = parent.locator("img, canvas")
-                        min_side = 120 if self.platform == "weixin" else 100
+                        min_side = 120 if self.platform in ("weixin", "xiaohongshu") else 100
                        best = await self._pick_best_candidate(candidates, min_side=min_side)
                        if best:
                            logger.info(f"[{self.platform}] 策略Text: 成功")
@@ -554,6 +761,22 @@ class QRLoginService:
                        await self._save_cookies(final)
                        break

+                    # ── 小红书特殊：扫码后常跳转到 /new/home，不一定命中 success_indicator ──
+                    if self.platform == "xiaohongshu":
+                        lowered_url = current_url.lower()
+                        xhs_logged_in = (
+                            lowered_url.startswith("https://creator.xiaohongshu.com/new/")
+                            or "/publish/publish" in lowered_url
+                            or "/publish/success" in lowered_url
+                        ) and "/login" not in lowered_url
+                        if xhs_logged_in:
+                            logger.success(f"[xiaohongshu] 登录成功！URL={current_url[:120]}")
+                            self.login_success = True
+                            await asyncio.sleep(2)
+                            final = [dict(c) for c in await self.context.cookies()]
+                            await self._save_cookies(final)
+                            break
+
                    # ── 抖音：API 拦截到 redirect_url → 直接导航 ──
                    if self.platform == "douyin" and self._qr_api_confirmed and self._qr_redirect_url:
                        logger.info(f"[douyin] 导航到 redirect_url...")
--- a/backend/app/services/remotion_service.py
+++ b/backend/app/services/remotion_service.py
@@ -71,7 +71,8 @@ class RemotionService:
            "--video", str(video_path),
            "--output", str(output_path),
            "--fps", str(fps),
-            "--enableSubtitles", str(enable_subtitles).lower()
+            "--enableSubtitles", str(enable_subtitles).lower(),
+            "--concurrency", "4"
        ])

        if captions_path:
--- a/backend/app/services/uploader/weixin_uploader.py
+++ b/backend/app/services/uploader/weixin_uploader.py
@@ -847,13 +847,22 @@ class WeixinUploader(BaseUploader):
                            logger.info(text)
                            self._append_debug_log(text)
                            return True
-                        text = "[weixin][file_input] empty"
-                        logger.warning(text)
-                        self._append_debug_log(text)
-                        await asyncio.sleep(0.5)
-                        if await self._is_upload_in_progress(page):
+                        upload_started = False
+                        for _ in range(3):
+                            await asyncio.sleep(0.4)
+                            if await self._is_upload_in_progress(page):
+                                upload_started = True
+                                break
+                        if upload_started:
                            logger.info("[weixin] upload started after file input set")
                            return True
+
+                        text = "[weixin][file_input] empty after set_input_files and no upload signal"
+                        if attempt + 1 >= self.MAX_CLICK_RETRIES:
+                            logger.warning(text)
+                        else:
+                            logger.info(text)
+                        self._append_debug_log(text)
                    except Exception as e:
                        logger.warning(f"[weixin] failed to read file input info: {e}")
            except Exception as e:
--- a/backend/app/services/uploader/xiaohongshu_uploader.py
+++ b/backend/app/services/uploader/xiaohongshu_uploader.py
@@ -1,201 +1,775 @@
-"""
-Xiaohongshu (小红书) uploader using Playwright
-Based on social-auto-upload implementation
-"""
-from datetime import datetime
-from pathlib import Path
-from typing import Optional, List, Dict, Any
-import asyncio
-
-from playwright.async_api import Playwright, async_playwright
-from loguru import logger
-
-from .base_uploader import BaseUploader
-from .cookie_utils import set_init_script
-
-
-class XiaohongshuUploader(BaseUploader):
-    """Xiaohongshu video uploader using Playwright"""
-    
-    # 超时配置 (秒)
-    UPLOAD_TIMEOUT = 300  # 视频上传超时
-    PUBLISH_TIMEOUT = 120  # 发布检测超时
-    POLL_INTERVAL = 1  # 轮询间隔
-    
-    def __init__(
-        self,
-        title: str,
-        file_path: str,
-        tags: List[str],
-        publish_date: Optional[datetime] = None,
-        account_file: Optional[str] = None,
-        description: str = ""
-    ):
-        super().__init__(title, file_path, tags, publish_date, account_file, description)
-        self.upload_url = "https://creator.xiaohongshu.com/publish/publish?from=homepage&target=video"
-    
-    async def set_schedule_time(self, page, publish_date):
-        """Set scheduled publish time"""
-        try:
-            logger.info("[小红书] 正在设置定时发布时间...")
-            
-            # Click "定时发布" label
-            label_element = page.locator("label:has-text('定时发布')")
-            await label_element.click()
-            await asyncio.sleep(1)
-            
-            # Format time
-            publish_date_hour = publish_date.strftime("%Y-%m-%d %H:%M")
-            
-            # Fill datetime input
-            await page.locator('.el-input__inner[placeholder="选择日期和时间"]').click()
-            await page.keyboard.press("Control+KeyA")
-            await page.keyboard.type(str(publish_date_hour))
-            await page.keyboard.press("Enter")
-            
-            await asyncio.sleep(1)
-            logger.info(f"[小红书] 已设置定时发布: {publish_date_hour}")
-            
-        except Exception as e:
-            logger.error(f"[小红书] 设置定时发布失败: {e}")
-    
-    async def upload(self, playwright: Playwright) -> dict:
-        """Main upload logic with guaranteed resource cleanup"""
-        browser = None
-        context = None
-        try:
-            # Launch browser (headless for server deployment)
-            browser = await playwright.chromium.launch(headless=True)
-            context = await browser.new_context(
-                viewport={"width": 1600, "height": 900},
-                storage_state=self.account_file
-            )
-            context = await set_init_script(context)
-            
-            page = await context.new_page()
-            
-            # Go to upload page
-            await page.goto(self.upload_url)
-            logger.info(f"[小红书] 正在上传: {self.file_path.name}")
-            
-            # Upload video file
-            await page.locator("div[class^='upload-content'] input[class='upload-input']").set_input_files(str(self.file_path))
-            
-            # Wait for upload to complete (with timeout)
-            import time
-            upload_start = time.time()
-            while time.time() - upload_start < self.UPLOAD_TIMEOUT:
-                try:
-                    upload_input = await page.wait_for_selector('input.upload-input', timeout=3000)
-                    preview_new = await upload_input.query_selector(
-                        'xpath=following-sibling::div[contains(@class, "preview-new")]'
-                    )
-                    
-                    if preview_new:
-                        stage_elements = await preview_new.query_selector_all('div.stage')
-                        upload_success = False
-                        
-                        for stage in stage_elements:
-                            text_content = await page.evaluate('(element) => element.textContent', stage)
-                            if '上传成功' in text_content:
-                                upload_success = True
-                                break
-                        
-                        if upload_success:
-                            logger.info("[小红书] 检测到上传成功标识")
-                            break
-                        else:
-                            logger.info("[小红书] 未找到上传成功标识，继续等待...")
-                    else:
-                        logger.info("[小红书] 未找到预览元素，继续等待...")
-                    
-                    await asyncio.sleep(self.POLL_INTERVAL)
-                    
-                except Exception as e:
-                    logger.info(f"[小红书] 检测过程: {str(e)}，重新尝试...")
-                    await asyncio.sleep(0.5)
-            else:
-                logger.error("[小红书] 视频上传超时")
-                return {
-                    "success": False,
-                    "message": "视频上传超时",
-                    "url": None
-                }
-            
-            # Fill title and tags
-            await asyncio.sleep(1)
-            logger.info("[小红书] 正在填充标题和话题...")
-            
-            title_container = page.locator('div.plugin.title-container').locator('input.d-text')
-            if await title_container.count():
-                await title_container.fill(self.title[:30])
-            
-            # Add tags
-            css_selector = ".tiptap"
-            for tag in self.tags:
-                await page.type(css_selector, "#" + tag)
-                await page.press(css_selector, "Space")
-            
-            logger.info(f"[小红书] 总共添加 {len(self.tags)} 个话题")
-            
-            # Set scheduled publish time if needed
-            if self.publish_date != 0:
-                await self.set_schedule_time(page, self.publish_date)
-            
-            # Click publish button (with timeout)
-            publish_start = time.time()
-            while time.time() - publish_start < self.PUBLISH_TIMEOUT:
-                try:
-                    if self.publish_date != 0:
-                        await page.locator('button:has-text("定时发布")').click()
-                    else:
-                        await page.locator('button:has-text("发布")').click()
-                    
-                    await page.wait_for_url(
-                        "https://creator.xiaohongshu.com/publish/success?**",
-                        timeout=3000
-                    )
-                    logger.success("[小红书] 视频发布成功")
-                    break
-                except Exception:
-                    logger.info("[小红书] 视频正在发布中...")
-                    await asyncio.sleep(0.5)
-            else:
-                logger.warning("[小红书] 发布检测超时，请手动确认")
-            
-            # Save updated cookies
-            await context.storage_state(path=self.account_file)
-            logger.success("[小红书] Cookie 更新完毕")
-            
-            await asyncio.sleep(2)
-            
-            return {
-                "success": True,
-                "message": "发布成功，待审核" if self.publish_date == 0 else "已设置定时发布",
-                "url": None
-            }
-            
-        except Exception as e:
-            logger.exception(f"[小红书] 上传失败: {e}")
-            return {
-                "success": False,
-                "message": f"上传失败: {str(e)}",
-                "url": None
-            }
-        finally:
-            # 确保资源释放
-            if context:
-                try:
-                    await context.close()
-                except Exception:
-                    pass
-            if browser:
-                try:
-                    await browser.close()
-                except Exception:
-                    pass
-    
-    async def main(self) -> Dict[str, Any]:
-        """Execute upload"""
-        async with async_playwright() as playwright:
-            return await self.upload(playwright)
+"""
+Xiaohongshu (小红书) uploader using Playwright.
+"""
+from datetime import datetime
+from pathlib import Path
+from typing import Optional, List, Dict, Any
+import asyncio
+import os
+import re
+import shutil
+import time
+
+from playwright.async_api import Playwright, async_playwright
+from loguru import logger
+
+from .base_uploader import BaseUploader
+from .cookie_utils import set_init_script
+from app.core.config import settings
+
+
+class XiaohongshuUploader(BaseUploader):
+    """Xiaohongshu video uploader using Playwright"""
+
+    UPLOAD_TIMEOUT = 420
+    UPLOAD_IDLE_TIMEOUT = 90
+    UPLOAD_SIGNAL_TIMEOUT = 12
+    PUBLISH_TIMEOUT = 120
+    PAGE_READY_TIMEOUT = 60
+    POLL_INTERVAL = 2
+    MAX_CLICK_RETRIES = 3
+
+    def __init__(
+        self,
+        title: str,
+        file_path: str,
+        tags: List[str],
+        publish_date: Optional[datetime] = None,
+        account_file: Optional[str] = None,
+        description: str = "",
+        user_id: Optional[str] = None,
+    ):
+        super().__init__(title, file_path, tags, publish_date, account_file, description)
+        self.user_id = user_id
+        self.upload_url = "https://creator.xiaohongshu.com/publish/publish?from=homepage&target=video"
+        self._publish_api_submitted = False
+        self._publish_api_error: Optional[str] = None
+        self._temp_upload_paths: List[Path] = []
+
+    def _track_temp_upload_path(self, path: Path) -> None:
+        self._temp_upload_paths.append(path)
+
+    def _prepare_upload_file(self) -> Path:
+        src = self.file_path
+        if src.suffix:
+            return src
+
+        parent_suffix = Path(src.parent.name).suffix
+        if not parent_suffix:
+            return src
+
+        temp_dir = Path("/tmp/vigent_uploads")
+        temp_dir.mkdir(parents=True, exist_ok=True)
+        target = temp_dir / src.parent.name
+
+        try:
+            if target.exists():
+                target.unlink()
+        except Exception:
+            pass
+
+        try:
+            os.link(src, target)
+            logger.info(f"[小红书] using hardlink upload file: {target}")
+        except Exception:
+            try:
+                shutil.copy2(src, target)
+                logger.info(f"[小红书] using copied upload file: {target}")
+            except Exception as e:
+                logger.warning(f"[小红书] 构建带后缀上传文件失败，回退原文件: {e}")
+                return src
+
+        self._track_temp_upload_path(target)
+        return target
+
+    def _cleanup_upload_file(self) -> None:
+        if not self._temp_upload_paths:
+            return
+
+        paths = list(self._temp_upload_paths)
+        self._temp_upload_paths = []
+        for path in paths:
+            try:
+                if path.exists():
+                    path.unlink()
+            except Exception as e:
+                logger.warning(f"[小红书] 清理临时上传文件失败: {e}")
+
+    def _resolve_headless_mode(self) -> str:
+        mode = (settings.XIAOHONGSHU_HEADLESS_MODE or "").strip().lower()
+        return mode or "headless-new"
+
+    def _build_launch_options(self) -> Dict[str, Any]:
+        mode = self._resolve_headless_mode()
+        args = [
+            "--no-sandbox",
+            "--disable-dev-shm-usage",
+            "--disable-blink-features=AutomationControlled",
+        ]
+
+        headless = mode not in ("headful", "false", "0", "no")
+        if headless and mode in ("new", "headless-new", "headless_new"):
+            args.append("--headless=new")
+
+        if settings.XIAOHONGSHU_FORCE_SWIFTSHADER or headless:
+            args.extend([
+                "--enable-unsafe-swiftshader",
+                "--use-gl=swiftshader",
+            ])
+
+        options: Dict[str, Any] = {"headless": headless, "args": args}
+        chrome_path = (settings.XIAOHONGSHU_CHROME_PATH or "").strip()
+        if chrome_path:
+            if Path(chrome_path).exists():
+                options["executable_path"] = chrome_path
+            else:
+                logger.warning(f"[小红书] XIAOHONGSHU_CHROME_PATH 不存在: {chrome_path}")
+        else:
+            channel = (settings.XIAOHONGSHU_BROWSER_CHANNEL or "").strip()
+            if channel:
+                options["channel"] = channel
+
+        return options
+
+    def _debug_artifacts_enabled(self) -> bool:
+        return bool(settings.DEBUG and settings.XIAOHONGSHU_DEBUG_ARTIFACTS)
+
+    async def _save_debug_screenshot(self, page, name: str) -> None:
+        if not self._debug_artifacts_enabled():
+            return
+        try:
+            debug_dir = Path(__file__).parent.parent.parent / "debug_screenshots"
+            debug_dir.mkdir(exist_ok=True)
+            safe_name = name.replace("/", "_").replace(" ", "_")
+            file_path = debug_dir / f"xiaohongshu_{safe_name}.png"
+            await page.screenshot(path=str(file_path), full_page=True)
+            logger.info(f"[小红书] saved debug screenshot: {file_path}")
+        except Exception as e:
+            logger.warning(f"[小红书] 保存调试截图失败: {e}")
+
+    def _publish_screenshot_dir(self) -> Path:
+        user_key = re.sub(r"[^A-Za-z0-9_-]", "_", self.user_id or "legacy")[:64] or "legacy"
+        target = settings.PUBLISH_SCREENSHOT_DIR / user_key
+        target.mkdir(parents=True, exist_ok=True)
+        return target
+
+    async def _save_publish_success_screenshot(self, page) -> Optional[str]:
+        try:
+            timestamp = time.strftime("%Y%m%d_%H%M%S", time.localtime())
+            filename = f"xiaohongshu_success_{timestamp}_{int(time.time() * 1000) % 1000:03d}.png"
+            file_path = self._publish_screenshot_dir() / filename
+            await page.screenshot(path=str(file_path), full_page=False)
+            return f"/api/publish/screenshot/{filename}"
+        except Exception as e:
+            logger.warning(f"[小红书] 保存发布成功截图失败: {e}")
+            return None
+
+    def _attach_publish_listener(self, page) -> None:
+        ignore_tokens = ("report", "collect", "analytics", "monitor", "perf")
+
+        def on_response(response):
+            try:
+                request = response.request
+                if request.method not in ("POST", "PUT"):
+                    return
+
+                url = (response.url or "").lower()
+                if "xiaohongshu.com" not in url or "api" not in url:
+                    return
+                if not any(token in url for token in ("publish", "note/create", "note/publish", "note/save")):
+                    return
+                if any(token in url for token in ignore_tokens):
+                    return
+
+                if response.status < 400:
+                    self._publish_api_submitted = True
+                    logger.info("[小红书][publish] publish API ok")
+                else:
+                    self._publish_api_error = f"发布请求失败（HTTP {response.status}）"
+                    logger.warning(f"[小红书][publish] publish API failed status={response.status}")
+            except Exception:
+                pass
+
+        page.on("response", on_response)
+
+    async def _is_text_visible(self, page, text: str, exact: bool = False) -> bool:
+        try:
+            return await page.get_by_text(text, exact=exact).first.is_visible()
+        except Exception:
+            return False
+
+    async def _first_existing_locator(self, page, selectors: List[str], require_visible: bool = True):
+        for selector in selectors:
+            locator = page.locator(selector)
+            try:
+                if await locator.count() == 0:
+                    continue
+                candidate = locator.first
+                if require_visible and not await candidate.is_visible():
+                    continue
+                return candidate
+            except Exception:
+                continue
+        return None
+
+    async def _is_login_page(self, page) -> bool:
+        url = page.url.lower()
+        if "login" in url or "signin" in url:
+            return True
+        if await self._is_text_visible(page, "扫码登录", exact=False):
+            return True
+        if await self._is_text_visible(page, "立即登录", exact=False):
+            return True
+        return False
+
+    async def _go_to_publish_page(self, page):
+        await page.goto(self.upload_url, wait_until="domcontentloaded", timeout=self.PAGE_READY_TIMEOUT * 1000)
+        await asyncio.sleep(2)
+        return page
+
+    async def _find_file_input(self, page):
+        selectors = [
+            "input[type='file'][accept*='video']",
+            "div[class*='upload'] input[type='file']",
+            "input.upload-input",
+            "input[type='file']",
+        ]
+        return await self._first_existing_locator(page, selectors, require_visible=False)
+
+    async def _open_upload_entry(self, page) -> None:
+        selectors = [
+            "button:has-text('上传视频')",
+            "button:has-text('上传')",
+            "div[role='button']:has-text('上传视频')",
+            "div[role='button']:has-text('上传')",
+            "span:has-text('上传视频')",
+        ]
+        target = await self._first_existing_locator(page, selectors)
+        if not target:
+            return
+        try:
+            await target.scroll_into_view_if_needed()
+        except Exception:
+            pass
+        try:
+            await target.click(timeout=2000)
+        except Exception:
+            try:
+                await target.evaluate("el => el.click()")
+            except Exception:
+                pass
+
+    async def _is_upload_in_progress(self, page) -> bool:
+        in_progress_texts = [
+            "上传中",
+            "正在上传",
+            "处理中",
+            "视频处理中",
+            "转码中",
+            "请稍候",
+            "上传进度",
+            "校验中",
+            "准备中",
+        ]
+        for text in in_progress_texts:
+            if await self._is_text_visible(page, text, exact=False):
+                return True
+        return False
+
+    async def _is_upload_success(self, page) -> bool:
+        success_texts = [
+            "上传成功",
+            "上传完成",
+            "处理完成",
+            "转码完成",
+            "可发布",
+        ]
+        for text in success_texts:
+            if await self._is_text_visible(page, text, exact=False):
+                return True
+        return await self._is_publish_button_enabled(page)
+
+    async def _upload_failed_reason(self, page) -> Optional[str]:
+        failure_texts = [
+            "上传失败",
+            "上传异常",
+            "上传出错",
+            "上传超时",
+            "网络异常",
+        ]
+        for text in failure_texts:
+            if await self._is_text_visible(page, text, exact=False):
+                return f"上传失败：{text}"
+        return None
+
+    async def _upload_video(self, page) -> bool:
+        page = await self._go_to_publish_page(page)
+        await self._save_debug_screenshot(page, "publish_page")
+
+        upload_path = self._prepare_upload_file()
+        try:
+            upload_size = upload_path.stat().st_size
+            logger.info(
+                f"[小红书][upload_file] path={upload_path} "
+                f"size={upload_size} suffix={upload_path.suffix}"
+            )
+        except Exception as e:
+            logger.warning(f"[小红书] 读取上传文件信息失败: {e}")
+
+        for attempt in range(self.MAX_CLICK_RETRIES):
+            file_input = await self._find_file_input(page)
+            if not file_input:
+                await self._open_upload_entry(page)
+                await asyncio.sleep(1)
+                file_input = await self._find_file_input(page)
+
+            if not file_input:
+                logger.info(f"[小红书] 未找到上传文件 input，准备重试 ({attempt + 1}/{self.MAX_CLICK_RETRIES})")
+                await asyncio.sleep(1)
+                continue
+
+            try:
+                await file_input.set_input_files(str(upload_path))
+                logger.info(f"[小红书] 已设置上传文件: {upload_path.name}")
+
+                try:
+                    file_info = await file_input.evaluate(
+                        """
+                        (input) => {
+                          const file = input && input.files ? input.files[0] : null;
+                          if (!file) return null;
+                          return { name: file.name, size: file.size, type: file.type };
+                        }
+                        """
+                    )
+                    if file_info:
+                        selected_name = str(file_info.get("name") or "")
+                        logger.info(
+                            "[小红书][file_input] "
+                            f"name={selected_name} "
+                            f"size={file_info.get('size')} "
+                            f"type={file_info.get('type')}"
+                        )
+                        if upload_path.suffix and selected_name and not selected_name.lower().endswith(upload_path.suffix.lower()):
+                            logger.warning(
+                                "[小红书] file input 文件名后缀与上传文件不一致，"
+                                f"expect=*{upload_path.suffix} actual={selected_name}"
+                            )
+                            if attempt + 1 < self.MAX_CLICK_RETRIES:
+                                await asyncio.sleep(1)
+                                continue
+                            await self._save_debug_screenshot(page, "upload_input_name_mismatch")
+                            return False
+
+                        if not str(file_info.get("type") or "").strip():
+                            logger.warning("[小红书] file input MIME 为空，可能影响站点识别")
+                except Exception:
+                    pass
+
+                signal_detected = False
+                bootstrap_error: Optional[str] = None
+                deadline = time.time() + self.UPLOAD_SIGNAL_TIMEOUT
+                while time.time() < deadline:
+                    bootstrap_error = await self._upload_failed_reason(page)
+                    if bootstrap_error:
+                        break
+                    if await self._is_upload_in_progress(page) or await self._is_upload_success(page):
+                        signal_detected = True
+                        break
+                    await asyncio.sleep(0.6)
+
+                if bootstrap_error:
+                    logger.warning(f"[小红书] 上传启动阶段失败: {bootstrap_error}")
+                    if attempt + 1 < self.MAX_CLICK_RETRIES:
+                        await asyncio.sleep(1)
+                        continue
+                    return False
+
+                if signal_detected:
+                    return True
+
+                logger.info("[小红书] 未立即检测到上传状态，进入后续上传监控")
+                return True
+            except Exception as e:
+                logger.warning(f"[小红书] set_input_files 失败: {e}")
+
+            await asyncio.sleep(1)
+
+        await self._save_debug_screenshot(page, "upload_input_missing")
+        return False
+
+    async def _wait_for_upload_complete(self, page) -> tuple[bool, str]:
+        start = time.time()
+        idle_start = start
+        while time.time() - start < self.UPLOAD_TIMEOUT:
+            reason = await self._upload_failed_reason(page)
+            if reason:
+                logger.warning(f"[小红书] 上传失败检测: {reason}")
+                return False, reason
+
+            if await self._is_upload_success(page):
+                return True, "上传完成"
+
+            if await self._is_upload_in_progress(page):
+                idle_start = time.time()
+                logger.info("[小红书] 视频上传进行中...")
+            else:
+                if time.time() - idle_start > self.UPLOAD_IDLE_TIMEOUT:
+                    await self._save_debug_screenshot(page, "upload_idle_timeout")
+                    return False, "未检测到有效上传进度（疑似上传控件未生效）"
+                logger.info("[小红书] 等待上传状态...")
+
+            await asyncio.sleep(self.POLL_INTERVAL)
+
+        return False, "视频上传超时"
+
+    def _normalize_tags(self, tags: List[str]) -> List[str]:
+        normalized: List[str] = []
+        seen = set()
+        for raw in tags:
+            item = (raw or "").strip().lstrip("#")
+            if not item:
+                continue
+            lowered = item.lower()
+            if lowered in seen:
+                continue
+            seen.add(lowered)
+            normalized.append(item)
+        return normalized
+
+    async def _fill_title(self, page) -> bool:
+        selectors = [
+            "input[placeholder*='标题']",
+            "div.plugin.title-container input",
+            "input.d-text",
+        ]
+        target = await self._first_existing_locator(page, selectors)
+        if not target:
+            return False
+
+        try:
+            await target.click(timeout=1500)
+            await target.fill((self.title or "")[:30])
+            return True
+        except Exception:
+            return False
+
+    async def _fill_description(self, page, text: str) -> bool:
+        selectors = [
+            ".tiptap[contenteditable='true']",
+            "[contenteditable='true'][data-placeholder*='描述']",
+            "[contenteditable='true'][role='textbox']",
+            "textarea[placeholder*='描述']",
+            "textarea[placeholder*='正文']",
+        ]
+        target = await self._first_existing_locator(page, selectors)
+        if not target:
+            return False
+
+        try:
+            await target.click(timeout=1500)
+            await page.keyboard.press("Control+KeyA")
+            await page.keyboard.type(text)
+            return True
+        except Exception:
+            return False
+
+    async def set_schedule_time(self, page, publish_date: datetime) -> bool:
+        try:
+            toggle = await self._first_existing_locator(
+                page,
+                [
+                    "label:has-text('定时发布')",
+                    "span:has-text('定时发布')",
+                    "div:has-text('定时发布')",
+                ],
+            )
+            if not toggle:
+                return False
+
+            try:
+                await toggle.click(timeout=2000)
+            except Exception:
+                await toggle.evaluate("el => el.click()")
+
+            await asyncio.sleep(0.5)
+            date_input = await self._first_existing_locator(
+                page,
+                [
+                    "input[placeholder*='日期和时间']",
+                    "input[placeholder*='发布时间']",
+                    "input[placeholder*='选择日期']",
+                ],
+            )
+            if not date_input:
+                return False
+
+            value = publish_date.strftime("%Y-%m-%d %H:%M")
+            await date_input.click(timeout=2000)
+            await page.keyboard.press("Control+KeyA")
+            await page.keyboard.type(value)
+            await page.keyboard.press("Enter")
+            logger.info(f"[小红书] 已设置定时发布: {value}")
+            return True
+        except Exception as e:
+            logger.warning(f"[小红书] 设置定时发布时间失败: {e}")
+            return False
+
+    async def _find_publish_button(self, page, scheduled: bool):
+        selectors = [
+            "button:has-text('定时发布')",
+            "div[role='button']:has-text('定时发布')",
+        ] if scheduled else [
+            "button:has-text('发布')",
+            "button:has-text('立即发布')",
+            "div[role='button']:has-text('发布')",
+        ]
+
+        for selector in selectors:
+            locator = page.locator(selector)
+            try:
+                if await locator.count() == 0:
+                    continue
+                candidate = locator.first
+                if not await candidate.is_visible():
+                    continue
+                return candidate
+            except Exception:
+                continue
+        return None
+
+    async def _is_publish_button_enabled(self, page) -> bool:
+        buttons = [
+            await self._find_publish_button(page, scheduled=False),
+            await self._find_publish_button(page, scheduled=True),
+        ]
+        for button in buttons:
+            if not button:
+                continue
+            try:
+                if await button.is_enabled():
+                    return True
+            except Exception:
+                continue
+        return False
+
+    async def _click_publish(self, page, scheduled: bool) -> tuple[bool, str]:
+        for _ in range(self.MAX_CLICK_RETRIES):
+            button = await self._find_publish_button(page, scheduled)
+            if not button:
+                await asyncio.sleep(0.8)
+                continue
+
+            try:
+                if not await button.is_enabled():
+                    await asyncio.sleep(0.8)
+                    continue
+            except Exception:
+                pass
+
+            try:
+                await button.click(timeout=2000)
+                return True, "发布按钮点击成功"
+            except Exception:
+                try:
+                    await button.evaluate("el => el.click()")
+                    return True, "发布按钮 JS 点击成功"
+                except Exception:
+                    await asyncio.sleep(0.8)
+
+        return False, "未找到可点击的发布按钮"
+
+    async def _wait_for_publish_result(self, page) -> tuple[bool, str, bool]:
+        create_url = page.url
+        success_url_tokens = [
+            "/publish/success",
+            "/publish/result",
+            "/publish/published",
+        ]
+        success_texts = [
+            "发布成功",
+            "发布完成",
+            "审核中",
+            "查看笔记",
+            "去查看",
+        ]
+        failure_texts = [
+            "发布失败",
+            "发布异常",
+            "发布出错",
+            "网络异常",
+            "请完善",
+            "请补充",
+        ]
+
+        start_time = time.time()
+        while time.time() - start_time < self.PUBLISH_TIMEOUT:
+            if self._publish_api_error:
+                return False, self._publish_api_error, False
+
+            current_url = page.url
+            lowered_url = current_url.lower()
+            if any(token in lowered_url for token in success_url_tokens):
+                return True, f"发布成功：跳转到 {current_url}", False
+
+            if current_url != create_url and "/publish/publish" not in lowered_url:
+                return True, f"发布成功：页面已跳转 {current_url}", False
+
+            if self._publish_api_submitted:
+                return True, "发布成功：API 已确认", False
+
+            for text in failure_texts:
+                if await self._is_text_visible(page, text, exact=False):
+                    return False, f"发布失败：{text}", False
+
+            for text in success_texts:
+                if await self._is_text_visible(page, text, exact=False):
+                    return True, f"发布成功：检测到文案 {text}", False
+
+            logger.info("[小红书] 等待发布结果...")
+            await asyncio.sleep(self.POLL_INTERVAL)
+
+        return False, "发布超时", True
+
+    async def upload(self, playwright: Playwright) -> Dict[str, Any]:
+        browser = None
+        context = None
+        page = None
+        try:
+            launch_options = self._build_launch_options()
+            browser = await playwright.chromium.launch(**launch_options)
+            context = await browser.new_context(
+                storage_state=self.account_file,
+                viewport={"width": 1600, "height": 900},
+                device_scale_factor=1,
+                user_agent=settings.XIAOHONGSHU_USER_AGENT,
+                locale=settings.XIAOHONGSHU_LOCALE,
+                timezone_id=settings.XIAOHONGSHU_TIMEZONE_ID,
+            )
+            context = await set_init_script(context)
+
+            page = await context.new_page()
+            self._attach_publish_listener(page)
+
+            await self._go_to_publish_page(page)
+            if await self._is_login_page(page):
+                return {
+                    "success": False,
+                    "message": "登录失效，请重新扫码登录小红书",
+                    "url": None,
+                }
+
+            logger.info(f"[小红书] 正在上传: {self.file_path.name}")
+            if not await self._upload_video(page):
+                return {
+                    "success": False,
+                    "message": "未能触发有效视频上传，请确认发布页状态及视频文件格式",
+                    "url": None,
+                }
+
+            upload_success, upload_reason = await self._wait_for_upload_complete(page)
+            if not upload_success:
+                await self._save_debug_screenshot(page, "upload_failed")
+                return {
+                    "success": False,
+                    "message": upload_reason,
+                    "url": None,
+                }
+
+            await asyncio.sleep(1)
+            title_filled = await self._fill_title(page)
+            if not title_filled:
+                logger.warning("[小红书] 未找到标题输入框，尝试在正文中补充标题")
+
+            normalized_tags = self._normalize_tags(self.tags)
+            body_parts: List[str] = []
+            if self.description:
+                body_parts.append(self.description.strip())
+            if not title_filled and self.title:
+                body_parts.insert(0, self.title.strip())
+            if normalized_tags:
+                body_parts.append(" ".join([f"#{tag}" for tag in normalized_tags]))
+            body_text = "\n".join([part for part in body_parts if part]).strip()
+
+            if body_text:
+                body_ok = await self._fill_description(page, body_text)
+                if not body_ok:
+                    logger.warning("[小红书] 未找到正文输入框，跳过正文/话题填充")
+
+            if self.publish_date != 0 and isinstance(self.publish_date, datetime):
+                if not await self.set_schedule_time(page, self.publish_date):
+                    return {
+                        "success": False,
+                        "message": "未找到定时发布控件，请检查小红书发布页结构",
+                        "url": None,
+                    }
+
+            clicked, click_reason = await self._click_publish(page, scheduled=self.publish_date != 0)
+            if not clicked:
+                await self._save_debug_screenshot(page, "publish_button_not_clickable")
+                return {
+                    "success": False,
+                    "message": click_reason,
+                    "url": None,
+                }
+
+            publish_success, publish_reason, is_timeout = await self._wait_for_publish_result(page)
+
+            await context.storage_state(path=self.account_file)
+            logger.success("[小红书] Cookie 更新完毕")
+
+            if publish_success:
+                await asyncio.sleep(2)
+                screenshot_url = await self._save_publish_success_screenshot(page)
+                return {
+                    "success": True,
+                    "message": "发布成功，待审核" if self.publish_date == 0 else "已设置定时发布",
+                    "url": None,
+                    "screenshot_url": screenshot_url,
+                }
+
+            if is_timeout:
+                return {
+                    "success": False,
+                    "message": f"发布状态未知（检测超时），请到小红书创作中心确认: {publish_reason}",
+                    "url": None,
+                }
+
+            return {
+                "success": False,
+                "message": publish_reason,
+                "url": None,
+            }
+
+        except Exception as e:
+            logger.exception(f"[小红书] 上传失败: {e}")
+            return {
+                "success": False,
+                "message": f"上传失败: {str(e)}",
+                "url": None,
+            }
+        finally:
+            self._cleanup_upload_file()
+
+            if page:
+                try:
+                    if not page.is_closed():
+                        await page.close()
+                except Exception:
+                    pass
+
+            if context:
+                try:
+                    await context.close()
+                except Exception:
+                    pass
+
+            if browser:
+                try:
+                    await browser.close()
+                except Exception:
+                    pass
+
+    async def main(self) -> Dict[str, Any]:
+        async with async_playwright() as playwright:
+            return await self.upload(playwright)
--- a/backend/app/services/video_service.py
+++ b/backend/app/services/video_service.py
@@ -1,14 +1,15 @@
-"""
-视频合成服务
-"""
-import os
-import subprocess
-import json
-import shlex
-from pathlib import Path
-from loguru import logger
-from typing import Optional
-
+"""
+视频合成服务
+"""
+import asyncio
+import os
+import subprocess
+import json
+import shlex
+from pathlib import Path
+from loguru import logger
+from typing import Optional
+
 class VideoService:
    def __init__(self):
        pass
@@ -113,146 +114,155 @@ class VideoService:

        logger.warning("视频方向归一化失败，回退使用原视频")
        return video_path
-
-    def _run_ffmpeg(self, cmd: list) -> bool:
-        cmd_str = ' '.join(shlex.quote(str(c)) for c in cmd)
-        logger.debug(f"FFmpeg CMD: {cmd_str}")
-        try:
-            # Synchronous call for BackgroundTasks compatibility
-            result = subprocess.run(
-                cmd,
-                shell=False,
-                capture_output=True,
-                text=True,
-                encoding='utf-8',
-            )
-            if result.returncode != 0:
-                logger.error(f"FFmpeg Error: {result.stderr}")
-                return False
-            return True
-        except Exception as e:
-            logger.error(f"FFmpeg Exception: {e}")
-            return False
-
-    def _get_duration(self, file_path: str) -> float:
-        # Synchronous call for BackgroundTasks compatibility
-        # 使用参数列表形式避免 shell=True 的命令注入风险
-        cmd = [
-            'ffprobe', '-v', 'error',
-            '-show_entries', 'format=duration',
-            '-of', 'default=noprint_wrappers=1:nokey=1',
-            file_path
-        ]
-        try:
-            result = subprocess.run(
-                cmd,
-                capture_output=True,
-                text=True,
-            )
-            return float(result.stdout.strip())
-        except Exception:
-            return 0.0
-
-    def mix_audio(
-        self,
-        voice_path: str,
-        bgm_path: str,
-        output_path: str,
-        bgm_volume: float = 0.2
-    ) -> str:
-        """混合人声与背景音乐"""
-        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
-
-        volume = max(0.0, min(float(bgm_volume), 1.0))
-        filter_complex = (
-            f"[0:a]volume=1.0[a0];"
-            f"[1:a]volume={volume}[a1];"
-            f"[a0][a1]amix=inputs=2:duration=first:dropout_transition=2:normalize=0[aout]"
-        )
-
-        cmd = [
-            "ffmpeg", "-y",
-            "-i", voice_path,
-            "-stream_loop", "-1", "-i", bgm_path,
-            "-filter_complex", filter_complex,
-            "-map", "[aout]",
-            "-c:a", "pcm_s16le",
-            "-shortest",
-            output_path,
-        ]
-
-        if self._run_ffmpeg(cmd):
-            return output_path
-        raise RuntimeError("FFmpeg audio mix failed")
-
-    async def compose(
-        self,
-        video_path: str,
-        audio_path: str,
-        output_path: str,
-        subtitle_path: Optional[str] = None
-    ) -> str:
-        """合成视频"""
-        # Ensure output dir
-        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
-        
-        video_duration = self._get_duration(video_path)
-        audio_duration = self._get_duration(audio_path)
-        
-        # Audio loop if needed
-        loop_count = 1
-        if audio_duration > video_duration and video_duration > 0:
-            loop_count = int(audio_duration / video_duration) + 1
-            
-        cmd = ["ffmpeg", "-y"]
-        
-        # Input video (stream_loop must be before -i)
-        if loop_count > 1:
-            cmd.extend(["-stream_loop", str(loop_count)])
-        cmd.extend(["-i", video_path])
-        
-        # Input audio
-        cmd.extend(["-i", audio_path])
-        
-        # Filter complex
-        filter_complex = []
-        
-        # Subtitles (skip for now to mimic previous state or implement basic)
-        # Previous state: subtitles disabled due to font issues
-        # if subtitle_path: ...
-        
-        # Audio map with high quality encoding
-        cmd.extend([
-            "-c:v", "libx264",
-            "-preset", "slow",      # 慢速预设，更好的压缩效率
-            "-crf", "18",           # 高质量（与 LatentSync 一致）
-            "-c:a", "aac",
-            "-b:a", "192k",         # 音频比特率
-            "-shortest"
-        ])
-        # Use audio from input 1
-        cmd.extend(["-map", "0:v", "-map", "1:a"])
-        
-        cmd.append(output_path)
-        
-        if self._run_ffmpeg(cmd):
-            return output_path
-        else:
-            raise RuntimeError("FFmpeg composition failed")
-
+
+    def _run_ffmpeg(self, cmd: list) -> bool:
+        cmd_str = ' '.join(shlex.quote(str(c)) for c in cmd)
+        logger.debug(f"FFmpeg CMD: {cmd_str}")
+        try:
+            result = subprocess.run(
+                cmd,
+                shell=False,
+                capture_output=True,
+                text=True,
+                encoding='utf-8',
+                timeout=600,
+            )
+            if result.returncode != 0:
+                logger.error(f"FFmpeg Error: {result.stderr}")
+                return False
+            return True
+        except subprocess.TimeoutExpired:
+            logger.error("FFmpeg timed out after 600s")
+            return False
+        except Exception as e:
+            logger.error(f"FFmpeg Exception: {e}")
+            return False
+
+    def _get_duration(self, file_path: str) -> float:
+        # Synchronous call for BackgroundTasks compatibility
+        # 使用参数列表形式避免 shell=True 的命令注入风险
+        cmd = [
+            'ffprobe', '-v', 'error',
+            '-show_entries', 'format=duration',
+            '-of', 'default=noprint_wrappers=1:nokey=1',
+            file_path
+        ]
+        try:
+            result = subprocess.run(
+                cmd,
+                capture_output=True,
+                text=True,
+                timeout=30,
+            )
+            return float(result.stdout.strip())
+        except Exception:
+            return 0.0
+
+    def mix_audio(
+        self,
+        voice_path: str,
+        bgm_path: str,
+        output_path: str,
+        bgm_volume: float = 0.2
+    ) -> str:
+        """混合人声与背景音乐"""
+        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+
+        volume = max(0.0, min(float(bgm_volume), 1.0))
+        filter_complex = (
+            f"[0:a]volume=1.0[a0];"
+            f"[1:a]volume={volume}[a1];"
+            f"[a0][a1]amix=inputs=2:duration=first:dropout_transition=2:normalize=0[aout]"
+        )
+
+        cmd = [
+            "ffmpeg", "-y",
+            "-i", voice_path,
+            "-stream_loop", "-1", "-i", bgm_path,
+            "-filter_complex", filter_complex,
+            "-map", "[aout]",
+            "-c:a", "pcm_s16le",
+            "-shortest",
+            output_path,
+        ]
+
+        if self._run_ffmpeg(cmd):
+            return output_path
+        raise RuntimeError("FFmpeg audio mix failed")
+
+    async def compose(
+        self,
+        video_path: str,
+        audio_path: str,
+        output_path: str,
+        subtitle_path: Optional[str] = None
+    ) -> str:
+        """合成视频"""
+        # Ensure output dir
+        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+
+        loop = asyncio.get_running_loop()
+        video_duration = await loop.run_in_executor(None, self._get_duration, video_path)
+        audio_duration = await loop.run_in_executor(None, self._get_duration, audio_path)
+        
+        # Audio loop if needed
+        loop_count = 1
+        if audio_duration > video_duration and video_duration > 0:
+            loop_count = int(audio_duration / video_duration) + 1
+            
+        cmd = ["ffmpeg", "-y"]
+        
+        # Input video (stream_loop must be before -i)
+        if loop_count > 1:
+            cmd.extend(["-stream_loop", str(loop_count)])
+        cmd.extend(["-i", video_path])
+        
+        # Input audio
+        cmd.extend(["-i", audio_path])
+        
+        # Filter complex
+        filter_complex = []
+        
+        # Subtitles (skip for now to mimic previous state or implement basic)
+        # Previous state: subtitles disabled due to font issues
+        # if subtitle_path: ...
+        
+        # 不需要循环时用流复制（几乎瞬间完成），需要循环时才重编码
+        if loop_count > 1:
+            cmd.extend([
+                "-c:v", "libx264", "-preset", "fast", "-crf", "18",
+            ])
+        else:
+            cmd.extend(["-c:v", "copy"])
+
+        cmd.extend([
+            "-c:a", "aac",
+            "-b:a", "192k",
+            "-shortest",
+            "-map", "0:v", "-map", "1:a",
+        ])
+        
+        cmd.append(output_path)
+        
+        ok = await loop.run_in_executor(None, self._run_ffmpeg, cmd)
+        if ok:
+            return output_path
+        else:
+            raise RuntimeError("FFmpeg composition failed")
+
    def concat_videos(self, video_paths: list, output_path: str, target_fps: int = 25) -> str:
        """使用 FFmpeg concat demuxer 拼接多个视频片段"""
-        if not video_paths:
-            raise ValueError("No video segments to concat")
-
-        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
-
-        # 生成 concat list 文件
-        list_path = Path(output_path).parent / f"{Path(output_path).stem}_concat.txt"
-        with open(list_path, "w", encoding="utf-8") as f:
-            for vp in video_paths:
-                f.write(f"file '{vp}'\n")
-
+        if not video_paths:
+            raise ValueError("No video segments to concat")
+
+        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+
+        # 生成 concat list 文件
+        list_path = Path(output_path).parent / f"{Path(output_path).stem}_concat.txt"
+        with open(list_path, "w", encoding="utf-8") as f:
+            for vp in video_paths:
+                f.write(f"file '{vp}'\n")
+
        cmd = [
            "ffmpeg", "-y",
            "-f", "concat",
@@ -260,48 +270,43 @@ class VideoService:
            "-fflags", "+genpts",
            "-i", str(list_path),
            "-an",
-            "-vsync", "cfr",
-            "-r", str(target_fps),
-            "-c:v", "libx264",
-            "-preset", "fast",
-            "-crf", "18",
-            "-pix_fmt", "yuv420p",
+            "-c:v", "copy",
            "-movflags", "+faststart",
            output_path,
        ]
-
-        try:
-            if self._run_ffmpeg(cmd):
-                return output_path
-            else:
-                raise RuntimeError("FFmpeg concat failed")
-        finally:
-            try:
-                list_path.unlink(missing_ok=True)
-            except Exception:
-                pass
-
-    def split_audio(self, audio_path: str, start: float, end: float, output_path: str) -> str:
-        """用 FFmpeg 按时间范围切分音频"""
-        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
-
-        duration = end - start
-        if duration <= 0:
-            raise ValueError(f"Invalid audio split range: start={start}, end={end}, duration={duration}")
-
-        cmd = [
-            "ffmpeg", "-y",
-            "-ss", str(start),
-            "-t", str(duration),
-            "-i", audio_path,
-            "-c", "copy",
-            output_path,
-        ]
-
-        if self._run_ffmpeg(cmd):
-            return output_path
-        raise RuntimeError(f"FFmpeg audio split failed: {start}-{end}")
-
+
+        try:
+            if self._run_ffmpeg(cmd):
+                return output_path
+            else:
+                raise RuntimeError("FFmpeg concat failed")
+        finally:
+            try:
+                list_path.unlink(missing_ok=True)
+            except Exception:
+                pass
+
+    def split_audio(self, audio_path: str, start: float, end: float, output_path: str) -> str:
+        """用 FFmpeg 按时间范围切分音频"""
+        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+
+        duration = end - start
+        if duration <= 0:
+            raise ValueError(f"Invalid audio split range: start={start}, end={end}, duration={duration}")
+
+        cmd = [
+            "ffmpeg", "-y",
+            "-ss", str(start),
+            "-t", str(duration),
+            "-i", audio_path,
+            "-c", "copy",
+            output_path,
+        ]
+
+        if self._run_ffmpeg(cmd):
+            return output_path
+        raise RuntimeError(f"FFmpeg audio split failed: {start}-{end}")
+
    def get_resolution(self, file_path: str) -> tuple[int, int]:
        """获取视频有效显示分辨率（考虑旋转元数据）。"""
        info = self.get_video_metadata(file_path)
@@ -309,7 +314,7 @@ class VideoService:
            int(info.get("effective_width") or 0),
            int(info.get("effective_height") or 0),
        )
-
+
    def prepare_segment(self, video_path: str, target_duration: float, output_path: str,
                        target_resolution: Optional[tuple] = None, source_start: float = 0.0,
                        source_end: Optional[float] = None, target_fps: Optional[int] = None) -> str:
@@ -339,6 +344,7 @@ class VideoService:
        needs_loop = target_duration > available
        needs_scale = target_resolution is not None
        needs_fps = bool(target_fps and target_fps > 0)
+        target_fps_value = int(target_fps) if needs_fps and target_fps is not None else None
        has_source_end = clip_end < video_dur

        # 当需要循环且存在截取范围时，先裁剪出片段，再循环裁剪后的文件
@@ -356,50 +362,50 @@ class VideoService:
                "-c:v", "libx264", "-preset", "fast", "-crf", "18",
                trim_temp,
            ]
-            if not self._run_ffmpeg(trim_cmd):
-                raise RuntimeError(f"FFmpeg trim for loop failed: {video_path}")
-            actual_input = trim_temp
-            source_start = 0.0  # 已裁剪，不需要再 seek
-            # 重新计算循环次数（基于裁剪后文件）
-            available = self._get_duration(trim_temp) or available
-
-        loop_count = int(target_duration / available) + 1 if needs_loop else 0
-
-        cmd = ["ffmpeg", "-y"]
-        if needs_loop:
-            cmd.extend(["-stream_loop", str(loop_count)])
+            if not self._run_ffmpeg(trim_cmd):
+                raise RuntimeError(f"FFmpeg trim for loop failed: {video_path}")
+            actual_input = trim_temp
+            source_start = 0.0  # 已裁剪，不需要再 seek
+            # 重新计算循环次数（基于裁剪后文件）
+            available = self._get_duration(trim_temp) or available
+
+        loop_count = int(target_duration / available) + 1 if needs_loop else 0
+
+        cmd = ["ffmpeg", "-y"]
+        if needs_loop:
+            cmd.extend(["-stream_loop", str(loop_count)])
        if source_start > 0:
            cmd.extend(["-ss", str(source_start)])
        cmd.extend(["-i", actual_input, "-t", str(target_duration), "-an"])

        filters = []
-        if needs_fps:
-            filters.append(f"fps={int(target_fps)}")
+        if target_fps_value is not None:
+            filters.append(f"fps={target_fps_value}")
        if needs_scale:
            w, h = target_resolution
            filters.append(f"scale={w}:{h}:force_original_aspect_ratio=decrease,pad={w}:{h}:(ow-iw)/2:(oh-ih)/2")

        if filters:
            cmd.extend(["-vf", ",".join(filters)])
-        if needs_fps:
-            cmd.extend(["-vsync", "cfr", "-r", str(int(target_fps))])
+        if target_fps_value is not None:
+            cmd.extend(["-vsync", "cfr", "-r", str(target_fps_value)])

        # 需要循环、缩放或指定起点时必须重编码，否则用 stream copy 保持原画质
        if needs_loop or needs_scale or source_start > 0 or has_source_end or needs_fps:
            cmd.extend(["-c:v", "libx264", "-preset", "fast", "-crf", "18"])
        else:
            cmd.extend(["-c:v", "copy"])
-
-        cmd.append(output_path)
-
-        try:
-            if self._run_ffmpeg(cmd):
-                return output_path
-            raise RuntimeError(f"FFmpeg prepare_segment failed: {video_path}")
-        finally:
-            # 清理裁剪临时文件
-            if trim_temp:
-                try:
-                    Path(trim_temp).unlink(missing_ok=True)
-                except Exception:
-                    pass
+
+        cmd.append(output_path)
+
+        try:
+            if self._run_ffmpeg(cmd):
+                return output_path
+            raise RuntimeError(f"FFmpeg prepare_segment failed: {video_path}")
+        finally:
+            # 清理裁剪临时文件
+            if trim_temp:
+                try:
+                    Path(trim_temp).unlink(missing_ok=True)
+                except Exception:
+                    pass
--- a/backend/app/services/voice_clone_service.py
+++ b/backend/app/services/voice_clone_service.py
@@ -32,6 +32,7 @@ class VoiceCloneService:
        ref_text: str,
        language: str,
        speed: float = 1.0,
+        instruct_text: str = "",
        max_retries: int = 4,
    ) -> bytes:
        timeout = httpx.Timeout(240.0)
@@ -39,15 +40,18 @@ class VoiceCloneService:
        for attempt in range(max_retries):
            try:
                async with httpx.AsyncClient(timeout=timeout) as client:
+                    data = {
+                        "text": text,
+                        "ref_text": ref_text,
+                        "language": language,
+                        "speed": str(speed),
+                    }
+                    if instruct_text:
+                        data["instruct_text"] = instruct_text
                    response = await client.post(
                        f"{self.base_url}/generate",
                        files={"ref_audio": ("ref.wav", ref_audio_data, "audio/wav")},
-                        data={
-                            "text": text,
-                            "ref_text": ref_text,
-                            "language": language,
-                            "speed": str(speed),
-                        },
+                        data=data,
                    )

                retryable = False
@@ -99,6 +103,7 @@ class VoiceCloneService:
        output_path: str,
        language: str = "Chinese",
        speed: float = 1.0,
+        instruct_text: str = "",
    ) -> str:
        """
        使用声音克隆生成语音
@@ -132,6 +137,7 @@ class VoiceCloneService:
                ref_text=ref_text,
                language=language,
                speed=speed,
+                instruct_text=instruct_text,
            )
            with open(output_path, "wb") as f:
                f.write(audio_bytes)
--- a/backend/app/services/whisper_service.py
+++ b/backend/app/services/whisper_service.py
@@ -151,6 +151,46 @@ def split_segment_to_lines(words: List[dict], max_chars: int = MAX_CHARS_PER_LIN
    return segments


+def smooth_word_timestamps(words: List[dict]) -> List[dict]:
+    """
+    时间戳后处理平滑：
+    1. 保证时间戳严格单调递增
+    2. 消除 Whisper 输出中的微小抖动（字的 end > 下一字的 start）
+    3. 填补字间间隙，避免字幕高亮"跳空"
+    """
+    if len(words) <= 1:
+        return words
+
+    result = [words[0].copy()]
+    for i in range(1, len(words)):
+        w = words[i].copy()
+        prev = result[-1]
+
+        # 保证 start 不早于前一字的 start（单调递增）
+        if w["start"] < prev["start"]:
+            w["start"] = prev["start"]
+
+        # 保证 start 不早于前一字的 end
+        if w["start"] < prev["end"]:
+            # 两字重叠，取中点分割
+            mid = (prev["end"] + w["start"]) / 2
+            prev["end"] = round(mid, 3)
+            w["start"] = round(mid, 3)
+
+        # 填补字间间隙（间隙 < 50ms 时直接连接，避免高亮跳空）
+        gap = w["start"] - prev["end"]
+        if 0 < gap < 0.05:
+            prev["end"] = w["start"]
+
+        # 保证 end >= start
+        if w["end"] < w["start"]:
+            w["end"] = w["start"] + 0.05
+
+        result.append(w)
+
+    return result
+
+
 class WhisperService:
    """字幕对齐服务（基于 faster-whisper）"""

@@ -219,6 +259,8 @@ class WhisperService:
                language=language,
                word_timestamps=True,  # 启用字级别时间戳
                vad_filter=True,  # 启用 VAD 过滤静音
+                beam_size=8,  # 增大搜索宽度，提升时间戳精度
+                # condition_on_previous_text 保持默认 True，避免时间戳系统性超前
            )

            logger.info(f"Detected language: {info.language} (prob: {info.language_probability:.2f})")
@@ -244,6 +286,7 @@ class WhisperService:
                            all_words.extend(chars)

                if all_words:
+                    all_words = smooth_word_timestamps(all_words)
                    line_segments = split_segment_to_lines(all_words, max_chars)
                    all_segments.extend(line_segments)

@@ -268,6 +311,14 @@ class WhisperService:
                    w_starts = [c["start"] for c in whisper_chars]
                    w_final_end = whisper_chars[-1]["end"]

+                    # 字数比例异常检测
+                    ratio = n_o / n_w
+                    if ratio > 1.5 or ratio < 0.67:
+                        logger.warning(
+                            f"original_text 与 Whisper 字数比例异常: {n_o}/{n_w} = {ratio:.2f}, "
+                            f"字幕时间戳精度可能下降"
+                        )
+
                    logger.info(
                        f"Using original_text for subtitles (len={len(original_text)}), "
                        f"rhythm-mapping {n_o} orig chars onto {n_w} Whisper chars, "
@@ -302,11 +353,21 @@ class WhisperService:
                            "end": round(t_end, 3),
                        })

-                    all_segments = split_segment_to_lines(remapped, max_chars)
+                    # 限制单字时长范围，防止比例异常时极端漂移
+                    MIN_CHAR_DURATION = 0.04   # 40ms（一帧@25fps）
+                    MAX_CHAR_DURATION = 0.8    # 800ms
+                    for r in remapped:
+                        dur = r["end"] - r["start"]
+                        if dur < MIN_CHAR_DURATION:
+                            r["end"] = round(r["start"] + MIN_CHAR_DURATION, 3)
+                        elif dur > MAX_CHAR_DURATION:
+                            r["end"] = round(r["start"] + MAX_CHAR_DURATION, 3)
+
+                    all_segments = split_segment_to_lines(smooth_word_timestamps(remapped), max_chars)
                    logger.info(f"Rebuilt {len(all_segments)} subtitle segments (rhythm-mapped)")
                elif orig_chars:
                    # Whisper 字符不足，退回线性插值
-                    all_segments = split_segment_to_lines(orig_chars, max_chars)
+                    all_segments = split_segment_to_lines(smooth_word_timestamps(orig_chars), max_chars)
                    logger.info(f"Rebuilt {len(all_segments)} subtitle segments (linear fallback)")

            logger.info(f"Generated {len(all_segments)} subtitle segments")
--- a/backend/assets/styles/subtitle.json
+++ b/backend/assets/styles/subtitle.json
@@ -54,5 +54,61 @@
    "letter_spacing": 1,
    "bottom_margin": 72,
    "is_default": false
+  },
+  {
+    "id": "subtitle_pink",
+    "label": "少女粉",
+    "font_file": "DingTalk JinBuTi.ttf",
+    "font_family": "DingTalkJinBuTi",
+    "font_size": 56,
+    "highlight_color": "#FF69B4",
+    "normal_color": "#FFFFFF",
+    "stroke_color": "#1A0010",
+    "stroke_size": 3,
+    "letter_spacing": 2,
+    "bottom_margin": 80,
+    "is_default": false
+  },
+  {
+    "id": "subtitle_lime",
+    "label": "清新绿",
+    "font_file": "DingTalk Sans.ttf",
+    "font_family": "DingTalkSans",
+    "font_size": 50,
+    "highlight_color": "#76FF03",
+    "normal_color": "#FFFFFF",
+    "stroke_color": "#001A00",
+    "stroke_size": 3,
+    "letter_spacing": 1,
+    "bottom_margin": 78,
+    "is_default": false
+  },
+  {
+    "id": "subtitle_gold",
+    "label": "金色隶书",
+    "font_file": "阿里妈妈刀隶体.ttf",
+    "font_family": "AliMamaDaoLiTi",
+    "font_size": 56,
+    "highlight_color": "#FDE68A",
+    "normal_color": "#E8D5B0",
+    "stroke_color": "#2B1B00",
+    "stroke_size": 3,
+    "letter_spacing": 3,
+    "bottom_margin": 80,
+    "is_default": false
+  },
+  {
+    "id": "subtitle_kai",
+    "label": "楷体红字",
+    "font_file": "simkai.ttf",
+    "font_family": "SimKai",
+    "font_size": 54,
+    "highlight_color": "#FF4444",
+    "normal_color": "#FFFFFF",
+    "stroke_color": "#000000",
+    "stroke_size": 3,
+    "letter_spacing": 2,
+    "bottom_margin": 80,
+    "is_default": false
  }
 ]
--- a/backend/assets/styles/title.json
+++ b/backend/assets/styles/title.json
@@ -7,7 +7,7 @@
    "font_size": 90,
    "color": "#FFFFFF",
    "stroke_color": "#000000",
-    "stroke_size": 8,
+    "stroke_size": 5,
    "letter_spacing": 5,
    "top_margin": 62,
    "font_weight": 900,
@@ -21,7 +21,7 @@
    "font_size": 72,
    "color": "#FFFFFF",
    "stroke_color": "#000000",
-    "stroke_size": 8,
+    "stroke_size": 5,
    "letter_spacing": 4,
    "top_margin": 60,
    "font_weight": 900,
@@ -35,7 +35,7 @@
    "font_size": 70,
    "color": "#FDE68A",
    "stroke_color": "#2B1B00",
-    "stroke_size": 8,
+    "stroke_size": 5,
    "letter_spacing": 3,
    "top_margin": 58,
    "font_weight": 800,
@@ -49,10 +49,122 @@
    "font_size": 72,
    "color": "#FFFFFF",
    "stroke_color": "#1F0A00",
-    "stroke_size": 8,
+    "stroke_size": 5,
    "letter_spacing": 4,
    "top_margin": 60,
    "font_weight": 900,
    "is_default": false
+  },
+  {
+    "id": "title_pangmen",
+    "label": "庞门正道",
+    "font_file": "title/庞门正道标题体3.0.ttf",
+    "font_family": "PangMenZhengDao",
+    "font_size": 80,
+    "color": "#FFFFFF",
+    "stroke_color": "#000000",
+    "stroke_size": 5,
+    "letter_spacing": 5,
+    "top_margin": 60,
+    "font_weight": 900,
+    "is_default": false
+  },
+  {
+    "id": "title_round",
+    "label": "优设标题圆",
+    "font_file": "title/优设标题圆.otf",
+    "font_family": "YouSheBiaoTiYuan",
+    "font_size": 78,
+    "color": "#FFFFFF",
+    "stroke_color": "#4A1A6B",
+    "stroke_size": 5,
+    "letter_spacing": 4,
+    "top_margin": 60,
+    "font_weight": 900,
+    "is_default": false
+  },
+  {
+    "id": "title_alibaba",
+    "label": "阿里数黑体",
+    "font_file": "title/阿里巴巴数黑体.ttf",
+    "font_family": "AlibabaShuHeiTi",
+    "font_size": 72,
+    "color": "#FFFFFF",
+    "stroke_color": "#000000",
+    "stroke_size": 4,
+    "letter_spacing": 3,
+    "top_margin": 60,
+    "font_weight": 900,
+    "is_default": false
+  },
+  {
+    "id": "title_chaohei",
+    "label": "文道潮黑",
+    "font_file": "title/文道潮黑.ttf",
+    "font_family": "WenDaoChaoHei",
+    "font_size": 76,
+    "color": "#00E5FF",
+    "stroke_color": "#001A33",
+    "stroke_size": 5,
+    "letter_spacing": 4,
+    "top_margin": 60,
+    "font_weight": 900,
+    "is_default": false
+  },
+  {
+    "id": "title_wujie",
+    "label": "无界黑",
+    "font_file": "title/标小智无界黑.otf",
+    "font_family": "BiaoXiaoZhiWuJieHei",
+    "font_size": 74,
+    "color": "#FFFFFF",
+    "stroke_color": "#1A1A1A",
+    "stroke_size": 4,
+    "letter_spacing": 3,
+    "top_margin": 60,
+    "font_weight": 900,
+    "is_default": false
+  },
+  {
+    "id": "title_houdi",
+    "label": "厚底黑",
+    "font_file": "title/Aa厚底黑.ttf",
+    "font_family": "AaHouDiHei",
+    "font_size": 76,
+    "color": "#FF6B6B",
+    "stroke_color": "#1A0000",
+    "stroke_size": 5,
+    "letter_spacing": 4,
+    "top_margin": 60,
+    "font_weight": 900,
+    "is_default": false
+  },
+  {
+    "id": "title_banyuan",
+    "label": "寒蝉半圆体",
+    "font_file": "title/寒蝉半圆体.otf",
+    "font_family": "HanChanBanYuan",
+    "font_size": 78,
+    "color": "#FFFFFF",
+    "stroke_color": "#000000",
+    "stroke_size": 5,
+    "letter_spacing": 4,
+    "top_margin": 60,
+    "font_weight": 900,
+    "is_default": false
+  },
+  {
+    "id": "title_jixiang",
+    "label": "欣意吉祥宋",
+    "font_file": "title/字体圈欣意吉祥宋.ttf",
+    "font_family": "XinYiJiXiangSong",
+    "font_size": 70,
+    "color": "#FDE68A",
+    "stroke_color": "#2B1B00",
+    "stroke_size": 5,
+    "letter_spacing": 3,
+    "top_margin": 58,
+    "font_weight": 800,
+    "is_default": false
  }
 ]
--- a/frontend/src/components/AccountSettingsDropdown.tsx
+++ b/frontend/src/components/AccountSettingsDropdown.tsx
@@ -4,6 +4,7 @@ import { useState, useEffect, useRef } from "react";
 import { useAuth } from "@/shared/contexts/AuthContext";
 import api from "@/shared/api/axios";
 import { ApiResponse } from "@/shared/api/types";
+import { AppModal, AppModalHeader } from "@/shared/ui/AppModal";

 // 账户设置下拉菜单组件
 export default function AccountSettingsDropdown() {
@@ -90,6 +91,15 @@ export default function AccountSettingsDropdown() {
        }
    };

+    const closePasswordModal = () => {
+        setShowPasswordModal(false);
+        setError('');
+        setSuccess('');
+        setOldPassword('');
+        setNewPassword('');
+        setConfirmPassword('');
+    };
+
    return (
        <div className="relative" ref={dropdownRef}>
            <button
@@ -137,81 +147,83 @@ export default function AccountSettingsDropdown() {

            {/* 修改密码弹窗 */}
            {showPasswordModal && (
-                <div className="fixed inset-0 z-[200] flex items-start justify-center pt-20 bg-black/60 backdrop-blur-sm p-4">
-                    <div className="w-full max-w-md p-6 bg-gray-900 border border-white/10 rounded-2xl shadow-2xl mx-4">
-                        <h3 className="text-xl font-bold text-white mb-4">修改密码</h3>
-                        <form onSubmit={handleChangePassword} className="space-y-4">
-                            <div>
-                                <label className="block text-sm text-gray-300 mb-1">当前密码</label>
-                                <input
-                                    type="password"
-                                    value={oldPassword}
-                                    onChange={(e) => setOldPassword(e.target.value)}
-                                    required
-                                    className="w-full px-3 py-2 bg-white/5 border border-white/10 rounded-lg text-white placeholder-gray-500 focus:outline-none focus:ring-2 focus:ring-purple-500"
-                                    placeholder="输入当前密码"
-                                />
-                            </div>
-                            <div>
-                                <label className="block text-sm text-gray-300 mb-1">新密码</label>
-                                <input
-                                    type="password"
-                                    value={newPassword}
-                                    onChange={(e) => setNewPassword(e.target.value)}
-                                    required
-                                    className="w-full px-3 py-2 bg-white/5 border border-white/10 rounded-lg text-white placeholder-gray-500 focus:outline-none focus:ring-2 focus:ring-purple-500"
-                                    placeholder="至少6位"
-                                />
-                            </div>
-                            <div>
-                                <label className="block text-sm text-gray-300 mb-1">确认新密码</label>
-                                <input
-                                    type="password"
-                                    value={confirmPassword}
-                                    onChange={(e) => setConfirmPassword(e.target.value)}
-                                    required
-                                    className="w-full px-3 py-2 bg-white/5 border border-white/10 rounded-lg text-white placeholder-gray-500 focus:outline-none focus:ring-2 focus:ring-purple-500"
-                                    placeholder="再次输入新密码"
-                                />
-                            </div>
+                <AppModal
+                    isOpen={showPasswordModal}
+                    onClose={closePasswordModal}
+                    zIndexClassName="z-[200]"
+                    panelClassName="w-full max-w-md rounded-2xl border border-white/10 bg-[#171821]/95 shadow-[0_24px_80px_rgba(0,0,0,0.55)] overflow-hidden"
+                    closeOnOverlay={false}
+                >
+                    <AppModalHeader
+                        title="修改密码"
+                        subtitle="修改后将自动退出并重新登录"
+                        onClose={closePasswordModal}
+                    />

-                            {error && (
-                                <div className="p-2 bg-red-500/20 border border-red-500/50 rounded text-red-200 text-sm">
-                                    {error}
-                                </div>
-                            )}
-                            {success && (
-                                <div className="p-2 bg-green-500/20 border border-green-500/50 rounded text-green-200 text-sm">
-                                    {success}
-                                </div>
-                            )}
+                    <form onSubmit={handleChangePassword} className="space-y-4 p-5">
+                        <div>
+                            <label className="block text-sm text-gray-300 mb-1">当前密码</label>
+                            <input
+                                type="password"
+                                value={oldPassword}
+                                onChange={(e) => setOldPassword(e.target.value)}
+                                required
+                                className="w-full px-3 py-2 bg-white/5 border border-white/10 rounded-lg text-white placeholder-gray-500 focus:outline-none focus:ring-2 focus:ring-purple-500"
+                                placeholder="输入当前密码"
+                            />
+                        </div>
+                        <div>
+                            <label className="block text-sm text-gray-300 mb-1">新密码</label>
+                            <input
+                                type="password"
+                                value={newPassword}
+                                onChange={(e) => setNewPassword(e.target.value)}
+                                required
+                                className="w-full px-3 py-2 bg-white/5 border border-white/10 rounded-lg text-white placeholder-gray-500 focus:outline-none focus:ring-2 focus:ring-purple-500"
+                                placeholder="至少6位"
+                            />
+                        </div>
+                        <div>
+                            <label className="block text-sm text-gray-300 mb-1">确认新密码</label>
+                            <input
+                                type="password"
+                                value={confirmPassword}
+                                onChange={(e) => setConfirmPassword(e.target.value)}
+                                required
+                                className="w-full px-3 py-2 bg-white/5 border border-white/10 rounded-lg text-white placeholder-gray-500 focus:outline-none focus:ring-2 focus:ring-purple-500"
+                                placeholder="再次输入新密码"
+                            />
+                        </div>

-                            <div className="flex gap-3 pt-2">
-                                <button
-                                    type="button"
-                                    onClick={() => {
-                                        setShowPasswordModal(false);
-                                        setError('');
-                                        setSuccess('');
-                                        setOldPassword('');
-                                        setNewPassword('');
-                                        setConfirmPassword('');
-                                    }}
-                                    className="flex-1 py-2 bg-white/10 hover:bg-white/20 text-white rounded-lg transition-colors"
-                                >
-                                    取消
-                                </button>
-                                <button
-                                    type="submit"
-                                    disabled={loading}
-                                    className="flex-1 py-2 bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-700 hover:to-pink-700 text-white rounded-lg transition-colors disabled:opacity-50"
-                                >
-                                    {loading ? '修改中...' : '确认修改'}
-                                </button>
+                        {error && (
+                            <div className="p-2 bg-red-500/20 border border-red-500/50 rounded text-red-200 text-sm">
+                                {error}
                            </div>
-                        </form>
-                    </div>
-                </div>
+                        )}
+                        {success && (
+                            <div className="p-2 bg-green-500/20 border border-green-500/50 rounded text-green-200 text-sm">
+                                {success}
+                            </div>
+                        )}
+
+                        <div className="flex gap-3 pt-2">
+                            <button
+                                type="button"
+                                onClick={closePasswordModal}
+                                className="flex-1 py-2 bg-white/10 hover:bg-white/20 text-white rounded-lg transition-colors"
+                            >
+                                取消
+                            </button>
+                            <button
+                                type="submit"
+                                disabled={loading}
+                                className="flex-1 py-2 bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-700 hover:to-pink-700 text-white rounded-lg transition-colors disabled:opacity-50"
+                            >
+                                {loading ? '修改中...' : '确认修改'}
+                            </button>
+                        </div>
+                    </form>
+                </AppModal>
            )}
        </div>
    );
--- a/frontend/src/components/VideoPreviewModal.tsx
+++ b/frontend/src/components/VideoPreviewModal.tsx
@@ -1,7 +1,7 @@
 "use client";

-import { useEffect } from "react";
-import { X, Video } from "lucide-react";
+import { Video } from "lucide-react";
+import { AppModal, AppModalHeader } from "@/shared/ui/AppModal";

 interface VideoPreviewModalProps {
    videoUrl: string | null;
@@ -16,66 +16,34 @@ export default function VideoPreviewModal({
    title = "视频预览",
    subtitle = "ESC 关闭 · 点击空白关闭",
 }: VideoPreviewModalProps) {
-  useEffect(() => {
-    if (!videoUrl) return;
-    // 按 ESC 关闭
-    const handleEsc = (e: KeyboardEvent) => {
-      if (e.key === 'Escape') onClose();
-    };
-    const prevOverflow = document.body.style.overflow;
-    document.addEventListener('keydown', handleEsc);
-    // 禁止背景滚动
-    document.body.style.overflow = 'hidden';
+  if (!videoUrl) return null;

-    return () => {
-      document.removeEventListener('keydown', handleEsc);
-      document.body.style.overflow = prevOverflow;
-    };
-  }, [videoUrl, onClose]);
+  return (
+    <AppModal
+      isOpen={Boolean(videoUrl)}
+      onClose={onClose}
+      zIndexClassName="z-[320]"
+      panelClassName="relative w-full max-w-4xl rounded-2xl border border-white/10 bg-[#171821]/95 shadow-[0_24px_80px_rgba(0,0,0,0.55)] overflow-hidden flex flex-col"
+      closeOnOverlay
+    >
+      <div data-video-preview-open="true" className="flex flex-col">
+        <AppModalHeader
+          title={title}
+          subtitle={subtitle}
+          icon={<Video className="h-5 w-5" />}
+          onClose={onClose}
+        />

-    if (!videoUrl) return null;
-
-    return (
-        <div
-            className="fixed inset-0 z-[200] flex items-center justify-center bg-black/80 backdrop-blur-sm p-4 animate-in fade-in duration-200"
-            onClick={onClose}
-        >
-            <div
-                className="relative w-full max-w-4xl bg-gray-900 border border-white/10 rounded-2xl shadow-2xl overflow-hidden flex flex-col"
-                onClick={(e) => e.stopPropagation()}
-            >
-                <div className="flex items-center justify-between px-6 py-3 border-b border-white/10 bg-gradient-to-r from-white/5 via-white/0 to-white/5">
-                    <div className="flex items-center gap-3">
-                        <div className="h-9 w-9 rounded-lg bg-white/10 flex items-center justify-center text-white">
-                            <Video className="h-5 w-5" />
-                        </div>
-                        <div>
-                            <h3 className="text-lg font-semibold text-white">
-                                {title}
-                            </h3>
-                            <p className="text-xs text-gray-400">
-                                {subtitle}
-                            </p>
-                        </div>
-                    </div>
-                    <button
-                        onClick={onClose}
-                        className="p-2 text-gray-400 hover:text-white hover:bg-white/10 rounded-lg transition-colors"
-                    >
-                        <X className="h-5 w-5" />
-                    </button>
-                </div>
-
-                <div className="bg-black flex items-center justify-center min-h-[50vh] max-h-[80vh]">
-                    <video
-                        src={videoUrl}
-                        controls
-                        autoPlay
-                        preload="metadata"
-                        className="w-full h-full max-h-[80vh] object-contain"
-                    />
-                </div>
-            </div>
+        <div className="bg-black flex items-center justify-center min-h-[50vh] max-h-[80vh]">
+          <video
+            src={videoUrl}
+            controls
+            autoPlay
+            preload="metadata"
+            className="w-full h-full max-h-[80vh] object-contain"
+          />
        </div>
-    );
+      </div>
+    </AppModal>
+  );
 }
--- a/frontend/src/features/home/model/useGeneratedAudios.ts
+++ b/frontend/src/features/home/model/useGeneratedAudios.ts
@@ -127,6 +127,7 @@ export const useGeneratedAudios = ({
    ref_text?: string;
    language: string;
    speed?: number;
+    instruct_text?: string;
  }) => {
    setIsGeneratingAudio(true);
    setAudioTask({ status: "pending", progress: 0, message: "正在提交..." });
--- a/frontend/src/features/home/model/useGeneratedVideos.ts
+++ b/frontend/src/features/home/model/useGeneratedVideos.ts
@@ -12,7 +12,7 @@ interface GeneratedVideo {
 }

 interface UseGeneratedVideosOptions {
-
+  storageKey: string;
  selectedVideoId: string | null;
  setSelectedVideoId: React.Dispatch<React.SetStateAction<string | null>>;
  setGeneratedVideo: React.Dispatch<React.SetStateAction<string | null>>;
@@ -20,7 +20,7 @@ interface UseGeneratedVideosOptions {
 }

 export const useGeneratedVideos = ({
-
+  storageKey,
  selectedVideoId,
  setSelectedVideoId,
  setGeneratedVideo,
@@ -45,6 +45,8 @@ export const useGeneratedVideos = ({
        if (preferVideoId === "__latest__") {
          setSelectedVideoId(videos[0].id);
          setGeneratedVideo(resolveMediaUrl(videos[0].path));
+          // 写入跨页面共享标记，让另一个页面也能感知最新生成的视频
+          localStorage.setItem(`vigent_${storageKey}_latestGeneratedVideoId`, videos[0].id);
        } else {
          const found = videos.find(v => v.id === preferVideoId);
          if (found) {
--- a/frontend/src/features/home/model/useHomeController.ts
+++ b/frontend/src/features/home/model/useHomeController.ts
@@ -1,4 +1,4 @@
-import { useEffect, useRef, useState } from "react";
+import { useEffect, useMemo, useRef, useState } from "react";
 import api from "@/shared/api/axios";
 import {
  buildTextShadow,
@@ -26,6 +26,7 @@ import { useRefAudios } from "@/features/home/model/useRefAudios";
 import { useTitleSubtitleStyles } from "@/features/home/model/useTitleSubtitleStyles";
 import { useTimelineEditor } from "@/features/home/model/useTimelineEditor";
 import { useSavedScripts } from "@/features/home/model/useSavedScripts";
+import { useVideoFrameCapture } from "@/features/home/model/useVideoFrameCapture";
 import { ApiResponse, unwrap } from "@/shared/api/types";

 const VOICES: Record<string, { id: string; name: string }[]> = {
@@ -123,6 +124,8 @@ interface RefAudio {
  created_at: number;
 }

+type LipsyncModelMode = "default" | "fast" | "advanced";
+
 import type { Material } from "@/shared/types/material";

 export const useHomeController = () => {
@@ -154,6 +157,7 @@ export const useHomeController = () => {
  const [titleDisplayMode, setTitleDisplayMode] = useState<"short" | "persistent">("short");
  const [subtitleBottomMargin, setSubtitleBottomMargin] = useState<number>(80);
  const [outputAspectRatio, setOutputAspectRatio] = useState<"9:16" | "16:9">("9:16");
+  const [lipsyncModelMode, setLipsyncModelMode] = useState<LipsyncModelMode>("default");
  const [showStylePreview, setShowStylePreview] = useState<boolean>(false);
  const [materialDimensions, setMaterialDimensions] = useState<{ width: number; height: number } | null>(null);

@@ -181,6 +185,9 @@ export const useHomeController = () => {
  // 语速控制
  const [speed, setSpeed] = useState<number>(1.0);

+  // 语气控制（仅声音克隆模式）
+  const [emotion, setEmotion] = useState<string>("normal");
+
  // ClipTrimmer 模态框状态
  const [clipTrimmerOpen, setClipTrimmerOpen] = useState(false);
  const [clipTrimmerSegmentId, setClipTrimmerSegmentId] = useState<string | null>(null);
@@ -280,6 +287,9 @@ export const useHomeController = () => {
  // 文案提取模态框
  const [extractModalOpen, setExtractModalOpen] = useState(false);

+  // AI 改写模态框
+  const [rewriteModalOpen, setRewriteModalOpen] = useState(false);
+
  // 获取存储 key 的前缀（登录用户使用 userId，未登录使用 guest）
  const storageKey = userId || "guest";

@@ -361,7 +371,7 @@ export const useHomeController = () => {
    fetchGeneratedVideos,
    deleteVideo,
  } = useGeneratedVideos({
-
+    storageKey,
    selectedVideoId,
    setSelectedVideoId,
    setGeneratedVideo,
@@ -395,6 +405,19 @@ export const useHomeController = () => {
    storageKey,
  });

+  // 时间轴第一段素材的视频 URL（用于帧截取预览）
+  // 使用后端代理 URL（同源）避免 CORS canvas taint
+  const firstTimelineMaterialUrl = useMemo(() => {
+    const firstSeg = timelineSegments[0];
+    const matId = firstSeg?.materialId ?? selectedMaterials[0];
+    if (!matId) return null;
+    const mat = materials.find((m) => m.id === matId);
+    if (!mat) return null;
+    return `/api/materials/stream/${mat.id}`;
+  }, [materials, timelineSegments, selectedMaterials]);
+
+  const materialPosterUrl = useVideoFrameCapture(showStylePreview ? firstTimelineMaterialUrl : null);
+
  useEffect(() => {
    if (isAuthLoading || !userId) return;
    let active = true;
@@ -472,6 +495,8 @@ export const useHomeController = () => {
    setSubtitleBottomMargin,
    outputAspectRatio,
    setOutputAspectRatio,
+    lipsyncModelMode,
+    setLipsyncModelMode,
    selectedBgmId,
    setSelectedBgmId,
    bgmVolume,
@@ -485,6 +510,8 @@ export const useHomeController = () => {
    setSelectedAudioId,
    speed,
    setSpeed,
+    emotion,
+    setEmotion,
  });

  const { savedScripts, saveScript, deleteScript: deleteSavedScript } = useSavedScripts(storageKey);
@@ -708,6 +735,9 @@ export const useHomeController = () => {
  // 开始录音
  const startRecording = async () => {
    try {
+      setRecordedBlob(null);
+      setRecordingTime(0);
+
      const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
      const mediaRecorder = new MediaRecorder(stream, { mimeType: "audio/webm" });
      const chunks: BlobPart[] = [];
@@ -721,7 +751,6 @@ export const useHomeController = () => {

      mediaRecorder.start();
      setIsRecording(true);
-      setRecordingTime(0);
      mediaRecorderRef.current = mediaRecorder;

      // 计时器
@@ -757,6 +786,11 @@ export const useHomeController = () => {
    setRecordingTime(0);
  };

+  const discardRecording = () => {
+    setRecordedBlob(null);
+    setRecordingTime(0);
+  };
+
  // 格式化录音时长
  const formatRecordingTime = (seconds: number) => {
    const mins = Math.floor(seconds / 60);
@@ -859,6 +893,13 @@ export const useHomeController = () => {
      return;
    }

+    const emotionToInstruct: Record<string, string> = {
+      normal: "",
+      happy: "You are a helpful assistant. 请非常开心地说一句话。<|endofprompt|>",
+      sad: "You are a helpful assistant. 请非常伤心地说一句话。<|endofprompt|>",
+      angry: "You are a helpful assistant. 请非常生气地说一句话。<|endofprompt|>",
+    };
+
    const params = {
      text: text.trim(),
      tts_mode: ttsMode,
@@ -867,6 +908,7 @@ export const useHomeController = () => {
      ref_text: ttsMode === "voiceclone" ? refText : undefined,
      language: textLang,
      speed: ttsMode === "voiceclone" ? speed : undefined,
+      instruct_text: ttsMode === "voiceclone" ? emotionToInstruct[emotion] || "" : undefined,
    };
    await generateAudio(params);
  };
@@ -904,6 +946,7 @@ export const useHomeController = () => {
        text: selectedAudio.text || text,
        generated_audio_id: selectedAudio.id,
        language: selectedAudio.language || textLang,
+        lipsync_model: lipsyncModelMode,
        title: videoTitle.trim() || undefined,
        enable_subtitles: true,
        output_aspect_ratio: outputAspectRatio,
@@ -1004,7 +1047,7 @@ export const useHomeController = () => {

      if (enableBgm && selectedBgmId) {
        payload.bgm_id = selectedBgmId;
-        payload.bgm_volume = bgmVolume;
+        payload.bgm_volume = 0.2;
      }

      // 创建生成任务
@@ -1080,6 +1123,8 @@ export const useHomeController = () => {
    setText,
    extractModalOpen,
    setExtractModalOpen,
+    rewriteModalOpen,
+    setRewriteModalOpen,
    handleGenerateMeta,
    isGeneratingMeta,
    handleTranslate,
@@ -1122,10 +1167,13 @@ export const useHomeController = () => {
    setSubtitleBottomMargin,
    outputAspectRatio,
    setOutputAspectRatio,
+    lipsyncModelMode,
+    setLipsyncModelMode,
    resolveAssetUrl,
    getFontFormat,
    buildTextShadow,
    materialDimensions,
+    materialPosterUrl,
    ttsMode,
    setTtsMode,
    voices: VOICES[textLang] || VOICES["zh-CN"],
@@ -1157,6 +1205,7 @@ export const useHomeController = () => {
    startRecording,
    stopRecording,
    useRecording,
+    discardRecording,
    formatRecordingTime,
    bgmList,
    bgmLoading,
@@ -1195,6 +1244,8 @@ export const useHomeController = () => {
    selectAudio,
    speed,
    setSpeed,
+    emotion,
+    setEmotion,
    timelineSegments,
    reorderSegments,
    setSourceRange,
--- a/frontend/src/features/home/model/useHomePersistence.ts
+++ b/frontend/src/features/home/model/useHomePersistence.ts
@@ -52,6 +52,8 @@ interface UseHomePersistenceOptions {
  setSubtitleBottomMargin: React.Dispatch<React.SetStateAction<number>>;
  outputAspectRatio: '9:16' | '16:9';
  setOutputAspectRatio: React.Dispatch<React.SetStateAction<'9:16' | '16:9'>>;
+  lipsyncModelMode: 'default' | 'fast' | 'advanced';
+  setLipsyncModelMode: React.Dispatch<React.SetStateAction<'default' | 'fast' | 'advanced'>>;
  selectedBgmId: string;
  setSelectedBgmId: React.Dispatch<React.SetStateAction<string>>;
  bgmVolume: number;
@@ -65,6 +67,8 @@ interface UseHomePersistenceOptions {
  setSelectedAudioId: React.Dispatch<React.SetStateAction<string | null>>;
  speed: number;
  setSpeed: React.Dispatch<React.SetStateAction<number>>;
+  emotion: string;
+  setEmotion: React.Dispatch<React.SetStateAction<string>>;
 }

 export const useHomePersistence = ({
@@ -109,6 +113,8 @@ export const useHomePersistence = ({
  setSubtitleBottomMargin,
  outputAspectRatio,
  setOutputAspectRatio,
+  lipsyncModelMode,
+  setLipsyncModelMode,
  selectedBgmId,
  setSelectedBgmId,
  bgmVolume,
@@ -122,6 +128,8 @@ export const useHomePersistence = ({
  setSelectedAudioId,
  speed,
  setSpeed,
+  emotion,
+  setEmotion,
 }: UseHomePersistenceOptions) => {
  const [isRestored, setIsRestored] = useState(false);

@@ -142,7 +150,8 @@ export const useHomePersistence = ({
    const savedTitleFontSize = localStorage.getItem(`vigent_${storageKey}_titleFontSize`);
    const savedSecondaryTitleFontSize = localStorage.getItem(`vigent_${storageKey}_secondaryTitleFontSize`);
    const savedBgmId = localStorage.getItem(`vigent_${storageKey}_bgmId`);
-    const savedSelectedVideoId = localStorage.getItem(`vigent_${storageKey}_selectedVideoId`);
+    const savedSelectedVideoId = localStorage.getItem(`vigent_${storageKey}_latestGeneratedVideoId`)
+      || localStorage.getItem(`vigent_${storageKey}_selectedVideoId`);
    const savedSelectedAudioId = localStorage.getItem(`vigent_${storageKey}_selectedAudioId`);
    const savedBgmVolume = localStorage.getItem(`vigent_${storageKey}_bgmVolume`);
    const savedEnableBgm = localStorage.getItem(`vigent_${storageKey}_enableBgm`);
@@ -151,7 +160,9 @@ export const useHomePersistence = ({
    const savedTitleDisplayMode = localStorage.getItem(`vigent_${storageKey}_titleDisplayMode`);
    const savedSubtitleBottomMargin = localStorage.getItem(`vigent_${storageKey}_subtitleBottomMargin`);
    const savedOutputAspectRatio = localStorage.getItem(`vigent_${storageKey}_outputAspectRatio`);
+    const savedLipsyncModelMode = localStorage.getItem(`vigent_${storageKey}_lipsyncModelMode`);
    const savedSpeed = localStorage.getItem(`vigent_${storageKey}_speed`);
+    const savedEmotion = localStorage.getItem(`vigent_${storageKey}_emotion`);

    setText(savedText || "大家好，欢迎来到我的频道，今天给大家分享一些有趣的内容。");
    setVideoTitle(savedTitle ? clampTitle(savedTitle) : "");
@@ -205,6 +216,8 @@ export const useHomePersistence = ({
    if (savedBgmVolume) setBgmVolume(parseFloat(savedBgmVolume));
    if (savedEnableBgm !== null) setEnableBgm(savedEnableBgm === 'true');
    if (savedSelectedVideoId) setSelectedVideoId(savedSelectedVideoId);
+    // 消费后清除跨页面共享标记，避免反复覆盖
+    localStorage.removeItem(`vigent_${storageKey}_latestGeneratedVideoId`);
    if (savedSelectedAudioId) setSelectedAudioId(savedSelectedAudioId);

    if (savedTitleTopMargin) {
@@ -227,11 +240,21 @@ export const useHomePersistence = ({
      setOutputAspectRatio(savedOutputAspectRatio);
    }

+    if (
+      savedLipsyncModelMode === 'default'
+      || savedLipsyncModelMode === 'fast'
+      || savedLipsyncModelMode === 'advanced'
+    ) {
+      setLipsyncModelMode(savedLipsyncModelMode);
+    }
+
    if (savedSpeed) {
      const parsed = parseFloat(savedSpeed);
      if (!Number.isNaN(parsed)) setSpeed(parsed);
    }

+    if (savedEmotion) setEmotion(savedEmotion);
+
    // eslint-disable-next-line react-hooks/set-state-in-effect
    setIsRestored(true);
  }, [
@@ -246,6 +269,7 @@ export const useHomePersistence = ({
    setSelectedVideoId,
    setSelectedAudioId,
    setSpeed,
+    setEmotion,
    setSubtitleFontSize,
    setSubtitleSizeLocked,
    setText,
@@ -259,6 +283,7 @@ export const useHomePersistence = ({
    setTitleDisplayMode,
    setSubtitleBottomMargin,
    setOutputAspectRatio,
+    setLipsyncModelMode,
    setTtsMode,
    setVideoTitle,
    setVideoSecondaryTitle,
@@ -374,6 +399,12 @@ export const useHomePersistence = ({
    }
  }, [outputAspectRatio, storageKey, isRestored]);

+  useEffect(() => {
+    if (isRestored) {
+      localStorage.setItem(`vigent_${storageKey}_lipsyncModelMode`, lipsyncModelMode);
+    }
+  }, [lipsyncModelMode, storageKey, isRestored]);
+
  useEffect(() => {
    if (isRestored) {
      localStorage.setItem(`vigent_${storageKey}_bgmId`, selectedBgmId);
@@ -424,5 +455,11 @@ export const useHomePersistence = ({
    }
  }, [speed, storageKey, isRestored]);

+  useEffect(() => {
+    if (isRestored) {
+      localStorage.setItem(`vigent_${storageKey}_emotion`, emotion);
+    }
+  }, [emotion, storageKey, isRestored]);
+
  return { isRestored };
 };
--- a/frontend/src/features/home/model/useVideoFrameCapture.ts
+++ b/frontend/src/features/home/model/useVideoFrameCapture.ts
@@ -0,0 +1,93 @@
+import { useEffect, useState } from "react";
+
+/** 预览窗口最大 280px 宽，截取无需超过此尺寸 */
+const MAX_CAPTURE_WIDTH = 480;
+
+/**
+ * 从视频 URL 截取 0.1s 处的帧，返回 JPEG data URL。
+ * 失败时返回 null（降级渐变背景）。
+ */
+export function useVideoFrameCapture(videoUrl: string | null): string | null {
+  const [frameUrl, setFrameUrl] = useState<string | null>(null);
+
+  useEffect(() => {
+    if (!videoUrl) {
+      setFrameUrl(null);
+      return;
+    }
+
+    let isActive = true;
+    const video = document.createElement("video");
+    video.muted = true;
+    video.preload = "auto";
+    video.playsInline = true;
+
+    const cleanup = () => {
+      video.removeEventListener("loadedmetadata", onLoaded);
+      video.removeEventListener("canplay", onLoaded);
+      video.removeEventListener("seeked", onSeeked);
+      video.removeEventListener("error", onError);
+      video.src = "";
+      video.load();
+    };
+
+    const onSeeked = () => {
+      if (!isActive) return;
+      try {
+        const vw = video.videoWidth;
+        const vh = video.videoHeight;
+        if (!vw || !vh) {
+          if (isActive) setFrameUrl(null);
+          cleanup();
+          return;
+        }
+
+        const scale = Math.min(1, MAX_CAPTURE_WIDTH / vw);
+        const cw = Math.round(vw * scale);
+        const ch = Math.round(vh * scale);
+
+        const canvas = document.createElement("canvas");
+        canvas.width = cw;
+        canvas.height = ch;
+        const ctx = canvas.getContext("2d");
+        if (!ctx) {
+          if (isActive) setFrameUrl(null);
+          cleanup();
+          return;
+        }
+        ctx.drawImage(video, 0, 0, cw, ch);
+        const dataUrl = canvas.toDataURL("image/jpeg", 0.7);
+        if (isActive) setFrameUrl(dataUrl);
+      } catch {
+        if (isActive) setFrameUrl(null);
+      }
+      cleanup();
+    };
+
+    let seeked = false;
+    const onLoaded = () => {
+      if (!isActive || seeked) return;
+      seeked = true;
+      video.currentTime = 0.1;
+    };
+
+    const onError = () => {
+      if (isActive) setFrameUrl(null);
+      cleanup();
+    };
+
+    // 先绑定监听，再设 src
+    video.addEventListener("loadedmetadata", onLoaded);
+    video.addEventListener("canplay", onLoaded);
+    video.addEventListener("seeked", onSeeked);
+    video.addEventListener("error", onError);
+    video.src = videoUrl;
+
+    return () => {
+      isActive = false;
+      cleanup();
+    };
+  }, [videoUrl]);
+
+  return frameUrl;
+}
--- a/frontend/src/features/home/ui/BgmPanel.tsx
+++ b/frontend/src/features/home/ui/BgmPanel.tsx
@@ -1,5 +1,6 @@
-import type { RefObject, MouseEvent } from "react";
-import { RefreshCw, Play, Pause } from "lucide-react";
+import { type RefObject, type MouseEvent, useCallback, useMemo, useState } from "react";
+import { RefreshCw, Play, Pause, ChevronDown, Check, Search } from "lucide-react";
+import { SelectPopover } from "@/shared/ui/SelectPopover";

 interface BgmItem {
  id: string;
@@ -18,8 +19,6 @@ interface BgmPanelProps {
  onSelectBgm: (id: string) => void;
  playingBgmId: string | null;
  onTogglePreview: (bgm: BgmItem, event: MouseEvent) => void;
-  bgmVolume: number;
-  onVolumeChange: (value: number) => void;
  bgmListContainerRef: RefObject<HTMLDivElement | null>;
  registerBgmItemRef: (id: string, element: HTMLDivElement | null) => void;
 }
@@ -35,11 +34,31 @@ export function BgmPanel({
  onSelectBgm,
  playingBgmId,
  onTogglePreview,
-  bgmVolume,
-  onVolumeChange,
  bgmListContainerRef,
  registerBgmItemRef,
 }: BgmPanelProps) {
+  const [bgmFilter, setBgmFilter] = useState("");
+  const selectedBgm = bgmList.find((item) => item.id === selectedBgmId) || null;
+  const canSelectBgm = enableBgm && !bgmLoading && !bgmError && bgmList.length > 0;
+  const filteredBgmList = useMemo(() => {
+    const query = bgmFilter.trim().toLowerCase();
+    if (!query) return bgmList;
+    return bgmList.filter((bgm) => bgm.name.toLowerCase().includes(query));
+  }, [bgmFilter, bgmList]);
+
+  const handleOpenBgmPopover = useCallback(() => {
+    setBgmFilter("");
+
+    requestAnimationFrame(() => {
+      requestAnimationFrame(() => {
+        const container = bgmListContainerRef.current;
+        if (!container) return;
+        const selectedRow = container.querySelector<HTMLElement>("[data-bgm-selected='true']");
+        selectedRow?.scrollIntoView({ block: "nearest", behavior: "auto" });
+      });
+    });
+  }, [bgmListContainerRef]);
+
  return (
    <div className="bg-white/5 rounded-2xl p-6 border border-white/10 backdrop-blur-sm">
      <div className="flex items-center justify-between mb-4">
@@ -79,57 +98,108 @@ export function BgmPanel({
      ) : bgmList.length === 0 ? (
        <div className="text-center py-4 text-gray-500 text-sm">暂无背景音乐，请先导入素材</div>
      ) : (
-        <div
-          ref={bgmListContainerRef}
-          className={`space-y-2 max-h-64 overflow-y-auto hide-scrollbar ${enableBgm ? '' : 'opacity-70'}`}
-        >
-          {bgmList.map((bgm) => (
-            <div
-              key={bgm.id}
-              ref={(el) => registerBgmItemRef(bgm.id, el)}
-              className={`p-3 rounded-lg border transition-all flex items-center justify-between group ${selectedBgmId === bgm.id
-                ? "border-purple-500 bg-purple-500/20"
-                : "border-white/10 bg-white/5 hover:border-white/30"
-                }`}
-            >
-              <button onClick={() => onSelectBgm(bgm.id)} className="flex-1 text-left">
-                <div className="text-white text-sm truncate">{bgm.name}</div>
-                <div className="text-xs text-gray-400">.{bgm.ext || 'audio'}</div>
+        <div className={!enableBgm ? "opacity-70" : ""}>
+          <p className="mb-2 text-xs text-gray-400">曲目选择</p>
+          <SelectPopover
+            sheetTitle="选择背景音乐"
+            disabled={!canSelectBgm}
+            onOpen={handleOpenBgmPopover}
+            trigger={({ open, toggle }) => (
+              <button
+                type="button"
+                onClick={toggle}
+                disabled={!canSelectBgm}
+                className={`w-full rounded-xl border px-3 py-2.5 text-left transition-colors ${canSelectBgm
+                  ? "border-white/10 bg-black/25 hover:border-white/30"
+                  : "border-white/10 bg-black/20 text-gray-500 cursor-not-allowed"
+                  }`}
+              >
+                <span className="flex items-center justify-between gap-3">
+                  <span className="min-w-0">
+                    <span className="block truncate text-sm text-white">
+                      {selectedBgm?.name || "请选择背景音乐"}
+                    </span>
+                    <span className="mt-0.5 block text-xs text-gray-400">
+                      {selectedBgm ? `.${selectedBgm.ext || "audio"}` : "未选择"}
+                    </span>
+                  </span>
+                  <ChevronDown className={`h-4 w-4 text-gray-300 transition-transform ${open ? "rotate-180" : ""}`} />
+                </span>
              </button>
-              <div className="flex items-center gap-2 pl-2">
-                <button
-                  onClick={(e) => onTogglePreview(bgm, e)}
-                  className="p-1 text-gray-500 hover:text-purple-400 transition-colors"
-                  title="试听"
-                >
-                  {playingBgmId === bgm.id ? (
-                    <Pause className="h-4 w-4" />
-                  ) : (
-                    <Play className="h-4 w-4" />
-                  )}
-                </button>
-                {selectedBgmId === bgm.id && (
-                  <span className="text-xs text-purple-300">已选</span>
+            )}
+          >
+            {({ close }) => (
+              <div className="space-y-2">
+                <div className="rounded-lg border border-white/10 bg-black/30 px-3 py-2">
+                  <div className="flex items-center gap-2">
+                    <Search className="h-4 w-4 text-gray-400" />
+                    <input
+                      type="text"
+                      value={bgmFilter}
+                      onChange={(e) => setBgmFilter(e.target.value)}
+                      placeholder="搜索背景音乐..."
+                      className="w-full bg-transparent text-sm text-white placeholder-gray-500 outline-none"
+                    />
+                  </div>
+                </div>
+
+                {filteredBgmList.length === 0 ? (
+                  <div className="py-6 text-center text-sm text-gray-400">没有匹配的背景音乐</div>
+                ) : (
+                  <div
+                    ref={bgmListContainerRef}
+                    className="space-y-1"
+                    style={{ contentVisibility: "auto" }}
+                  >
+                    {filteredBgmList.map((bgm) => {
+                      const isSelected = selectedBgmId === bgm.id;
+
+                      return (
+                        <div
+                          key={bgm.id}
+                          ref={(el) => registerBgmItemRef(bgm.id, el)}
+                          data-popover-selected={isSelected ? "true" : undefined}
+                          data-bgm-selected={isSelected ? "true" : "false"}
+                          className={`flex items-center justify-between gap-2 rounded-lg border px-3 py-2 transition-colors ${isSelected
+                            ? "border-purple-500 bg-purple-500/20"
+                            : "border-white/10 bg-white/5 hover:border-white/30"
+                            }`}
+                        >
+                          <button
+                            type="button"
+                            onClick={() => {
+                              onSelectBgm(bgm.id);
+                              close();
+                            }}
+                            className="min-w-0 flex-1 text-left"
+                          >
+                            <span className="block truncate text-sm text-white">{bgm.name}</span>
+                            <span className="mt-0.5 block text-xs text-gray-400">.{bgm.ext || "audio"}</span>
+                          </button>
+
+                          <div className="flex items-center gap-2 pl-2">
+                            <button
+                              type="button"
+                              onClick={(e) => onTogglePreview(bgm, e)}
+                              className="p-1 text-gray-400 hover:text-purple-300 transition-colors"
+                              title="试听"
+                            >
+                              {playingBgmId === bgm.id ? (
+                                <Pause className="h-4 w-4" />
+                              ) : (
+                                <Play className="h-4 w-4" />
+                              )}
+                            </button>
+                            {isSelected && <Check className="h-4 w-4 text-purple-300" />}
+                          </div>
+                        </div>
+                      );
+                    })}
+                  </div>
                )}
              </div>
-            </div>
-          ))}
-        </div>
-      )}
-
-      {enableBgm && (
-        <div className="mt-4">
-          <label className="text-sm text-gray-300 mb-2 block">音量</label>
-          <input
-            type="range"
-            min="0"
-            max="1"
-            step="0.05"
-            value={bgmVolume}
-            onChange={(e) => onVolumeChange(parseFloat(e.target.value))}
-            className="w-full accent-purple-500"
-          />
-          <div className="text-xs text-gray-400 mt-1">当前: {Math.round(bgmVolume * 100)}%</div>
+            )}
+          </SelectPopover>
        </div>
      )}
    </div>
--- a/frontend/src/features/home/ui/ClipTrimmer.tsx
+++ b/frontend/src/features/home/ui/ClipTrimmer.tsx
@@ -1,6 +1,7 @@
-import { useCallback, useEffect, useRef, useState } from "react";
-import { X, Play, Pause } from "lucide-react";
-import type { TimelineSegment } from "@/features/home/model/useTimelineEditor";
+import { useCallback, useEffect, useRef, useState } from "react";
+import { Play, Pause } from "lucide-react";
+import type { TimelineSegment } from "@/features/home/model/useTimelineEditor";
+import { AppModal, AppModalHeader } from "@/shared/ui/AppModal";

 interface ClipTrimmerProps {
  isOpen: boolean;
@@ -153,21 +154,18 @@ export function ClipTrimmer({
  const endPct = duration > 0 ? (effectiveEnd / duration) * 100 : 100;
  const playheadPct = duration > 0 ? (currentTime / duration) * 100 : 0;

-  return (
-    <div className="fixed inset-0 z-50 flex items-center justify-center bg-black/60 backdrop-blur-sm" onClick={onClose}>
-      <div
-        className="bg-gray-900 border border-white/10 rounded-2xl w-full max-w-lg mx-4 overflow-hidden"
-        onClick={(e) => e.stopPropagation()}
-      >
-        {/* Header */}
-        <div className="flex items-center justify-between px-5 py-3 border-b border-white/10">
-          <h3 className="text-white font-semibold text-sm">
-            截取设置 - {segment.materialName}
-          </h3>
-          <button onClick={onClose} className="text-gray-400 hover:text-white">
-            <X className="h-4 w-4" />
-          </button>
-        </div>
+  return (
+    <AppModal
+      isOpen={isOpen}
+      onClose={onClose}
+      panelClassName="w-full max-w-lg mx-4 rounded-2xl border border-white/10 bg-[#171821]/95 shadow-[0_24px_80px_rgba(0,0,0,0.55)] overflow-hidden"
+      closeOnOverlay
+    >
+      <AppModalHeader
+        title={`截取设置 - ${segment.materialName}`}
+        subtitle="拖拽起止点，精确控制素材片段"
+        onClose={onClose}
+      />

        {/* Video preview */}
        <div className="px-5 pt-4">
@@ -287,7 +285,6 @@ export function ClipTrimmer({
            确定
          </button>
        </div>
-      </div>
-    </div>
-  );
-}
+    </AppModal>
+  );
+}
--- a/frontend/src/features/home/ui/FloatingStylePreview.tsx
+++ b/frontend/src/features/home/ui/FloatingStylePreview.tsx
@@ -53,6 +53,7 @@ interface FloatingStylePreviewProps {
  buildTextShadow: (color: string, size: number) => string;
  previewBaseWidth: number;
  previewBaseHeight: number;
+  previewBackgroundUrl?: string | null;
 }

 const DESKTOP_WIDTH = 280;
@@ -79,6 +80,7 @@ export function FloatingStylePreview({
  buildTextShadow,
  previewBaseWidth,
  previewBaseHeight,
+  previewBackgroundUrl,
 }: FloatingStylePreviewProps) {
  const isMobile = typeof window !== "undefined" && window.innerWidth < 640;
  const windowWidth = isMobile ? MOBILE_WIDTH : DESKTOP_WIDTH;
@@ -190,7 +192,11 @@ export function FloatingStylePreview({
            ${subtitleFontUrl ? `@font-face { font-family: '${subtitleFontFamilyName}'; src: url('${subtitleFontUrl}') format('${getFontFormat(activeSubtitleStyle?.font_file)}'); font-weight: 400; font-style: normal; }` : ''}
          `}</style>
        )}
-        <div className="absolute inset-0 opacity-20 bg-gradient-to-br from-purple-500/40 via-transparent to-pink-500/30" />
+        {previewBackgroundUrl ? (
+          <img src={previewBackgroundUrl} alt="" className="absolute inset-0 w-full h-full object-cover" />
+        ) : (
+          <div className="absolute inset-0 opacity-20 bg-gradient-to-br from-purple-500/40 via-transparent to-pink-500/30" />
+        )}
        <div
          className="absolute top-0 left-0"
          style={{
--- a/frontend/src/features/home/ui/GenerateActionBar.tsx
+++ b/frontend/src/features/home/ui/GenerateActionBar.tsx
@@ -1,10 +1,21 @@
-import { Rocket } from "lucide-react";
+import { Rocket, ChevronDown, Check } from "lucide-react";
+import { SelectPopover } from "@/shared/ui/SelectPopover";
+
+type LipsyncModelMode = "default" | "fast" | "advanced";
+
+const MODEL_OPTIONS: Array<{ value: LipsyncModelMode; label: string; desc: string }> = [
+  { value: "default", label: "默认模型", desc: "按时长智能路由" },
+  { value: "fast", label: "快速模型", desc: "速度优先" },
+  { value: "advanced", label: "高级模型", desc: "质量优先" },
+];

 interface GenerateActionBarProps {
  isGenerating: boolean;
  progress: number;
  disabled: boolean;
  materialCount?: number;
+  modelMode: LipsyncModelMode;
+  onModelModeChange: (value: LipsyncModelMode) => void;
  onGenerate: () => void;
 }

@@ -13,45 +24,102 @@ export function GenerateActionBar({
  progress,
  disabled,
  materialCount = 1,
+  modelMode,
+  onModelModeChange,
  onGenerate,
 }: GenerateActionBarProps) {
+  const currentModel = MODEL_OPTIONS.find((opt) => opt.value === modelMode) || MODEL_OPTIONS[0];
+
  return (
    <div>
-      <button
-        onClick={onGenerate}
-        disabled={disabled}
-        className={`w-full py-4 rounded-xl font-bold text-lg transition-all ${disabled
-          ? "bg-gray-600 cursor-not-allowed text-gray-400"
-          : "bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-700 hover:to-pink-700 text-white shadow-lg hover:shadow-purple-500/25"
-          }`}
-      >
-        {isGenerating ? (
-          <span className="flex items-center justify-center gap-3">
-            <svg className="animate-spin h-5 w-5" viewBox="0 0 24 24">
-              <circle
-                className="opacity-25"
-                cx="12"
-                cy="12"
-                r="10"
-                stroke="currentColor"
-                strokeWidth="4"
-                fill="none"
-              />
-              <path
-                className="opacity-75"
-                fill="currentColor"
-                d="M4 12a8 8 0 018-8V0C5.373 0 0 5.373 0 12h4z"
-              />
-            </svg>
-            生成中... {progress}%
-          </span>
-        ) : (
-          <span className="flex items-center justify-center gap-2">
-            <Rocket className="h-5 w-5" />
-            生成视频
-          </span>
-        )}
-      </button>
+      <div className="flex items-center gap-2">
+        <button
+          onClick={onGenerate}
+          disabled={disabled}
+          className={`flex-1 py-4 rounded-xl font-bold text-lg transition-all ${disabled
+            ? "bg-gray-600 cursor-not-allowed text-gray-400"
+            : "bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-700 hover:to-pink-700 text-white shadow-lg hover:shadow-purple-500/25"
+            }`}
+        >
+          {isGenerating ? (
+            <span className="flex items-center justify-center gap-3">
+              <svg className="animate-spin h-5 w-5" viewBox="0 0 24 24">
+                <circle
+                  className="opacity-25"
+                  cx="12"
+                  cy="12"
+                  r="10"
+                  stroke="currentColor"
+                  strokeWidth="4"
+                  fill="none"
+                />
+                <path
+                  className="opacity-75"
+                  fill="currentColor"
+                  d="M4 12a8 8 0 018-8V0C5.373 0 0 5.373 0 12h4z"
+                />
+              </svg>
+              生成中... {progress}%
+            </span>
+          ) : (
+            <span className="flex items-center justify-center gap-2">
+              <Rocket className="h-5 w-5" />
+              生成视频
+            </span>
+          )}
+        </button>
+
+        <SelectPopover
+          sheetTitle="选择唇形模型"
+          disabled={isGenerating}
+          trigger={({ open, toggle }) => (
+            <button
+              type="button"
+              onClick={toggle}
+              disabled={isGenerating}
+              className="h-[58px] min-w-[152px] rounded-xl border border-white/15 bg-black/30 px-3 text-left text-sm text-gray-200 transition-colors hover:border-white/30 disabled:cursor-not-allowed disabled:opacity-50"
+              title="选择唇形模型"
+            >
+              <span className="flex items-center justify-between gap-2">
+                <span className="min-w-0">
+                  <span className="block truncate text-sm text-white">{currentModel.label}</span>
+                  <span className="mt-0.5 block text-xs text-gray-400">{currentModel.desc}</span>
+                </span>
+                <ChevronDown className={`h-4 w-4 text-gray-300 transition-transform ${open ? "rotate-180" : ""}`} />
+              </span>
+            </button>
+          )}
+        >
+          {({ close }) => (
+            <div className="space-y-1">
+              {MODEL_OPTIONS.map((opt) => {
+                const isSelected = opt.value === modelMode;
+                return (
+                  <button
+                    key={opt.value}
+                    type="button"
+                    data-popover-selected={isSelected ? "true" : undefined}
+                    onClick={() => {
+                      onModelModeChange(opt.value);
+                      close();
+                    }}
+                    className={`flex w-full items-center justify-between rounded-lg border px-3 py-2 text-left transition-colors ${isSelected
+                      ? "border-purple-500 bg-purple-500/20"
+                      : "border-white/10 bg-white/5 hover:border-white/30"
+                      }`}
+                  >
+                    <span>
+                      <span className="block text-sm text-white">{opt.label}</span>
+                      <span className="mt-0.5 block text-xs text-gray-400">{opt.desc}</span>
+                    </span>
+                    {isSelected && <Check className="h-4 w-4 text-purple-300" />}
+                  </button>
+                );
+              })}
+            </div>
+          )}
+        </SelectPopover>
+      </div>
      {!isGenerating && materialCount >= 2 && (
        <p className="text-xs text-gray-400 text-center mt-1.5">
          多素材模式 ({materialCount} 个机位)，生成耗时较长
--- a/frontend/src/features/home/ui/GeneratedAudiosPanel.tsx
+++ b/frontend/src/features/home/ui/GeneratedAudiosPanel.tsx
@@ -1,6 +1,7 @@
-import { useState, useRef, useCallback, useEffect } from "react";
-import { Play, Pause, Pencil, Trash2, Check, X, RefreshCw, Mic, ChevronDown } from "lucide-react";
-import type { GeneratedAudio } from "@/features/home/model/useGeneratedAudios";
+import { useState, useRef, useCallback, useEffect, useMemo } from "react";
+import { Play, Pause, Pencil, Trash2, Check, X, RefreshCw, Mic, ChevronDown, Search } from "lucide-react";
+import type { GeneratedAudio } from "@/features/home/model/useGeneratedAudios";
+import { SelectPopover } from "@/shared/ui/SelectPopover";

 interface AudioTask {
  status: string;
@@ -23,6 +24,8 @@ interface GeneratedAudiosPanelProps {
  speed: number;
  onSpeedChange: (speed: number) => void;
  ttsMode: string;
+  emotion: string;
+  onEmotionChange: (e: string) => void;
  embedded?: boolean;
 }

@@ -41,14 +44,16 @@ export function GeneratedAudiosPanel({
  speed,
  onSpeedChange,
  ttsMode,
+  emotion,
+  onEmotionChange,
  embedded = false,
 }: GeneratedAudiosPanelProps) {
-  const [editingId, setEditingId] = useState<string | null>(null);
-  const [editName, setEditName] = useState("");
-  const [playingId, setPlayingId] = useState<string | null>(null);
-  const [speedOpen, setSpeedOpen] = useState(false);
-  const audioRef = useRef<HTMLAudioElement | null>(null);
-  const speedRef = useRef<HTMLDivElement>(null);
+  const [editingId, setEditingId] = useState<string | null>(null);
+  const [editName, setEditName] = useState("");
+  const [playingId, setPlayingId] = useState<string | null>(null);
+  const [audioFilter, setAudioFilter] = useState("");
+  const audioRef = useRef<HTMLAudioElement | null>(null);
+  const audioListContainerRef = useRef<HTMLDivElement | null>(null);

  const stopPlaying = useCallback(() => {
    if (audioRef.current) {
@@ -69,17 +74,6 @@ export function GeneratedAudiosPanel({
    };
  }, []);

-  // Close speed dropdown on click outside
-  useEffect(() => {
-    const handler = (e: MouseEvent) => {
-      if (speedRef.current && !speedRef.current.contains(e.target as Node)) {
-        setSpeedOpen(false);
-      }
-    };
-    if (speedOpen) document.addEventListener("mousedown", handler);
-    return () => document.removeEventListener("mousedown", handler);
-  }, [speedOpen]);
-
  const togglePlay = (audio: GeneratedAudio, e: React.MouseEvent) => {
    e.stopPropagation();
    if (playingId === audio.id) {
@@ -125,40 +119,121 @@ export function GeneratedAudiosPanel({
  ] as const;
  const currentSpeedLabel = speedOptions.find((o) => o.value === speed)?.label ?? "正常";

+  const emotionOptions = [
+    { value: "normal", label: "正常" },
+    { value: "happy", label: "欢快" },
+    { value: "sad", label: "低沉" },
+    { value: "angry", label: "严肃" },
+  ] as const;
+  const currentEmotionLabel = emotionOptions.find((o) => o.value === emotion)?.label ?? "正常";
+  const selectedAudio = generatedAudios.find((audio) => audio.id === selectedAudioId) || null;
+  const filteredAudios = useMemo(() => {
+    const query = audioFilter.trim().toLowerCase();
+    if (!query) return generatedAudios;
+    return generatedAudios.filter((audio) => audio.name.toLowerCase().includes(query));
+  }, [audioFilter, generatedAudios]);
+
+  const handleOpenAudioPopover = useCallback(() => {
+    setAudioFilter("");
+
+    requestAnimationFrame(() => {
+      requestAnimationFrame(() => {
+        const container = audioListContainerRef.current;
+        if (!container) return;
+        const selectedRow = container.querySelector<HTMLElement>("[data-audio-selected='true']");
+        selectedRow?.scrollIntoView({ block: "nearest", behavior: "auto" });
+      });
+    });
+  }, []);
+
  const content = (
    <>
      {embedded ? (
        <>
-          {/* Row 1: 语速 + 生成配音 (right-aligned) */}
+          {/* Row 1: 语气 + 语速 + 生成配音 (right-aligned) */}
          <div className="flex justify-end items-center gap-1.5 mb-3">
-            {ttsMode === "voiceclone" && (
-              <div ref={speedRef} className="relative">
-                <button
-                  onClick={() => setSpeedOpen((v) => !v)}
-                  className="px-2 py-1 text-xs bg-white/10 hover:bg-white/20 rounded text-gray-300 whitespace-nowrap flex items-center gap-1 transition-all"
-                >
-                  语速: {currentSpeedLabel}
-                  <ChevronDown className={`h-3 w-3 transition-transform ${speedOpen ? "rotate-180" : ""}`} />
-                </button>
-                {speedOpen && (
-                  <div className="absolute right-0 top-full mt-1 bg-gray-800 border border-white/20 rounded-lg shadow-xl py-1 z-50 min-w-[80px]">
-                    {speedOptions.map((opt) => (
-                      <button
-                        key={opt.value}
-                        onClick={() => { onSpeedChange(opt.value); setSpeedOpen(false); }}
-                        className={`w-full text-left px-3 py-1.5 text-xs transition-colors ${
-                          speed === opt.value
-                            ? "bg-purple-600/40 text-purple-200"
-                            : "text-gray-300 hover:bg-white/10"
-                        }`}
-                      >
-                        {opt.label}
-                      </button>
-                    ))}
-                  </div>
-                )}
-              </div>
-            )}
+            {ttsMode === "voiceclone" && (
+              <SelectPopover
+                sheetTitle="选择语气"
+                trigger={({ open, toggle }) => (
+                  <button
+                    type="button"
+                    onClick={toggle}
+                    className="rounded-lg border border-white/10 bg-black/25 px-2.5 py-1.5 text-xs text-gray-200 whitespace-nowrap flex items-center gap-1 transition-colors hover:border-white/30"
+                  >
+                    语气: {currentEmotionLabel}
+                    <ChevronDown className={`h-3 w-3 transition-transform ${open ? "rotate-180" : ""}`} />
+                  </button>
+                )}
+              >
+                {({ close }) => (
+                  <div className="space-y-1">
+                    {emotionOptions.map((opt) => {
+                      const isSelected = emotion === opt.value;
+                      return (
+                        <button
+                          key={opt.value}
+                          type="button"
+                          data-popover-selected={isSelected ? "true" : undefined}
+                          onClick={() => {
+                            onEmotionChange(opt.value);
+                            close();
+                          }}
+                          className={`flex w-full items-center justify-between rounded-lg border px-3 py-2 text-left text-xs transition-colors ${isSelected
+                            ? "border-purple-500 bg-purple-500/20 text-purple-200"
+                            : "border-white/10 bg-white/5 text-gray-300 hover:border-white/30"
+                            }`}
+                        >
+                          {opt.label}
+                          {isSelected && <Check className="h-3.5 w-3.5 text-purple-300" />}
+                        </button>
+                      );
+                    })}
+                  </div>
+                )}
+              </SelectPopover>
+            )}
+            {ttsMode === "voiceclone" && (
+              <SelectPopover
+                sheetTitle="选择语速"
+                trigger={({ open, toggle }) => (
+                  <button
+                    type="button"
+                    onClick={toggle}
+                    className="rounded-lg border border-white/10 bg-black/25 px-2.5 py-1.5 text-xs text-gray-200 whitespace-nowrap flex items-center gap-1 transition-colors hover:border-white/30"
+                  >
+                    语速: {currentSpeedLabel}
+                    <ChevronDown className={`h-3 w-3 transition-transform ${open ? "rotate-180" : ""}`} />
+                  </button>
+                )}
+              >
+                {({ close }) => (
+                  <div className="space-y-1">
+                    {speedOptions.map((opt) => {
+                      const isSelected = speed === opt.value;
+                      return (
+                        <button
+                          key={opt.value}
+                          type="button"
+                          data-popover-selected={isSelected ? "true" : undefined}
+                          onClick={() => {
+                            onSpeedChange(opt.value);
+                            close();
+                          }}
+                          className={`flex w-full items-center justify-between rounded-lg border px-3 py-2 text-left text-xs transition-colors ${isSelected
+                            ? "border-purple-500 bg-purple-500/20 text-purple-200"
+                            : "border-white/10 bg-white/5 text-gray-300 hover:border-white/30"
+                            }`}
+                        >
+                          {opt.label}
+                          {isSelected && <Check className="h-3.5 w-3.5 text-purple-300" />}
+                        </button>
+                      );
+                    })}
+                  </div>
+                )}
+              </SelectPopover>
+            )}
            <button
              onClick={onGenerateAudio}
              disabled={isGeneratingAudio || !canGenerate}
@@ -192,34 +267,88 @@ export function GeneratedAudiosPanel({
            配音列表
          </h2>
          <div className="flex gap-1.5">
-            {ttsMode === "voiceclone" && (
-              <div ref={speedRef} className="relative">
-                <button
-                  onClick={() => setSpeedOpen((v) => !v)}
-                  className="px-2 py-1 text-xs bg-white/10 hover:bg-white/20 rounded text-gray-300 whitespace-nowrap flex items-center gap-1 transition-all"
-                >
-                  语速: {currentSpeedLabel}
-                  <ChevronDown className={`h-3 w-3 transition-transform ${speedOpen ? "rotate-180" : ""}`} />
-                </button>
-                {speedOpen && (
-                  <div className="absolute right-0 top-full mt-1 bg-gray-800 border border-white/20 rounded-lg shadow-xl py-1 z-50 min-w-[80px]">
-                    {speedOptions.map((opt) => (
-                      <button
-                        key={opt.value}
-                        onClick={() => { onSpeedChange(opt.value); setSpeedOpen(false); }}
-                        className={`w-full text-left px-3 py-1.5 text-xs transition-colors ${
-                          speed === opt.value
-                            ? "bg-purple-600/40 text-purple-200"
-                            : "text-gray-300 hover:bg-white/10"
-                        }`}
-                      >
-                        {opt.label}
-                      </button>
-                    ))}
-                  </div>
-                )}
-              </div>
-            )}
+            {ttsMode === "voiceclone" && (
+              <SelectPopover
+                sheetTitle="选择语气"
+                trigger={({ open, toggle }) => (
+                  <button
+                    type="button"
+                    onClick={toggle}
+                    className="rounded-lg border border-white/10 bg-black/25 px-2.5 py-1.5 text-xs text-gray-200 whitespace-nowrap flex items-center gap-1 transition-colors hover:border-white/30"
+                  >
+                    语气: {currentEmotionLabel}
+                    <ChevronDown className={`h-3 w-3 transition-transform ${open ? "rotate-180" : ""}`} />
+                  </button>
+                )}
+              >
+                {({ close }) => (
+                  <div className="space-y-1">
+                    {emotionOptions.map((opt) => {
+                      const isSelected = emotion === opt.value;
+                      return (
+                        <button
+                          key={opt.value}
+                          type="button"
+                          data-popover-selected={isSelected ? "true" : undefined}
+                          onClick={() => {
+                            onEmotionChange(opt.value);
+                            close();
+                          }}
+                          className={`flex w-full items-center justify-between rounded-lg border px-3 py-2 text-left text-xs transition-colors ${isSelected
+                            ? "border-purple-500 bg-purple-500/20 text-purple-200"
+                            : "border-white/10 bg-white/5 text-gray-300 hover:border-white/30"
+                            }`}
+                        >
+                          {opt.label}
+                          {isSelected && <Check className="h-3.5 w-3.5 text-purple-300" />}
+                        </button>
+                      );
+                    })}
+                  </div>
+                )}
+              </SelectPopover>
+            )}
+            {ttsMode === "voiceclone" && (
+              <SelectPopover
+                sheetTitle="选择语速"
+                trigger={({ open, toggle }) => (
+                  <button
+                    type="button"
+                    onClick={toggle}
+                    className="rounded-lg border border-white/10 bg-black/25 px-2.5 py-1.5 text-xs text-gray-200 whitespace-nowrap flex items-center gap-1 transition-colors hover:border-white/30"
+                  >
+                    语速: {currentSpeedLabel}
+                    <ChevronDown className={`h-3 w-3 transition-transform ${open ? "rotate-180" : ""}`} />
+                  </button>
+                )}
+              >
+                {({ close }) => (
+                  <div className="space-y-1">
+                    {speedOptions.map((opt) => {
+                      const isSelected = speed === opt.value;
+                      return (
+                        <button
+                          key={opt.value}
+                          type="button"
+                          data-popover-selected={isSelected ? "true" : undefined}
+                          onClick={() => {
+                            onSpeedChange(opt.value);
+                            close();
+                          }}
+                          className={`flex w-full items-center justify-between rounded-lg border px-3 py-2 text-left text-xs transition-colors ${isSelected
+                            ? "border-purple-500 bg-purple-500/20 text-purple-200"
+                            : "border-white/10 bg-white/5 text-gray-300 hover:border-white/30"
+                            }`}
+                        >
+                          {opt.label}
+                          {isSelected && <Check className="h-3.5 w-3.5 text-purple-300" />}
+                        </button>
+                      );
+                    })}
+                  </div>
+                )}
+              </SelectPopover>
+            )}
            <button
              onClick={onGenerateAudio}
              disabled={isGeneratingAudio || !canGenerate}
@@ -268,87 +397,142 @@ export function GeneratedAudiosPanel({
      )}

      {/* 配音列表 */}
-      {generatedAudios.length === 0 ? (
-        <div className="text-center py-6 text-gray-400">
-          <p className="text-sm">暂无配音</p>
-          <p className="text-xs mt-1 text-gray-500">点击「生成配音」创建</p>
-        </div>
-      ) : (
-        <div className="space-y-2 max-h-48 sm:max-h-56 overflow-y-auto hide-scrollbar">
-          {generatedAudios.map((audio) => {
-            const isSelected = selectedAudioId === audio.id;
-            return (
-              <div
-                key={audio.id}
-                onClick={() => onSelectAudio(audio)}
-                className={`p-3 rounded-lg border transition-all cursor-pointer flex items-center justify-between group ${
-                  isSelected
-                    ? "border-purple-500 bg-purple-500/20"
-                    : "border-white/10 bg-white/5 hover:border-white/30"
-                }`}
-              >
-                {editingId === audio.id ? (
-                  <div className="flex-1 flex items-center gap-2" onClick={(e) => e.stopPropagation()}>
-                    <input
-                      value={editName}
-                      onChange={(e) => setEditName(e.target.value)}
-                      className="flex-1 bg-black/40 border border-white/20 rounded-md px-2 py-1 text-xs text-white"
-                      autoFocus
-                      onKeyDown={(e) => {
-                        if (e.key === "Enter") saveEditing(audio.id, e as unknown as React.MouseEvent);
-                        if (e.key === "Escape") cancelEditing(e as unknown as React.MouseEvent);
-                      }}
-                    />
-                    <button onClick={(e) => saveEditing(audio.id, e)} className="p-1 text-green-400 hover:text-green-300" title="保存">
-                      <Check className="h-4 w-4" />
-                    </button>
-                    <button onClick={cancelEditing} className="p-1 text-gray-400 hover:text-white" title="取消">
-                      <X className="h-4 w-4" />
-                    </button>
-                  </div>
-                ) : (
-                  <>
-                    <div className="min-w-0 flex-1">
-                      <div className="text-white text-sm truncate">{audio.name}</div>
-                      <div className="text-gray-400 text-xs">{audio.duration_sec.toFixed(1)}s</div>
-                    </div>
-                    <div className="flex items-center gap-1 pl-2 opacity-40 group-hover:opacity-100 transition-opacity">
-                      <button
-                        onClick={(e) => togglePlay(audio, e)}
-                        className="p-1 text-gray-500 hover:text-purple-400 transition-colors"
-                        title={playingId === audio.id ? "暂停" : "播放"}
-                      >
-                        {playingId === audio.id ? (
-                          <Pause className="h-3.5 w-3.5" />
-                        ) : (
-                          <Play className="h-3.5 w-3.5" />
-                        )}
-                      </button>
-                      <button
-                        onClick={(e) => startEditing(audio, e)}
-                        className="p-1 text-gray-500 hover:text-white transition-colors"
-                        title="重命名"
-                      >
-                        <Pencil className="h-3.5 w-3.5" />
-                      </button>
-                      <button
-                        onClick={(e) => {
-                          e.stopPropagation();
-                          onDeleteAudio(audio.id);
-                        }}
-                        className="p-1 text-gray-500 hover:text-red-400 transition-colors"
-                        title="删除"
-                      >
-                        <Trash2 className="h-3.5 w-3.5" />
-                      </button>
-                    </div>
-                  </>
-                )}
-              </div>
-            );
-          })}
-        </div>
-      )}
+      {generatedAudios.length === 0 ? (
+        <div className="text-center py-6 text-gray-400">
+          <p className="text-sm">暂无配音</p>
+          <p className="text-xs mt-1 text-gray-500">点击「生成配音」创建</p>
+        </div>
+      ) : (
+        <SelectPopover
+          sheetTitle="选择配音"
+          onOpen={handleOpenAudioPopover}
+          trigger={({ open, toggle }) => (
+            <button
+              type="button"
+              onClick={toggle}
+              className="w-full rounded-xl border border-white/10 bg-black/25 px-3 py-2.5 text-left transition-colors hover:border-white/30"
+            >
+              <span className="flex items-center justify-between gap-3">
+                <span className="min-w-0">
+                  <span className="block text-xs text-gray-400">当前配音</span>
+                  <span className="mt-0.5 block truncate text-sm text-white">
+                    {selectedAudio ? selectedAudio.name : "请选择配音"}
+                  </span>
+                </span>
+                <ChevronDown className={`h-4 w-4 text-gray-300 transition-transform ${open ? "rotate-180" : ""}`} />
+              </span>
+            </button>
+          )}
+        >
+          {({ close }) => (
+            <div className="space-y-2">
+              <div className="rounded-lg border border-white/10 bg-black/30 px-3 py-2">
+                <div className="flex items-center gap-2">
+                  <Search className="h-4 w-4 text-gray-400" />
+                  <input
+                    type="text"
+                    value={audioFilter}
+                    onChange={(e) => setAudioFilter(e.target.value)}
+                    placeholder="搜索配音..."
+                    className="w-full bg-transparent text-sm text-white placeholder-gray-500 outline-none"
+                  />
+                </div>
+              </div>
+
+              {filteredAudios.length === 0 ? (
+                <div className="py-6 text-center text-sm text-gray-400">没有匹配的配音</div>
+              ) : (
+                <div ref={audioListContainerRef} className="space-y-1" style={{ contentVisibility: "auto" }}>
+                  {filteredAudios.map((audio) => {
+                    const isSelected = selectedAudioId === audio.id;
+                    return (
+                      <div
+                        key={audio.id}
+                        data-popover-selected={isSelected ? "true" : undefined}
+                        data-audio-selected={isSelected ? "true" : "false"}
+                        className={`flex items-center justify-between gap-2 rounded-lg border px-3 py-2 transition-colors ${isSelected
+                          ? "border-purple-500 bg-purple-500/20"
+                          : "border-white/10 bg-white/5 hover:border-white/30"
+                          }`}
+                      >
+                        {editingId === audio.id ? (
+                          <div className="flex-1 flex items-center gap-2" onClick={(e) => e.stopPropagation()}>
+                            <input
+                              value={editName}
+                              onChange={(e) => setEditName(e.target.value)}
+                              className="flex-1 bg-black/40 border border-white/20 rounded-md px-2 py-1 text-xs text-white"
+                              autoFocus
+                              onKeyDown={(e) => {
+                                if (e.key === "Enter") saveEditing(audio.id, e as unknown as React.MouseEvent);
+                                if (e.key === "Escape") cancelEditing(e as unknown as React.MouseEvent);
+                              }}
+                            />
+                            <button type="button" onClick={(e) => saveEditing(audio.id, e)} className="p-1 text-green-400 hover:text-green-300" title="保存">
+                              <Check className="h-4 w-4" />
+                            </button>
+                            <button type="button" onClick={cancelEditing} className="p-1 text-gray-400 hover:text-white" title="取消">
+                              <X className="h-4 w-4" />
+                            </button>
+                          </div>
+                        ) : (
+                          <button
+                            type="button"
+                            onClick={() => {
+                              onSelectAudio(audio);
+                              close();
+                            }}
+                            className="min-w-0 flex-1 text-left"
+                          >
+                            <span className="block truncate text-sm text-white">{audio.name}</span>
+                            <span className="mt-0.5 block text-xs text-gray-400">{audio.duration_sec.toFixed(1)}s</span>
+                          </button>
+                        )}
+
+                        {editingId !== audio.id && (
+                          <div className="flex items-center gap-1 pl-2">
+                            <button
+                              type="button"
+                              onClick={(e) => togglePlay(audio, e)}
+                              className="p-1 text-gray-400 hover:text-purple-300 transition-colors"
+                              title={playingId === audio.id ? "暂停" : "播放"}
+                            >
+                              {playingId === audio.id ? (
+                                <Pause className="h-3.5 w-3.5" />
+                              ) : (
+                                <Play className="h-3.5 w-3.5" />
+                              )}
+                            </button>
+                            <button
+                              type="button"
+                              onClick={(e) => startEditing(audio, e)}
+                              className="p-1 text-gray-400 hover:text-white transition-colors"
+                              title="重命名"
+                            >
+                              <Pencil className="h-3.5 w-3.5" />
+                            </button>
+                            <button
+                              type="button"
+                              onClick={(e) => {
+                                e.stopPropagation();
+                                onDeleteAudio(audio.id);
+                              }}
+                              className="p-1 text-gray-400 hover:text-red-400 transition-colors"
+                              title="删除"
+                            >
+                              <Trash2 className="h-3.5 w-3.5" />
+                            </button>
+                            {isSelected && <Check className="h-3.5 w-3.5 text-purple-300" />}
+                          </div>
+                        )}
+                      </div>
+                    );
+                  })}
+                </div>
+              )}
+            </div>
+          )}
+        </SelectPopover>
+      )}
    </>
  );

--- a/frontend/src/features/home/ui/HistoryList.tsx
+++ b/frontend/src/features/home/ui/HistoryList.tsx
@@ -1,4 +1,6 @@
-import { RefreshCw, Trash2 } from "lucide-react";
+import { useCallback, useMemo, useRef, useState } from "react";
+import { RefreshCw, Trash2, Search, ChevronDown, Check } from "lucide-react";
+import { SelectPopover } from "@/shared/ui/SelectPopover";

 interface GeneratedVideo {
  id: string;
@@ -29,6 +31,29 @@ export function HistoryList({
  formatDate,
  embedded = false,
 }: HistoryListProps) {
+  const [videoFilter, setVideoFilter] = useState("");
+  const videoListContainerRef = useRef<HTMLDivElement | null>(null);
+
+  const selectedVideo = generatedVideos.find((v) => v.id === selectedVideoId) || null;
+  const filteredVideos = useMemo(() => {
+    const query = videoFilter.trim().toLowerCase();
+    if (!query) return generatedVideos;
+    return generatedVideos.filter((v) => formatDate(v.created_at).toLowerCase().includes(query));
+  }, [generatedVideos, videoFilter, formatDate]);
+
+  const handleOpenVideoPopover = useCallback(() => {
+    setVideoFilter("");
+
+    requestAnimationFrame(() => {
+      requestAnimationFrame(() => {
+        const container = videoListContainerRef.current;
+        if (!container) return;
+        const selectedRow = container.querySelector<HTMLElement>("[data-video-selected='true']");
+        selectedRow?.scrollIntoView({ block: "nearest", behavior: "auto" });
+      });
+    });
+  }, []);
+
  const content = (
    <>
      {!embedded && (
@@ -48,36 +73,98 @@ export function HistoryList({
          <p>暂无生成的作品</p>
        </div>
      ) : (
-        <div
-          className="space-y-2 max-h-64 overflow-y-auto hide-scrollbar"
-          style={{ contentVisibility: 'auto' }}
-        >
-          {generatedVideos.map((v) => (
-            <div
-              key={v.id}
-              ref={(el) => registerVideoRef(v.id, el)}
-              className={`p-3 rounded-lg border transition-all flex items-center justify-between group ${selectedVideoId === v.id
-                ? "border-purple-500 bg-purple-500/20"
-                : "border-white/10 bg-white/5 hover:border-white/30"
-                }`}
+        <SelectPopover
+          sheetTitle="选择作品"
+          onOpen={handleOpenVideoPopover}
+          trigger={({ open, toggle }) => (
+            <button
+              type="button"
+              onClick={toggle}
+              className="w-full rounded-xl border border-white/10 bg-black/25 px-3 py-2.5 text-left transition-colors hover:border-white/30"
            >
-              <button onClick={() => onSelectVideo(v)} className="flex-1 text-left">
-                <div className="text-white text-sm truncate">{formatDate(v.created_at)}</div>
-                <div className="text-gray-400 text-xs">{v.size_mb.toFixed(1)} MB</div>
-              </button>
-              <button
-                onClick={(e) => {
-                  e.stopPropagation();
-                  onDeleteVideo(v.id);
-                }}
-                className="p-1 text-gray-500 hover:text-red-400 opacity-40 group-hover:opacity-100 transition-opacity"
-                title="删除视频"
-              >
-                <Trash2 className="h-4 w-4" />
-              </button>
+              <span className="flex items-center justify-between gap-3">
+                <span className="min-w-0">
+                  <span className="block text-xs text-gray-400">当前作品</span>
+                  <span className="mt-0.5 block truncate text-sm text-white">
+                    {selectedVideo ? formatDate(selectedVideo.created_at) : "请选择作品"}
+                  </span>
+                </span>
+                <ChevronDown className={`h-4 w-4 text-gray-300 transition-transform ${open ? "rotate-180" : ""}`} />
+              </span>
+            </button>
+          )}
+        >
+          {({ close }) => (
+            <div className="space-y-2">
+              <div className="rounded-lg border border-white/10 bg-black/30 px-3 py-2">
+                <div className="flex items-center gap-2">
+                  <Search className="h-4 w-4 text-gray-400" />
+                  <input
+                    type="text"
+                    value={videoFilter}
+                    onChange={(e) => setVideoFilter(e.target.value)}
+                    placeholder="搜索作品..."
+                    className="w-full bg-transparent text-sm text-white placeholder-gray-500 outline-none"
+                  />
+                </div>
+              </div>
+
+              {filteredVideos.length === 0 ? (
+                <div className="py-6 text-center text-sm text-gray-400">没有匹配的作品</div>
+              ) : (
+                <div
+                  ref={videoListContainerRef}
+                  className="space-y-1"
+                  style={{ contentVisibility: "auto" }}
+                >
+                  {filteredVideos.map((v) => {
+                    const isSelected = selectedVideoId === v.id;
+
+                    return (
+                      <div
+                        key={v.id}
+                        ref={(el) => registerVideoRef(v.id, el)}
+                        data-popover-selected={isSelected ? "true" : undefined}
+                        data-video-selected={isSelected ? "true" : "false"}
+                        className={`flex items-center justify-between gap-2 rounded-lg border px-3 py-2 transition-colors ${isSelected
+                          ? "border-purple-500 bg-purple-500/20"
+                          : "border-white/10 bg-white/5 hover:border-white/30"
+                          }`}
+                      >
+                        <button
+                          type="button"
+                          onClick={() => {
+                            onSelectVideo(v);
+                            close();
+                          }}
+                          className="min-w-0 flex-1 text-left"
+                        >
+                          <span className="block truncate text-sm text-white">{formatDate(v.created_at)}</span>
+                          <span className="mt-0.5 block text-xs text-gray-400">{v.size_mb.toFixed(1)} MB</span>
+                        </button>
+
+                        <div className="flex items-center gap-2 pl-2">
+                          <button
+                            type="button"
+                            onClick={(e) => {
+                              e.stopPropagation();
+                              onDeleteVideo(v.id);
+                            }}
+                            className="p-1 text-gray-400 hover:text-red-400"
+                            title="删除视频"
+                          >
+                            <Trash2 className="h-4 w-4" />
+                          </button>
+                          {isSelected && <Check className="h-4 w-4 text-purple-300" />}
+                        </div>
+                      </div>
+                    );
+                  })}
+                </div>
+              )}
            </div>
-          ))}
-        </div>
+          )}
+        </SelectPopover>
      )}
    </>
  );
--- a/frontend/src/features/home/ui/HomePage.tsx
+++ b/frontend/src/features/home/ui/HomePage.tsx
@@ -5,6 +5,7 @@ import { useRouter } from "next/navigation";
 import { RefreshCw } from "lucide-react";
 import VideoPreviewModal from "@/components/VideoPreviewModal";
 import ScriptExtractionModal from "./ScriptExtractionModal";
+import RewriteModal from "./RewriteModal";
 import { useHomeController } from "@/features/home/model/useHomeController";
 import { resolveMediaUrl } from "@/shared/lib/media";
 import { BgmPanel } from "@/features/home/ui/BgmPanel";
@@ -52,6 +53,8 @@ export function HomePage() {
    setText,
    extractModalOpen,
    setExtractModalOpen,
+    rewriteModalOpen,
+    setRewriteModalOpen,
    handleGenerateMeta,
    isGeneratingMeta,
    handleTranslate,
@@ -94,6 +97,8 @@ export function HomePage() {
    setTitleDisplayMode,
    outputAspectRatio,
    setOutputAspectRatio,
+    lipsyncModelMode,
+    setLipsyncModelMode,
    resolveAssetUrl,
    getFontFormat,
    buildTextShadow,
@@ -127,6 +132,7 @@ export function HomePage() {
    startRecording,
    stopRecording,
    useRecording,
+    discardRecording,
    formatRecordingTime,
    bgmList,
    bgmLoading,
@@ -138,8 +144,6 @@ export function HomePage() {
    setSelectedBgmId,
    playingBgmId,
    toggleBgmPreview,
-    bgmVolume,
-    setBgmVolume,
    bgmListContainerRef,
    registerBgmItemRef,
    currentTask,
@@ -165,6 +169,8 @@ export function HomePage() {
    selectAudio,
    speed,
    setSpeed,
+    emotion,
+    setEmotion,
    timelineSegments,
    reorderSegments,
    setSourceRange,
@@ -172,6 +178,7 @@ export function HomePage() {
    setClipTrimmerOpen,
    clipTrimmerSegmentId,
    setClipTrimmerSegmentId,
+    materialPosterUrl,
  } = useHomeController();

  useEffect(() => {
@@ -215,6 +222,7 @@ export function HomePage() {
              text={text}
              onChangeText={setText}
              onOpenExtractModal={() => setExtractModalOpen(true)}
+              onOpenRewriteModal={() => setRewriteModalOpen(true)}
              onGenerateMeta={handleGenerateMeta}
              isGeneratingMeta={isGeneratingMeta}
              onTranslate={handleTranslate}
@@ -227,60 +235,10 @@ export function HomePage() {
              onDeleteScript={deleteSavedScript}
            />

-            {/* 二、标题与字幕 */}
-            <TitleSubtitlePanel
-              showStylePreview={showStylePreview}
-              onTogglePreview={() => setShowStylePreview((prev) => !prev)}
-              videoTitle={videoTitle}
-              onTitleChange={titleInput.handleChange}
-              onTitleCompositionStart={titleInput.handleCompositionStart}
-              onTitleCompositionEnd={titleInput.handleCompositionEnd}
-              videoSecondaryTitle={videoSecondaryTitle}
-              onSecondaryTitleChange={secondaryTitleInput.handleChange}
-              onSecondaryTitleCompositionStart={secondaryTitleInput.handleCompositionStart}
-              onSecondaryTitleCompositionEnd={secondaryTitleInput.handleCompositionEnd}
-              titleStyles={titleStyles}
-              selectedTitleStyleId={selectedTitleStyleId}
-              onSelectTitleStyle={setSelectedTitleStyleId}
-              titleFontSize={titleFontSize}
-              onTitleFontSizeChange={(value) => {
-                setTitleFontSize(value);
-                setTitleSizeLocked(true);
-              }}
-              selectedSecondaryTitleStyleId={selectedSecondaryTitleStyleId}
-              onSelectSecondaryTitleStyle={setSelectedSecondaryTitleStyleId}
-              secondaryTitleFontSize={secondaryTitleFontSize}
-              onSecondaryTitleFontSizeChange={(value) => {
-                setSecondaryTitleFontSize(value);
-                setSecondaryTitleSizeLocked(true);
-              }}
-              secondaryTitleTopMargin={secondaryTitleTopMargin}
-              onSecondaryTitleTopMarginChange={setSecondaryTitleTopMargin}
-              subtitleStyles={subtitleStyles}
-              selectedSubtitleStyleId={selectedSubtitleStyleId}
-              onSelectSubtitleStyle={setSelectedSubtitleStyleId}
-              subtitleFontSize={subtitleFontSize}
-              onSubtitleFontSizeChange={(value) => {
-                setSubtitleFontSize(value);
-                setSubtitleSizeLocked(true);
-              }}
-              titleTopMargin={titleTopMargin}
-              onTitleTopMarginChange={setTitleTopMargin}
-              subtitleBottomMargin={subtitleBottomMargin}
-              onSubtitleBottomMarginChange={setSubtitleBottomMargin}
-              titleDisplayMode={titleDisplayMode}
-              onTitleDisplayModeChange={setTitleDisplayMode}
-              resolveAssetUrl={resolveAssetUrl}
-              getFontFormat={getFontFormat}
-              buildTextShadow={buildTextShadow}
-              previewBaseWidth={outputAspectRatio === "16:9" ? 1920 : 1080}
-              previewBaseHeight={outputAspectRatio === "16:9" ? 1080 : 1920}
-            />
-
-            {/* 三、配音 */}
-            <div className="bg-white/5 rounded-2xl p-4 sm:p-6 border border-white/10 backdrop-blur-sm">
+            {/* 二、配音 */}
+            <div className="relative z-20 bg-white/5 rounded-2xl p-4 sm:p-6 border border-white/10 backdrop-blur-sm">
              <h2 className="text-base sm:text-lg font-semibold text-white mb-4">
-                三、配音
+                二、配音
              </h2>
              <h3 className="text-sm font-medium text-gray-400 mb-3">配音方式</h3>
              <VoiceSelector
@@ -317,6 +275,7 @@ export function HomePage() {
                    onStartRecording={startRecording}
                    onStopRecording={stopRecording}
                    onUseRecording={useRecording}
+                    onDiscardRecording={discardRecording}
                    formatRecordingTime={formatRecordingTime}
                  />
                )}
@@ -338,13 +297,15 @@ export function HomePage() {
                speed={speed}
                onSpeedChange={setSpeed}
                ttsMode={ttsMode}
+                emotion={emotion}
+                onEmotionChange={setEmotion}
              />
            </div>

-            {/* 四、素材编辑 */}
+            {/* 三、素材编辑 */}
            <div className="bg-white/5 rounded-2xl p-4 sm:p-6 border border-white/10 backdrop-blur-sm">
              <h2 className="text-base sm:text-lg font-semibold text-white mb-4">
-                四、素材编辑
+                三、素材编辑
              </h2>
              <MaterialSelector
                embedded
@@ -397,6 +358,57 @@ export function HomePage() {
              </div>
            </div>

+            {/* 四、标题与字幕 */}
+            <TitleSubtitlePanel
+              showStylePreview={showStylePreview}
+              onTogglePreview={() => setShowStylePreview((prev) => !prev)}
+              videoTitle={videoTitle}
+              onTitleChange={titleInput.handleChange}
+              onTitleCompositionStart={titleInput.handleCompositionStart}
+              onTitleCompositionEnd={titleInput.handleCompositionEnd}
+              videoSecondaryTitle={videoSecondaryTitle}
+              onSecondaryTitleChange={secondaryTitleInput.handleChange}
+              onSecondaryTitleCompositionStart={secondaryTitleInput.handleCompositionStart}
+              onSecondaryTitleCompositionEnd={secondaryTitleInput.handleCompositionEnd}
+              titleStyles={titleStyles}
+              selectedTitleStyleId={selectedTitleStyleId}
+              onSelectTitleStyle={setSelectedTitleStyleId}
+              titleFontSize={titleFontSize}
+              onTitleFontSizeChange={(value) => {
+                setTitleFontSize(value);
+                setTitleSizeLocked(true);
+              }}
+              selectedSecondaryTitleStyleId={selectedSecondaryTitleStyleId}
+              onSelectSecondaryTitleStyle={setSelectedSecondaryTitleStyleId}
+              secondaryTitleFontSize={secondaryTitleFontSize}
+              onSecondaryTitleFontSizeChange={(value) => {
+                setSecondaryTitleFontSize(value);
+                setSecondaryTitleSizeLocked(true);
+              }}
+              secondaryTitleTopMargin={secondaryTitleTopMargin}
+              onSecondaryTitleTopMarginChange={setSecondaryTitleTopMargin}
+              subtitleStyles={subtitleStyles}
+              selectedSubtitleStyleId={selectedSubtitleStyleId}
+              onSelectSubtitleStyle={setSelectedSubtitleStyleId}
+              subtitleFontSize={subtitleFontSize}
+              onSubtitleFontSizeChange={(value) => {
+                setSubtitleFontSize(value);
+                setSubtitleSizeLocked(true);
+              }}
+              titleTopMargin={titleTopMargin}
+              onTitleTopMarginChange={setTitleTopMargin}
+              subtitleBottomMargin={subtitleBottomMargin}
+              onSubtitleBottomMarginChange={setSubtitleBottomMargin}
+              titleDisplayMode={titleDisplayMode}
+              onTitleDisplayModeChange={setTitleDisplayMode}
+              resolveAssetUrl={resolveAssetUrl}
+              getFontFormat={getFontFormat}
+              buildTextShadow={buildTextShadow}
+              previewBaseWidth={outputAspectRatio === "16:9" ? 1920 : 1080}
+              previewBaseHeight={outputAspectRatio === "16:9" ? 1080 : 1920}
+              previewBackgroundUrl={materialPosterUrl}
+            />
+
            {/* 背景音乐 (不编号) */}
            <BgmPanel
              bgmList={bgmList}
@@ -409,8 +421,6 @@ export function HomePage() {
              onSelectBgm={setSelectedBgmId}
              playingBgmId={playingBgmId}
              onTogglePreview={toggleBgmPreview}
-              bgmVolume={bgmVolume}
-              onVolumeChange={setBgmVolume}
              bgmListContainerRef={bgmListContainerRef}
              registerBgmItemRef={registerBgmItemRef}
            />
@@ -421,6 +431,8 @@ export function HomePage() {
              progress={currentTask?.progress || 0}
              materialCount={selectedMaterials.length}
              disabled={isGenerating || selectedMaterials.length === 0 || !selectedAudio}
+              modelMode={lipsyncModelMode}
+              onModelModeChange={setLipsyncModelMode}
              onGenerate={handleGenerate}
            />
          </div>
@@ -493,6 +505,13 @@ export function HomePage() {
        onApply={(nextText) => setText(nextText)}
      />

+      <RewriteModal
+        isOpen={rewriteModalOpen}
+        onClose={() => setRewriteModalOpen(false)}
+        originalText={text}
+        onApply={(newText) => setText(newText)}
+      />
+
      <ClipTrimmer
        isOpen={clipTrimmerOpen}
        segment={clipTrimmerSegment}
--- a/frontend/src/features/home/ui/MaterialSelector.tsx
+++ b/frontend/src/features/home/ui/MaterialSelector.tsx
@@ -1,6 +1,7 @@
-import { type ChangeEvent, type MouseEvent, useMemo } from "react";
-import { Upload, RefreshCw, Eye, Trash2, X, Pencil, Check } from "lucide-react";
+import { type ChangeEvent, type MouseEvent, useCallback, useMemo, useRef, useState } from "react";
+import { Upload, RefreshCw, Eye, Trash2, X, Pencil, Check, Search, ChevronDown } from "lucide-react";
 import type { Material } from "@/shared/types/material";
+import { SelectPopover } from "@/shared/ui/SelectPopover";

 interface MaterialSelectorProps {
  materials: Material[];
@@ -53,8 +54,46 @@ export function MaterialSelector({
  registerMaterialRef,
  embedded = false,
 }: MaterialSelectorProps) {
+  const [materialFilter, setMaterialFilter] = useState("");
+  const materialListContainerRef = useRef<HTMLDivElement | null>(null);
  const selectedSet = useMemo(() => new Set(selectedMaterials), [selectedMaterials]);
  const isFull = selectedMaterials.length >= 4;
+  const selectedMaterialItems = useMemo(
+    () => selectedMaterials.map((id) => materials.find((m) => m.id === id)).filter((m): m is Material => Boolean(m)),
+    [materials, selectedMaterials],
+  );
+  const filteredMaterials = useMemo(() => {
+    const query = materialFilter.trim().toLowerCase();
+    if (!query) return materials;
+    return materials.filter((m) => (m.scene || m.name).toLowerCase().includes(query));
+  }, [materialFilter, materials]);
+
+  const selectedSummary = useMemo(() => {
+    if (selectedMaterialItems.length === 0) {
+      return "请选择素材（最多4个）";
+    }
+    const names = selectedMaterialItems
+      .slice(0, 2)
+      .map((m) => m.scene || m.name)
+      .join("、");
+    if (selectedMaterialItems.length > 2) {
+      return `${names} +${selectedMaterialItems.length - 2}`;
+    }
+    return names;
+  }, [selectedMaterialItems]);
+
+  const handleOpenMaterialPopover = useCallback(() => {
+    setMaterialFilter("");
+
+    requestAnimationFrame(() => {
+      requestAnimationFrame(() => {
+        const container = materialListContainerRef.current;
+        if (!container) return;
+        const selectedRow = container.querySelector<HTMLElement>("[data-material-selected='true']");
+        selectedRow?.scrollIntoView({ block: "nearest", behavior: "auto" });
+      });
+    });
+  }, []);

  const content = (
    <>
@@ -151,100 +190,146 @@ export function MaterialSelector({
          </p>
        </div>
      ) : (
-        <div
-          className="space-y-2 max-h-48 sm:max-h-64 overflow-y-auto hide-scrollbar"
-          style={{ contentVisibility: 'auto' }}
+        <SelectPopover
+          sheetTitle="选择视频素材"
+          onOpen={handleOpenMaterialPopover}
+          trigger={({ open, toggle }) => (
+            <button
+              type="button"
+              onClick={toggle}
+              className="w-full rounded-xl border border-white/10 bg-black/25 px-3 py-2.5 text-left transition-colors hover:border-white/30"
+            >
+              <span className="flex items-center justify-between gap-3">
+                <span className="min-w-0">
+                  <span className="block text-xs text-gray-400">已选 {selectedMaterials.length}/4 个素材</span>
+                  <span className="mt-0.5 block truncate text-sm text-white">{selectedSummary}</span>
+                </span>
+                <ChevronDown className={`h-4 w-4 text-gray-300 transition-transform ${open ? "rotate-180" : ""}`} />
+              </span>
+            </button>
+          )}
        >
-          {materials.map((m) => {
-            const isSelected = selectedSet.has(m.id);
-            return (
-              <div
-                key={m.id}
-                ref={(el) => registerMaterialRef(m.id, el)}
-                className={`p-3 rounded-lg border transition-all flex items-center justify-between group ${isSelected
-                  ? "border-purple-500 bg-purple-500/20"
-                  : isFull
-                    ? "border-white/5 bg-white/[0.02] opacity-50 cursor-not-allowed"
-                    : "border-white/10 bg-white/5 hover:border-white/30"
-                  }`}
-              >
-                {editingMaterialId === m.id ? (
-                  <div className="flex-1 flex items-center gap-2" onClick={(e) => e.stopPropagation()}>
-                    <input
-                      value={editMaterialName}
-                      onChange={(e) => onEditNameChange(e.target.value)}
-                      className="flex-1 bg-black/40 border border-white/20 rounded-md px-2 py-1 text-xs text-white"
-                      autoFocus
-                    />
-                    <button
-                      onClick={(e) => onSaveEditing(m.id, e)}
-                      className="p-1 text-green-400 hover:text-green-300"
-                      title="保存"
-                    >
-                      <Check className="h-4 w-4" />
-                    </button>
-                    <button
-                      onClick={onCancelEditing}
-                      className="p-1 text-gray-400 hover:text-white"
-                      title="取消"
-                    >
-                      <X className="h-4 w-4" />
-                    </button>
-                  </div>
-                ) : (
-                  <button onClick={() => onToggleMaterial(m.id)} disabled={isFull && !isSelected} className="flex-1 text-left flex items-center gap-2">
-                    {/* 复选框 */}
-                    <span
-                      className={`flex-shrink-0 w-4 h-4 rounded border flex items-center justify-center text-[10px] ${isSelected
-                        ? "border-purple-500 bg-purple-500 text-white"
-                        : "border-white/30 text-transparent"
-                        }`}
-                    >
-                      {isSelected ? "✓" : ""}
-                    </span>
-                    <div className="min-w-0">
-                      <div className="text-white text-sm truncate">{m.scene || m.name}</div>
-                      <div className="text-gray-400 text-xs">{m.size_mb.toFixed(1)} MB</div>
-                    </div>
-                  </button>
-                )}
-                <div className="flex items-center gap-2 pl-2">
-                  <button
-                    onClick={(e) => {
-                      e.stopPropagation();
-                      if (m.path) {
-                        onPreviewMaterial(m.path);
-                      }
-                    }}
-                    className="p-1 text-gray-500 hover:text-white opacity-40 group-hover:opacity-100 transition-opacity"
-                    title="预览视频"
-                  >
-                    <Eye className="h-4 w-4" />
-                  </button>
-                  {editingMaterialId !== m.id && (
-                    <button
-                      onClick={(e) => onStartEditing(m, e)}
-                      className="p-1 text-gray-500 hover:text-white opacity-40 group-hover:opacity-100 transition-opacity"
-                      title="重命名"
-                    >
-                      <Pencil className="h-4 w-4" />
-                    </button>
-                  )}
-                  <button
-                    onClick={(e) => {
-                      e.stopPropagation();
-                      onDeleteMaterial(m.id);
-                    }}
-                    className="p-1 text-gray-500 hover:text-red-400 opacity-40 group-hover:opacity-100 transition-opacity"
-                    title="删除素材"
-                  >
-                    <Trash2 className="h-4 w-4" />
-                  </button>
+          {() => (
+            <div className="space-y-2">
+              <div className="rounded-lg border border-white/10 bg-black/30 px-3 py-2">
+                <div className="flex items-center gap-2">
+                  <Search className="h-4 w-4 text-gray-400" />
+                  <input
+                    type="text"
+                    value={materialFilter}
+                    onChange={(e) => setMaterialFilter(e.target.value)}
+                    placeholder="搜索素材名称..."
+                    className="w-full bg-transparent text-sm text-white placeholder-gray-500 outline-none"
+                  />
                </div>
              </div>
-            );
-          })}
-        </div>
+
+              {filteredMaterials.length === 0 ? (
+                <div className="py-6 text-center text-sm text-gray-400">没有匹配的素材</div>
+              ) : (
+                <div
+                  ref={materialListContainerRef}
+                  className="space-y-1"
+                  style={{ contentVisibility: "auto" }}
+                >
+                  {filteredMaterials.map((m) => {
+                    const isSelected = selectedSet.has(m.id);
+
+                    return (
+                      <div
+                        key={m.id}
+                        ref={(el) => registerMaterialRef(m.id, el)}
+                        data-popover-selected={isSelected ? "true" : undefined}
+                        data-material-selected={isSelected ? "true" : "false"}
+                        className={`flex items-center justify-between gap-2 rounded-lg border px-3 py-2 transition-colors ${isSelected
+                          ? "border-purple-500 bg-purple-500/20"
+                          : isFull
+                            ? "border-white/5 bg-white/[0.02] opacity-50"
+                            : "border-white/10 bg-white/5 hover:border-white/30"
+                          }`}
+                      >
+                        {editingMaterialId === m.id ? (
+                          <div className="flex-1 flex items-center gap-2" onClick={(e) => e.stopPropagation()}>
+                            <input
+                              value={editMaterialName}
+                              onChange={(e) => onEditNameChange(e.target.value)}
+                              className="flex-1 rounded-md border border-white/20 bg-black/40 px-2 py-1 text-xs text-white"
+                              autoFocus
+                            />
+                            <button
+                              type="button"
+                              onClick={(e) => onSaveEditing(m.id, e)}
+                              className="p-1 text-green-400 hover:text-green-300"
+                              title="保存"
+                            >
+                              <Check className="h-4 w-4" />
+                            </button>
+                            <button
+                              type="button"
+                              onClick={onCancelEditing}
+                              className="p-1 text-gray-400 hover:text-white"
+                              title="取消"
+                            >
+                              <X className="h-4 w-4" />
+                            </button>
+                          </div>
+                        ) : (
+                          <button
+                            type="button"
+                            onClick={() => onToggleMaterial(m.id)}
+                            disabled={isFull && !isSelected}
+                            className="min-w-0 flex-1 text-left"
+                          >
+                            <span className="block truncate text-sm text-white">{m.scene || m.name}</span>
+                            <span className="mt-0.5 block text-xs text-gray-400">{m.size_mb.toFixed(1)} MB</span>
+                          </button>
+                        )}
+
+                        <div className="flex items-center gap-2 pl-2">
+                          <button
+                            type="button"
+                            onClick={(e) => {
+                              e.stopPropagation();
+                              if (m.path) {
+                                onPreviewMaterial(m.path);
+                              }
+                            }}
+                            className="p-1 text-gray-400 hover:text-purple-300"
+                            title="预览视频"
+                          >
+                            <Eye className="h-4 w-4" />
+                          </button>
+                          {editingMaterialId !== m.id && (
+                            <button
+                              type="button"
+                              onClick={(e) => onStartEditing(m, e)}
+                              className="p-1 text-gray-400 hover:text-white"
+                              title="重命名"
+                            >
+                              <Pencil className="h-4 w-4" />
+                            </button>
+                          )}
+                          <button
+                            type="button"
+                            onClick={(e) => {
+                              e.stopPropagation();
+                              onDeleteMaterial(m.id);
+                            }}
+                            className="p-1 text-gray-400 hover:text-red-400"
+                            title="删除素材"
+                          >
+                            <Trash2 className="h-4 w-4" />
+                          </button>
+                          {isSelected && <Check className="h-4 w-4 text-purple-300" />}
+                        </div>
+                      </div>
+                    );
+                  })}
+                </div>
+              )}
+            </div>
+          )}
+        </SelectPopover>
      )}
    </>
  );
--- a/frontend/src/features/home/ui/RefAudioPanel.tsx
+++ b/frontend/src/features/home/ui/RefAudioPanel.tsx
@@ -1,6 +1,8 @@
-import { useEffect, useState } from "react";
-import type { MouseEvent } from "react";
-import { Upload, RefreshCw, Play, Pause, Pencil, Trash2, Check, X, Mic, Square, RotateCw } from "lucide-react";
+import { useCallback, useEffect, useMemo, useRef, useState } from "react";
+import type { ChangeEvent, MouseEvent } from "react";
+import { Upload, RefreshCw, Play, Pause, Pencil, Trash2, Check, X, Mic, Square, RotateCw, Search, ChevronDown } from "lucide-react";
+import { SelectPopover } from "@/shared/ui/SelectPopover";
+import { AppModal, AppModalHeader } from "@/shared/ui/AppModal";

 interface RefAudio {
  id: string;
@@ -36,7 +38,8 @@ interface RefAudioPanelProps {
  recordingTime: number;
  onStartRecording: () => void;
  onStopRecording: () => void;
-  onUseRecording: () => void;
+  onUseRecording: () => void | Promise<void>;
+  onDiscardRecording: () => void;
  formatRecordingTime: (seconds: number) => string;
 }

@@ -68,9 +71,26 @@ export function RefAudioPanel({
  onStartRecording,
  onStopRecording,
  onUseRecording,
+  onDiscardRecording,
  formatRecordingTime,
 }: RefAudioPanelProps) {
  const [recordedUrl, setRecordedUrl] = useState<string | null>(null);
+  const [refAudioFilter, setRefAudioFilter] = useState("");
+  const [recordingModalOpen, setRecordingModalOpen] = useState(false);
+  const [recordedPreviewPlaying, setRecordedPreviewPlaying] = useState(false);
+  const [recordedPreviewCurrentTime, setRecordedPreviewCurrentTime] = useState(0);
+  const [recordedPreviewDuration, setRecordedPreviewDuration] = useState(0);
+  const refAudioListContainerRef = useRef<HTMLDivElement | null>(null);
+  const recordedAudioRef = useRef<HTMLAudioElement | null>(null);
+
+  const stopRecordedPreview = useCallback(() => {
+    const player = recordedAudioRef.current;
+    if (!player) return;
+    player.pause();
+    player.currentTime = 0;
+    setRecordedPreviewPlaying(false);
+    setRecordedPreviewCurrentTime(0);
+  }, []);

  useEffect(() => {
    if (!recordedBlob) {
@@ -88,45 +108,95 @@ export function RefAudioPanel({
  const needsRetranscribe = (audio: RefAudio) =>
    audio.ref_text.startsWith(OLD_FIXED_REF_TEXT);

+  const selectedRefAudioLabel = selectedRefAudio?.name || "请选择参考音频";
+  const filteredRefAudios = useMemo(() => {
+    const query = refAudioFilter.trim().toLowerCase();
+    if (!query) return refAudios;
+    return refAudios.filter((audio) => audio.name.toLowerCase().includes(query));
+  }, [refAudioFilter, refAudios]);
+
+  const handleOpenRefAudioPopover = useCallback(() => {
+    setRefAudioFilter("");
+
+    requestAnimationFrame(() => {
+      requestAnimationFrame(() => {
+        const container = refAudioListContainerRef.current;
+        if (!container) return;
+        const selectedRow = container.querySelector<HTMLElement>("[data-ref-selected='true']");
+        selectedRow?.scrollIntoView({ block: "nearest", behavior: "auto" });
+      });
+    });
+  }, []);
+
+  const closeRecordingModal = () => {
+    stopRecordedPreview();
+    if (isRecording) {
+      onStopRecording();
+    }
+    setRecordingModalOpen(false);
+  };
+
+  const handleUseRecordingAndClose = () => {
+    stopRecordedPreview();
+    setRecordingModalOpen(false);
+    void onUseRecording();
+  };
+
+  const handleToggleRecordedPreview = () => {
+    const player = recordedAudioRef.current;
+    if (!player) return;
+
+    if (player.paused) {
+      player.play().catch(() => {
+        setRecordedPreviewPlaying(false);
+      });
+      return;
+    }
+
+    player.pause();
+  };
+
+  const handleRecordedSeek = (event: ChangeEvent<HTMLInputElement>) => {
+    const player = recordedAudioRef.current;
+    if (!player) return;
+    const nextTime = Number(event.target.value);
+    player.currentTime = Number.isFinite(nextTime) ? nextTime : 0;
+    setRecordedPreviewCurrentTime(Number.isFinite(nextTime) ? nextTime : 0);
+  };
+
+  const totalRecordedPreviewTime =
+    Number.isFinite(recordedPreviewDuration) && recordedPreviewDuration > 0
+      ? recordedPreviewDuration
+      : recordingTime;
+
  return (
    <div className="space-y-4">
      <div>
        <div className="flex justify-between items-center mb-2">
          <span className="text-sm text-gray-300">📁 我的参考音频 <span className="text-xs text-gray-500 font-normal">(上传3-10秒语音样本)</span></span>
-          <div className="flex gap-2">
-            <input
-              type="file"
-              id="ref-audio-upload"
-              accept=".wav,.mp3,.m4a,.webm,.ogg,.flac,.aac"
-              onChange={(e) => {
-                const file = e.target.files?.[0];
-                if (file) {
-                  onUploadRefAudio(file);
-                }
-                e.target.value = '';
-              }}
-              className="hidden"
-            />
-            <label
-              htmlFor="ref-audio-upload"
-              className={`px-2 py-1 text-xs rounded cursor-pointer transition-all flex items-center gap-1 ${isUploadingRef
-                ? "bg-gray-600 cursor-not-allowed text-gray-400"
-                : "bg-purple-600 hover:bg-purple-700 text-white"
-                }`}
-            >
-              <Upload className="h-3.5 w-3.5" />
-              上传
-            </label>
-            <button
-              onClick={onFetchRefAudios}
-              className="px-2 py-1 text-xs bg-white/10 hover:bg-white/20 rounded text-gray-300 flex items-center gap-1"
-            >
-              <RefreshCw className="h-3.5 w-3.5" />
-              刷新
-            </button>
-          </div>
+          <button
+            onClick={onFetchRefAudios}
+            className="px-2 py-1 text-xs bg-white/10 hover:bg-white/20 rounded text-gray-300 flex items-center gap-1"
+          >
+            <RefreshCw className="h-3.5 w-3.5" />
+            刷新
+          </button>
        </div>

+        <input
+          type="file"
+          id="ref-audio-upload"
+          accept=".wav,.mp3,.m4a,.webm,.ogg,.flac,.aac"
+          onChange={(e) => {
+            const file = e.target.files?.[0];
+            if (file) {
+              onUploadRefAudio(file);
+            }
+            e.target.value = "";
+          }}
+          className="hidden"
+        />
+
        {isUploadingRef && (
          <div className="mb-2 p-2 bg-purple-500/10 rounded text-sm text-purple-300">
            ⏳ 上传并识别中...
@@ -147,146 +217,316 @@ export function RefAudioPanel({
            暂无参考音频，请上传或录制
          </div>
        ) : (
-          <div className="grid grid-cols-2 gap-2" style={{ contentVisibility: 'auto' }}>
-            {refAudios.map((audio) => (
-              <div
-                key={audio.id}
-                className={`p-2 rounded-lg border transition-all relative group cursor-pointer ${selectedRefAudio?.id === audio.id
-                  ? "border-purple-500 bg-purple-500/20"
-                  : "border-white/10 bg-white/5 hover:border-white/30"
-                  }`}
-                onClick={() => {
-                  if (editingAudioId !== audio.id) {
-                    onSelectRefAudio(audio);
-                  }
-                }}
+          <SelectPopover
+            sheetTitle="选择参考音频"
+            onOpen={handleOpenRefAudioPopover}
+            trigger={({ open, toggle }) => (
+              <button
+                type="button"
+                onClick={toggle}
+                className="w-full rounded-xl border border-white/10 bg-black/25 px-3 py-2.5 text-left transition-colors hover:border-white/30"
              >
-                {editingAudioId === audio.id ? (
-                  <div className="flex items-center gap-1" onClick={(e) => e.stopPropagation()}>
+                <span className="flex items-center justify-between gap-3">
+                  <span className="min-w-0">
+                    <span className="block text-xs text-gray-400">当前参考音频</span>
+                    <span className="mt-0.5 block truncate text-sm text-white">{selectedRefAudioLabel}</span>
+                  </span>
+                  <ChevronDown className={`h-4 w-4 text-gray-300 transition-transform ${open ? "rotate-180" : ""}`} />
+                </span>
+              </button>
+            )}
+          >
+            {({ close }) => (
+              <div className="space-y-2">
+                <div className="rounded-lg border border-white/10 bg-black/30 px-3 py-2">
+                  <div className="flex items-center gap-2">
+                    <Search className="h-4 w-4 text-gray-400" />
                    <input
                      type="text"
-                      value={editName}
-                      onChange={(e) => onEditNameChange(e.target.value)}
-                      className="w-full bg-black/50 text-white text-xs px-1 py-0.5 rounded border border-purple-500 focus:outline-none"
-                      autoFocus
-                      onKeyDown={(e) => {
-                        if (e.key === 'Enter') onSaveEditing(audio.id, e as unknown as MouseEvent);
-                        if (e.key === 'Escape') onCancelEditing(e as unknown as MouseEvent);
-                      }}
+                      value={refAudioFilter}
+                      onChange={(e) => setRefAudioFilter(e.target.value)}
+                      placeholder="搜索参考音频..."
+                      className="w-full bg-transparent text-sm text-white placeholder-gray-500 outline-none"
                    />
-                    <button onClick={(e) => onSaveEditing(audio.id, e)} className="text-green-400 hover:text-green-300 text-xs">
-                      <Check className="h-3 w-3" />
-                    </button>
-                    <button onClick={(e) => onCancelEditing(e)} className="text-gray-400 hover:text-gray-300 text-xs">
-                      <X className="h-3 w-3" />
-                    </button>
                  </div>
+                </div>
+
+                {filteredRefAudios.length === 0 ? (
+                  <div className="py-6 text-center text-sm text-gray-400">没有匹配的参考音频</div>
                ) : (
-                  <>
-                    <div className="flex justify-between items-start mb-1">
-                      <div className="text-white text-xs truncate pr-1 flex-1" title={audio.name}>
-                        {audio.name}
-                      </div>
-                      <div className="flex gap-1 opacity-40 group-hover:opacity-100 transition-opacity">
-                        <button
-                          onClick={(e) => onTogglePlayPreview(audio, e)}
-                          className="text-gray-400 hover:text-purple-400 text-xs"
-                          title="试听"
+                  <div ref={refAudioListContainerRef} className="space-y-1" style={{ contentVisibility: "auto" }}>
+                    {filteredRefAudios.map((audio) => {
+                      const isSelected = selectedRefAudio?.id === audio.id;
+
+                      return (
+                        <div
+                          key={audio.id}
+                          data-popover-selected={isSelected ? "true" : undefined}
+                          data-ref-selected={isSelected ? "true" : "false"}
+                          className={`flex items-center justify-between gap-2 rounded-lg border px-3 py-2 transition-colors ${isSelected
+                            ? "border-purple-500 bg-purple-500/20"
+                            : "border-white/10 bg-white/5 hover:border-white/30"
+                            }`}
                        >
-                          {playingAudioId === audio.id ? (
-                            <Pause className="h-3.5 w-3.5" />
+                          {editingAudioId === audio.id ? (
+                            <div className="flex-1 flex items-center gap-2" onClick={(e) => e.stopPropagation()}>
+                              <input
+                                type="text"
+                                value={editName}
+                                onChange={(e) => onEditNameChange(e.target.value)}
+                                className="w-full rounded border border-purple-500 bg-black/50 px-2 py-1 text-xs text-white focus:outline-none"
+                                autoFocus
+                                onKeyDown={(e) => {
+                                  if (e.key === "Enter") onSaveEditing(audio.id, e as unknown as MouseEvent);
+                                  if (e.key === "Escape") onCancelEditing(e as unknown as MouseEvent);
+                                }}
+                              />
+                              <button type="button" onClick={(e) => onSaveEditing(audio.id, e)} className="text-green-400 hover:text-green-300">
+                                <Check className="h-3.5 w-3.5" />
+                              </button>
+                              <button type="button" onClick={(e) => onCancelEditing(e)} className="text-gray-400 hover:text-gray-300">
+                                <X className="h-3.5 w-3.5" />
+                              </button>
+                            </div>
                          ) : (
-                            <Play className="h-3.5 w-3.5" />
+                            <button
+                              type="button"
+                              onClick={() => {
+                                onSelectRefAudio(audio);
+                                close();
+                              }}
+                              className="min-w-0 flex-1 text-left"
+                            >
+                              <span className="block truncate text-sm text-white" title={audio.name}>{audio.name}</span>
+                              <span className="mt-0.5 block text-xs text-gray-400">
+                                {audio.duration_sec.toFixed(1)}s
+                                {needsRetranscribe(audio) && (
+                                  <span className="ml-1 text-yellow-500" title="需要重新识别文字">⚠</span>
+                                )}
+                              </span>
+                            </button>
                          )}
-                        </button>
-                        <button
-                          onClick={(e) => {
-                            e.stopPropagation();
-                            onRetranscribe(audio.id);
-                          }}
-                          disabled={retranscribingId === audio.id}
-                          className="text-gray-400 hover:text-cyan-400 text-xs disabled:opacity-50"
-                          title="重新识别文字"
-                        >
-                          <RotateCw className={`h-3.5 w-3.5 ${retranscribingId === audio.id ? 'animate-spin' : ''}`} />
-                        </button>
-                        <button
-                          onClick={(e) => onStartEditing(audio, e)}
-                          className="text-gray-400 hover:text-blue-400 text-xs"
-                          title="重命名"
-                        >
-                          <Pencil className="h-3.5 w-3.5" />
-                        </button>
-                        <button
-                          onClick={(e) => {
-                            e.stopPropagation();
-                            onDeleteRefAudio(audio.id);
-                          }}
-                          className="text-gray-400 hover:text-red-400 text-xs"
-                          title="删除"
-                        >
-                          <Trash2 className="h-3.5 w-3.5" />
-                        </button>
-                      </div>
-                    </div>
-                    <div className="text-gray-400 text-xs">
-                      {audio.duration_sec.toFixed(1)}s
-                      {needsRetranscribe(audio) && (
-                        <span className="text-yellow-500 ml-1" title="需要重新识别文字">⚠</span>
-                      )}
-                    </div>
-                  </>
+
+                          {editingAudioId !== audio.id && (
+                            <div className="flex items-center gap-1 pl-2">
+                              <button
+                                type="button"
+                                onClick={(e) => onTogglePlayPreview(audio, e)}
+                                className="text-gray-400 hover:text-purple-300"
+                                title="试听"
+                              >
+                                {playingAudioId === audio.id ? (
+                                  <Pause className="h-3.5 w-3.5" />
+                                ) : (
+                                  <Play className="h-3.5 w-3.5" />
+                                )}
+                              </button>
+                              <button
+                                type="button"
+                                onClick={(e) => {
+                                  e.stopPropagation();
+                                  onRetranscribe(audio.id);
+                                }}
+                                disabled={retranscribingId === audio.id}
+                                className="text-gray-400 hover:text-cyan-400 disabled:opacity-50"
+                                title="重新识别文字"
+                              >
+                                <RotateCw className={`h-3.5 w-3.5 ${retranscribingId === audio.id ? "animate-spin" : ""}`} />
+                              </button>
+                              <button
+                                type="button"
+                                onClick={(e) => onStartEditing(audio, e)}
+                                className="text-gray-400 hover:text-blue-400"
+                                title="重命名"
+                              >
+                                <Pencil className="h-3.5 w-3.5" />
+                              </button>
+                              <button
+                                type="button"
+                                onClick={(e) => {
+                                  e.stopPropagation();
+                                  onDeleteRefAudio(audio.id);
+                                }}
+                                className="text-gray-400 hover:text-red-400"
+                                title="删除"
+                              >
+                                <Trash2 className="h-3.5 w-3.5" />
+                              </button>
+                              {isSelected && <Check className="h-3.5 w-3.5 text-purple-300" />}
+                            </div>
+                          )}
+                        </div>
+                      );
+                    })}
+                  </div>
                )}
              </div>
-            ))}
-          </div>
+            )}
+          </SelectPopover>
        )}
-      </div>

-      <div className="border-t border-white/10 pt-4">
-        <span className="text-sm text-gray-300 mb-2 block">🎤 或在线录音 <span className="text-xs text-gray-500">（建议 3-10 秒，超出将自动截取）</span></span>
-        <div className="flex gap-2 items-center">
-          {!isRecording ? (
-            <button
-              onClick={onStartRecording}
-              className="px-4 py-2 bg-red-600 hover:bg-red-700 text-white rounded-lg text-sm font-medium transition-colors flex items-center gap-2"
-            >
-              <Mic className="h-4 w-4" />
-              开始录音
-            </button>
-          ) : (
-            <button
-              onClick={onStopRecording}
-              className="px-4 py-2 bg-gray-600 hover:bg-gray-700 text-white rounded-lg text-sm font-medium transition-colors flex items-center gap-2"
-            >
-              <Square className="h-4 w-4" />
-              停止
-            </button>
-          )}
-          {isRecording && (
-            <span className="text-red-400 text-sm animate-pulse">
-              🔴 录音中 {formatRecordingTime(recordingTime)}
+        <div className="mt-3 flex flex-wrap items-center justify-end gap-2">
+          {recordedBlob && !isRecording && (
+            <span className="mr-auto text-xs text-emerald-300/90">
+              已录制 {formatRecordingTime(recordingTime)}，可点击“在线录音”处理
            </span>
          )}
+          <label
+            htmlFor="ref-audio-upload"
+            className={`px-3 py-1.5 text-xs rounded-lg cursor-pointer transition-all inline-flex items-center gap-1.5 ${isUploadingRef
+              ? "bg-gray-600 cursor-not-allowed text-gray-400 pointer-events-none"
+              : "bg-purple-600 hover:bg-purple-700 text-white"
+              }`}
+          >
+            <Upload className="h-3.5 w-3.5" />
+            上传音频
+          </label>
+          <button
+            type="button"
+            onClick={() => setRecordingModalOpen(true)}
+            disabled={isUploadingRef}
+            className="px-3 py-1.5 text-xs rounded-lg transition-colors bg-red-600 hover:bg-red-700 text-white disabled:bg-gray-600 disabled:text-gray-400 inline-flex items-center gap-1.5"
+          >
+            <Mic className="h-3.5 w-3.5" />
+            在线录音
+          </button>
        </div>
-
-        {recordedBlob && !isRecording && (
-          <div className="mt-3 p-3 bg-green-500/10 border border-green-500/30 rounded-lg">
-            <div className="flex items-center gap-2 mb-2">
-              <span className="text-green-300 text-sm">✅ 录音完成 ({formatRecordingTime(recordingTime)})</span>
-              <audio src={recordedUrl || ''} controls className="h-8" />
-            </div>
-            <button
-              onClick={onUseRecording}
-              disabled={isUploadingRef}
-              className="px-3 py-1 bg-green-600 hover:bg-green-700 text-white rounded text-sm disabled:bg-gray-600"
-            >
-              使用此录音
-            </button>
-          </div>
-        )}
      </div>

+      {recordingModalOpen && (
+        <AppModal
+          isOpen={recordingModalOpen}
+          onClose={closeRecordingModal}
+          panelClassName="w-full max-w-lg rounded-2xl border border-white/10 bg-[#171821]/95 shadow-[0_24px_80px_rgba(0,0,0,0.55)] overflow-hidden"
+          closeOnOverlay={false}
+        >
+          <AppModalHeader
+            title="🎤 在线录音"
+            subtitle="建议录制 3-10 秒，超出会自动截取到可用长度"
+            onClose={closeRecordingModal}
+          />
+
+          <div className="space-y-4 p-4 sm:p-5">
+            <div className="rounded-xl border border-white/10 bg-black/25 p-3 sm:p-4">
+              <div className="flex flex-wrap items-center gap-2">
+                {!isRecording ? (
+                  <button
+                    type="button"
+                    onClick={onStartRecording}
+                    disabled={isUploadingRef}
+                    className="px-4 py-2 rounded-lg text-sm font-medium bg-red-600 hover:bg-red-700 text-white transition-colors disabled:bg-gray-600 disabled:text-gray-400 inline-flex items-center gap-2"
+                  >
+                    <Mic className="h-4 w-4" />
+                    {recordedBlob ? "重新录音" : "开始录音"}
+                  </button>
+                ) : (
+                  <button
+                    type="button"
+                    onClick={onStopRecording}
+                    className="px-4 py-2 rounded-lg text-sm font-medium bg-gray-600 hover:bg-gray-700 text-white transition-colors inline-flex items-center gap-2"
+                  >
+                    <Square className="h-4 w-4" />
+                    停止录音
+                  </button>
+                )}
+
+                {isRecording ? (
+                  <span className="inline-flex items-center gap-1 rounded-full border border-red-400/40 bg-red-500/10 px-3 py-1 text-xs text-red-300 animate-pulse">
+                    <span className="h-1.5 w-1.5 rounded-full bg-red-400" />
+                    录音中 {formatRecordingTime(recordingTime)}
+                  </span>
+                ) : recordedBlob ? (
+                  <span className="inline-flex items-center gap-1 rounded-full border border-emerald-400/30 bg-emerald-500/10 px-3 py-1 text-xs text-emerald-300">
+                    已录制 {formatRecordingTime(recordingTime)}
+                  </span>
+                ) : null}
+              </div>
+
+              {!recordedBlob && !isRecording && (
+                <p className="mt-3 text-xs text-gray-500">点击“开始录音”后允许麦克风权限，结束后可试听并确认上传</p>
+              )}
+            </div>
+
+            {recordedBlob && !isRecording && (
+              <div className="space-y-3 rounded-xl border border-emerald-500/30 bg-emerald-500/10 p-3">
+                <div className="flex items-center justify-between gap-2">
+                  <span className="text-sm text-emerald-200">✅ 录音完成，可先试听再使用</span>
+                  <span className="text-xs text-emerald-300/80">{formatRecordingTime(recordingTime)}</span>
+                </div>
+
+                <div className="rounded-lg border border-white/10 bg-black/35 px-3 py-2.5">
+                  <audio
+                    key={recordedUrl || "recorded-preview"}
+                    ref={recordedAudioRef}
+                    src={recordedUrl || ""}
+                    className="hidden"
+                    onPlay={() => setRecordedPreviewPlaying(true)}
+                    onPause={() => setRecordedPreviewPlaying(false)}
+                    onEnded={() => {
+                      setRecordedPreviewPlaying(false);
+                      setRecordedPreviewCurrentTime(0);
+                    }}
+                    onTimeUpdate={(event) => setRecordedPreviewCurrentTime(event.currentTarget.currentTime || 0)}
+                    onLoadedMetadata={(event) => setRecordedPreviewDuration(event.currentTarget.duration || 0)}
+                  />
+
+                  <div className="flex items-center gap-3">
+                    <button
+                      type="button"
+                      onClick={handleToggleRecordedPreview}
+                      disabled={!recordedUrl}
+                      className="h-8 w-8 shrink-0 rounded-full bg-white/10 hover:bg-white/20 text-emerald-200 disabled:text-gray-500 disabled:bg-white/5 inline-flex items-center justify-center transition-colors"
+                      title={recordedPreviewPlaying ? "暂停试听" : "播放试听"}
+                    >
+                      {recordedPreviewPlaying ? (
+                        <Pause className="h-4 w-4" />
+                      ) : (
+                        <Play className="h-4 w-4 translate-x-[1px]" />
+                      )}
+                    </button>
+
+                    <div className="min-w-0 flex-1">
+                      <input
+                        type="range"
+                        min={0}
+                        max={Math.max(totalRecordedPreviewTime, 0.1)}
+                        step={0.01}
+                        value={Math.min(recordedPreviewCurrentTime, totalRecordedPreviewTime || 0)}
+                        onChange={handleRecordedSeek}
+                        className="w-full h-1.5 cursor-pointer appearance-none rounded-full bg-white/15 accent-emerald-400"
+                      />
+                      <div className="mt-1 flex items-center justify-between text-[11px] text-emerald-200/80">
+                        <span>{formatRecordingTime(Math.floor(recordedPreviewCurrentTime))}</span>
+                        <span>{formatRecordingTime(Math.floor(totalRecordedPreviewTime))}</span>
+                      </div>
+                    </div>
+                  </div>
+                </div>
+
+                <div className="flex flex-wrap items-center justify-end gap-2">
+                  <button
+                    type="button"
+                    onClick={onDiscardRecording}
+                    disabled={isUploadingRef}
+                    className="px-3 py-1.5 rounded-lg text-sm bg-white/10 hover:bg-white/20 text-gray-200 transition-colors disabled:bg-white/5 disabled:text-gray-500"
+                  >
+                    弃用本次录音
+                  </button>
+                  <button
+                    type="button"
+                    onClick={handleUseRecordingAndClose}
+                    disabled={isUploadingRef}
+                    className="px-3 py-1.5 rounded-lg text-sm bg-green-600 hover:bg-green-700 text-white transition-colors disabled:bg-gray-600 disabled:text-gray-400"
+                  >
+                    使用此录音
+                  </button>
+                </div>
+              </div>
+            )}
+          </div>
+        </AppModal>
+      )}
+
    </div>
  );
 }
--- a/frontend/src/features/home/ui/RewriteModal.tsx
+++ b/frontend/src/features/home/ui/RewriteModal.tsx
@@ -0,0 +1,199 @@
+import { useState, useEffect, useRef, useCallback } from "react";
+import { Loader2, Sparkles } from "lucide-react";
+import api from "@/shared/api/axios";
+import { ApiResponse, unwrap } from "@/shared/api/types";
+import { AppModal, AppModalHeader } from "@/shared/ui/AppModal";
+
+const CUSTOM_PROMPT_KEY = "vigent_rewriteCustomPrompt";
+
+interface RewriteModalProps {
+  isOpen: boolean;
+  onClose: () => void;
+  originalText: string;
+  onApply: (text: string) => void;
+}
+
+export default function RewriteModal({
+  isOpen,
+  onClose,
+  originalText,
+  onApply,
+}: RewriteModalProps) {
+  const [customPrompt, setCustomPrompt] = useState(
+    () => (typeof window !== "undefined" ? localStorage.getItem(CUSTOM_PROMPT_KEY) || "" : "")
+  );
+  const [rewrittenText, setRewrittenText] = useState("");
+  const [isLoading, setIsLoading] = useState(false);
+  const [error, setError] = useState<string | null>(null);
+
+  // Debounced save customPrompt to localStorage
+  const debounceRef = useRef<ReturnType<typeof setTimeout>>(undefined);
+  useEffect(() => {
+    debounceRef.current = setTimeout(() => {
+      localStorage.setItem(CUSTOM_PROMPT_KEY, customPrompt);
+    }, 300);
+    return () => clearTimeout(debounceRef.current);
+  }, [customPrompt]);
+
+  // Reset state when modal opens
+  useEffect(() => {
+    if (isOpen) {
+      setRewrittenText("");
+      setError(null);
+      setIsLoading(false);
+    }
+  }, [isOpen]);
+
+  const handleRewrite = useCallback(async () => {
+    if (!originalText.trim()) return;
+
+    setIsLoading(true);
+    setError(null);
+
+    try {
+      const { data: res } = await api.post<
+        ApiResponse<{ rewritten_text: string }>
+      >("/api/ai/rewrite", {
+        text: originalText,
+        custom_prompt: customPrompt.trim() || null,
+      });
+      const payload = unwrap(res);
+      setRewrittenText(payload.rewritten_text || "");
+    } catch (err: unknown) {
+      console.error("AI rewrite failed:", err);
+      const axiosErr = err as {
+        response?: { data?: { message?: string } };
+        message?: string;
+      };
+      const msg =
+        axiosErr.response?.data?.message || axiosErr.message || "改写失败，请重试";
+      setError(msg);
+    } finally {
+      setIsLoading(false);
+    }
+  }, [originalText, customPrompt]);
+
+  const handleApply = () => {
+    onApply(rewrittenText);
+    onClose();
+  };
+
+  const handleRetry = () => {
+    setRewrittenText("");
+    setError(null);
+  };
+
+  if (!isOpen) return null;
+
+  return (
+    <AppModal
+      isOpen={isOpen}
+      onClose={onClose}
+      panelClassName="w-full max-w-2xl max-h-[90vh] rounded-2xl border border-white/10 bg-[#171821]/95 shadow-[0_24px_80px_rgba(0,0,0,0.55)] overflow-hidden flex flex-col"
+      closeOnOverlay={false}
+    >
+      <AppModalHeader
+        title="AI 智能改写"
+        icon={<Sparkles className="h-5 w-5 text-purple-300" />}
+        onClose={onClose}
+      />
+
+        {/* Content */}
+        <div className="flex-1 overflow-y-auto p-6 space-y-5">
+          {/* Custom Prompt */}
+          <div className="space-y-2">
+            <label className="text-sm text-gray-300">
+              自定义提示词 (可选)
+            </label>
+            <textarea
+              value={customPrompt}
+              onChange={(e) => setCustomPrompt(e.target.value)}
+              placeholder="输入改写要求..."
+              rows={3}
+              className="w-full bg-black/20 border border-white/10 rounded-xl px-3 py-2 text-sm text-white placeholder-gray-500 focus:outline-none focus:border-purple-500 transition-colors resize-none"
+            />
+            <p className="text-xs text-gray-500">留空则使用默认提示词</p>
+          </div>
+
+          {/* Action button (before result) */}
+          {!rewrittenText && (
+            <button
+              onClick={handleRewrite}
+              disabled={isLoading || !originalText.trim()}
+              className="w-full py-3 px-4 bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-500 hover:to-pink-500 disabled:opacity-50 disabled:cursor-not-allowed text-white rounded-xl transition-all font-medium shadow-lg flex items-center justify-center gap-2"
+            >
+              {isLoading ? (
+                <>
+                  <Loader2 className="w-5 h-5 animate-spin" />
+                  改写中...
+                </>
+              ) : (
+                <>
+                  <Sparkles className="w-5 h-5" />
+                  开始改写
+                </>
+              )}
+            </button>
+          )}
+
+          {/* Error */}
+          {error && (
+            <div className="bg-red-500/10 border border-red-500/30 rounded-xl p-4">
+              <p className="text-red-400 text-sm">{error}</p>
+            </div>
+          )}
+
+          {/* Rewritten result */}
+          {rewrittenText && (
+            <>
+              <div className="space-y-2">
+                <div className="flex justify-between items-center">
+                  <h4 className="font-semibold text-purple-300 flex items-center gap-2">
+                    <Sparkles className="h-4 w-4" />
+                    AI 改写结果
+                  </h4>
+                  <button
+                    onClick={handleApply}
+                    className="text-xs bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-500 hover:to-pink-500 text-white px-3 py-1.5 rounded-lg transition-colors shadow-sm"
+                  >
+                    使用此结果
+                  </button>
+                </div>
+                <div className="bg-purple-900/10 border border-purple-500/20 rounded-xl p-4 max-h-60 overflow-y-auto hide-scrollbar">
+                  <p className="text-gray-200 text-sm leading-relaxed whitespace-pre-wrap">
+                    {rewrittenText}
+                  </p>
+                </div>
+              </div>
+
+              <div className="space-y-2">
+                <div className="flex justify-between items-center">
+                  <h4 className="font-semibold text-gray-400 flex items-center gap-2">
+                    📝 原文对比
+                  </h4>
+                  <button
+                    onClick={onClose}
+                    className="text-xs bg-white/10 hover:bg-white/20 text-white px-3 py-1.5 rounded-lg transition-colors"
+                  >
+                    保留原文
+                  </button>
+                </div>
+                <div className="bg-white/5 border border-white/10 rounded-xl p-4 max-h-40 overflow-y-auto hide-scrollbar">
+                  <p className="text-gray-400 text-sm leading-relaxed whitespace-pre-wrap">
+                    {originalText}
+                  </p>
+                </div>
+              </div>
+
+              <button
+                onClick={handleRetry}
+                className="w-full py-2.5 px-4 bg-white/10 hover:bg-white/20 text-white rounded-xl transition-colors"
+              >
+                重新改写
+              </button>
+            </>
+          )}
+        </div>
+    </AppModal>
+  );
+}
--- a/frontend/src/features/home/ui/ScriptEditor.tsx
+++ b/frontend/src/features/home/ui/ScriptEditor.tsx
@@ -18,6 +18,7 @@ interface ScriptEditorProps {
  text: string;
  onChangeText: (value: string) => void;
  onOpenExtractModal: () => void;
+  onOpenRewriteModal: () => void;
  onGenerateMeta: () => void;
  isGeneratingMeta: boolean;
  onTranslate: (targetLang: string) => void;
@@ -34,6 +35,7 @@ export function ScriptEditor({
  text,
  onChangeText,
  onOpenExtractModal,
+  onOpenRewriteModal,
  onGenerateMeta,
  isGeneratingMeta,
  onTranslate,
@@ -45,6 +47,9 @@ export function ScriptEditor({
  onLoadScript,
  onDeleteScript,
 }: ScriptEditorProps) {
+  const actionBtnBase = "px-3 py-1.5 text-xs rounded-lg transition-colors whitespace-nowrap inline-flex items-center gap-1.5";
+  const actionBtnDisabled = "bg-gray-600 cursor-not-allowed text-gray-400";
+
  const [showLangMenu, setShowLangMenu] = useState(false);
  const langMenuRef = useRef<HTMLDivElement>(null);
  const [showHistoryMenu, setShowHistoryMenu] = useState(false);
@@ -93,7 +98,7 @@ export function ScriptEditor({
          <div className="relative" ref={historyMenuRef}>
            <button
              onClick={() => setShowHistoryMenu((prev) => !prev)}
-              className="h-7 px-2.5 text-xs rounded transition-all whitespace-nowrap bg-gray-600 hover:bg-gray-500 text-white inline-flex items-center gap-1"
+              className={`${actionBtnBase} bg-gray-600 hover:bg-gray-500 text-white`}
            >
              <History className="h-3.5 w-3.5" />
              历史文案
@@ -135,7 +140,7 @@ export function ScriptEditor({
          </div>
          <button
            onClick={onOpenExtractModal}
-            className="h-7 px-2.5 text-xs rounded transition-all whitespace-nowrap bg-purple-600 hover:bg-purple-700 text-white inline-flex items-center gap-1"
+            className={`${actionBtnBase} bg-purple-600 hover:bg-purple-700 text-white`}
          >
            <FileText className="h-3.5 w-3.5" />
            文案提取助手
@@ -144,9 +149,9 @@ export function ScriptEditor({
            <button
              onClick={() => setShowLangMenu((prev) => !prev)}
              disabled={isTranslating || !text.trim()}
-              className={`h-7 px-2.5 text-xs rounded transition-all whitespace-nowrap inline-flex items-center gap-1 ${
+              className={`${actionBtnBase} ${
                isTranslating || !text.trim()
-                  ? "bg-gray-600 cursor-not-allowed text-gray-400"
+                  ? actionBtnDisabled
                  : "bg-gradient-to-r from-emerald-600 to-teal-600 hover:from-emerald-700 hover:to-teal-700 text-white"
              }`}
            >
@@ -191,8 +196,8 @@ export function ScriptEditor({
          <button
            onClick={onGenerateMeta}
            disabled={isGeneratingMeta || !text.trim()}
-            className={`h-7 px-2.5 text-xs rounded transition-all whitespace-nowrap inline-flex items-center gap-1 ${isGeneratingMeta || !text.trim()
-              ? "bg-gray-600 cursor-not-allowed text-gray-400"
+            className={`${actionBtnBase} ${isGeneratingMeta || !text.trim()
+              ? actionBtnDisabled
              : "bg-gradient-to-r from-blue-600 to-cyan-600 hover:from-blue-700 hover:to-cyan-700 text-white"
              }`}
          >
@@ -218,18 +223,32 @@ export function ScriptEditor({
      />
      <div className="flex items-center justify-between mt-2 text-sm text-gray-400">
        <span>{text.length} 字</span>
-        <button
-          onClick={onSaveScript}
-          disabled={!text.trim()}
-          className={`px-2.5 py-1 text-xs rounded transition-all flex items-center gap-1 ${
-            !text.trim()
-              ? "bg-gray-700 cursor-not-allowed text-gray-500"
-              : "bg-amber-600/80 hover:bg-amber-600 text-white"
-          }`}
-        >
-          <Save className="h-3 w-3" />
-          保存文案
-        </button>
+        <div className="flex items-center gap-2">
+          <button
+            onClick={onOpenRewriteModal}
+            disabled={!text.trim()}
+            className={`${actionBtnBase} ${
+              !text.trim()
+                ? "bg-gray-600 cursor-not-allowed text-gray-400"
+                : "bg-purple-600 hover:bg-purple-700 text-white"
+            }`}
+          >
+            <Sparkles className="h-3.5 w-3.5" />
+            AI智能改写
+          </button>
+          <button
+            onClick={onSaveScript}
+            disabled={!text.trim()}
+            className={`${actionBtnBase} ${
+              !text.trim()
+                ? "bg-gray-600 cursor-not-allowed text-gray-400"
+                : "bg-amber-600 hover:bg-amber-700 text-white"
+            }`}
+          >
+            <Save className="h-3.5 w-3.5" />
+            保存文案
+          </button>
+        </div>
      </div>
    </div>
  );
--- a/frontend/src/features/home/ui/ScriptExtractionModal.tsx
+++ b/frontend/src/features/home/ui/ScriptExtractionModal.tsx
@@ -3,6 +3,7 @@
 import { useEffect, useCallback } from "react";
 import { Loader2 } from "lucide-react";
 import { useScriptExtraction } from "./script-extraction/useScriptExtraction";
+import { AppModal, AppModalHeader } from "@/shared/ui/AppModal";

 interface ScriptExtractionModalProps {
    isOpen: boolean;
@@ -18,21 +19,14 @@ export default function ScriptExtractionModal({
    const {
        isLoading,
        script,
-        rewrittenScript,
        error,
-        doRewrite,
        step,
        dragActive,
        selectedFile,
        activeTab,
        inputUrl,
-        customPrompt,
-        showCustomPrompt,
-        setDoRewrite,
        setActiveTab,
        setInputUrl,
-        setCustomPrompt,
-        setShowCustomPrompt,
        handleDrag,
        handleDrop,
        handleFileChange,
@@ -43,17 +37,15 @@ export default function ScriptExtractionModal({
        clearInputUrl,
    } = useScriptExtraction({ isOpen });

-    // 快捷键：ESC 关闭，Enter 提交（仅在 config 步骤）
+    // 快捷键：Enter 提交（仅在 config 步骤）
    const canExtract = (activeTab === "file" && selectedFile) || (activeTab === "url" && inputUrl.trim());

    const handleKeyDown = useCallback((e: KeyboardEvent) => {
-        if (e.key === "Escape") {
-            onClose();
-        } else if (e.key === "Enter" && !e.shiftKey && step === "config" && canExtract && !isLoading) {
+        if (e.key === "Enter" && !e.shiftKey && step === "config" && canExtract && !isLoading) {
            e.preventDefault();
            handleExtract();
        }
-    }, [onClose, step, canExtract, isLoading, handleExtract]);
+    }, [step, canExtract, isLoading, handleExtract]);

    useEffect(() => {
        if (!isOpen) return;
@@ -75,20 +67,13 @@ export default function ScriptExtractionModal({
    };

    return (
-        <div className="fixed inset-0 z-50 flex items-center justify-center bg-black/80 backdrop-blur-sm p-4 animate-in fade-in duration-200">
-            <div className="bg-[#1a1a1a] border border-white/10 rounded-2xl w-full max-w-2xl max-h-[90vh] overflow-hidden flex flex-col shadow-2xl">
-                {/* Header */}
-                <div className="flex items-center justify-between p-4 border-b border-white/10 bg-white/5">
-                    <h3 className="text-lg font-semibold text-white flex items-center gap-2">
-                        📜 文案提取助手
-                    </h3>
-                    <button
-                        onClick={onClose}
-                        className="text-gray-400 hover:text-white transition-colors text-2xl leading-none"
-                    >
-                        &times;
-                    </button>
-                </div>
+        <AppModal
+            isOpen={isOpen}
+            onClose={onClose}
+            panelClassName="w-full max-w-2xl max-h-[90vh] rounded-2xl border border-white/10 bg-[#171821]/95 shadow-[0_24px_80px_rgba(0,0,0,0.55)] overflow-hidden flex flex-col"
+            closeOnOverlay={false}
+        >
+            <AppModalHeader title="📜 文案提取助手" onClose={onClose} />

                {/* Content */}
                <div className="flex-1 overflow-y-auto p-6">
@@ -190,46 +175,6 @@ export default function ScriptExtractionModal({
                                </div>
                            )}

-                            {/* Options */}
-                            <div className="bg-white/5 rounded-xl border border-white/10 overflow-hidden">
-                                <div className="flex items-center justify-between p-4">
-                                    <label className="flex items-center gap-2 cursor-pointer">
-                                        <input
-                                            type="checkbox"
-                                            checked={doRewrite}
-                                            onChange={(e) => setDoRewrite(e.target.checked)}
-                                            className="w-4 h-4 rounded bg-white/10 border-white/20 text-purple-500 focus:ring-purple-500"
-                                        />
-                                        <span className="text-sm text-gray-300">
-                                            AI 智能改写
-                                        </span>
-                                    </label>
-                                    {doRewrite && (
-                                        <button
-                                            type="button"
-                                            onClick={() => setShowCustomPrompt(!showCustomPrompt)}
-                                            className="text-xs text-purple-400 hover:text-purple-300 transition-colors flex items-center gap-1"
-                                        >
-                                            自定义提示词 {showCustomPrompt ? "▲" : "▼"}
-                                        </button>
-                                    )}
-                                </div>
-                                {doRewrite && showCustomPrompt && (
-                                    <div className="px-4 pb-4 space-y-2">
-                                        <textarea
-                                            value={customPrompt}
-                                            onChange={(e) => setCustomPrompt(e.target.value)}
-                                            placeholder="输入自定义改写提示词..."
-                                            rows={3}
-                                            className="w-full bg-black/20 border border-white/10 rounded-lg px-3 py-2 text-sm text-white placeholder-gray-500 focus:outline-none focus:border-purple-500 transition-colors resize-none"
-                                        />
-                                        <p className="text-xs text-gray-500">
-                                            留空则使用默认提示词
-                                        </p>
-                                    </div>
-                                )}
-                            </div>
-
                            {/* Error */}
                            {error && (
                                <div className="bg-red-500/10 border border-red-500/30 rounded-xl p-4">
@@ -273,9 +218,7 @@ export default function ScriptExtractionModal({
                            <p className="text-sm text-gray-400 text-center max-w-sm px-4">
                                {activeTab === "url" && "正在下载视频..."}
                                <br />
-                                {doRewrite
-                                    ? "正在进行语音识别和 AI 智能改写..."
-                                    : "正在进行语音识别..."}
+                                正在进行语音识别...
                                <br />
                                <span className="opacity-75">
                                    大文件可能需要几分钟，请不要关闭窗口
@@ -286,60 +229,30 @@ export default function ScriptExtractionModal({

                    {step === "result" && (
                        <div className="space-y-6">
-                            {rewrittenScript && (
-                                <div className="space-y-2">
-                                    <div className="flex justify-between items-center">
-                                        <h4 className="font-semibold text-purple-300 flex items-center gap-2">
-                                            ✨ AI 改写结果{" "}
-                                            <span className="text-xs font-normal text-purple-400/70">
-                                                (推荐)
-                                            </span>
-                                        </h4>
+                            <div className="space-y-2">
+                                <div className="flex justify-between items-center">
+                                    <h4 className="font-semibold text-gray-300 flex items-center gap-2">
+                                        🎙️ 识别结果
+                                    </h4>
+                                    <div className="flex items-center gap-2">
                                        {onApply && (
                                            <button
-                                                onClick={() => handleApplyAndClose(rewrittenScript)}
+                                                onClick={() => handleApplyAndClose(script)}
                                                className="text-xs bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-500 hover:to-pink-500 text-white px-3 py-1.5 rounded-lg transition-colors flex items-center gap-1 shadow-sm"
                                            >
                                                📥 填入
                                            </button>
                                        )}
                                        <button
-                                            onClick={() => copyToClipboard(rewrittenScript)}
-                                            className="text-xs bg-purple-600 hover:bg-purple-500 text-white px-3 py-1.5 rounded-lg transition-colors flex items-center gap-1"
+                                            onClick={() => copyToClipboard(script)}
+                                            className="text-xs bg-white/10 hover:bg-white/20 text-white px-3 py-1.5 rounded-lg transition-colors"
                                        >
-                                            📋 复制内容
+                                            复制
                                        </button>
                                    </div>
-                                    <div className="bg-purple-900/10 border border-purple-500/20 rounded-xl p-4 max-h-60 overflow-y-auto hide-scrollbar">
-                                        <p className="text-gray-200 text-sm leading-relaxed whitespace-pre-wrap">
-                                            {rewrittenScript}
-                                        </p>
-                                    </div>
                                </div>
-                            )}
-
-                            <div className="space-y-2">
-                                <div className="flex justify-between items-center">
-                                    <h4 className="font-semibold text-gray-400 flex items-center gap-2">
-                                        🎙️ 原始识别结果
-                                    </h4>
-                                    {onApply && (
-                                        <button
-                                            onClick={() => handleApplyAndClose(script)}
-                                            className="text-xs bg-white/10 hover:bg-white/20 text-white px-3 py-1.5 rounded-lg transition-colors flex items-center gap-1"
-                                        >
-                                            📥 填入
-                                        </button>
-                                    )}
-                                    <button
-                                        onClick={() => copyToClipboard(script)}
-                                        className="text-xs bg-white/10 hover:bg-white/20 text-white px-3 py-1.5 rounded-lg transition-colors"
-                                    >
-                                        复制
-                                    </button>
-                                </div>
-                                <div className="bg-white/5 border border-white/10 rounded-xl p-4 max-h-40 overflow-y-auto hide-scrollbar">
-                                    <p className="text-gray-400 text-sm leading-relaxed whitespace-pre-wrap">
+                                <div className="bg-white/5 border border-white/10 rounded-xl p-4 max-h-60 overflow-y-auto hide-scrollbar">
+                                    <p className="text-gray-200 text-sm leading-relaxed whitespace-pre-wrap">
                                        {script}
                                    </p>
                                </div>
@@ -356,7 +269,6 @@ export default function ScriptExtractionModal({
                        </div>
                    )}
                </div>
-            </div>
-        </div>
+        </AppModal>
    );
 }
--- a/frontend/src/features/home/ui/TimelineEditor.tsx
+++ b/frontend/src/features/home/ui/TimelineEditor.tsx
@@ -1,8 +1,9 @@
 import { useEffect, useRef, useCallback, useState, useMemo } from "react";
 import WaveSurfer from "wavesurfer.js";
-import { ChevronDown, GripVertical } from "lucide-react";
+import { ChevronDown, GripVertical, Check } from "lucide-react";
 import type { TimelineSegment } from "@/features/home/model/useTimelineEditor";
 import type { Material } from "@/shared/types/material";
+import { SelectPopover } from "@/shared/ui/SelectPopover";

 interface TimelineEditorProps {
  audioDuration: number;
@@ -51,9 +52,7 @@ export function TimelineEditor({
  const [dragFromIdx, setDragFromIdx] = useState<number | null>(null);
  const [dragOverIdx, setDragOverIdx] = useState<number | null>(null);

-  // Aspect ratio dropdown
-  const [ratioOpen, setRatioOpen] = useState(false);
-  const ratioRef = useRef<HTMLDivElement>(null);
+  // Aspect ratio options
  const ratioOptions = [
    { value: "9:16" as const, label: "竖屏 9:16" },
    { value: "16:9" as const, label: "横屏 16:9" },
@@ -61,16 +60,6 @@ export function TimelineEditor({
  const currentRatioLabel =
    ratioOptions.find((opt) => opt.value === outputAspectRatio)?.label ?? "竖屏 9:16";

-  useEffect(() => {
-    const handler = (e: MouseEvent) => {
-      if (ratioRef.current && !ratioRef.current.contains(e.target as Node)) {
-        setRatioOpen(false);
-      }
-    };
-    if (ratioOpen) document.addEventListener("mousedown", handler);
-    return () => document.removeEventListener("mousedown", handler);
-  }, [ratioOpen]);
-
  // Create / recreate wavesurfer when audioUrl changes
  useEffect(() => {
    if (!waveRef.current || !audioUrl) return;
@@ -188,37 +177,49 @@ export function TimelineEditor({
          <h3 className="text-sm font-medium text-gray-400">时间轴编辑</h3>
        )}
        <div className="flex items-center gap-2 text-xs text-gray-400">
-          <div ref={ratioRef} className="relative">
-            <button
-              type="button"
-              onClick={() => setRatioOpen((v) => !v)}
-              className="px-2 py-1 text-xs bg-white/10 hover:bg-white/20 rounded text-gray-300 whitespace-nowrap flex items-center gap-1 transition-all"
-              title="设置输出画面比例"
+          <div className="shrink-0">
+            <SelectPopover
+              sheetTitle="设置输出画面比例"
+              trigger={({ open, toggle }) => (
+                <button
+                  type="button"
+                  onClick={toggle}
+                  className="rounded-lg border border-white/10 bg-black/25 px-2.5 py-1.5 text-left transition-colors hover:border-white/30"
+                  title="设置输出画面比例"
+                >
+                  <span className="flex items-center justify-between gap-2">
+                    <span className="truncate text-xs text-white">画面: {currentRatioLabel}</span>
+                    <ChevronDown className={`h-3.5 w-3.5 text-gray-300 transition-transform ${open ? "rotate-180" : ""}`} />
+                  </span>
+                </button>
+              )}
            >
-              画面: {currentRatioLabel}
-              <ChevronDown className={`h-3 w-3 transition-transform ${ratioOpen ? "rotate-180" : ""}`} />
-            </button>
-            {ratioOpen && (
-              <div className="absolute right-0 top-full mt-1 bg-gray-800 border border-white/20 rounded-lg shadow-xl py-1 z-50 min-w-[106px]">
-                {ratioOptions.map((opt) => (
-                  <button
-                    key={opt.value}
-                    type="button"
-                    onClick={() => {
-                      onOutputAspectRatioChange(opt.value);
-                      setRatioOpen(false);
-                    }}
-                    className={`w-full text-left px-3 py-1.5 text-xs transition-colors ${
-                      outputAspectRatio === opt.value
-                        ? "bg-purple-600/40 text-purple-200"
-                        : "text-gray-300 hover:bg-white/10"
-                    }`}
-                  >
-                    {opt.label}
-                  </button>
-                ))}
-              </div>
-            )}
+              {({ close }) => (
+                <div className="space-y-1">
+                  {ratioOptions.map((opt) => {
+                    const isSelected = outputAspectRatio === opt.value;
+                    return (
+                      <button
+                        key={opt.value}
+                        type="button"
+                        data-popover-selected={isSelected ? "true" : undefined}
+                        onClick={() => {
+                          onOutputAspectRatioChange(opt.value);
+                          close();
+                        }}
+                        className={`flex w-full items-center justify-between rounded-lg border px-3 py-2 text-left transition-colors ${isSelected
+                          ? "border-purple-500 bg-purple-500/20"
+                          : "border-white/10 bg-white/5 hover:border-white/30"
+                          }`}
+                      >
+                        <span className="text-xs text-white">{opt.label}</span>
+                        {isSelected && <Check className="h-3.5 w-3.5 text-purple-300" />}
+                      </button>
+                    );
+                  })}
+                </div>
+              )}
+            </SelectPopover>
          </div>

          {audioUrl && (
--- a/frontend/src/features/home/ui/TitleSubtitlePanel.tsx
+++ b/frontend/src/features/home/ui/TitleSubtitlePanel.tsx
@@ -1,5 +1,6 @@
-import { ChevronDown, Eye } from "lucide-react";
+import { ChevronDown, Eye, Check } from "lucide-react";
 import { FloatingStylePreview } from "@/features/home/ui/FloatingStylePreview";
+import { SelectPopover } from "@/shared/ui/SelectPopover";

 interface SubtitleStyleOption {
  id: string;
@@ -69,6 +70,7 @@ interface TitleSubtitlePanelProps {
  buildTextShadow: (color: string, size: number) => string;
  previewBaseWidth?: number;
  previewBaseHeight?: number;
+  previewBackgroundUrl?: string | null;
 }

 export function TitleSubtitlePanel({
@@ -109,25 +111,68 @@ export function TitleSubtitlePanel({
  buildTextShadow,
  previewBaseWidth = 1080,
  previewBaseHeight = 1920,
+  previewBackgroundUrl,
 }: TitleSubtitlePanelProps) {
+  const titleDisplayOptions: Array<{ value: "short" | "persistent"; label: string }> = [
+    { value: "short", label: "标题短暂显示" },
+    { value: "persistent", label: "标题常驻显示" },
+  ];
+  const currentTitleDisplay = titleDisplayOptions.find((opt) => opt.value === titleDisplayMode) || titleDisplayOptions[0];
+
+  const currentTitleStyle = titleStyles.find((style) => style.id === selectedTitleStyleId) || titleStyles[0] || null;
+  const currentSecondaryTitleStyle = titleStyles.find((style) => style.id === selectedSecondaryTitleStyleId) || titleStyles[0] || null;
+  const currentSubtitleStyle = subtitleStyles.find((style) => style.id === selectedSubtitleStyleId) || subtitleStyles[0] || null;
+
  return (
    <div className="bg-white/5 rounded-2xl p-4 sm:p-6 border border-white/10 backdrop-blur-sm">
      <div className="flex items-center justify-between mb-4 gap-2">
        <h2 className="text-base sm:text-lg font-semibold text-white flex items-center gap-2">
-          二、标题与字幕
+          四、标题与字幕
        </h2>
        <div className="flex items-center gap-1.5">
-          <div className="relative shrink-0">
-            <select
-              value={titleDisplayMode}
-              onChange={(e) => onTitleDisplayModeChange(e.target.value as "short" | "persistent")}
-              className="appearance-none rounded-lg border border-white/15 bg-black/35 px-2.5 py-1.5 pr-7 text-xs text-gray-200 outline-none transition-colors hover:border-white/25 focus:border-purple-500"
-              aria-label="标题显示方式"
+          <div className="shrink-0">
+            <SelectPopover
+              sheetTitle="标题显示方式"
+              trigger={({ open, toggle }) => (
+                <button
+                  type="button"
+                  onClick={toggle}
+                  className="min-w-[146px] rounded-lg border border-white/10 bg-black/25 px-2.5 py-1.5 text-left text-xs text-gray-200 transition-colors hover:border-white/30"
+                  aria-label="标题显示方式"
+                >
+                  <span className="flex items-center justify-between gap-2">
+                    <span className="whitespace-nowrap">{currentTitleDisplay.label}</span>
+                    <ChevronDown className={`h-3.5 w-3.5 text-gray-400 transition-transform ${open ? "rotate-180" : ""}`} />
+                  </span>
+                </button>
+              )}
            >
-              <option value="short">标题短暂显示</option>
-              <option value="persistent">标题常驻显示</option>
-            </select>
-            <ChevronDown className="pointer-events-none absolute right-2 top-1/2 h-3.5 w-3.5 -translate-y-1/2 text-gray-400" />
+              {({ close }) => (
+                <div className="space-y-1">
+                  {titleDisplayOptions.map((opt) => {
+                    const isSelected = opt.value === titleDisplayMode;
+                    return (
+                      <button
+                        key={opt.value}
+                        type="button"
+                        data-popover-selected={isSelected ? "true" : undefined}
+                        onClick={() => {
+                          onTitleDisplayModeChange(opt.value);
+                          close();
+                        }}
+                        className={`flex w-full items-center justify-between rounded-lg border px-3 py-2 text-left transition-colors ${isSelected
+                          ? "border-purple-500 bg-purple-500/20"
+                          : "border-white/10 bg-white/5 hover:border-white/30"
+                          }`}
+                      >
+                        <span className="text-xs text-white whitespace-nowrap">{opt.label}</span>
+                        {isSelected && <Check className="h-3.5 w-3.5 text-purple-300" />}
+                      </button>
+                    );
+                  })}
+                </div>
+              )}
+            </SelectPopover>
          </div>
          <button
            onClick={onTogglePreview}
@@ -161,6 +206,7 @@ export function TitleSubtitlePanel({
          buildTextShadow={buildTextShadow}
          previewBaseWidth={previewBaseWidth}
          previewBaseHeight={previewBaseHeight}
+          previewBackgroundUrl={previewBackgroundUrl}
        />
      )}

@@ -200,17 +246,48 @@ export function TitleSubtitlePanel({
        <div className="mb-4 space-y-3">
          <div className="flex items-center gap-3">
            <label className="text-sm text-gray-300 shrink-0 w-20">标题样式</label>
-            <div className="relative w-1/3 min-w-[100px]">
-              <select
-                value={selectedTitleStyleId}
-                onChange={(e) => onSelectTitleStyle(e.target.value)}
-                className="w-full appearance-none rounded-lg border border-white/15 bg-black/35 px-3 py-2 pr-8 text-sm text-gray-200 outline-none transition-colors hover:border-white/25 focus:border-purple-500"
+            <div className="w-1/3 min-w-[130px]">
+              <SelectPopover
+                sheetTitle="标题样式"
+                trigger={({ open, toggle }) => (
+                  <button
+                    type="button"
+                    onClick={toggle}
+                    className="w-full rounded-lg border border-white/15 bg-black/35 px-3 py-2 text-left text-sm text-gray-200 transition-colors hover:border-white/25"
+                  >
+                    <span className="flex items-center justify-between gap-2">
+                      <span className="truncate">{currentTitleStyle?.label || "请选择"}</span>
+                      <ChevronDown className={`h-3.5 w-3.5 text-gray-400 transition-transform ${open ? "rotate-180" : ""}`} />
+                    </span>
+                  </button>
+                )}
              >
-                {titleStyles.map((style) => (
-                  <option key={style.id} value={style.id}>{style.label}</option>
-                ))}
-              </select>
-              <ChevronDown className="pointer-events-none absolute right-2.5 top-1/2 h-3.5 w-3.5 -translate-y-1/2 text-gray-400" />
+                {({ close }) => (
+                  <div className="space-y-1">
+                    {titleStyles.map((style) => {
+                      const isSelected = selectedTitleStyleId === style.id;
+                      return (
+                        <button
+                          key={style.id}
+                          type="button"
+                          data-popover-selected={isSelected ? "true" : undefined}
+                          onClick={() => {
+                            onSelectTitleStyle(style.id);
+                            close();
+                          }}
+                          className={`flex w-full items-center justify-between rounded-lg border px-3 py-2 text-left transition-colors ${isSelected
+                            ? "border-purple-500 bg-purple-500/20"
+                            : "border-white/10 bg-white/5 hover:border-white/30"
+                            }`}
+                        >
+                          <span className="text-sm text-white">{style.label}</span>
+                          {isSelected && <Check className="h-4 w-4 text-purple-300" />}
+                        </button>
+                      );
+                    })}
+                  </div>
+                )}
+              </SelectPopover>
            </div>
          </div>
          <div className="flex items-center gap-3">
@@ -228,17 +305,48 @@ export function TitleSubtitlePanel({
        <div className="mb-4 space-y-3">
          <div className="flex items-center gap-3">
            <label className="text-sm text-gray-300 shrink-0 w-20">副标题样式</label>
-            <div className="relative w-1/3 min-w-[100px]">
-              <select
-                value={selectedSecondaryTitleStyleId}
-                onChange={(e) => onSelectSecondaryTitleStyle(e.target.value)}
-                className="w-full appearance-none rounded-lg border border-white/15 bg-black/35 px-3 py-2 pr-8 text-sm text-gray-200 outline-none transition-colors hover:border-white/25 focus:border-purple-500"
+            <div className="w-1/3 min-w-[130px]">
+              <SelectPopover
+                sheetTitle="副标题样式"
+                trigger={({ open, toggle }) => (
+                  <button
+                    type="button"
+                    onClick={toggle}
+                    className="w-full rounded-lg border border-white/15 bg-black/35 px-3 py-2 text-left text-sm text-gray-200 transition-colors hover:border-white/25"
+                  >
+                    <span className="flex items-center justify-between gap-2">
+                      <span className="truncate">{currentSecondaryTitleStyle?.label || "请选择"}</span>
+                      <ChevronDown className={`h-3.5 w-3.5 text-gray-400 transition-transform ${open ? "rotate-180" : ""}`} />
+                    </span>
+                  </button>
+                )}
              >
-                {titleStyles.map((style) => (
-                  <option key={style.id} value={style.id}>{style.label}</option>
-                ))}
-              </select>
-              <ChevronDown className="pointer-events-none absolute right-2.5 top-1/2 h-3.5 w-3.5 -translate-y-1/2 text-gray-400" />
+                {({ close }) => (
+                  <div className="space-y-1">
+                    {titleStyles.map((style) => {
+                      const isSelected = selectedSecondaryTitleStyleId === style.id;
+                      return (
+                        <button
+                          key={style.id}
+                          type="button"
+                          data-popover-selected={isSelected ? "true" : undefined}
+                          onClick={() => {
+                            onSelectSecondaryTitleStyle(style.id);
+                            close();
+                          }}
+                          className={`flex w-full items-center justify-between rounded-lg border px-3 py-2 text-left transition-colors ${isSelected
+                            ? "border-purple-500 bg-purple-500/20"
+                            : "border-white/10 bg-white/5 hover:border-white/30"
+                            }`}
+                        >
+                          <span className="text-sm text-white">{style.label}</span>
+                          {isSelected && <Check className="h-4 w-4 text-purple-300" />}
+                        </button>
+                      );
+                    })}
+                  </div>
+                )}
+              </SelectPopover>
            </div>
          </div>
          <div className="flex items-center gap-3">
@@ -256,17 +364,48 @@ export function TitleSubtitlePanel({
        <div className="mt-4 space-y-3">
          <div className="flex items-center gap-3">
            <label className="text-sm text-gray-300 shrink-0 w-20">字幕样式</label>
-            <div className="relative w-1/3 min-w-[100px]">
-              <select
-                value={selectedSubtitleStyleId}
-                onChange={(e) => onSelectSubtitleStyle(e.target.value)}
-                className="w-full appearance-none rounded-lg border border-white/15 bg-black/35 px-3 py-2 pr-8 text-sm text-gray-200 outline-none transition-colors hover:border-white/25 focus:border-purple-500"
+            <div className="w-1/3 min-w-[130px]">
+              <SelectPopover
+                sheetTitle="字幕样式"
+                trigger={({ open, toggle }) => (
+                  <button
+                    type="button"
+                    onClick={toggle}
+                    className="w-full rounded-lg border border-white/15 bg-black/35 px-3 py-2 text-left text-sm text-gray-200 transition-colors hover:border-white/25"
+                  >
+                    <span className="flex items-center justify-between gap-2">
+                      <span className="truncate">{currentSubtitleStyle?.label || "请选择"}</span>
+                      <ChevronDown className={`h-3.5 w-3.5 text-gray-400 transition-transform ${open ? "rotate-180" : ""}`} />
+                    </span>
+                  </button>
+                )}
              >
-                {subtitleStyles.map((style) => (
-                  <option key={style.id} value={style.id}>{style.label}</option>
-                ))}
-              </select>
-              <ChevronDown className="pointer-events-none absolute right-2.5 top-1/2 h-3.5 w-3.5 -translate-y-1/2 text-gray-400" />
+                {({ close }) => (
+                  <div className="space-y-1">
+                    {subtitleStyles.map((style) => {
+                      const isSelected = selectedSubtitleStyleId === style.id;
+                      return (
+                        <button
+                          key={style.id}
+                          type="button"
+                          data-popover-selected={isSelected ? "true" : undefined}
+                          onClick={() => {
+                            onSelectSubtitleStyle(style.id);
+                            close();
+                          }}
+                          className={`flex w-full items-center justify-between rounded-lg border px-3 py-2 text-left transition-colors ${isSelected
+                            ? "border-purple-500 bg-purple-500/20"
+                            : "border-white/10 bg-white/5 hover:border-white/30"
+                            }`}
+                        >
+                          <span className="text-sm text-white">{style.label}</span>
+                          {isSelected && <Check className="h-4 w-4 text-purple-300" />}
+                        </button>
+                      );
+                    })}
+                  </div>
+                )}
+              </SelectPopover>
            </div>
          </div>
          <div className="flex items-center gap-3">
--- a/frontend/src/features/home/ui/VoiceSelector.tsx
+++ b/frontend/src/features/home/ui/VoiceSelector.tsx
@@ -1,11 +1,34 @@
-import type { ReactNode } from "react";
-import { Mic, Volume2 } from "lucide-react";
+import { useCallback, useEffect, useRef, useState, type MouseEvent, type ReactNode } from "react";
+import { Check, ChevronDown, Loader2, Mic, Pause, Play, Volume2 } from "lucide-react";
+import { toast } from "sonner";
+import { SelectPopover } from "@/shared/ui/SelectPopover";

 interface VoiceOption {
  id: string;
  name: string;
 }

+const LOCALE_LABELS: Record<string, string> = {
+  "zh-CN": "中文",
+  "en-US": "English",
+  "ja-JP": "日本語",
+  "ko-KR": "한국어",
+  "fr-FR": "Français",
+  "de-DE": "Deutsch",
+  "es-ES": "Español",
+  "ru-RU": "Русский",
+  "it-IT": "Italiano",
+  "pt-BR": "Português",
+};
+
+const getLocaleFromVoiceId = (voiceId: string) => {
+  const parts = voiceId.split("-");
+  if (parts.length >= 2) {
+    return `${parts[0]}-${parts[1]}`;
+  }
+  return voiceId;
+};
+
 interface VoiceSelectorProps {
  ttsMode: "edgetts" | "voiceclone";
  onSelectTtsMode: (mode: "edgetts" | "voiceclone") => void;
@@ -25,6 +48,102 @@ export function VoiceSelector({
  voiceCloneSlot,
  embedded = false,
 }: VoiceSelectorProps) {
+  const selectedVoice = voices.find((v) => v.id === voice) ?? voices[0];
+  const selectedLocale = selectedVoice ? getLocaleFromVoiceId(selectedVoice.id) : "";
+  const selectedLangLabel = LOCALE_LABELS[selectedLocale] ?? selectedLocale;
+
+  const [previewingVoiceId, setPreviewingVoiceId] = useState<string | null>(null);
+  const [previewLoadingVoiceId, setPreviewLoadingVoiceId] = useState<string | null>(null);
+  const previewPlayerRef = useRef<HTMLAudioElement | null>(null);
+  const previewRequestIdRef = useRef(0);
+
+  const stopVoicePreview = useCallback(() => {
+    previewRequestIdRef.current += 1;
+
+    if (previewPlayerRef.current) {
+      previewPlayerRef.current.pause();
+      previewPlayerRef.current.src = "";
+      previewPlayerRef.current.currentTime = 0;
+      previewPlayerRef.current = null;
+    }
+    setPreviewingVoiceId(null);
+    setPreviewLoadingVoiceId(null);
+  }, []);
+
+  useEffect(() => () => {
+    stopVoicePreview();
+  }, [stopVoicePreview]);
+
+  useEffect(() => {
+    if (ttsMode !== "edgetts") {
+      stopVoicePreview();
+    }
+  }, [ttsMode, stopVoicePreview]);
+
+  const handleVoicePreview = useCallback(async (voiceId: string, e: MouseEvent<HTMLButtonElement>) => {
+    e.stopPropagation();
+
+    if (previewingVoiceId === voiceId) {
+      stopVoicePreview();
+      return;
+    }
+
+    stopVoicePreview();
+    setPreviewLoadingVoiceId(voiceId);
+    const requestId = ++previewRequestIdRef.current;
+
+    try {
+      const audioUrl = `/api/videos/voice-preview?voice=${encodeURIComponent(voiceId)}`;
+      const player = new Audio(audioUrl);
+      previewPlayerRef.current = player;
+      let errorNotified = false;
+
+      const notifyPreviewError = () => {
+        if (errorNotified) return;
+        errorNotified = true;
+        toast.error("音色试听失败，请稍后重试");
+      };
+
+      player.onplaying = () => {
+        if (requestId === previewRequestIdRef.current) {
+          setPreviewLoadingVoiceId(null);
+          setPreviewingVoiceId(voiceId);
+        }
+      };
+
+      player.onended = () => {
+        if (previewPlayerRef.current === player) {
+          previewPlayerRef.current = null;
+          setPreviewingVoiceId(null);
+          setPreviewLoadingVoiceId(null);
+        }
+      };
+
+      player.onerror = () => {
+        if (previewPlayerRef.current === player) {
+          previewPlayerRef.current = null;
+          setPreviewingVoiceId(null);
+          setPreviewLoadingVoiceId(null);
+          notifyPreviewError();
+        }
+      };
+
+      await player.play();
+
+      if (requestId !== previewRequestIdRef.current) {
+        player.pause();
+        player.src = "";
+        player.currentTime = 0;
+      }
+    } catch {
+      toast.error("音色试听失败，请稍后重试");
+    } finally {
+      if (requestId === previewRequestIdRef.current) {
+        setPreviewLoadingVoiceId(null);
+      }
+    }
+  }, [previewingVoiceId, stopVoicePreview]);
+
  const content = (
    <>
      <div className="flex gap-2 mb-4">
@@ -51,19 +170,86 @@ export function VoiceSelector({
      </div>

      {ttsMode === "edgetts" && (
-        <div className="grid grid-cols-2 gap-3">
-          {voices.map((v) => (
-            <button
-              key={v.id}
-              onClick={() => onSelectVoice(v.id)}
-              className={`p-3 rounded-xl border-2 transition-all text-left ${voice === v.id
-                ? "border-purple-500 bg-purple-500/20"
-                : "border-white/10 bg-white/5 hover:border-white/30"
-                }`}
-            >
-              <span className="text-white text-sm">{v.name}</span>
-            </button>
-          ))}
+        <div className="space-y-2">
+          <p className="text-xs text-gray-400">音色选择</p>
+          <SelectPopover
+            sheetTitle="选择声音"
+            trigger={({ open, toggle }) => (
+              <button
+                type="button"
+                onClick={toggle}
+                className="w-full rounded-xl border border-white/10 bg-black/25 px-3 py-2.5 text-left hover:border-white/30 transition-colors"
+              >
+                <span className="flex items-center justify-between gap-3">
+                  <span className="min-w-0">
+                    <span className="block truncate text-sm text-white">
+                      {selectedVoice?.name || "请选择声音"}
+                    </span>
+                    <span className="block text-xs text-gray-400">
+                      {selectedLangLabel || "未识别语言"}
+                    </span>
+                  </span>
+                  <ChevronDown className={`h-4 w-4 text-gray-300 transition-transform ${open ? "rotate-180" : ""}`} />
+                </span>
+              </button>
+            )}
+          >
+            {({ close }) => (
+              <div className="space-y-1">
+                {voices.map((v) => {
+                  const isSelected = voice === v.id;
+                  const isPreviewing = previewingVoiceId === v.id;
+                  const isPreviewLoading = previewLoadingVoiceId === v.id;
+                  const locale = getLocaleFromVoiceId(v.id);
+                  const langLabel = LOCALE_LABELS[locale] ?? locale;
+
+                  return (
+                    <div
+                      key={v.id}
+                      data-popover-selected={isSelected ? "true" : undefined}
+                      className={`flex w-full items-center justify-between rounded-lg border px-3 py-2 text-left transition-colors ${isSelected
+                        ? "border-purple-500 bg-purple-500/20"
+                        : "border-white/10 bg-white/5 hover:border-white/30"
+                        }`}
+                    >
+                      <button
+                        type="button"
+                        onClick={() => {
+                          stopVoicePreview();
+                          onSelectVoice(v.id);
+                          close();
+                        }}
+                        className="min-w-0 flex-1 text-left"
+                      >
+                        <span className="block truncate text-sm text-white">{v.name}</span>
+                        <span className="mt-0.5 block text-xs text-gray-400">{langLabel}</span>
+                      </button>
+
+                      <div className="flex items-center gap-2 pl-2">
+                        <button
+                          type="button"
+                          onClick={(e) => {
+                            void handleVoicePreview(v.id, e);
+                          }}
+                          className="p-1 text-gray-400 hover:text-purple-300 transition-colors"
+                          title={isPreviewing ? "停止试听" : "试听"}
+                        >
+                          {isPreviewLoading ? (
+                            <Loader2 className="h-4 w-4 animate-spin" />
+                          ) : isPreviewing ? (
+                            <Pause className="h-4 w-4" />
+                          ) : (
+                            <Play className="h-4 w-4" />
+                          )}
+                        </button>
+                        {isSelected && <Check className="h-4 w-4 text-purple-300" />}
+                      </div>
+                    </div>
+                  );
+                })}
+              </div>
+            )}
+          </SelectPopover>
        </div>
      )}

--- a/frontend/src/features/home/ui/script-extraction/useScriptExtraction.ts
+++ b/frontend/src/features/home/ui/script-extraction/useScriptExtraction.ts
@@ -1,4 +1,4 @@
-import { useState, useEffect, useCallback, useRef } from "react";
+import { useState, useEffect, useCallback } from "react";
 import api from "@/shared/api/axios";
 import { ApiResponse, unwrap } from "@/shared/api/types";
 import { toast } from "sonner";
@@ -7,7 +7,6 @@ export type ExtractionStep = "config" | "processing" | "result";
 export type InputTab = "file" | "url";

 const VALID_FILE_TYPES = [".mp4", ".mov", ".avi", ".mp3", ".wav", ".m4a"];
-const CUSTOM_PROMPT_KEY = "vigent_rewriteCustomPrompt";

 interface UseScriptExtractionOptions {
    isOpen: boolean;
@@ -16,32 +15,18 @@ interface UseScriptExtractionOptions {
 export const useScriptExtraction = ({ isOpen }: UseScriptExtractionOptions) => {
    const [isLoading, setIsLoading] = useState(false);
    const [script, setScript] = useState("");
-    const [rewrittenScript, setRewrittenScript] = useState("");
    const [error, setError] = useState<string | null>(null);
-    const [doRewrite, setDoRewrite] = useState(true);
    const [step, setStep] = useState<ExtractionStep>("config");
    const [dragActive, setDragActive] = useState(false);
    const [selectedFile, setSelectedFile] = useState<File | null>(null);
    const [activeTab, setActiveTab] = useState<InputTab>("url");
    const [inputUrl, setInputUrl] = useState("");
-    const [customPrompt, setCustomPrompt] = useState(() => typeof window !== "undefined" ? localStorage.getItem(CUSTOM_PROMPT_KEY) || "" : "");
-    const [showCustomPrompt, setShowCustomPrompt] = useState(false);

-    // Debounced save customPrompt to localStorage
-    const debounceRef = useRef<ReturnType<typeof setTimeout>>(undefined);
-    useEffect(() => {
-        debounceRef.current = setTimeout(() => {
-            localStorage.setItem(CUSTOM_PROMPT_KEY, customPrompt);
-        }, 300);
-        return () => clearTimeout(debounceRef.current);
-    }, [customPrompt]);
-
-    // Reset state when modal opens (customPrompt is persistent, not reset)
+    // Reset state when modal opens
    useEffect(() => {
        if (isOpen) {
            setStep("config");
            setScript("");
-            setRewrittenScript("");
            setError(null);
            setIsLoading(false);
            setSelectedFile(null);
@@ -112,13 +97,10 @@ export const useScriptExtraction = ({ isOpen }: UseScriptExtractionOptions) => {
            } else if (activeTab === "url") {
                formData.append("url", inputUrl.trim());
            }
-            formData.append("rewrite", doRewrite ? "true" : "false");
-            if (doRewrite && customPrompt.trim()) {
-                formData.append("custom_prompt", customPrompt.trim());
-            }
+            formData.append("rewrite", "false");

            const { data: res } = await api.post<
-                ApiResponse<{ original_script: string; rewritten_script?: string }>
+                ApiResponse<{ original_script: string }>
            >("/api/tools/extract-script", formData, {
                headers: { "Content-Type": "multipart/form-data" },
                timeout: 180000, // 3 minutes timeout
@@ -126,7 +108,6 @@ export const useScriptExtraction = ({ isOpen }: UseScriptExtractionOptions) => {

            const payload = unwrap(res);
            setScript(payload.original_script);
-            setRewrittenScript(payload.rewritten_script || "");
            setStep("result");
        } catch (err: unknown) {
            console.error(err);
@@ -141,7 +122,7 @@ export const useScriptExtraction = ({ isOpen }: UseScriptExtractionOptions) => {
        } finally {
            setIsLoading(false);
        }
-    }, [activeTab, selectedFile, inputUrl, doRewrite, customPrompt]);
+    }, [activeTab, selectedFile, inputUrl]);

    const copyToClipboard = useCallback((text: string) => {
        if (navigator.clipboard && window.isSecureContext) {
@@ -200,22 +181,15 @@ export const useScriptExtraction = ({ isOpen }: UseScriptExtractionOptions) => {
        // State
        isLoading,
        script,
-        rewrittenScript,
        error,
-        doRewrite,
        step,
        dragActive,
        selectedFile,
        activeTab,
        inputUrl,
-        customPrompt,
-        showCustomPrompt,
        // Setters
-        setDoRewrite,
        setActiveTab,
        setInputUrl,
-        setCustomPrompt,
-        setShowCustomPrompt,
        // Handlers
        handleDrag,
        handleDrop,
--- a/frontend/src/features/publish/model/usePublishController.ts
+++ b/frontend/src/features/publish/model/usePublishController.ts
@@ -83,6 +83,8 @@ export const usePublishController = () => {
      setVideos(nextVideos);
      if (nextVideos.length > 0 && autoSelectLatest) {
        setSelectedVideo(nextVideos[0].id);
+        // 写入跨页面共享标记，让首页也能感知最新生成的视频
+        localStorage.setItem(`vigent_${getStorageKey()}_latestGeneratedVideoId`, nextVideos[0].id);
      }
      updatePrefetch({ videos: nextVideos });
    } catch (error) {
@@ -109,16 +111,23 @@ export const usePublishController = () => {

  // ---- 视频选择恢复（唯一一个 effect，条件极简） ----
  // 等 auth 完成 + videos 有数据 → 恢复一次，之后再也不跑
+  // 优先检查跨页面共享标记（最新生成的视频），其次恢复上次选择
  useEffect(() => {
    if (isAuthLoading || videos.length === 0 || videoRestoredRef.current) return;
    videoRestoredRef.current = true;

    const key = getStorageKey();
-    const saved = localStorage.getItem(`vigent_${key}_publish_selected_video`);
-    if (saved && videos.some(v => v.id === saved)) {
-      setSelectedVideo(saved);
+    const latestId = localStorage.getItem(`vigent_${key}_latestGeneratedVideoId`);
+    if (latestId && videos.some(v => v.id === latestId)) {
+      setSelectedVideo(latestId);
+      localStorage.removeItem(`vigent_${key}_latestGeneratedVideoId`);
    } else {
-      setSelectedVideo(videos[0].id);
+      const saved = localStorage.getItem(`vigent_${key}_publish_selected_video`);
+      if (saved && videos.some(v => v.id === saved)) {
+        setSelectedVideo(saved);
+      } else {
+        setSelectedVideo(videos[0].id);
+      }
    }
  }, [isAuthLoading, videos, getStorageKey]);

@@ -222,6 +231,29 @@ export const usePublishController = () => {

  // ---- 操作函数 ----

+  const runWithConcurrency = async <T,>(
+    taskFactories: Array<() => Promise<T>>,
+    concurrency: number
+  ): Promise<T[]> => {
+    if (taskFactories.length === 0) return [];
+
+    const results: T[] = new Array(taskFactories.length);
+    let nextIndex = 0;
+
+    const worker = async () => {
+      while (true) {
+        const currentIndex = nextIndex;
+        nextIndex += 1;
+        if (currentIndex >= taskFactories.length) return;
+        results[currentIndex] = await taskFactories[currentIndex]();
+      }
+    };
+
+    const workerCount = Math.min(Math.max(concurrency, 1), taskFactories.length);
+    await Promise.all(Array.from({ length: workerCount }, () => worker()));
+    return results;
+  };
+
  const togglePlatform = (platform: string) => {
    if (selectedPlatforms.includes(platform)) {
      setSelectedPlatforms(selectedPlatforms.filter((p) => p !== platform));
@@ -243,7 +275,8 @@ export const usePublishController = () => {
    setIsPublishing(true);
    setPublishResults([]);
    const tagList = tags.split(/[,，\s]+/).filter((t) => t.trim());
-    for (const platform of selectedPlatforms) {
+
+    const publishOnePlatform = async (platform: string): Promise<PublishResult> => {
      try {
        const { data: res } = await api.post<ApiResponse<any>>("/api/publish", {
          video_path: video.path, platform, title, tags: tagList, description: "",
@@ -251,19 +284,26 @@ export const usePublishController = () => {
        const result = unwrap(res);
        const screenshotUrl = typeof result.screenshot_url === "string"
          ? resolveMediaUrl(result.screenshot_url) || result.screenshot_url : undefined;
-        setPublishResults((prev) => [...prev, {
+        return {
          platform: result.platform || platform,
          success: Boolean(result.success),
          message: result.message || "",
          url: result.url,
          screenshot_url: screenshotUrl,
-        }]);
+        };
      } catch (error: any) {
        const message = error.response?.data?.message || String(error);
-        setPublishResults((prev) => [...prev, { platform, success: false, message }]);
+        return { platform, success: false, message };
      }
+    };
+
+    try {
+      const taskFactories = selectedPlatforms.map((platform) => () => publishOnePlatform(platform));
+      const results = await runWithConcurrency(taskFactories, 2);
+      setPublishResults(results);
+    } finally {
+      setIsPublishing(false);
    }
-    setIsPublishing(false);
  };

  const handleLogin = async (platform: string) => {
--- a/frontend/src/features/publish/ui/PublishPage.tsx
+++ b/frontend/src/features/publish/ui/PublishPage.tsx
@@ -4,9 +4,13 @@ import Link from "next/link";
 import Image from "next/image";
 import VideoPreviewModal from "@/components/VideoPreviewModal";
 import AccountSettingsDropdown from "@/components/AccountSettingsDropdown";
+import { SelectPopover } from "@/shared/ui/SelectPopover";
+import { AppModal, AppModalHeader } from "@/shared/ui/AppModal";
 import { usePublishController } from "@/features/publish/model/usePublishController";
 import {
  ArrowLeft,
+  Check,
+  ChevronDown,
  RotateCcw,
  LogOut,
  QrCode,
@@ -18,6 +22,7 @@ import {
 export function PublishPage() {
  const {
    accounts,
+    videos,
    isAccountsLoading,
    isVideosLoading,
    selectedVideo,
@@ -47,6 +52,8 @@ export function PublishPage() {
    closeQrModal,
  } = usePublishController();

+  const selectedVideoItem = videos.find((v) => v.id === selectedVideo) || null;
+
  return (
    <div className="min-h-dvh">
      <VideoPreviewModal
@@ -56,51 +63,69 @@ export function PublishPage() {
      />
      {/* QR码弹窗 */}
      {qrPlatform && (
-        <div className="fixed inset-0 bg-black/80 flex items-center justify-center z-50">
-          <div className="bg-white rounded-2xl p-8 max-w-md min-w-[320px]">
-            <h2 className="text-2xl font-bold mb-4 text-center">🔐 扫码登录 {qrPlatform}</h2>
+        <AppModal
+          isOpen={Boolean(qrPlatform)}
+          onClose={closeQrModal}
+          panelClassName="w-full max-w-md rounded-2xl border border-white/10 bg-[#171821]/95 shadow-[0_24px_80px_rgba(0,0,0,0.55)] overflow-hidden"
+          closeOnOverlay={false}
+        >
+          <AppModalHeader
+            title={`🔐 扫码登录 ${qrPlatform}`}
+            subtitle="请使用手机扫码完成登录验证"
+            icon={<QrCode className="h-5 w-5 text-purple-300" />}
+            onClose={closeQrModal}
+          />
+
+          <div className="p-5 space-y-4">
            {isLoadingQR ? (
              <div className="flex flex-col items-center py-8">
-                <div className="animate-spin w-16 h-16 border-4 border-purple-500 border-t-transparent rounded-full" />
-                <p className="text-gray-600 mt-4">正在获取二维码...</p>
+                <Loader2 className="h-14 w-14 animate-spin text-purple-400" />
+                <p className="text-gray-300 mt-4">正在获取二维码...</p>
              </div>
            ) : faceVerifyQr ? (
-              <>
-                <Image
-                  src={`data:image/png;base64,${faceVerifyQr}`}
-                  alt="Face Verify QR"
-                  width={400}
-                  height={300}
-                  className="w-full h-auto rounded-lg"
-                  unoptimized
-                />
-                <p className="text-center text-orange-600 font-medium mt-4">
-                  需要身份验证，请用抖音APP扫描上方二维码完成刷脸验证
+              <div className="space-y-3">
+                <div className="mx-auto w-fit rounded-xl border border-white/10 bg-white p-2 shadow-[0_10px_30px_rgba(0,0,0,0.35)]">
+                  <Image
+                    src={`data:image/png;base64,${faceVerifyQr}`}
+                    alt="Face Verify QR"
+                    width={400}
+                    height={300}
+                    className="h-auto w-[min(82vw,400px)] border border-black/5"
+                    unoptimized
+                  />
+                </div>
+                <p className="text-center text-amber-300 text-sm font-medium">
+                  需要身份验证，请用抖音 APP 扫描上方二维码完成刷脸验证
                </p>
-              </>
+              </div>
            ) : qrCodeImage ? (
-              <>
-                <Image
-                  src={`data:image/png;base64,${qrCodeImage}`}
-                  alt="QR Code"
-                  width={280}
-                  height={280}
-                  className="w-full h-auto"
-                  unoptimized
-                />
-                <p className="text-center text-gray-600 mt-4">
-                  请使用手机扫码登录
-                </p>
-              </>
-            ) : null}
+              <div className="space-y-3">
+                <div className="mx-auto w-fit rounded-xl border border-white/10 bg-white p-3 shadow-[0_10px_30px_rgba(0,0,0,0.35)]">
+                  <Image
+                    src={`data:image/png;base64,${qrCodeImage}`}
+                    alt="QR Code"
+                    width={300}
+                    height={300}
+                    className="h-auto w-[min(74vw,300px)] border border-black/5"
+                    unoptimized
+                  />
+                </div>
+                <p className="text-center text-gray-300 text-sm">请使用手机扫码登录</p>
+              </div>
+            ) : (
+              <div className="rounded-xl border border-red-500/30 bg-red-500/10 px-4 py-3 text-sm text-red-200">
+                二维码获取失败，请重试
+              </div>
+            )}
+
            <button
              onClick={closeQrModal}
-              className="w-full mt-4 px-4 py-2 bg-gray-200 rounded-lg hover:bg-gray-300"
+              className="w-full px-4 py-2.5 bg-white/10 hover:bg-white/20 text-white rounded-lg transition-colors"
            >
              取消
            </button>
          </div>
-        </div>
+        </AppModal>
      )}

      {/* Header - 统一样式 */}
@@ -227,76 +252,112 @@ export function PublishPage() {
            {/* 选择视频 */}
            <div className="bg-white/5 rounded-2xl p-6 border border-white/10 backdrop-blur-sm">
              <h2 className="text-lg font-semibold text-white mb-4">八、选择发布作品</h2>
-
-              <div className="flex items-center gap-3 mb-4">
-                <Search className="text-gray-400 w-4 h-4" />
-                <input
-                  type="text"
-                  value={videoFilter}
-                  onChange={(e) => setVideoFilter(e.target.value)}
-                  placeholder="搜索视频名称..."
-                  className="flex-1 bg-black/30 border border-white/10 rounded-lg px-3 py-2 text-sm text-white placeholder-gray-500 focus:outline-none focus:border-purple-500"
-                />
-              </div>
-
-              {isVideosLoading ? (
-                <div className="space-y-2">
-                  {Array.from({ length: 2 }).map((_, index) => (
-                    <div
-                      key={`video-skeleton-${index}`}
-                      className="p-3 rounded-lg border border-white/10 bg-white/5 animate-pulse"
-                    >
-                      <div className="h-4 w-40 bg-white/10 rounded" />
-                      <div className="h-3 w-24 bg-white/5 rounded mt-2" />
-                    </div>
-                  ))}
-                </div>
-              ) : filteredVideos.length === 0 ? (
-                <div className="text-center py-8 text-gray-400">
-                  暂无可发布的视频
-                </div>
-              ) : (
-                <div className="space-y-2 max-h-64 overflow-y-auto hide-scrollbar" style={{ contentVisibility: "auto" }}>
-                  {filteredVideos.map((v) => (
-                    <div
-                      key={v.id}
-                      onClick={() => setSelectedVideo(v.id)}
-                      className={`p-3 rounded-lg border transition-all flex items-center justify-between group cursor-pointer ${selectedVideo === v.id
-                        ? "border-purple-500 bg-purple-500/20"
-                        : "border-white/10 bg-white/5 hover:border-white/30"
-                        }`}
-                    >
-                      <div className="flex flex-col">
-                        <span className="text-sm text-white">{v.name}</span>
-                      </div>
-                      <div className="flex items-center gap-2 pl-2">
-                        <button
-                          onClick={(e) => {
-                            e.stopPropagation();
-                            handlePreviewVideo(v.id);
-                          }}
-                          onMouseEnter={() => {
-                            const src = v.path.startsWith("/") ? v.path : `/${v.path}`;
-                            const prefetch = document.createElement("link");
-                            prefetch.rel = "preload";
-                            prefetch.as = "video";
-                            prefetch.href = src;
-                            document.head.appendChild(prefetch);
-                            setTimeout(() => prefetch.remove(), 2000);
-                          }}
-                          className="p-1 text-gray-500 hover:text-purple-400 transition-colors"
-                          title="预览"
-                        >
-                          <Eye className="h-4 w-4" />
-                        </button>
-                        {selectedVideo === v.id && (
-                          <span className="text-xs text-purple-300">已选</span>
-                        )}
+              <SelectPopover
+                sheetTitle="选择发布作品"
+                onOpen={() => setVideoFilter("")}
+                trigger={({ open, toggle }) => (
+                  <button
+                    type="button"
+                    onClick={toggle}
+                    className="w-full rounded-xl border border-white/10 bg-black/25 px-3 py-2.5 text-left transition-colors hover:border-white/30"
+                  >
+                    <span className="flex items-center justify-between gap-3">
+                      <span className="min-w-0">
+                        <span className="block text-xs text-gray-400">当前作品</span>
+                        <span className="mt-0.5 block truncate text-sm text-white">
+                          {selectedVideoItem?.name || (isVideosLoading ? "正在加载作品..." : "请选择发布作品")}
+                        </span>
+                      </span>
+                      <ChevronDown className={`h-4 w-4 text-gray-300 transition-transform ${open ? "rotate-180" : ""}`} />
+                    </span>
+                  </button>
+                )}
+              >
+                {({ close }) => (
+                  <div className="space-y-2">
+                    <div className="rounded-lg border border-white/10 bg-black/30 px-3 py-2">
+                      <div className="flex items-center gap-2">
+                        <Search className="h-4 w-4 text-gray-400" />
+                        <input
+                          type="text"
+                          value={videoFilter}
+                          onChange={(e) => setVideoFilter(e.target.value)}
+                          placeholder="搜索视频名称..."
+                          className="w-full bg-transparent text-sm text-white placeholder-gray-500 outline-none"
+                        />
                      </div>
                    </div>
-                  ))}
-                </div>
-              )}
+
+                    {isVideosLoading ? (
+                      <div className="space-y-2 p-1">
+                        {Array.from({ length: 2 }).map((_, index) => (
+                          <div
+                            key={`video-skeleton-${index}`}
+                            className="p-3 rounded-lg border border-white/10 bg-white/5 animate-pulse"
+                          >
+                            <div className="h-4 w-40 bg-white/10 rounded" />
+                          </div>
+                        ))}
+                      </div>
+                    ) : filteredVideos.length === 0 ? (
+                      <div className="py-8 text-center text-sm text-gray-400">
+                        暂无可发布的视频
+                      </div>
+                    ) : (
+                      <div className="space-y-1 pb-1" style={{ contentVisibility: "auto" }}>
+                        {filteredVideos.map((v) => {
+                          const isSelected = selectedVideo === v.id;
+
+                          return (
+                            <div
+                              key={v.id}
+                              data-popover-selected={isSelected ? "true" : undefined}
+                              className={`flex items-center gap-2 rounded-lg border px-3 py-2 transition-colors ${isSelected
+                                ? "border-purple-500 bg-purple-500/20"
+                                : "border-white/10 bg-white/5 hover:border-white/30"
+                                }`}
+                            >
+                              <button
+                                type="button"
+                                onClick={() => {
+                                  setSelectedVideo(v.id);
+                                  close();
+                                }}
+                                className="min-w-0 flex-1 text-left"
+                              >
+                                <span className="block truncate text-sm text-white">{v.name}</span>
+                              </button>
+
+                              <button
+                                type="button"
+                                onClick={(e) => {
+                                  e.stopPropagation();
+                                  handlePreviewVideo(v.id);
+                                }}
+                                onMouseEnter={() => {
+                                  const src = v.path.startsWith("/") ? v.path : `/${v.path}`;
+                                  const prefetch = document.createElement("link");
+                                  prefetch.rel = "preload";
+                                  prefetch.as = "video";
+                                  prefetch.href = src;
+                                  document.head.appendChild(prefetch);
+                                  setTimeout(() => prefetch.remove(), 2000);
+                                }}
+                                className="p-1 text-gray-400 hover:text-purple-300"
+                                title="预览"
+                              >
+                                <Eye className="h-4 w-4" />
+                              </button>
+
+                              {isSelected && <Check className="h-4 w-4 text-purple-300" />}
+                            </div>
+                          );
+                        })}
+                      </div>
+                    )}
+                  </div>
+                )}
+              </SelectPopover>
            </div>

            {/* 填写信息 */}
--- a/frontend/src/shared/ui/AppModal.tsx
+++ b/frontend/src/shared/ui/AppModal.tsx
@@ -0,0 +1,132 @@
+"use client";
+
+import { useEffect, useRef, type ReactNode } from "react";
+import { createPortal } from "react-dom";
+import { X } from "lucide-react";
+
+interface AppModalProps {
+  isOpen: boolean;
+  onClose: () => void;
+  children: ReactNode;
+  zIndexClassName?: string;
+  panelClassName?: string;
+  closeOnOverlay?: boolean;
+  lockBodyScroll?: boolean;
+}
+
+export function AppModal({
+  isOpen,
+  onClose,
+  children,
+  zIndexClassName = "z-[220]",
+  panelClassName = "w-full max-w-2xl rounded-2xl border border-white/10 bg-[#171821]/95 shadow-[0_24px_80px_rgba(0,0,0,0.55)] overflow-hidden",
+  closeOnOverlay = true,
+  lockBodyScroll = true,
+}: AppModalProps) {
+  const containerRef = useRef<HTMLDivElement | null>(null);
+
+  useEffect(() => {
+    if (!isOpen) return;
+
+    const handleEsc = (event: KeyboardEvent) => {
+      if (event.key === "Escape") onClose();
+    };
+
+    const previousActiveElement = document.activeElement as HTMLElement | null;
+
+    if (lockBodyScroll) {
+      const openCount = Number(document.body.dataset.appModalOpenCount ?? "0");
+      if (openCount === 0) {
+        document.body.dataset.appModalPrevOverflow = document.body.style.overflow;
+        document.body.style.overflow = "hidden";
+      }
+      document.body.dataset.appModalOpenCount = String(openCount + 1);
+    }
+
+    document.addEventListener("keydown", handleEsc);
+    requestAnimationFrame(() => containerRef.current?.focus());
+
+    return () => {
+      document.removeEventListener("keydown", handleEsc);
+
+      if (lockBodyScroll) {
+        const openCount = Number(document.body.dataset.appModalOpenCount ?? "0");
+        const nextCount = Math.max(0, openCount - 1);
+
+        if (nextCount === 0) {
+          document.body.style.overflow = document.body.dataset.appModalPrevOverflow ?? "";
+          delete document.body.dataset.appModalPrevOverflow;
+          delete document.body.dataset.appModalOpenCount;
+        } else {
+          document.body.dataset.appModalOpenCount = String(nextCount);
+        }
+      }
+
+      previousActiveElement?.focus?.();
+    };
+  }, [isOpen, lockBodyScroll, onClose]);
+
+  if (!isOpen || typeof document === "undefined") return null;
+
+  return createPortal(
+    <div
+      ref={containerRef}
+      role="dialog"
+      aria-modal="true"
+      tabIndex={-1}
+      className={`fixed inset-0 ${zIndexClassName} flex items-center justify-center bg-black/80 backdrop-blur-sm p-4 animate-in fade-in duration-200`}
+      onClick={closeOnOverlay ? onClose : undefined}
+    >
+      <div className={panelClassName} onClick={(event) => event.stopPropagation()}>
+        {children}
+      </div>
+    </div>,
+    document.body
+  );
+}
+
+interface AppModalHeaderProps {
+  title: ReactNode;
+  subtitle?: ReactNode;
+  icon?: ReactNode;
+  onClose?: () => void;
+  actions?: ReactNode;
+}
+
+export function AppModalHeader({
+  title,
+  subtitle,
+  icon,
+  onClose,
+  actions,
+}: AppModalHeaderProps) {
+  return (
+    <div className="flex items-center justify-between gap-3 border-b border-white/10 bg-gradient-to-r from-white/[0.08] via-white/[0.03] to-white/[0.08] px-4 py-3">
+      <div className="min-w-0 flex items-center gap-3">
+        {icon ? (
+          <div className="h-9 w-9 rounded-lg bg-white/10 text-white flex items-center justify-center">
+            {icon}
+          </div>
+        ) : null}
+        <div className="min-w-0">
+          <h3 className="truncate text-base font-semibold text-white">{title}</h3>
+          {subtitle ? <p className="mt-0.5 text-xs text-gray-400">{subtitle}</p> : null}
+        </div>
+      </div>
+
+      <div className="flex items-center gap-2">
+        {actions}
+        {onClose ? (
+          <button
+            type="button"
+            onClick={onClose}
+            aria-label="关闭弹窗"
+            className="p-2 text-gray-400 hover:text-white hover:bg-white/10 rounded-lg transition-colors"
+          >
+            <X className="h-5 w-5" />
+          </button>
+        ) : null}
+      </div>
+    </div>
+  );
+}
--- a/frontend/src/shared/ui/SelectPopover.tsx
+++ b/frontend/src/shared/ui/SelectPopover.tsx
@@ -0,0 +1,233 @@
+"use client";
+
+import { type ReactNode, useEffect, useRef, useState } from "react";
+import { createPortal } from "react-dom";
+
+interface SelectPopoverTriggerContext {
+  open: boolean;
+  isMobile: boolean;
+  toggle: () => void;
+  close: () => void;
+}
+
+interface SelectPopoverPanelContext {
+  isMobile: boolean;
+  close: () => void;
+}
+
+interface SelectPopoverProps {
+  trigger: (ctx: SelectPopoverTriggerContext) => ReactNode;
+  children: (ctx: SelectPopoverPanelContext) => ReactNode;
+  sheetTitle?: string;
+  disabled?: boolean;
+  panelClassName?: string;
+  onOpen?: () => void;
+}
+
+const MOBILE_QUERY = "(max-width: 639px)";
+
+export function SelectPopover({
+  trigger,
+  children,
+  sheetTitle,
+  disabled = false,
+  panelClassName = "",
+  onOpen,
+}: SelectPopoverProps) {
+  type DesktopRect = {
+    left: number;
+    top: number;
+    width: number;
+    maxHeight: number;
+    direction: "up" | "down";
+  };
+
+  const containerRef = useRef<HTMLDivElement | null>(null);
+  const panelRef = useRef<HTMLDivElement | null>(null);
+  const desktopScrollRef = useRef<HTMLDivElement | null>(null);
+  const mobileScrollRef = useRef<HTMLDivElement | null>(null);
+  const [open, setOpen] = useState(false);
+  const [isMobile, setIsMobile] = useState(false);
+  const [desktopRect, setDesktopRect] = useState<DesktopRect | null>(null);
+  const isOpen = open && !disabled;
+
+  const canUseDOM = typeof window !== "undefined" && typeof document !== "undefined";
+
+  useEffect(() => {
+    if (typeof window === "undefined") return;
+
+    const mq = window.matchMedia(MOBILE_QUERY);
+    const handleChange = () => setIsMobile(mq.matches);
+    handleChange();
+
+    if (mq.addEventListener) {
+      mq.addEventListener("change", handleChange);
+      return () => mq.removeEventListener("change", handleChange);
+    }
+
+    mq.addListener(handleChange);
+    return () => mq.removeListener(handleChange);
+  }, []);
+
+  useEffect(() => {
+    if (!isOpen || isMobile) return;
+
+    const handlePointerDown = (event: MouseEvent) => {
+      if (canUseDOM && document.querySelector("[data-video-preview-open='true']")) {
+        return;
+      }
+
+      const target = event.target as Node;
+      const clickedTrigger = containerRef.current?.contains(target) ?? false;
+      const clickedPanel = panelRef.current?.contains(target) ?? false;
+      if (!clickedTrigger && !clickedPanel) {
+        setOpen(false);
+      }
+    };
+
+    document.addEventListener("mousedown", handlePointerDown);
+    return () => document.removeEventListener("mousedown", handlePointerDown);
+  }, [isOpen, isMobile, canUseDOM]);
+
+  useEffect(() => {
+    if (!isOpen) return;
+
+    const handleKeyDown = (event: KeyboardEvent) => {
+      const previewOpen = canUseDOM && Boolean(document.querySelector("[data-video-preview-open='true']"));
+      if (event.key === "Escape" && !previewOpen) {
+        setOpen(false);
+      }
+    };
+
+    document.addEventListener("keydown", handleKeyDown);
+    return () => document.removeEventListener("keydown", handleKeyDown);
+  }, [isOpen, canUseDOM]);
+
+  useEffect(() => {
+    if (isOpen) {
+      onOpen?.();
+    }
+  }, [isOpen, onOpen]);
+
+  useEffect(() => {
+    if (!isOpen || !canUseDOM) return;
+
+    let raf1 = 0;
+    let raf2 = 0;
+    const scrollSelectedIntoView = () => {
+      const container = isMobile ? mobileScrollRef.current : desktopScrollRef.current;
+      if (!container) return;
+
+      const selectedEl = container.querySelector<HTMLElement>(
+        "[data-popover-selected='true'], [aria-selected='true']",
+      );
+      selectedEl?.scrollIntoView({ block: "nearest", behavior: "auto" });
+    };
+
+    raf1 = window.requestAnimationFrame(() => {
+      raf2 = window.requestAnimationFrame(scrollSelectedIntoView);
+    });
+
+    return () => {
+      if (raf1) window.cancelAnimationFrame(raf1);
+      if (raf2) window.cancelAnimationFrame(raf2);
+    };
+  }, [isOpen, isMobile, canUseDOM]);
+
+  useEffect(() => {
+    if (!isOpen || isMobile || !canUseDOM) return;
+
+    const updateDesktopRect = () => {
+      const triggerEl = containerRef.current;
+      if (!triggerEl) return;
+
+      const viewportPadding = 8;
+      const gap = 8;
+      const preferredMaxHeight = 352;
+      const rect = triggerEl.getBoundingClientRect();
+      const width = rect.width;
+      const maxLeft = Math.max(viewportPadding, window.innerWidth - width - viewportPadding);
+      const left = Math.min(Math.max(viewportPadding, rect.left), maxLeft);
+
+      const spaceBelow = window.innerHeight - rect.bottom - gap - viewportPadding;
+      const spaceAbove = rect.top - gap - viewportPadding;
+      const openUp = spaceBelow < 220 && spaceAbove > spaceBelow;
+      const direction: "up" | "down" = openUp ? "up" : "down";
+      const chosenSpace = openUp ? spaceAbove : spaceBelow;
+      const maxHeight = Math.max(120, Math.min(preferredMaxHeight, Math.floor(chosenSpace)));
+      const top = openUp
+        ? Math.max(viewportPadding, rect.top - gap)
+        : Math.min(rect.bottom + gap, window.innerHeight - viewportPadding);
+
+      setDesktopRect({ left, top, width, maxHeight, direction });
+    };
+
+    updateDesktopRect();
+    window.addEventListener("resize", updateDesktopRect);
+    window.addEventListener("scroll", updateDesktopRect, true);
+
+    return () => {
+      window.removeEventListener("resize", updateDesktopRect);
+      window.removeEventListener("scroll", updateDesktopRect, true);
+    };
+  }, [isOpen, isMobile, canUseDOM]);
+
+  const close = () => setOpen(false);
+  const toggle = () => {
+    if (disabled) return;
+    setOpen((prev) => !prev);
+  };
+
+  const desktopPanel = canUseDOM && isOpen && !isMobile && desktopRect
+    ? createPortal(
+      <div
+        ref={panelRef}
+        className={`fixed z-[260] overflow-hidden rounded-2xl border border-white/20 bg-[#130f20]/95 backdrop-blur-md shadow-[0_20px_48px_rgba(8,10,20,0.5)] ${panelClassName}`}
+        style={{
+          left: desktopRect.left,
+          top: desktopRect.top,
+          width: desktopRect.width,
+          transform: desktopRect.direction === "up" ? "translateY(-100%)" : undefined,
+        }}
+        role="dialog"
+        aria-modal="false"
+      >
+        <div ref={desktopScrollRef} className="hide-scrollbar overflow-y-auto p-2" style={{ maxHeight: desktopRect.maxHeight }}>
+          {children({ isMobile: false, close })}
+        </div>
+      </div>,
+      document.body,
+    )
+    : null;
+
+  const mobileSheet = canUseDOM && isOpen && isMobile
+    ? createPortal(
+      <div
+        className="fixed inset-0 z-[220] bg-black/60"
+        onMouseDown={close}
+        role="dialog"
+        aria-modal="true"
+      >
+        <div
+          className="fixed inset-x-0 bottom-0 max-h-[78dvh] overflow-hidden rounded-t-3xl border-t border-white/20 bg-[#130f20]/95"
+          onMouseDown={(e) => e.stopPropagation()}
+        >
+          <div className="mx-auto mt-2 h-1.5 w-12 rounded-full bg-white/20" />
+          {sheetTitle && (
+            <div className="px-5 pt-3 pb-2 text-sm font-medium text-gray-300">{sheetTitle}</div>
+          )}
+          <div ref={mobileScrollRef} className="hide-scrollbar max-h-[calc(78dvh-56px)] overflow-y-auto p-3">{children({ isMobile: true, close })}</div>
+        </div>
+      </div>,
+      document.body,
+    )
+    : null;
+
+  return (
+    <div className="relative" ref={containerRef}>
+      {trigger({ open: isOpen, isMobile, toggle, close })}
+      {desktopPanel}
+      {mobileSheet}
+    </div>
+  );
+}
--- a/models/CosyVoice/cosyvoice_server.py
+++ b/models/CosyVoice/cosyvoice_server.py
@@ -65,7 +65,7 @@ def load_model():
    start = time.time()

    from cosyvoice.cli.cosyvoice import AutoModel
-    _model = AutoModel(model_dir=str(MODEL_DIR))
+    _model = AutoModel(model_dir=str(MODEL_DIR), fp16=True)

    _model_loaded = True
    print(f"✅ CosyVoice 3.0 model loaded in {time.time() - start:.1f}s")
@@ -174,6 +174,7 @@ async def generate(
    ref_text: str = Form(...),
    language: str = Form("Chinese"),
    speed: float = Form(1.0),
+    instruct_text: str = Form(""),
 ):
    """
    声音克隆生成
@@ -236,16 +237,30 @@ async def generate(
            # CosyVoice3 的 prompt_text 格式
            prompt_text = f"You are a helpful assistant.<|endofprompt|>{ref_text}"

+            use_instruct = bool(instruct_text.strip())
+            if use_instruct:
+                print(f"🎭 Instruct mode: {instruct_text[:60]}...")
+
            def _do_inference():
                """在线程池中执行推理"""
-                results = list(_model.inference_zero_shot(
-                    text,
-                    prompt_text,
-                    ref_audio_path,
-                    stream=False,
-                    speed=speed,
-                    text_frontend=True,
-                ))
+                if use_instruct:
+                    results = list(_model.inference_instruct2(
+                        text,
+                        instruct_text,
+                        ref_audio_path,
+                        stream=False,
+                        speed=speed,
+                        text_frontend=True,
+                    ))
+                else:
+                    results = list(_model.inference_zero_shot(
+                        text,
+                        prompt_text,
+                        ref_audio_path,
+                        stream=False,
+                        speed=speed,
+                        text_frontend=True,
+                    ))
                if not results:
                    raise RuntimeError("CosyVoice returned empty results")

--- a/models/LatentSync/latentsync/pipelines/lipsync_pipeline.py
+++ b/models/LatentSync/latentsync/pipelines/lipsync_pipeline.py
@@ -253,21 +253,58 @@ class LipsyncPipeline(DiffusionPipeline):
        faces = []
        boxes = []
        affine_matrices = []
+        valid_face_flags = []
        print(f"Affine transforming {len(video_frames)} faces...")
        for frame in tqdm.tqdm(video_frames):
-            face, box, affine_matrix = self.image_processor.affine_transform(frame)
-            faces.append(face)
-            boxes.append(box)
-            affine_matrices.append(affine_matrix)
+            try:
+                face, box, affine_matrix = self.image_processor.affine_transform(frame)
+                faces.append(face)
+                boxes.append(box)
+                affine_matrices.append(affine_matrix)
+                valid_face_flags.append(True)
+            except Exception:
+                faces.append(None)
+                boxes.append(None)
+                affine_matrices.append(None)
+                valid_face_flags.append(False)
+
+        valid_indices = [i for i, flag in enumerate(valid_face_flags) if flag]
+        if not valid_indices:
+            raise RuntimeError("Face not detected in any frame")
+
+        for i in range(len(faces)):
+            if faces[i] is not None:
+                continue
+            nearest_idx = min(valid_indices, key=lambda idx: abs(idx - i))
+            faces[i] = faces[nearest_idx].clone()
+            boxes[i] = boxes[nearest_idx]
+            affine_matrices[i] = affine_matrices[nearest_idx]
+
+        missing_count = len(valid_face_flags) - len(valid_indices)
+        if missing_count > 0:
+            print(
+                f"Warning: face not detected in {missing_count}/{len(valid_face_flags)} frames. "
+                "Those frames will keep original content."
+            )

        faces = torch.stack(faces)
-        return faces, boxes, affine_matrices
+        return faces, boxes, affine_matrices, valid_face_flags

-    def restore_video(self, faces: torch.Tensor, video_frames: np.ndarray, boxes: list, affine_matrices: list):
+    def restore_video(
+        self,
+        faces: torch.Tensor,
+        video_frames: np.ndarray,
+        boxes: list,
+        affine_matrices: list,
+        valid_face_flags: Optional[list] = None,
+    ):
        video_frames = video_frames[: len(faces)]
        out_frames = []
        print(f"Restoring {len(faces)} faces...")
        for index, face in enumerate(tqdm.tqdm(faces)):
+            if valid_face_flags is not None and not valid_face_flags[index]:
+                out_frames.append(video_frames[index])
+                continue
            x1, y1, x2, y2 = boxes[index]
            height = int(y2 - y1)
            width = int(x2 - x1)
@@ -281,33 +318,37 @@ class LipsyncPipeline(DiffusionPipeline):
    def loop_video(self, whisper_chunks: list, video_frames: np.ndarray):
        # If the audio is longer than the video, we need to loop the video
        if len(whisper_chunks) > len(video_frames):
-            faces, boxes, affine_matrices = self.affine_transform_video(video_frames)
+            faces, boxes, affine_matrices, valid_face_flags = self.affine_transform_video(video_frames)
            num_loops = math.ceil(len(whisper_chunks) / len(video_frames))
            loop_video_frames = []
            loop_faces = []
            loop_boxes = []
            loop_affine_matrices = []
+            loop_valid_face_flags = []
            for i in range(num_loops):
                if i % 2 == 0:
                    loop_video_frames.append(video_frames)
                    loop_faces.append(faces)
                    loop_boxes += boxes
                    loop_affine_matrices += affine_matrices
+                    loop_valid_face_flags += valid_face_flags
                else:
                    loop_video_frames.append(video_frames[::-1])
                    loop_faces.append(faces.flip(0))
                    loop_boxes += boxes[::-1]
                    loop_affine_matrices += affine_matrices[::-1]
+                    loop_valid_face_flags += valid_face_flags[::-1]

            video_frames = np.concatenate(loop_video_frames, axis=0)[: len(whisper_chunks)]
            faces = torch.cat(loop_faces, dim=0)[: len(whisper_chunks)]
            boxes = loop_boxes[: len(whisper_chunks)]
            affine_matrices = loop_affine_matrices[: len(whisper_chunks)]
+            valid_face_flags = loop_valid_face_flags[: len(whisper_chunks)]
        else:
            video_frames = video_frames[: len(whisper_chunks)]
-            faces, boxes, affine_matrices = self.affine_transform_video(video_frames)
+            faces, boxes, affine_matrices, valid_face_flags = self.affine_transform_video(video_frames)

-        return video_frames, faces, boxes, affine_matrices
+        return video_frames, faces, boxes, affine_matrices, valid_face_flags

    @torch.no_grad()
    def __call__(
@@ -367,7 +408,7 @@ class LipsyncPipeline(DiffusionPipeline):
        audio_samples = read_audio(audio_path)
        video_frames = read_video(video_path, use_decord=False)

-        video_frames, faces, boxes, affine_matrices = self.loop_video(whisper_chunks, video_frames)
+        video_frames, faces, boxes, affine_matrices, valid_face_flags = self.loop_video(whisper_chunks, video_frames)

        synced_video_frames = []

@@ -457,7 +498,13 @@ class LipsyncPipeline(DiffusionPipeline):
            )
            synced_video_frames.append(decoded_latents)

-        synced_video_frames = self.restore_video(torch.cat(synced_video_frames), video_frames, boxes, affine_matrices)
+        synced_video_frames = self.restore_video(
+            torch.cat(synced_video_frames),
+            video_frames,
+            boxes,
+            affine_matrices,
+            valid_face_flags=valid_face_flags,
+        )

        audio_samples_remain_length = int(synced_video_frames.shape[0] / video_fps * audio_sample_rate)
        audio_samples = audio_samples[:audio_samples_remain_length].cpu().numpy()
@@ -473,5 +520,5 @@ class LipsyncPipeline(DiffusionPipeline):

        sf.write(os.path.join(temp_dir, "audio.wav"), audio_samples, audio_sample_rate)

-        command = f"ffmpeg -y -loglevel error -nostdin -i {os.path.join(temp_dir, 'video.mp4')} -i {os.path.join(temp_dir, 'audio.wav')} -c:v libx264 -crf 18 -c:a aac -q:v 0 -q:a 0 {video_out_path}"
+        command = f"ffmpeg -y -loglevel error -nostdin -i {os.path.join(temp_dir, 'video.mp4')} -i {os.path.join(temp_dir, 'audio.wav')} -c:v copy -c:a aac -q:a 0 {video_out_path}"
        subprocess.run(command, shell=True)
--- a/models/LatentSync/latentsync/utils/util.py
+++ b/models/LatentSync/latentsync/utils/util.py
@@ -49,11 +49,22 @@ def read_video(video_path: str, change_fps=True, use_decord=True):
        if os.path.exists(temp_dir):
            shutil.rmtree(temp_dir)
        os.makedirs(temp_dir, exist_ok=True)
-        command = (
-            f"ffmpeg -loglevel error -y -nostdin -i {video_path} -r 25 -crf 18 {os.path.join(temp_dir, 'video.mp4')}"
-        )
-        subprocess.run(command, shell=True)
-        target_video_path = os.path.join(temp_dir, "video.mp4")
+
+        # 检测输入视频 FPS，已是 25fps 时跳过重编码
+        cap = cv2.VideoCapture(video_path)
+        current_fps = cap.get(cv2.CAP_PROP_FPS)
+        cap.release()
+
+        if abs(current_fps - 25.0) < 0.5:
+            # 已是 25fps，直接使用原文件（避免一次有损重编码）
+            print(f"Video already at {current_fps:.1f}fps, skipping FPS conversion")
+            target_video_path = video_path
+        else:
+            command = (
+                f"ffmpeg -loglevel error -y -nostdin -i {video_path} -r 25 -crf 18 {os.path.join(temp_dir, 'video.mp4')}"
+            )
+            subprocess.run(command, shell=True)
+            target_video_path = os.path.join(temp_dir, "video.mp4")
    else:
        target_video_path = video_path

--- a/models/MuseTalk/LICENSE
+++ b/models/MuseTalk/LICENSE
@@ -0,0 +1,159 @@
+
+MIT License
+
+Copyright (c) 2024 Tencent Music Entertainment Group
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+
+Other dependencies and licenses:
+
+
+Open Source Software Licensed under the MIT License:
+--------------------------------------------------------------------
+1. sd-vae-ft-mse
+Files：https://huggingface.co/stabilityai/sd-vae-ft-mse/tree/main
+License：MIT license
+For details：https://choosealicense.com/licenses/mit/
+
+2. whisper 
+Files：https://github.com/openai/whisper
+License：MIT license
+              Copyright (c) 2022 OpenAI
+For details：https://github.com/openai/whisper/blob/main/LICENSE
+
+3. face-parsing.PyTorch
+Files：https://github.com/zllrunning/face-parsing.PyTorch
+License：MIT License
+	Copyright (c) 2019 zll
+For details：https://github.com/zllrunning/face-parsing.PyTorch/blob/master/LICENSE
+
+
+
+Open Source Software Licensed under the Apache License Version 2.0:
+--------------------------------------------------------------------
+1. DWpose
+Files：https://huggingface.co/yzd-v/DWPose/tree/main
+License：Apache-2.0
+For details：https://choosealicense.com/licenses/apache-2.0/
+
+
+Terms of the Apache License Version 2.0:
+--------------------------------------------------------------------
+Apache License 
+
+Version 2.0, January 2004
+
+http://www.apache.org/licenses/ 
+
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+1. Definitions.
+
+"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
+
+"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
+
+"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
+
+"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
+
+"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
+
+"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
+
+"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
+
+"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
+
+"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
+
+"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
+
+2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
+
+3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
+
+4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
+
+You must give any other recipients of the Work or Derivative Works a copy of this License; and 
+
+You must cause any modified files to carry prominent notices stating that You changed the files; and 
+
+You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and 
+
+If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. 
+
+You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 
+
+5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
+
+6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
+
+7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
+
+8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
+
+9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
+
+END OF TERMS AND CONDITIONS
+ 
+
+
+Open Source Software Licensed under the BSD 3-Clause License:
+--------------------------------------------------------------------
+1. face-alignment
+Files：https://github.com/1adrianb/face-alignment/tree/master
+License：BSD 3-Clause License
+	Copyright (c) 2017, Adrian Bulat
+	All rights reserved.
+For details：https://github.com/1adrianb/face-alignment/blob/master/LICENSE
+
+
+Terms of the BSD 3-Clause License:
+--------------------------------------------------------------------
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+* Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+* Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+Open Source Software：
+--------------------------------------------------------------------
+1.s3FD
+Files：https://github.com/yxlijun/S3FD.pytorch
--- a/models/MuseTalk/README.md
+++ b/models/MuseTalk/README.md
@@ -0,0 +1,556 @@
+# MuseTalk
+
+> **ViGent2 集成说明**
+>
+> 本目录为 MuseTalk v1.5 的部署副本，作为混合唇形同步方案的长视频引擎。
+>
+> - **服务**: `scripts/server.py` — FastAPI 常驻推理服务 (端口 8011, GPU0)
+> - **PM2**: `vigent2-musetalk` (启动脚本 `run_musetalk.sh`)
+> - **路由**: 音频 >=120s 自动路由到 MuseTalk, <120s 走 LatentSync
+> - **部署文档**: [`Docs/MUSETALK_DEPLOY.md`](../../Docs/MUSETALK_DEPLOY.md)
+> - **修改记录**: `scripts/inference.py` 增强 FFmpeg 调用 + CLI 参数; `musetalk/utils/audio_processor.py` 音视频长度不匹配时零填充
+
+---
+
+<strong>MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling</strong>
+
+Yue Zhang<sup>\*</sup>,
+Zhizhou Zhong<sup>\*</sup>,
+Minhao Liu<sup>\*</sup>,
+Zhaokang Chen,
+Bin Wu<sup>†</sup>,
+Yubin Zeng, 
+Chao Zhan,
+Junxin Huang,
+Yingjie He,
+Wenjiang Zhou
+(<sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author, benbinwu@tencent.com)
+
+Lyra Lab, Tencent Music Entertainment
+
+**[github](https://github.com/TMElyralab/MuseTalk)**    **[huggingface](https://huggingface.co/TMElyralab/MuseTalk)**    **[space](https://huggingface.co/spaces/TMElyralab/MuseTalk)**    **[Technical report](https://arxiv.org/abs/2410.10122)**
+
+We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by [MuseV](https://github.com/TMElyralab/MuseV), as a complete virtual human solution.
+
+## 🔥 Updates
+We're excited to unveil MuseTalk 1.5. 
+This version **(1)** integrates training with perceptual loss, GAN loss, and sync loss, significantly boosting its overall performance. **(2)** We've implemented a two-stage training strategy and a spatio-temporal data sampling approach to strike a balance between visual quality and lip-sync accuracy. 
+Learn more details [here](https://arxiv.org/abs/2410.10122).
+**The inference codes, training codes and model weights of MuseTalk 1.5 are all available now!** 🚀
+
+# Overview
+`MuseTalk` is a real-time high quality audio-driven lip-syncing model trained in the latent space of `ft-mse-vae`, which
+
+1. modifies an unseen face according to the input audio, with a size of face region of `256 x 256`.
+1. supports audio in various languages, such as Chinese, English, and Japanese.
+1. supports real-time inference with 30fps+ on an NVIDIA Tesla V100.
+1. supports modification of the center point of the face region proposes, which **SIGNIFICANTLY** affects generation results. 
+1. checkpoint available trained on the HDTF and private dataset.
+
+# News
+- [04/05/2025] :mega: We are excited to announce that the training code is now open-sourced! You can now train your own MuseTalk model using our provided training scripts and configurations.
+- [03/28/2025] We are thrilled to announce the release of our 1.5 version. This version is a significant improvement over the 1.0 version, with enhanced clarity, identity consistency, and precise lip-speech synchronization. We update the [technical report](https://arxiv.org/abs/2410.10122) with more details.
+- [10/18/2024] We release the [technical report](https://arxiv.org/abs/2410.10122v2). Our report details a superior model to the open-source L1 loss version. It includes GAN and perceptual losses for improved clarity, and sync loss for enhanced performance.
+- [04/17/2024] We release a pipeline that utilizes MuseTalk for real-time inference.
+- [04/16/2024] Release Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk) on HuggingFace Spaces (thanks to HF team for their community grant)
+- [04/02/2024] Release MuseTalk project and pretrained models.
+
+
+## Model
+![Model Structure](https://github.com/user-attachments/assets/02f4a214-1bdd-4326-983c-e70b478accba)
+MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention. 
+
+Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is **NOT** a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with a single step.
+
+## Cases
+
+<table>
+<tr>
+<td width="33%">
+
+### Input Video
+---
+https://github.com/TMElyralab/MuseTalk/assets/163980830/37a3a666-7b90-4244-8d3a-058cb0e44107
+
+---
+https://github.com/user-attachments/assets/1ce3e850-90ac-4a31-a45f-8dfa4f2960ac
+
+---
+https://github.com/user-attachments/assets/fa3b13a1-ae26-4d1d-899e-87435f8d22b3
+
+---
+https://github.com/user-attachments/assets/15800692-39d1-4f4c-99f2-aef044dc3251
+
+---
+https://github.com/user-attachments/assets/a843f9c9-136d-4ed4-9303-4a7269787a60
+
+---
+https://github.com/user-attachments/assets/6eb4e70e-9e19-48e9-85a9-bbfa589c5fcb
+
+</td>
+<td width="33%">
+
+### MuseTalk 1.0
+---
+https://github.com/user-attachments/assets/c04f3cd5-9f77-40e9-aafd-61978380d0ef
+
+---
+https://github.com/user-attachments/assets/2051a388-1cef-4c1d-b2a2-3c1ceee5dc99
+
+---
+https://github.com/user-attachments/assets/b5f56f71-5cdc-4e2e-a519-454242000d32
+
+---
+https://github.com/user-attachments/assets/a5843835-04ab-4c31-989f-0995cfc22f34
+
+---
+https://github.com/user-attachments/assets/3dc7f1d7-8747-4733-bbdd-97874af0c028
+
+---
+https://github.com/user-attachments/assets/3c78064e-faad-4637-83ae-28452a22b09a
+
+</td>
+<td width="33%">
+
+### MuseTalk 1.5
+---
+https://github.com/user-attachments/assets/999a6f5b-61dd-48e1-b902-bb3f9cbc7247
+
+---
+https://github.com/user-attachments/assets/d26a5c9a-003c-489d-a043-c9a331456e75
+
+---
+https://github.com/user-attachments/assets/471290d7-b157-4cf6-8a6d-7e899afa302c
+
+---
+https://github.com/user-attachments/assets/1ee77c4c-8c70-4add-b6db-583a12faa7dc
+
+---
+https://github.com/user-attachments/assets/370510ea-624c-43b7-bbb0-ab5333e0fcc4
+
+---
+https://github.com/user-attachments/assets/b011ece9-a332-4bc1-b8b7-ef6e383d7bde
+
+</td>
+</tr>
+</table>
+
+
+# TODO:
+- [x] trained models and inference codes.
+- [x] Huggingface Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk).
+- [x] codes for real-time inference.
+- [x] [technical report](https://arxiv.org/abs/2410.10122v2).
+- [x] a better model with updated [technical report](https://arxiv.org/abs/2410.10122).
+- [x] realtime inference code for 1.5 version.
+- [x] training and data preprocessing codes. 
+- [ ] **always** welcome to submit issues and PRs to improve this repository! 😊
+
+
+# Getting Started
+We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users:
+
+## Third party integration
+Thanks for the third-party integration, which makes installation and use more convenient for everyone.
+We also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results.
+
+### [ComfyUI](https://github.com/chaojie/ComfyUI-MuseTalk)
+
+## Installation
+To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:
+
+### Build environment
+We recommend Python 3.10 and CUDA 11.7. Set up your environment as follows:
+
+```shell
+conda create -n MuseTalk python==3.10
+conda activate MuseTalk
+```
+
+### Install PyTorch 2.0.1
+Choose one of the following installation methods:
+
+```shell
+# Option 1: Using pip
+pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
+
+# Option 2: Using conda
+conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
+```
+
+### Install Dependencies
+Install the remaining required packages:
+
+```shell
+pip install -r requirements.txt
+```
+
+### Install MMLab Packages
+Install the MMLab ecosystem packages:
+
+```bash
+pip install --no-cache-dir -U openmim
+mim install mmengine
+mim install "mmcv==2.0.1"
+mim install "mmdet==3.1.0"
+mim install "mmpose==1.1.0"
+```
+
+### Setup FFmpeg
+1. [Download](https://github.com/BtbN/FFmpeg-Builds/releases) the ffmpeg-static package
+
+2. Configure FFmpeg based on your operating system:
+
+For Linux:
+```bash
+export FFMPEG_PATH=/path/to/ffmpeg
+# Example:
+export FFMPEG_PATH=/musetalk/ffmpeg-4.4-amd64-static
+```
+
+For Windows:
+Add the `ffmpeg-xxx\bin` directory to your system's PATH environment variable. Verify the installation by running `ffmpeg -version` in the command prompt - it should display the ffmpeg version information.
+
+### Download weights
+You can download weights in two ways:
+
+#### Option 1: Using Download Scripts
+We provide two scripts for automatic downloading:
+
+For Linux:
+```bash
+sh ./download_weights.sh
+```
+
+For Windows:
+```batch
+# Run the script
+download_weights.bat
+```
+
+#### Option 2: Manual Download
+You can also download the weights manually from the following links:
+
+1. Download our trained [weights](https://huggingface.co/TMElyralab/MuseTalk/tree/main)
+2. Download the weights of other components:
+   - [sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse/tree/main)
+   - [whisper](https://huggingface.co/openai/whisper-tiny/tree/main)
+   - [dwpose](https://huggingface.co/yzd-v/DWPose/tree/main)
+   - [syncnet](https://huggingface.co/ByteDance/LatentSync/tree/main)
+   - [face-parse-bisent](https://drive.google.com/file/d/154JgKpzCPW82qINcVieuPH3fZ2e0P812/view?pli=1)
+   - [resnet18](https://download.pytorch.org/models/resnet18-5c106cde.pth)
+
+Finally, these weights should be organized in `models` as follows:
+```
+./models/
+├── musetalk
+│   └── musetalk.json
+│   └── pytorch_model.bin
+├── musetalkV15
+│   └── musetalk.json
+│   └── unet.pth
+├── syncnet
+│   └── latentsync_syncnet.pt
+├── dwpose
+│   └── dw-ll_ucoco_384.pth
+├── face-parse-bisent
+│   ├── 79999_iter.pth
+│   └── resnet18-5c106cde.pth
+├── sd-vae
+│   ├── config.json
+│   └── diffusion_pytorch_model.bin
+└── whisper
+    ├── config.json
+    ├── pytorch_model.bin
+    └── preprocessor_config.json
+    
+```
+## Quickstart
+
+### Inference
+We provide inference scripts for both versions of MuseTalk:
+
+#### Prerequisites
+Before running inference, please ensure ffmpeg is installed and accessible:
+```bash
+# Check ffmpeg installation
+ffmpeg -version
+```
+If ffmpeg is not found, please install it first:
+- Windows: Download from [ffmpeg-static](https://github.com/BtbN/FFmpeg-Builds/releases) and add to PATH
+- Linux: `sudo apt-get install ffmpeg`
+
+#### Normal Inference
+##### Linux Environment
+```bash
+# MuseTalk 1.5 (Recommended)
+sh inference.sh v1.5 normal
+
+# MuseTalk 1.0
+sh inference.sh v1.0 normal
+```
+
+##### Windows Environment
+
+Please ensure that you set the `ffmpeg_path` to match the actual location of your FFmpeg installation.
+
+```bash
+# MuseTalk 1.5 (Recommended)
+python -m scripts.inference --inference_config configs\inference\test.yaml --result_dir results\test --unet_model_path models\musetalkV15\unet.pth --unet_config models\musetalkV15\musetalk.json --version v15 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin
+
+# For MuseTalk 1.0, change:
+# - models\musetalkV15 -> models\musetalk
+# - unet.pth -> pytorch_model.bin
+# - --version v15 -> --version v1
+```
+
+#### Real-time Inference
+##### Linux Environment
+```bash
+# MuseTalk 1.5 (Recommended)
+sh inference.sh v1.5 realtime
+
+# MuseTalk 1.0
+sh inference.sh v1.0 realtime
+```
+
+##### Windows Environment
+```bash
+# MuseTalk 1.5 (Recommended)
+python -m scripts.realtime_inference --inference_config configs\inference\realtime.yaml --result_dir results\realtime --unet_model_path models\musetalkV15\unet.pth --unet_config models\musetalkV15\musetalk.json --version v15 --fps 25 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin
+
+# For MuseTalk 1.0, change:
+# - models\musetalkV15 -> models\musetalk
+# - unet.pth -> pytorch_model.bin
+# - --version v15 -> --version v1
+```
+
+The configuration file `configs/inference/test.yaml` contains the inference settings, including:
+- `video_path`: Path to the input video, image file, or directory of images
+- `audio_path`: Path to the input audio file
+
+Note: For optimal results, we recommend using input videos with 25fps, which is the same fps used during model training. If your video has a lower frame rate, you can use frame interpolation or convert it to 25fps using ffmpeg.
+
+Important notes for real-time inference:
+1. Set `preparation` to `True` when processing a new avatar
+2. After preparation, the avatar will generate videos using audio clips from `audio_clips`
+3. The generation process can achieve 30fps+ on an NVIDIA Tesla V100
+4. Set `preparation` to `False` for generating more videos with the same avatar
+
+For faster generation without saving images, you can use:
+```bash
+python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml --skip_save_images
+```
+
+## Gradio Demo
+We provide an intuitive web interface through Gradio for users to easily adjust input parameters. To optimize inference time, users can generate only the **first frame** to fine-tune the best lip-sync parameters, which helps reduce facial artifacts in the final output.
+![para](assets/figs/gradio_2.png)
+For minimum hardware requirements, we tested the system on a Windows environment using an NVIDIA GeForce RTX 3050 Ti Laptop GPU with 4GB VRAM. In fp16 mode, generating an 8-second video takes approximately 5 minutes. ![speed](assets/figs/gradio.png)
+
+Both Linux and Windows users can launch the demo using the following command. Please ensure that the `ffmpeg_path` parameter matches your actual FFmpeg installation path:
+
+```bash
+# You can remove --use_float16 for better quality, but it will increase VRAM usage and inference time
+python app.py --use_float16 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin
+```
+
+## Training
+
+### Data Preparation
+To train MuseTalk, you need to prepare your dataset following these steps:
+
+1. **Place your source videos** 
+
+   For example, if you're using the HDTF dataset, place all your video files in `./dataset/HDTF/source`.
+
+2. **Run the preprocessing script**
+   ```bash
+   python -m scripts.preprocess --config ./configs/training/preprocess.yaml
+   ```
+   This script will:
+   - Extract frames from videos
+   - Detect and align faces
+   - Generate audio features
+   - Create the necessary data structure for training
+
+### Training Process
+After data preprocessing, you can start the training process:
+
+1. **First Stage**
+   ```bash
+   sh train.sh stage1
+   ```
+
+2. **Second Stage**
+   ```bash
+   sh train.sh stage2
+   ```
+
+### Configuration Adjustment
+Before starting the training, you should adjust the configuration files according to your hardware and requirements:
+
+1. **GPU Configuration** (`configs/training/gpu.yaml`):
+   - `gpu_ids`: Specify the GPU IDs you want to use (e.g., "0,1,2,3")
+   - `num_processes`: Set this to match the number of GPUs you're using
+
+2. **Stage 1 Configuration** (`configs/training/stage1.yaml`):
+   - `data.train_bs`: Adjust batch size based on your GPU memory (default: 32)
+   - `data.n_sample_frames`: Number of sampled frames per video (default: 1)
+
+3. **Stage 2 Configuration** (`configs/training/stage2.yaml`):
+   - `random_init_unet`: Must be set to `False` to use the model from stage 1
+   - `data.train_bs`: Smaller batch size due to high GPU memory cost (default: 2)
+   - `data.n_sample_frames`: Higher value for temporal consistency (default: 16)
+   - `solver.gradient_accumulation_steps`: Increase to simulate larger batch sizes (default: 8)
+  
+
+### GPU Memory Requirements
+Based on our testing on a machine with 8 NVIDIA H20 GPUs:
+
+#### Stage 1 Memory Usage
+| Batch Size | Gradient Accumulation | Memory per GPU | Recommendation |
+|:----------:|:----------------------:|:--------------:|:--------------:|
+| 8          | 1                      | ~32GB          |                |
+| 16         | 1                      | ~45GB          |                |
+| 32         | 1                      | ~74GB          | ✓              |
+
+#### Stage 2 Memory Usage
+| Batch Size | Gradient Accumulation | Memory per GPU | Recommendation |
+|:----------:|:----------------------:|:--------------:|:--------------:|
+| 1          | 8                      | ~54GB          |                |
+| 2          | 2                      | ~80GB          |                |
+| 2          | 8                      | ~85GB          | ✓              |
+
+<details close>
+## TestCases For 1.0
+<table class="center">
+  <tr style="font-weight: bolder;text-align:center;">
+        <td width="33%">Image</td>
+        <td width="33%">MuseV</td>
+        <td width="33%">+MuseTalk</td>
+  </tr>
+  <tr>
+    <td>
+      <img src=assets/demo/musk/musk.png width="95%">
+    </td>
+    <td >
+      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/4a4bb2d1-9d14-4ca9-85c8-7f19c39f712e controls preload></video>
+    </td>
+    <td >
+      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/b2a879c2-e23a-4d39-911d-51f0343218e4 controls preload></video>
+    </td>
+  </tr>
+  <tr>
+    <td>
+      <img src=assets/demo/yongen/yongen.jpeg width="95%">
+    </td>
+    <td >
+      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/57ef9dee-a9fd-4dc8-839b-3fbbbf0ff3f4 controls preload></video>
+    </td>
+    <td >
+      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/94d8dcba-1bcd-4b54-9d1d-8b6fc53228f0 controls preload></video>
+    </td>
+  </tr>
+  <tr>
+    <td>
+      <img src=assets/demo/sit/sit.jpeg width="95%">
+    </td>
+    <td >
+      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/5fbab81b-d3f2-4c75-abb5-14c76e51769e controls preload></video>
+    </td>
+    <td >
+      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/f8100f4a-3df8-4151-8de2-291b09269f66 controls preload></video>
+    </td>
+  </tr>
+   <tr>
+    <td>
+      <img src=assets/demo/man/man.png width="95%">
+    </td>
+    <td >
+      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/a6e7d431-5643-4745-9868-8b423a454153 controls preload></video>
+    </td>
+    <td >
+      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/6ccf7bc7-cb48-42de-85bd-076d5ee8a623 controls preload></video>
+    </td>
+  </tr>
+  <tr>
+    <td>
+      <img src=assets/demo/monalisa/monalisa.png width="95%">
+    </td>
+    <td >
+      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/1568f604-a34f-4526-a13a-7d282aa2e773 controls preload></video>
+    </td>
+    <td >
+      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/a40784fc-a885-4c1f-9b7e-8f87b7caf4e0 controls preload></video>
+    </td>
+  </tr>
+  <tr>
+    <td>
+      <img src=assets/demo/sun1/sun.png width="95%">
+    </td>
+    <td >
+      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/37a3a666-7b90-4244-8d3a-058cb0e44107 controls preload></video>
+    </td>
+    <td >
+      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/172f4ff1-d432-45bd-a5a7-a07dec33a26b controls preload></video>
+    </td>
+  </tr>
+  <tr>
+    <td>
+      <img src=assets/demo/sun2/sun.png width="95%">
+    </td>
+    <td >
+      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/37a3a666-7b90-4244-8d3a-058cb0e44107 controls preload></video>
+    </td>
+    <td >
+      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/85a6873d-a028-4cce-af2b-6c59a1f2971d controls preload></video>
+    </td>
+  </tr>
+</table >
+
+#### Use of bbox_shift to have adjustable results(For 1.0)
+:mag_right: We have found that upper-bound of the mask has an important impact on mouth openness. Thus, to control the mask region, we suggest using the `bbox_shift` parameter. Positive values (moving towards the lower half) increase mouth openness, while negative values (moving towards the upper half) decrease mouth openness.
+
+You can start by running with the default configuration to obtain the adjustable value range, and then re-run the script within this range. 
+
+For example, in the case of `Xinying Sun`, after running the default configuration, it shows that the adjustable value rage is [-9, 9]. Then, to decrease the mouth openness, we set the value to be `-7`. 
+```
+python -m scripts.inference --inference_config configs/inference/test.yaml --bbox_shift -7 
+```
+:pushpin: More technical details can be found in [bbox_shift](assets/BBOX_SHIFT.md).
+
+
+#### Combining MuseV and MuseTalk
+
+As a complete solution to virtual human generation, you are suggested to first apply [MuseV](https://github.com/TMElyralab/MuseV) to generate a video (text-to-video, image-to-video or pose-to-video) by referring [this](https://github.com/TMElyralab/MuseV?tab=readme-ov-file#text2video). Frame interpolation is suggested to increase frame rate. Then, you can use `MuseTalk` to generate a lip-sync video by referring [this](https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#inference).
+
+# Acknowledgement
+1. We thank open-source components like [whisper](https://github.com/openai/whisper), [dwpose](https://github.com/IDEA-Research/DWPose), [face-alignment](https://github.com/1adrianb/face-alignment), [face-parsing](https://github.com/zllrunning/face-parsing.PyTorch), [S3FD](https://github.com/yxlijun/S3FD.pytorch) and [LatentSync](https://huggingface.co/ByteDance/LatentSync/tree/main). 
+1. MuseTalk has referred much to [diffusers](https://github.com/huggingface/diffusers) and [isaacOnline/whisper](https://github.com/isaacOnline/whisper/tree/extract-embeddings).
+1. MuseTalk has been built on [HDTF](https://github.com/MRzzm/HDTF) datasets.
+
+Thanks for open-sourcing!
+
+# Limitations
+- Resolution: Though MuseTalk uses a face region size of 256 x 256, which make it better than other open-source methods, it has not yet reached the theoretical resolution bound. We will continue to deal with this problem.  
+If you need higher resolution, you could apply super resolution models such as [GFPGAN](https://github.com/TencentARC/GFPGAN) in combination with MuseTalk.
+
+- Identity preservation: Some details of the original face are not well preserved, such as mustache, lip shape and color.
+
+- Jitter: There exists some jitter as the current pipeline adopts single-frame generation.
+
+# Citation
+```bib
+@article{musetalk,
+  title={MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling},
+  author={Zhang, Yue and Zhong, Zhizhou and Liu, Minhao and Chen, Zhaokang and Wu, Bin and Zeng, Yubin and Zhan, Chao and He, Yingjie and Huang, Junxin and Zhou, Wenjiang},
+  journal={arxiv},
+  year={2025}
+}
+```
+# Disclaimer/License
+1. `code`: The code of MuseTalk is released under the MIT License. There is no limitation for both academic and commercial usage.
+1. `model`: The trained model are available for any purpose, even commercially.
+1. `other opensource model`: Other open-source models used must comply with their license, such as `whisper`, `ft-mse-vae`, `dwpose`, `S3FD`, etc..
+1. The testdata are collected from internet, which are available for non-commercial research purposes only.
+1. `AIGC`: This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.
--- a/models/MuseTalk/app.py
+++ b/models/MuseTalk/app.py
@@ -0,0 +1,570 @@
+import os
+import time
+import pdb
+import re
+
+import gradio as gr
+import numpy as np
+import sys
+import subprocess
+
+from huggingface_hub import snapshot_download
+import requests
+
+import argparse
+import os
+from omegaconf import OmegaConf
+import numpy as np
+import cv2
+import torch
+import glob
+import pickle
+from tqdm import tqdm
+import copy
+from argparse import Namespace
+import shutil
+import gdown
+import imageio
+import ffmpeg
+from moviepy.editor import *
+from transformers import WhisperModel
+
+ProjectDir = os.path.abspath(os.path.dirname(__file__))
+CheckpointsDir = os.path.join(ProjectDir, "models")
+
+@torch.no_grad()
+def debug_inpainting(video_path, bbox_shift, extra_margin=10, parsing_mode="jaw", 
+                    left_cheek_width=90, right_cheek_width=90):
+    """Debug inpainting parameters, only process the first frame"""
+    # Set default parameters
+    args_dict = {
+        "result_dir": './results/debug', 
+        "fps": 25, 
+        "batch_size": 1, 
+        "output_vid_name": '', 
+        "use_saved_coord": False,
+        "audio_padding_length_left": 2,
+        "audio_padding_length_right": 2,
+        "version": "v15",
+        "extra_margin": extra_margin,
+        "parsing_mode": parsing_mode,
+        "left_cheek_width": left_cheek_width,
+        "right_cheek_width": right_cheek_width
+    }
+    args = Namespace(**args_dict)
+
+    # Create debug directory
+    os.makedirs(args.result_dir, exist_ok=True)
+    
+    # Read first frame
+    if get_file_type(video_path) == "video":
+        reader = imageio.get_reader(video_path)
+        first_frame = reader.get_data(0)
+        reader.close()
+    else:
+        first_frame = cv2.imread(video_path)
+        first_frame = cv2.cvtColor(first_frame, cv2.COLOR_BGR2RGB)
+    
+    # Save first frame
+    debug_frame_path = os.path.join(args.result_dir, "debug_frame.png")
+    cv2.imwrite(debug_frame_path, cv2.cvtColor(first_frame, cv2.COLOR_RGB2BGR))
+    
+    # Get face coordinates
+    coord_list, frame_list = get_landmark_and_bbox([debug_frame_path], bbox_shift)
+    bbox = coord_list[0]
+    frame = frame_list[0]
+    
+    if bbox == coord_placeholder:
+        return None, "No face detected, please adjust bbox_shift parameter"
+    
+    # Initialize face parser
+    fp = FaceParsing(
+        left_cheek_width=args.left_cheek_width,
+        right_cheek_width=args.right_cheek_width
+    )
+    
+    # Process first frame
+    x1, y1, x2, y2 = bbox
+    y2 = y2 + args.extra_margin
+    y2 = min(y2, frame.shape[0])
+    crop_frame = frame[y1:y2, x1:x2]
+    crop_frame = cv2.resize(crop_frame,(256,256),interpolation = cv2.INTER_LANCZOS4)
+    
+    # Generate random audio features
+    random_audio = torch.randn(1, 50, 384, device=device, dtype=weight_dtype)
+    audio_feature = pe(random_audio)
+    
+    # Get latents
+    latents = vae.get_latents_for_unet(crop_frame)
+    latents = latents.to(dtype=weight_dtype)
+    
+    # Generate prediction results
+    pred_latents = unet.model(latents, timesteps, encoder_hidden_states=audio_feature).sample
+    recon = vae.decode_latents(pred_latents)
+    
+    # Inpaint back to original image
+    res_frame = recon[0]
+    res_frame = cv2.resize(res_frame.astype(np.uint8),(x2-x1,y2-y1))
+    combine_frame = get_image(frame, res_frame, [x1, y1, x2, y2], mode=args.parsing_mode, fp=fp)
+    
+    # Save results (no need to convert color space again since get_image already returns RGB format)
+    debug_result_path = os.path.join(args.result_dir, "debug_result.png")
+    cv2.imwrite(debug_result_path, combine_frame)
+    
+    # Create information text
+    info_text = f"Parameter information:\n" + \
+                f"bbox_shift: {bbox_shift}\n" + \
+                f"extra_margin: {extra_margin}\n" + \
+                f"parsing_mode: {parsing_mode}\n" + \
+                f"left_cheek_width: {left_cheek_width}\n" + \
+                f"right_cheek_width: {right_cheek_width}\n" + \
+                f"Detected face coordinates: [{x1}, {y1}, {x2}, {y2}]"
+    
+    return cv2.cvtColor(combine_frame, cv2.COLOR_RGB2BGR), info_text
+
+def print_directory_contents(path):
+    for child in os.listdir(path):
+        child_path = os.path.join(path, child)
+        if os.path.isdir(child_path):
+            print(child_path)
+
+def download_model():
+    # 检查必需的模型文件是否存在
+    required_models = {
+        "MuseTalk": f"{CheckpointsDir}/musetalkV15/unet.pth",
+        "MuseTalk": f"{CheckpointsDir}/musetalkV15/musetalk.json",
+        "SD VAE": f"{CheckpointsDir}/sd-vae/config.json",
+        "Whisper": f"{CheckpointsDir}/whisper/config.json",
+        "DWPose": f"{CheckpointsDir}/dwpose/dw-ll_ucoco_384.pth",
+        "SyncNet": f"{CheckpointsDir}/syncnet/latentsync_syncnet.pt",
+        "Face Parse": f"{CheckpointsDir}/face-parse-bisent/79999_iter.pth",
+        "ResNet": f"{CheckpointsDir}/face-parse-bisent/resnet18-5c106cde.pth"
+    }
+    
+    missing_models = []
+    for model_name, model_path in required_models.items():
+        if not os.path.exists(model_path):
+            missing_models.append(model_name)
+    
+    if missing_models:
+        # 全用英文
+        print("The following required model files are missing:")
+        for model in missing_models:
+            print(f"- {model}")
+        print("\nPlease run the download script to download the missing models:")
+        if sys.platform == "win32":
+            print("Windows: Run download_weights.bat")
+        else:
+            print("Linux/Mac: Run ./download_weights.sh")
+        sys.exit(1)
+    else:
+        print("All required model files exist.")
+
+
+
+
+download_model()  # for huggingface deployment.
+
+from musetalk.utils.blending import get_image
+from musetalk.utils.face_parsing import FaceParsing
+from musetalk.utils.audio_processor import AudioProcessor
+from musetalk.utils.utils import get_file_type, get_video_fps, datagen, load_all_model
+from musetalk.utils.preprocessing import get_landmark_and_bbox, read_imgs, coord_placeholder, get_bbox_range
+
+
+def fast_check_ffmpeg():
+    try:
+        subprocess.run(["ffmpeg", "-version"], capture_output=True, check=True)
+        return True
+    except:
+        return False
+
+
+@torch.no_grad()
+def inference(audio_path, video_path, bbox_shift, extra_margin=10, parsing_mode="jaw", 
+              left_cheek_width=90, right_cheek_width=90, progress=gr.Progress(track_tqdm=True)):
+    # Set default parameters, aligned with inference.py
+    args_dict = {
+        "result_dir": './results/output', 
+        "fps": 25, 
+        "batch_size": 8, 
+        "output_vid_name": '', 
+        "use_saved_coord": False,
+        "audio_padding_length_left": 2,
+        "audio_padding_length_right": 2,
+        "version": "v15",  # Fixed use v15 version
+        "extra_margin": extra_margin,
+        "parsing_mode": parsing_mode,
+        "left_cheek_width": left_cheek_width,
+        "right_cheek_width": right_cheek_width
+    }
+    args = Namespace(**args_dict)
+
+    # Check ffmpeg
+    if not fast_check_ffmpeg():
+        print("Warning: Unable to find ffmpeg, please ensure ffmpeg is properly installed")
+
+    input_basename = os.path.basename(video_path).split('.')[0]
+    audio_basename = os.path.basename(audio_path).split('.')[0]
+    output_basename = f"{input_basename}_{audio_basename}"
+    
+    # Create temporary directory
+    temp_dir = os.path.join(args.result_dir, f"{args.version}")
+    os.makedirs(temp_dir, exist_ok=True)
+    
+    # Set result save path
+    result_img_save_path = os.path.join(temp_dir, output_basename)
+    crop_coord_save_path = os.path.join(args.result_dir, "../", input_basename+".pkl")
+    os.makedirs(result_img_save_path, exist_ok=True)
+
+    if args.output_vid_name == "":
+        output_vid_name = os.path.join(temp_dir, output_basename+".mp4")
+    else:
+        output_vid_name = os.path.join(temp_dir, args.output_vid_name)
+        
+    ############################################## extract frames from source video ##############################################
+    if get_file_type(video_path) == "video":
+        save_dir_full = os.path.join(temp_dir, input_basename)
+        os.makedirs(save_dir_full, exist_ok=True)
+        # Read video
+        reader = imageio.get_reader(video_path)
+
+        # Save images
+        for i, im in enumerate(reader):
+            imageio.imwrite(f"{save_dir_full}/{i:08d}.png", im)
+        input_img_list = sorted(glob.glob(os.path.join(save_dir_full, '*.[jpJP][pnPN]*[gG]')))
+        fps = get_video_fps(video_path)
+    else: # input img folder
+        input_img_list = glob.glob(os.path.join(video_path, '*.[jpJP][pnPN]*[gG]'))
+        input_img_list = sorted(input_img_list, key=lambda x: int(os.path.splitext(os.path.basename(x))[0]))
+        fps = args.fps
+        
+    ############################################## extract audio feature ##############################################
+    # Extract audio features
+    whisper_input_features, librosa_length = audio_processor.get_audio_feature(audio_path)
+    whisper_chunks = audio_processor.get_whisper_chunk(
+        whisper_input_features, 
+        device, 
+        weight_dtype, 
+        whisper, 
+        librosa_length,
+        fps=fps,
+        audio_padding_length_left=args.audio_padding_length_left,
+        audio_padding_length_right=args.audio_padding_length_right,
+    )
+        
+    ############################################## preprocess input image  ##############################################
+    if os.path.exists(crop_coord_save_path) and args.use_saved_coord:
+        print("using extracted coordinates")
+        with open(crop_coord_save_path,'rb') as f:
+            coord_list = pickle.load(f)
+        frame_list = read_imgs(input_img_list)
+    else:
+        print("extracting landmarks...time consuming")
+        coord_list, frame_list = get_landmark_and_bbox(input_img_list, bbox_shift)
+        with open(crop_coord_save_path, 'wb') as f:
+            pickle.dump(coord_list, f)
+    bbox_shift_text = get_bbox_range(input_img_list, bbox_shift)
+    
+    # Initialize face parser
+    fp = FaceParsing(
+        left_cheek_width=args.left_cheek_width,
+        right_cheek_width=args.right_cheek_width
+    )
+    
+    i = 0
+    input_latent_list = []
+    for bbox, frame in zip(coord_list, frame_list):
+        if bbox == coord_placeholder:
+            continue
+        x1, y1, x2, y2 = bbox
+        y2 = y2 + args.extra_margin
+        y2 = min(y2, frame.shape[0])
+        crop_frame = frame[y1:y2, x1:x2]
+        crop_frame = cv2.resize(crop_frame,(256,256),interpolation = cv2.INTER_LANCZOS4)
+        latents = vae.get_latents_for_unet(crop_frame)
+        input_latent_list.append(latents)
+
+    # to smooth the first and the last frame
+    frame_list_cycle = frame_list + frame_list[::-1]
+    coord_list_cycle = coord_list + coord_list[::-1]
+    input_latent_list_cycle = input_latent_list + input_latent_list[::-1]
+    
+    ############################################## inference batch by batch ##############################################
+    print("start inference")
+    video_num = len(whisper_chunks)
+    batch_size = args.batch_size
+    gen = datagen(
+        whisper_chunks=whisper_chunks,
+        vae_encode_latents=input_latent_list_cycle,
+        batch_size=batch_size,
+        delay_frame=0,
+        device=device,
+    )
+    res_frame_list = []
+    for i, (whisper_batch,latent_batch) in enumerate(tqdm(gen,total=int(np.ceil(float(video_num)/batch_size)))):
+        audio_feature_batch = pe(whisper_batch)
+        # Ensure latent_batch is consistent with model weight type
+        latent_batch = latent_batch.to(dtype=weight_dtype)
+        
+        pred_latents = unet.model(latent_batch, timesteps, encoder_hidden_states=audio_feature_batch).sample
+        recon = vae.decode_latents(pred_latents)
+        for res_frame in recon:
+            res_frame_list.append(res_frame)
+            
+    ############################################## pad to full image ##############################################
+    print("pad talking image to original video")
+    for i, res_frame in enumerate(tqdm(res_frame_list)):
+        bbox = coord_list_cycle[i%(len(coord_list_cycle))]
+        ori_frame = copy.deepcopy(frame_list_cycle[i%(len(frame_list_cycle))])
+        x1, y1, x2, y2 = bbox
+        y2 = y2 + args.extra_margin
+        y2 = min(y2, frame.shape[0])
+        try:
+            res_frame = cv2.resize(res_frame.astype(np.uint8),(x2-x1,y2-y1))
+        except:
+            continue
+        
+        # Use v15 version blending
+        combine_frame = get_image(ori_frame, res_frame, [x1, y1, x2, y2], mode=args.parsing_mode, fp=fp)
+            
+        cv2.imwrite(f"{result_img_save_path}/{str(i).zfill(8)}.png",combine_frame)
+        
+    # Frame rate
+    fps = 25
+    # Output video path
+    output_video = 'temp.mp4'
+
+    # Read images
+    def is_valid_image(file):
+        pattern = re.compile(r'\d{8}\.png')
+        return pattern.match(file)
+
+    images = []
+    files = [file for file in os.listdir(result_img_save_path) if is_valid_image(file)]
+    files.sort(key=lambda x: int(x.split('.')[0]))
+
+    for file in files:
+        filename = os.path.join(result_img_save_path, file)
+        images.append(imageio.imread(filename))
+        
+
+    # Save video
+    imageio.mimwrite(output_video, images, 'FFMPEG', fps=fps, codec='libx264', pixelformat='yuv420p')
+
+    input_video = './temp.mp4'
+    # Check if the input_video and audio_path exist
+    if not os.path.exists(input_video):
+        raise FileNotFoundError(f"Input video file not found: {input_video}")
+    if not os.path.exists(audio_path):
+        raise FileNotFoundError(f"Audio file not found: {audio_path}")
+    
+    # Read video
+    reader = imageio.get_reader(input_video)
+    fps = reader.get_meta_data()['fps']  # Get original video frame rate
+    reader.close() # Otherwise, error on win11: PermissionError: [WinError 32] Another program is using this file, process cannot access. : 'temp.mp4'
+    # Store frames in list
+    frames = images
+    
+    print(len(frames))
+
+    # Load the video
+    video_clip = VideoFileClip(input_video)
+
+    # Load the audio
+    audio_clip = AudioFileClip(audio_path)
+
+    # Set the audio to the video
+    video_clip = video_clip.set_audio(audio_clip)
+
+    # Write the output video
+    video_clip.write_videofile(output_vid_name, codec='libx264', audio_codec='aac',fps=25)
+
+    os.remove("temp.mp4")
+    #shutil.rmtree(result_img_save_path)
+    print(f"result is save to {output_vid_name}")
+    return output_vid_name,bbox_shift_text
+
+
+
+# load model weights
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+vae, unet, pe = load_all_model(
+    unet_model_path="./models/musetalkV15/unet.pth", 
+    vae_type="sd-vae",
+    unet_config="./models/musetalkV15/musetalk.json",
+    device=device
+)
+
+# Parse command line arguments
+parser = argparse.ArgumentParser()
+parser.add_argument("--ffmpeg_path", type=str, default=r"ffmpeg-master-latest-win64-gpl-shared\bin", help="Path to ffmpeg executable")
+parser.add_argument("--ip", type=str, default="127.0.0.1", help="IP address to bind to")
+parser.add_argument("--port", type=int, default=7860, help="Port to bind to")
+parser.add_argument("--share", action="store_true", help="Create a public link")
+parser.add_argument("--use_float16", action="store_true", help="Use float16 for faster inference")
+args = parser.parse_args()
+
+# Set data type
+if args.use_float16:
+    # Convert models to half precision for better performance
+    pe = pe.half()
+    vae.vae = vae.vae.half()
+    unet.model = unet.model.half()
+    weight_dtype = torch.float16
+else:
+    weight_dtype = torch.float32
+
+# Move models to specified device
+pe = pe.to(device)
+vae.vae = vae.vae.to(device)
+unet.model = unet.model.to(device)
+
+timesteps = torch.tensor([0], device=device)
+
+# Initialize audio processor and Whisper model
+audio_processor = AudioProcessor(feature_extractor_path="./models/whisper")
+whisper = WhisperModel.from_pretrained("./models/whisper")
+whisper = whisper.to(device=device, dtype=weight_dtype).eval()
+whisper.requires_grad_(False)
+
+
+def check_video(video):
+    if not isinstance(video, str):
+        return video # in case of none type
+    # Define the output video file name
+    dir_path, file_name = os.path.split(video)
+    if file_name.startswith("outputxxx_"):
+        return video
+    # Add the output prefix to the file name
+    output_file_name = "outputxxx_" + file_name
+
+    os.makedirs('./results',exist_ok=True)
+    os.makedirs('./results/output',exist_ok=True)
+    os.makedirs('./results/input',exist_ok=True)
+
+    # Combine the directory path and the new file name
+    output_video = os.path.join('./results/input', output_file_name)
+
+
+    # read video
+    reader = imageio.get_reader(video)
+    fps = reader.get_meta_data()['fps']  # get fps from original video
+
+    # conver fps to 25
+    frames = [im for im in reader]
+    target_fps = 25
+    
+    L = len(frames)
+    L_target = int(L / fps * target_fps)
+    original_t = [x / fps for x in range(1, L+1)]
+    t_idx = 0
+    target_frames = []
+    for target_t in range(1, L_target+1):
+        while target_t / target_fps > original_t[t_idx]:
+            t_idx += 1      # find the first t_idx so that target_t / target_fps <= original_t[t_idx]
+            if t_idx >= L:
+                break
+        target_frames.append(frames[t_idx])
+
+    # save video
+    imageio.mimwrite(output_video, target_frames, 'FFMPEG', fps=25, codec='libx264', quality=9, pixelformat='yuv420p')
+    return output_video
+
+
+
+
+css = """#input_img {max-width: 1024px !important} #output_vid {max-width: 1024px; max-height: 576px}"""
+
+with gr.Blocks(css=css) as demo:
+    gr.Markdown(
+        """<div align='center'> <h1>MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling</h1> \
+                    <h2 style='font-weight: 450; font-size: 1rem; margin: 0rem'>\
+                    </br>\
+                    Yue Zhang <sup>*</sup>,\
+                    Zhizhou Zhong <sup>*</sup>,\
+                    Minhao Liu<sup>*</sup>,\
+                    Zhaokang Chen,\
+                    Bin Wu<sup>†</sup>,\
+                    Yubin Zeng,\
+                    Chao Zhang,\
+                    Yingjie He,\
+                    Junxin Huang,\
+                    Wenjiang Zhou <br>\
+                    (<sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author, benbinwu@tencent.com)\
+                    Lyra Lab, Tencent Music Entertainment\
+                </h2> \
+                <a style='font-size:18px;color: #000000' href='https://github.com/TMElyralab/MuseTalk'>[Github Repo]</a>\
+                <a style='font-size:18px;color: #000000' href='https://github.com/TMElyralab/MuseTalk'>[Huggingface]</a>\
+                <a style='font-size:18px;color: #000000' href='https://arxiv.org/abs/2410.10122'> [Technical report] </a>"""
+    )
+
+    with gr.Row():
+        with gr.Column():
+            audio = gr.Audio(label="Drving Audio",type="filepath")
+            video = gr.Video(label="Reference Video",sources=['upload'])
+            bbox_shift = gr.Number(label="BBox_shift value, px", value=0)
+            extra_margin = gr.Slider(label="Extra Margin", minimum=0, maximum=40, value=10, step=1)
+            parsing_mode = gr.Radio(label="Parsing Mode", choices=["jaw", "raw"], value="jaw")
+            left_cheek_width = gr.Slider(label="Left Cheek Width", minimum=20, maximum=160, value=90, step=5)
+            right_cheek_width = gr.Slider(label="Right Cheek Width", minimum=20, maximum=160, value=90, step=5)
+            bbox_shift_scale = gr.Textbox(label="'left_cheek_width' and 'right_cheek_width' parameters determine the range of left and right cheeks editing when parsing model is 'jaw'. The 'extra_margin' parameter determines the movement range of the jaw. Users can freely adjust these three parameters to obtain better inpainting results.")
+
+            with gr.Row():
+                debug_btn = gr.Button("1. Test Inpainting ")
+                btn = gr.Button("2. Generate")
+        with gr.Column():
+            debug_image = gr.Image(label="Test Inpainting Result (First Frame)")
+            debug_info = gr.Textbox(label="Parameter Information", lines=5)
+            out1 = gr.Video()
+    
+    video.change(
+        fn=check_video, inputs=[video], outputs=[video]
+    )
+    btn.click(
+        fn=inference,
+        inputs=[
+            audio,
+            video,
+            bbox_shift,
+            extra_margin,
+            parsing_mode,
+            left_cheek_width,
+            right_cheek_width
+        ],
+        outputs=[out1,bbox_shift_scale]
+    )
+    debug_btn.click(
+        fn=debug_inpainting,
+        inputs=[
+            video,
+            bbox_shift,
+            extra_margin,
+            parsing_mode,
+            left_cheek_width,
+            right_cheek_width
+        ],
+        outputs=[debug_image, debug_info]
+    )
+
+# Check ffmpeg and add to PATH
+if not fast_check_ffmpeg():
+    print(f"Adding ffmpeg to PATH: {args.ffmpeg_path}")
+    # According to operating system, choose path separator
+    path_separator = ';' if sys.platform == 'win32' else ':'
+    os.environ["PATH"] = f"{args.ffmpeg_path}{path_separator}{os.environ['PATH']}"
+    if not fast_check_ffmpeg():
+        print("Warning: Unable to find ffmpeg, please ensure ffmpeg is properly installed")
+
+# Solve asynchronous IO issues on Windows
+if sys.platform == 'win32':
+    import asyncio
+    asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
+
+# Start Gradio application
+demo.queue().launch(
+    share=args.share, 
+    debug=True, 
+    server_name=args.ip, 
+    server_port=args.port
+)
--- a/models/MuseTalk/configs/inference/realtime.yaml
+++ b/models/MuseTalk/configs/inference/realtime.yaml
@@ -0,0 +1,10 @@
+avator_1:
+ preparation: True # your can set it to False if you want to use the existing avator, it will save time
+ bbox_shift: 5
+ video_path: "data/video/yongen.mp4"
+ audio_clips:
+     audio_0: "data/audio/yongen.wav"
+     audio_1: "data/audio/eng.wav"
+ 
+  
+
--- a/models/MuseTalk/configs/inference/test.yaml
+++ b/models/MuseTalk/configs/inference/test.yaml
@@ -0,0 +1,10 @@
+task_0:
+ video_path: "data/video/yongen.mp4"
+ audio_path: "data/audio/yongen.wav"
+
+task_1:
+ video_path: "data/video/yongen.mp4"
+ audio_path: "data/audio/eng.wav"
+ bbox_shift: -7
+  
+
--- a/models/MuseTalk/configs/training/gpu.yaml
+++ b/models/MuseTalk/configs/training/gpu.yaml
@@ -0,0 +1,21 @@
+compute_environment: LOCAL_MACHINE
+debug: True
+deepspeed_config:
+  offload_optimizer_device: none
+  offload_param_device: none
+  zero3_init_flag: False
+  zero_stage: 2
+  
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+gpu_ids: "5, 7" # modify this according to your GPU number
+machine_rank: 0
+main_training_function: main
+num_machines: 1
+num_processes: 2 # it should be the same as the number of GPUs
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/models/MuseTalk/configs/training/preprocess.yaml
+++ b/models/MuseTalk/configs/training/preprocess.yaml
@@ -0,0 +1,31 @@
+clip_len_second: 30 # the length of the video clip
+video_root_raw: "./dataset/HDTF/source/" # the path of the original video
+val_list_hdtf:
+  - RD_Radio7_000
+  - RD_Radio8_000
+  - RD_Radio9_000
+  - WDA_TinaSmith_000
+  - WDA_TomCarper_000
+  - WDA_TomPerez_000
+  - WDA_TomUdall_000
+  - WDA_VeronicaEscobar0_000
+  - WDA_VeronicaEscobar1_000
+  - WDA_WhipJimClyburn_000
+  - WDA_XavierBecerra_000
+  - WDA_XavierBecerra_001
+  - WDA_XavierBecerra_002
+  - WDA_ZoeLofgren_000
+  - WRA_SteveScalise1_000
+  - WRA_TimScott_000
+  - WRA_ToddYoung_000
+  - WRA_TomCotton_000
+  - WRA_TomPrice_000
+  - WRA_VickyHartzler_000
+
+# following dir will be automatically generated
+video_root_25fps: "./dataset/HDTF/video_root_25fps/"
+video_file_list: "./dataset/HDTF/video_file_list.txt"
+video_audio_clip_root:  "./dataset/HDTF/video_audio_clip_root/"
+meta_root: "./dataset/HDTF/meta/"
+video_clip_file_list_train: "./dataset/HDTF/train.txt"
+video_clip_file_list_val: "./dataset/HDTF/val.txt"
--- a/models/MuseTalk/configs/training/stage1.yaml
+++ b/models/MuseTalk/configs/training/stage1.yaml
@@ -0,0 +1,89 @@
+exp_name: 'test'  # Name of the experiment
+output_dir: './exp_out/stage1/'  # Directory to save experiment outputs
+unet_sub_folder: musetalk  # Subfolder name for UNet model
+random_init_unet: True  # Whether to randomly initialize UNet (stage1) or use pretrained weights (stage2)
+whisper_path: "./models/whisper"  # Path to the Whisper model
+pretrained_model_name_or_path: "./models"  # Path to pretrained models
+resume_from_checkpoint: True  # Whether to resume training from a checkpoint
+padding_pixel_mouth: 10  # Number of pixels to pad around the mouth region
+vae_type: "sd-vae"  # Type of VAE model to use
+# Validation parameters
+num_images_to_keep: 8  # Number of validation images to keep
+ref_dropout_rate: 0  # Dropout rate for reference images
+syncnet_config_path: "./configs/training/syncnet.yaml"  # Path to SyncNet configuration
+use_adapted_weight: False  # Whether to use adapted weights for loss calculation
+cropping_jaw2edge_margin_mean: 10  # Mean margin for jaw-to-edge cropping
+cropping_jaw2edge_margin_std: 10  # Standard deviation for jaw-to-edge cropping
+crop_type: "crop_resize"  # Type of cropping method
+random_margin_method: "normal"  # Method for random margin generation
+num_backward_frames: 16  # Number of frames to use for backward pass in SyncNet
+
+data:
+  dataset_key: "HDTF"  # Dataset to use for training
+  train_bs: 32  # Training batch size (actual batch size is train_bs*n_sample_frames)
+  image_size: 256  # Size of input images
+  n_sample_frames: 1 # Number of frames to sample per batch
+  num_workers: 8  # Number of data loading workers
+  audio_padding_length_left: 2  # Left padding length for audio features
+  audio_padding_length_right: 2  # Right padding length for audio features
+  sample_method: pose_similarity_and_mouth_dissimilarity  # Method for sampling frames
+  top_k_ratio: 0.51  # Ratio for top-k sampling
+  contorl_face_min_size: True  # Whether to control minimum face size
+  min_face_size: 150  # Minimum face size in pixels
+
+loss_params:
+  l1_loss: 1.0  # Weight for L1 loss
+  vgg_loss: 0.01  # Weight for VGG perceptual loss
+  vgg_layer_weight: [1, 1, 1, 1, 1]  # Weights for different VGG layers
+  pyramid_scale: [1, 0.5, 0.25, 0.125]  # Scales for image pyramid
+  gan_loss: 0  # Weight for GAN loss
+  fm_loss: [1.0, 1.0, 1.0, 1.0]  # Weights for feature matching loss
+  sync_loss: 0  # Weight for sync loss
+  mouth_gan_loss: 0  # Weight for mouth-specific GAN loss
+
+model_params:
+  discriminator_params:
+    scales: [1]  # Scales for discriminator
+    block_expansion: 32  # Expansion factor for discriminator blocks
+    max_features: 512  # Maximum number of features in discriminator
+    num_blocks: 4  # Number of blocks in discriminator
+    sn: True  # Whether to use spectral normalization
+    image_channel: 3  # Number of image channels
+    estimate_jacobian: False  # Whether to estimate Jacobian
+
+discriminator_train_params:
+  lr: 0.000005  # Learning rate for discriminator
+  eps: 0.00000001  # Epsilon for optimizer
+  weight_decay: 0.01  # Weight decay for optimizer
+  patch_size: 1  # Size of patches for discriminator
+  betas: [0.5, 0.999]  # Beta parameters for Adam optimizer
+  epochs: 10000  # Number of training epochs
+  start_gan: 1000  # Step to start GAN training
+
+solver:
+  gradient_accumulation_steps: 1  # Number of steps for gradient accumulation
+  uncond_steps: 10  # Number of unconditional steps
+  mixed_precision: 'fp32'  # Precision mode for training
+  enable_xformers_memory_efficient_attention: True  # Whether to use memory efficient attention
+  gradient_checkpointing: True  # Whether to use gradient checkpointing
+  max_train_steps: 250000  # Maximum number of training steps
+  max_grad_norm: 1.0  # Maximum gradient norm for clipping
+  # Learning rate parameters
+  learning_rate: 2.0e-5  # Base learning rate
+  scale_lr: False  # Whether to scale learning rate
+  lr_warmup_steps: 1000  # Number of warmup steps for learning rate
+  lr_scheduler: "linear"  # Type of learning rate scheduler
+  # Optimizer parameters
+  use_8bit_adam: False  # Whether to use 8-bit Adam optimizer
+  adam_beta1: 0.5  # Beta1 parameter for Adam optimizer
+  adam_beta2: 0.999  # Beta2 parameter for Adam optimizer
+  adam_weight_decay: 1.0e-2  # Weight decay for Adam optimizer
+  adam_epsilon: 1.0e-8  # Epsilon for Adam optimizer
+
+total_limit: 10  # Maximum number of checkpoints to keep
+save_model_epoch_interval: 250000  # Interval between model saves
+checkpointing_steps: 10000  # Number of steps between checkpoints
+val_freq: 2000  # Frequency of validation
+
+seed: 41  # Random seed for reproducibility
+
--- a/models/MuseTalk/configs/training/stage2.yaml
+++ b/models/MuseTalk/configs/training/stage2.yaml
@@ -0,0 +1,89 @@
+exp_name: 'test'  # Name of the experiment
+output_dir: './exp_out/stage2/'  # Directory to save experiment outputs
+unet_sub_folder: musetalk  # Subfolder name for UNet model
+random_init_unet: False  # Whether to randomly initialize UNet (stage1) or use pretrained weights (stage2)
+whisper_path: "./models/whisper"  # Path to the Whisper model
+pretrained_model_name_or_path: "./models"  # Path to pretrained models
+resume_from_checkpoint: True  # Whether to resume training from a checkpoint
+padding_pixel_mouth: 10  # Number of pixels to pad around the mouth region
+vae_type: "sd-vae"  # Type of VAE model to use
+# Validation parameters
+num_images_to_keep: 8  # Number of validation images to keep
+ref_dropout_rate: 0  # Dropout rate for reference images
+syncnet_config_path: "./configs/training/syncnet.yaml"  # Path to SyncNet configuration
+use_adapted_weight: False  # Whether to use adapted weights for loss calculation
+cropping_jaw2edge_margin_mean: 10  # Mean margin for jaw-to-edge cropping
+cropping_jaw2edge_margin_std: 10  # Standard deviation for jaw-to-edge cropping
+crop_type: "dynamic_margin_crop_resize"  # Type of cropping method
+random_margin_method: "normal"  # Method for random margin generation
+num_backward_frames: 16  # Number of frames to use for backward pass in SyncNet
+
+data:
+  dataset_key: "HDTF"  # Dataset to use for training
+  train_bs: 2  # Training batch size (actual batch size is train_bs*n_sample_frames)
+  image_size: 256  # Size of input images
+  n_sample_frames: 16  # Number of frames to sample per batch
+  num_workers: 8  # Number of data loading workers
+  audio_padding_length_left: 2  # Left padding length for audio features
+  audio_padding_length_right: 2  # Right padding length for audio features
+  sample_method: pose_similarity_and_mouth_dissimilarity  # Method for sampling frames
+  top_k_ratio: 0.51  # Ratio for top-k sampling
+  contorl_face_min_size: True  # Whether to control minimum face size
+  min_face_size: 200  # Minimum face size in pixels
+
+loss_params:
+  l1_loss: 1.0  # Weight for L1 loss
+  vgg_loss: 0.01  # Weight for VGG perceptual loss
+  vgg_layer_weight: [1, 1, 1, 1, 1]  # Weights for different VGG layers
+  pyramid_scale: [1, 0.5, 0.25, 0.125]  # Scales for image pyramid
+  gan_loss: 0.01  # Weight for GAN loss
+  fm_loss: [1.0, 1.0, 1.0, 1.0]  # Weights for feature matching loss
+  sync_loss: 0.05  # Weight for sync loss
+  mouth_gan_loss: 0.01  # Weight for mouth-specific GAN loss
+
+model_params:
+  discriminator_params:
+    scales: [1]  # Scales for discriminator
+    block_expansion: 32  # Expansion factor for discriminator blocks
+    max_features: 512  # Maximum number of features in discriminator
+    num_blocks: 4  # Number of blocks in discriminator
+    sn: True  # Whether to use spectral normalization
+    image_channel: 3  # Number of image channels
+    estimate_jacobian: False  # Whether to estimate Jacobian
+
+discriminator_train_params:
+  lr: 0.000005  # Learning rate for discriminator
+  eps: 0.00000001  # Epsilon for optimizer
+  weight_decay: 0.01  # Weight decay for optimizer
+  patch_size: 1  # Size of patches for discriminator
+  betas: [0.5, 0.999]  # Beta parameters for Adam optimizer
+  epochs: 10000  # Number of training epochs
+  start_gan: 1000  # Step to start GAN training
+
+solver:
+  gradient_accumulation_steps: 8  # Number of steps for gradient accumulation
+  uncond_steps: 10  # Number of unconditional steps
+  mixed_precision: 'fp32'  # Precision mode for training
+  enable_xformers_memory_efficient_attention: True  # Whether to use memory efficient attention
+  gradient_checkpointing: True  # Whether to use gradient checkpointing
+  max_train_steps: 250000  # Maximum number of training steps
+  max_grad_norm: 1.0  # Maximum gradient norm for clipping
+  # Learning rate parameters
+  learning_rate: 5.0e-6  # Base learning rate
+  scale_lr: False  # Whether to scale learning rate
+  lr_warmup_steps: 1000  # Number of warmup steps for learning rate
+  lr_scheduler: "linear"  # Type of learning rate scheduler
+  # Optimizer parameters
+  use_8bit_adam: False  # Whether to use 8-bit Adam optimizer
+  adam_beta1: 0.5  # Beta1 parameter for Adam optimizer
+  adam_beta2: 0.999  # Beta2 parameter for Adam optimizer
+  adam_weight_decay: 1.0e-2  # Weight decay for Adam optimizer
+  adam_epsilon: 1.0e-8  # Epsilon for Adam optimizer
+
+total_limit: 10  # Maximum number of checkpoints to keep
+save_model_epoch_interval: 250000  # Interval between model saves
+checkpointing_steps: 2000  # Number of steps between checkpoints
+val_freq: 2000  # Frequency of validation
+
+seed: 41  # Random seed for reproducibility
+
--- a/models/MuseTalk/configs/training/syncnet.yaml
+++ b/models/MuseTalk/configs/training/syncnet.yaml
@@ -0,0 +1,19 @@
+# This file is modified from LatentSync (https://github.com/bytedance/LatentSync/blob/main/latentsync/configs/training/syncnet_16_pixel.yaml).
+model:
+  audio_encoder: # input (1, 80, 52)
+    in_channels: 1
+    block_out_channels: [32, 64, 128, 256, 512, 1024, 2048]
+    downsample_factors: [[2, 1], 2, 2, 1, 2, 2, [2, 3]]
+    attn_blocks: [0, 0, 0, 0, 0, 0, 0]
+    dropout: 0.0
+  visual_encoder: # input (48, 128, 256)
+    in_channels: 48
+    block_out_channels: [64, 128, 256, 256, 512, 1024, 2048, 2048]
+    downsample_factors: [[1, 2], 2, 2, 2, 2, 2, 2, 2]
+    attn_blocks: [0, 0, 0, 0, 0, 0, 0, 0]
+    dropout: 0.0
+
+ckpt:
+  resume_ckpt_path: ""
+  inference_ckpt_path: ./models/syncnet/latentsync_syncnet.pt # this pretrained model is from LatentSync (https://huggingface.co/ByteDance/LatentSync/tree/main)
+  save_ckpt_steps: 2500
--- a/models/MuseTalk/download_weights.bat
+++ b/models/MuseTalk/download_weights.bat
@@ -0,0 +1,41 @@
+@echo off
+setlocal
+
+:: Set the checkpoints directory
+set CheckpointsDir=models
+
+:: Create necessary directories
+mkdir %CheckpointsDir%\musetalk
+mkdir %CheckpointsDir%\musetalkV15
+mkdir %CheckpointsDir%\syncnet
+mkdir %CheckpointsDir%\dwpose
+mkdir %CheckpointsDir%\face-parse-bisent
+mkdir %CheckpointsDir%\sd-vae-ft-mse
+mkdir %CheckpointsDir%\whisper
+
+:: Install required packages
+pip install -U "huggingface_hub[hf_xet]"
+
+:: Set HuggingFace endpoint
+set HF_ENDPOINT=https://hf-mirror.com
+
+:: Download MuseTalk weights
+hf download TMElyralab/MuseTalk --local-dir %CheckpointsDir%
+
+:: Download SD VAE weights
+hf download stabilityai/sd-vae-ft-mse --local-dir %CheckpointsDir%\sd-vae --include "config.json" "diffusion_pytorch_model.bin"
+
+:: Download Whisper weights
+hf download openai/whisper-tiny --local-dir %CheckpointsDir%\whisper --include "config.json" "pytorch_model.bin" "preprocessor_config.json"
+
+:: Download DWPose weights
+hf download yzd-v/DWPose --local-dir %CheckpointsDir%\dwpose --include "dw-ll_ucoco_384.pth"
+
+:: Download SyncNet weights
+hf download ByteDance/LatentSync --local-dir %CheckpointsDir%\syncnet --include "latentsync_syncnet.pt"
+
+:: Download face-parse-bisent weights
+hf download ManyOtherFunctions/face-parse-bisent --local-dir %CheckpointsDir%\face-parse-bisent --include "79999_iter.pth" "resnet18-5c106cde.pth"
+
+echo All weights have been downloaded successfully!
+endlocal 
--- a/models/MuseTalk/download_weights.sh
+++ b/models/MuseTalk/download_weights.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+
+# Set the checkpoints directory
+CheckpointsDir="models"
+
+# Create necessary directories
+mkdir -p models/musetalk models/musetalkV15 models/syncnet models/dwpose models/face-parse-bisent models/sd-vae models/whisper
+
+# Install required packages
+pip install -U "huggingface_hub[cli]"
+pip install gdown
+
+# Set HuggingFace mirror endpoint
+export HF_ENDPOINT=https://hf-mirror.com
+
+# Download MuseTalk V1.0 weights
+huggingface-cli download TMElyralab/MuseTalk \
+  --local-dir $CheckpointsDir \
+  --include "musetalk/musetalk.json" "musetalk/pytorch_model.bin"
+
+# Download MuseTalk V1.5 weights (unet.pth)
+huggingface-cli download TMElyralab/MuseTalk \
+  --local-dir $CheckpointsDir \
+  --include "musetalkV15/musetalk.json" "musetalkV15/unet.pth"
+
+# Download SD VAE weights
+huggingface-cli download stabilityai/sd-vae-ft-mse \
+  --local-dir $CheckpointsDir/sd-vae \
+  --include "config.json" "diffusion_pytorch_model.bin"
+
+# Download Whisper weights
+huggingface-cli download openai/whisper-tiny \
+  --local-dir $CheckpointsDir/whisper \
+  --include "config.json" "pytorch_model.bin" "preprocessor_config.json"
+
+# Download DWPose weights
+huggingface-cli download yzd-v/DWPose \
+  --local-dir $CheckpointsDir/dwpose \
+  --include "dw-ll_ucoco_384.pth"
+
+# Download SyncNet weights
+huggingface-cli download ByteDance/LatentSync \
+  --local-dir $CheckpointsDir/syncnet \
+  --include "latentsync_syncnet.pt"
+
+# Download Face Parse Bisent weights
+gdown --id 154JgKpzCPW82qINcVieuPH3fZ2e0P812 -O $CheckpointsDir/face-parse-bisent/79999_iter.pth
+curl -L https://download.pytorch.org/models/resnet18-5c106cde.pth \
+  -o $CheckpointsDir/face-parse-bisent/resnet18-5c106cde.pth
+
+echo "✅ All weights have been downloaded successfully!" 
--- a/models/MuseTalk/entrypoint.sh
+++ b/models/MuseTalk/entrypoint.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+
+echo "entrypoint.sh"
+whoami
+which python
+source /opt/conda/etc/profile.d/conda.sh
+conda activate musev
+which python
+python app.py
--- a/models/MuseTalk/inference.sh
+++ b/models/MuseTalk/inference.sh
@@ -0,0 +1,72 @@
+#!/bin/bash
+
+# This script runs inference based on the version and mode specified by the user.
+# Usage:
+# To run v1.0 inference: sh inference.sh v1.0 [normal|realtime]
+# To run v1.5 inference: sh inference.sh v1.5 [normal|realtime]
+
+# Check if the correct number of arguments is provided
+if [ "$#" -ne 2 ]; then
+    echo "Usage: $0 <version> <mode>"
+    echo "Example: $0 v1.0 normal or $0 v1.5 realtime"
+    exit 1
+fi
+
+# Get the version and mode from the user input
+version=$1
+mode=$2
+
+# Validate mode
+if [ "$mode" != "normal" ] && [ "$mode" != "realtime" ]; then
+    echo "Invalid mode specified. Please use 'normal' or 'realtime'."
+    exit 1
+fi
+
+# Set config path based on mode
+if [ "$mode" = "normal" ]; then
+    config_path="./configs/inference/test.yaml"
+    result_dir="./results/test"
+else
+    config_path="./configs/inference/realtime.yaml"
+    result_dir="./results/realtime"
+fi
+
+# Define the model paths based on the version
+if [ "$version" = "v1.0" ]; then
+    model_dir="./models/musetalk"
+    unet_model_path="$model_dir/pytorch_model.bin"
+    unet_config="$model_dir/musetalk.json"
+    version_arg="v1"
+elif [ "$version" = "v1.5" ]; then
+    model_dir="./models/musetalkV15"
+    unet_model_path="$model_dir/unet.pth"
+    unet_config="$model_dir/musetalk.json"
+    version_arg="v15"
+else
+    echo "Invalid version specified. Please use v1.0 or v1.5."
+    exit 1
+fi
+
+# Set script name based on mode
+if [ "$mode" = "normal" ]; then
+    script_name="scripts.inference"
+else
+    script_name="scripts.realtime_inference"
+fi
+
+# Base command arguments
+cmd_args="--inference_config $config_path \
+    --result_dir $result_dir \
+    --unet_model_path $unet_model_path \
+    --unet_config $unet_config \
+    --version $version_arg"
+
+# Add realtime-specific arguments if in realtime mode
+if [ "$mode" = "realtime" ]; then
+    cmd_args="$cmd_args \
+    --fps 25 \
+    --version $version_arg"
+fi
+
+# Run inference
+python3 -m $script_name $cmd_args
--- a/models/MuseTalk/musetalk/data/audio.py
+++ b/models/MuseTalk/musetalk/data/audio.py
@@ -0,0 +1,168 @@
+import librosa
+import librosa.filters
+import numpy as np
+from scipy import signal
+from scipy.io import wavfile
+
+class HParams:
+    # copy from wav2lip
+    def __init__(self):
+        self.n_fft = 800
+        self.hop_size = 200
+        self.win_size = 800
+        self.sample_rate = 16000
+        self.frame_shift_ms = None
+        self.signal_normalization = True
+        
+        self.allow_clipping_in_normalization = True
+        self.symmetric_mels = True
+        self.max_abs_value = 4.0
+        self.preemphasize = True
+        self.preemphasis = 0.97
+        self.min_level_db = -100
+        self.ref_level_db = 20
+        self.fmin = 55
+        self.fmax=7600
+
+        self.use_lws=False
+        self.num_mels=80  # Number of mel-spectrogram channels and local conditioning dimensionality
+        self.rescale=True  # Whether to rescale audio prior to preprocessing
+        self.rescaling_max=0.9  # Rescaling value
+        self.use_lws=False
+
+
+hp = HParams()
+
+def load_wav(path, sr):
+    return librosa.core.load(path, sr=sr)[0]
+#def load_wav(path, sr):
+#    audio, sr_native = sf.read(path)
+#    if sr != sr_native:
+#        audio = librosa.resample(audio.T, sr_native, sr).T
+#    return audio
+
+def save_wav(wav, path, sr):
+    wav *= 32767 / max(0.01, np.max(np.abs(wav)))
+    #proposed by @dsmiller
+    wavfile.write(path, sr, wav.astype(np.int16))
+
+def save_wavenet_wav(wav, path, sr):
+    librosa.output.write_wav(path, wav, sr=sr)
+
+def preemphasis(wav, k, preemphasize=True):
+    if preemphasize:
+        return signal.lfilter([1, -k], [1], wav)
+    return wav
+
+def inv_preemphasis(wav, k, inv_preemphasize=True):
+    if inv_preemphasize:
+        return signal.lfilter([1], [1, -k], wav)
+    return wav
+
+def get_hop_size():
+    hop_size = hp.hop_size
+    if hop_size is None:
+        assert hp.frame_shift_ms is not None
+        hop_size = int(hp.frame_shift_ms / 1000 * hp.sample_rate)
+    return hop_size
+
+def linearspectrogram(wav):
+    D = _stft(preemphasis(wav, hp.preemphasis, hp.preemphasize))
+    S = _amp_to_db(np.abs(D)) - hp.ref_level_db
+    
+    if hp.signal_normalization:
+        return _normalize(S)
+    return S
+
+def melspectrogram(wav):
+    D = _stft(preemphasis(wav, hp.preemphasis, hp.preemphasize))
+    S = _amp_to_db(_linear_to_mel(np.abs(D))) - hp.ref_level_db
+    
+    if hp.signal_normalization:
+        return _normalize(S)
+    return S
+
+def _lws_processor():
+    import lws
+    return lws.lws(hp.n_fft, get_hop_size(), fftsize=hp.win_size, mode="speech")
+
+def _stft(y):
+    if hp.use_lws:
+        return _lws_processor(hp).stft(y).T
+    else:
+        return librosa.stft(y=y, n_fft=hp.n_fft, hop_length=get_hop_size(), win_length=hp.win_size)
+
+##########################################################
+#Those are only correct when using lws!!! (This was messing with Wavenet quality for a long time!)
+def num_frames(length, fsize, fshift):
+    """Compute number of time frames of spectrogram
+    """
+    pad = (fsize - fshift)
+    if length % fshift == 0:
+        M = (length + pad * 2 - fsize) // fshift + 1
+    else:
+        M = (length + pad * 2 - fsize) // fshift + 2
+    return M
+
+
+def pad_lr(x, fsize, fshift):
+    """Compute left and right padding
+    """
+    M = num_frames(len(x), fsize, fshift)
+    pad = (fsize - fshift)
+    T = len(x) + 2 * pad
+    r = (M - 1) * fshift + fsize - T
+    return pad, pad + r
+##########################################################
+#Librosa correct padding
+def librosa_pad_lr(x, fsize, fshift):
+    return 0, (x.shape[0] // fshift + 1) * fshift - x.shape[0]
+
+# Conversions
+_mel_basis = None
+
+def _linear_to_mel(spectogram):
+    global _mel_basis
+    if _mel_basis is None:
+        _mel_basis = _build_mel_basis()
+    return np.dot(_mel_basis, spectogram)
+
+def _build_mel_basis():
+    assert hp.fmax <= hp.sample_rate // 2
+    return librosa.filters.mel(sr=hp.sample_rate, n_fft=hp.n_fft, n_mels=hp.num_mels,
+                               fmin=hp.fmin, fmax=hp.fmax)
+
+def _amp_to_db(x):
+    min_level = np.exp(hp.min_level_db / 20 * np.log(10))
+    return 20 * np.log10(np.maximum(min_level, x))
+
+def _db_to_amp(x):
+    return np.power(10.0, (x) * 0.05)
+
+def _normalize(S):
+    if hp.allow_clipping_in_normalization:
+        if hp.symmetric_mels:
+            return np.clip((2 * hp.max_abs_value) * ((S - hp.min_level_db) / (-hp.min_level_db)) - hp.max_abs_value,
+                           -hp.max_abs_value, hp.max_abs_value)
+        else:
+            return np.clip(hp.max_abs_value * ((S - hp.min_level_db) / (-hp.min_level_db)), 0, hp.max_abs_value)
+    
+    assert S.max() <= 0 and S.min() - hp.min_level_db >= 0
+    if hp.symmetric_mels:
+        return (2 * hp.max_abs_value) * ((S - hp.min_level_db) / (-hp.min_level_db)) - hp.max_abs_value
+    else:
+        return hp.max_abs_value * ((S - hp.min_level_db) / (-hp.min_level_db))
+
+def _denormalize(D):
+    if hp.allow_clipping_in_normalization:
+        if hp.symmetric_mels:
+            return (((np.clip(D, -hp.max_abs_value,
+                              hp.max_abs_value) + hp.max_abs_value) * -hp.min_level_db / (2 * hp.max_abs_value))
+                    + hp.min_level_db)
+        else:
+            return ((np.clip(D, 0, hp.max_abs_value) * -hp.min_level_db / hp.max_abs_value) + hp.min_level_db)
+    
+    if hp.symmetric_mels:
+        return (((D + hp.max_abs_value) * -hp.min_level_db / (2 * hp.max_abs_value)) + hp.min_level_db)
+    else:
+        return ((D * -hp.min_level_db / hp.max_abs_value) + hp.min_level_db)
--- a/models/MuseTalk/musetalk/data/dataset.py
+++ b/models/MuseTalk/musetalk/data/dataset.py
@@ -0,0 +1,610 @@
+import os
+import numpy as np
+import random
+from PIL import Image
+import torch
+from torch.utils.data import Dataset, ConcatDataset
+import torchvision.transforms as transforms
+from transformers import AutoFeatureExtractor
+import librosa
+import time
+import json
+import math
+from decord import AudioReader, VideoReader
+from decord.ndarray import cpu
+
+from musetalk.data.sample_method import get_src_idx, shift_landmarks_to_face_coordinates, resize_landmark 
+from musetalk.data import audio 
+from musetalk.utils.audio_utils import ensure_wav
+
+syncnet_mel_step_size = math.ceil(16 / 5 * 16)  # latentsync
+
+
+class FaceDataset(Dataset):
+    """Dataset class for loading and processing video data
+    
+    Each video can be represented as:
+    - Concatenated frame images
+    - '.mp4' or '.gif' files
+    - Folder containing all frames
+    """
+    def __init__(self,
+                 cfg,
+                 list_paths,
+                 root_path='./dataset/',
+                 repeats=None):
+        # Initialize dataset paths
+        meta_paths = []
+        if repeats is None:
+            repeats = [1] * len(list_paths)
+        assert len(repeats) == len(list_paths)
+        
+        # Load data list
+        for list_path, repeat_time in zip(list_paths, repeats):
+            with open(list_path, 'r') as f:
+                num = 0
+                f.readline()  # Skip header line
+                for line in f.readlines():
+                    line_info = line.strip()
+                    meta = line_info.split()
+                    meta = meta[0]
+                    meta_paths.extend([os.path.join(root_path, meta)] * repeat_time)
+                    num += 1
+                print(f'{list_path}: {num} x {repeat_time} = {num * repeat_time} samples')
+
+        # Set basic attributes
+        self.meta_paths = meta_paths
+        self.root_path = root_path
+        self.image_size = cfg['image_size']
+        self.min_face_size = cfg['min_face_size']
+        self.T = cfg['T']
+        self.sample_method = cfg['sample_method']
+        self.top_k_ratio = cfg['top_k_ratio']
+        self.max_attempts = 200
+        self.padding_pixel_mouth = cfg['padding_pixel_mouth']
+        
+        # Cropping related parameters
+        self.crop_type = cfg['crop_type']
+        self.jaw2edge_margin_mean = cfg['cropping_jaw2edge_margin_mean']
+        self.jaw2edge_margin_std = cfg['cropping_jaw2edge_margin_std']
+        self.random_margin_method = cfg['random_margin_method']
+        
+        # Image transformations
+        self.to_tensor = transforms.Compose([
+            transforms.ToTensor(),
+            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
+        ])
+        self.pose_to_tensor = transforms.Compose([
+            transforms.ToTensor(),
+        ])
+
+        # Feature extractor
+        self.feature_extractor = AutoFeatureExtractor.from_pretrained(cfg['whisper_path'])
+        self.contorl_face_min_size = cfg["contorl_face_min_size"]
+        
+        print("The sample method is: ", self.sample_method)
+        print(f"only use face size > {self.min_face_size}", self.contorl_face_min_size)
+
+    def generate_random_value(self):
+        """Generate random value
+        
+        Returns:
+            float: Generated random value
+        """
+        if self.random_margin_method == "uniform":
+            random_value = np.random.uniform(
+                self.jaw2edge_margin_mean - self.jaw2edge_margin_std, 
+                self.jaw2edge_margin_mean + self.jaw2edge_margin_std
+            )
+        elif self.random_margin_method == "normal":
+            random_value = np.random.normal(
+                loc=self.jaw2edge_margin_mean, 
+                scale=self.jaw2edge_margin_std
+            )
+            random_value = np.clip(
+                random_value, 
+                self.jaw2edge_margin_mean - self.jaw2edge_margin_std, 
+                self.jaw2edge_margin_mean + self.jaw2edge_margin_std, 
+            )
+        else:
+            raise ValueError(f"Invalid random margin method: {self.random_margin_method}")
+        return max(0, random_value)
+
+    def dynamic_margin_crop(self, img, original_bbox, extra_margin=None):
+        """Dynamically crop image with dynamic margin
+        
+        Args:
+            img: Input image
+            original_bbox: Original bounding box
+            extra_margin: Extra margin
+            
+        Returns:
+            tuple: (x1, y1, x2, y2, extra_margin)
+        """
+        if extra_margin is None:
+            extra_margin = self.generate_random_value()
+        w, h = img.size
+        x1, y1, x2, y2 = original_bbox
+        y2 = min(y2 + int(extra_margin), h)
+        return x1, y1, x2, y2, extra_margin
+
+    def crop_resize_img(self, img, bbox, crop_type='crop_resize', extra_margin=None):
+        """Crop and resize image
+        
+        Args:
+            img: Input image
+            bbox: Bounding box
+            crop_type: Type of cropping
+            extra_margin: Extra margin
+            
+        Returns:
+            tuple: (Processed image, extra_margin, mask_scaled_factor)
+        """
+        mask_scaled_factor = 1.
+        if crop_type == 'crop_resize':
+            x1, y1, x2, y2 = bbox
+            img = img.crop((x1, y1, x2, y2))
+            img = img.resize((self.image_size, self.image_size), Image.LANCZOS)
+        elif crop_type == 'dynamic_margin_crop_resize':
+            x1, y1, x2, y2, extra_margin = self.dynamic_margin_crop(img, bbox, extra_margin)
+            w_original, _ = img.size
+            img = img.crop((x1, y1, x2, y2))
+            w_cropped, _ = img.size
+            mask_scaled_factor = w_cropped / w_original
+            img = img.resize((self.image_size, self.image_size), Image.LANCZOS)
+        elif crop_type == 'resize':
+            w, h = img.size
+            scale = np.sqrt(self.image_size ** 2 / (h * w))
+            new_w = int(w * scale) / 64 * 64
+            new_h = int(h * scale) / 64 * 64
+            img = img.resize((new_w, new_h), Image.LANCZOS)
+        return img, extra_margin, mask_scaled_factor
+
+    def get_audio_file(self, wav_path, start_index):
+        """Get audio file features
+        
+        Args:
+            wav_path: Audio file path
+            start_index: Starting index
+            
+        Returns:
+            tuple: (Audio features, start index)
+        """
+        if not os.path.exists(wav_path):
+            return None
+        wav_path_converted = ensure_wav(wav_path)
+        audio_input_librosa, sampling_rate = librosa.load(wav_path_converted, sr=16000)
+        assert sampling_rate == 16000
+
+        while start_index >= 25 * 30:
+            audio_input = audio_input_librosa[16000*30:]
+            start_index -= 25 * 30
+        if start_index + 2 * 25 >= 25 * 30:
+            start_index -= 4 * 25
+            audio_input = audio_input_librosa[16000*4:16000*34]
+        else:
+            audio_input = audio_input_librosa[:16000*30]
+
+        assert 2 * (start_index) >= 0
+        assert 2 * (start_index + 2 * 25) <= 1500
+
+        audio_input = self.feature_extractor(
+            audio_input,
+            return_tensors="pt",
+            sampling_rate=sampling_rate
+        ).input_features
+        return audio_input, start_index
+
+    def get_audio_file_mel(self, wav_path, start_index):
+        """Get mel spectrogram of audio file
+        
+        Args:
+            wav_path: Audio file path
+            start_index: Starting index
+            
+        Returns:
+            tuple: (Mel spectrogram, start index)
+        """
+        if not os.path.exists(wav_path):
+            return None
+
+        wav_path_converted = ensure_wav(wav_path)
+        audio_input_librosa, sampling_rate = librosa.load(wav_path_converted, sr=16000)
+        assert sampling_rate == 16000
+
+        audio_mel = self.mel_feature_extractor(audio_input_librosa)
+        return audio_mel, start_index
+
+    def mel_feature_extractor(self, audio_input):
+        """Extract mel spectrogram features
+        
+        Args:
+            audio_input: Input audio
+            
+        Returns:
+            ndarray: Mel spectrogram features
+        """
+        orig_mel = audio.melspectrogram(audio_input)
+        return orig_mel.T
+
+    def crop_audio_window(self, spec, start_frame_num, fps=25):
+        """Crop audio window
+        
+        Args:
+            spec: Spectrogram
+            start_frame_num: Starting frame number
+            fps: Frames per second
+            
+        Returns:
+            ndarray: Cropped spectrogram
+        """
+        start_idx = int(80. * (start_frame_num / float(fps)))
+        end_idx = start_idx + syncnet_mel_step_size
+        return spec[start_idx: end_idx, :]
+
+    def get_syncnet_input(self, video_path):
+        """Get SyncNet input features
+        
+        Args:
+            video_path: Video file path
+            
+        Returns:
+            ndarray: SyncNet input features
+        """
+        ar = AudioReader(video_path, sample_rate=16000)
+        original_mel = audio.melspectrogram(ar[:].asnumpy().squeeze(0))
+        return original_mel.T
+
+    def get_resized_mouth_mask(
+        self, 
+        img_resized, 
+        landmark_array, 
+        face_shape, 
+        padding_pixel_mouth=0, 
+        image_size=256,
+        crop_margin=0
+    ):
+        landmark_array = np.array(landmark_array)
+        resized_landmark = resize_landmark(
+            landmark_array, w=face_shape[0], h=face_shape[1], new_w=image_size, new_h=image_size)
+
+        landmark_array = np.array(resized_landmark[48 : 67])  # the lip landmarks in 68 landmarks format
+        min_x, min_y = np.min(landmark_array, axis=0)
+        max_x, max_y = np.max(landmark_array, axis=0)
+        min_x = min_x - padding_pixel_mouth
+        max_x = max_x + padding_pixel_mouth
+
+        # Calculate x-axis length and use it for y-axis
+        width = max_x - min_x
+
+        # Calculate old center point
+        center_y = (max_y + min_y) / 2
+
+        # Determine new min_y and max_y based on width
+        min_y = center_y - width / 4
+        max_y = center_y + width / 4
+
+        # Adjust mask position for dynamic crop, shift y-axis
+        min_y = min_y - crop_margin
+        max_y = max_y - crop_margin
+        
+        # Prevent out of bounds
+        min_x = max(min_x, 0)
+        min_y = max(min_y, 0)
+        max_x = min(max_x, face_shape[0])
+        max_y = min(max_y, face_shape[1])
+
+        mask = np.zeros_like(np.array(img_resized))
+        mask[round(min_y):round(max_y), round(min_x):round(max_x)] = 255
+        return Image.fromarray(mask)
+
+    def __len__(self):
+        return 100000
+
+    def __getitem__(self, idx):
+        attempts = 0
+        while attempts < self.max_attempts:
+            try:
+                meta_path = random.sample(self.meta_paths, k=1)[0]
+                with open(meta_path, 'r') as f:
+                    meta_data = json.load(f)
+            except Exception as e:
+                print(f"meta file error:{meta_path}")
+                print(e)
+                attempts += 1
+                time.sleep(0.1)
+                continue
+            
+            video_path = meta_data["mp4_path"]
+            wav_path =  meta_data["wav_path"]
+            bbox_list = meta_data["face_list"]
+            landmark_list = meta_data["landmark_list"]
+            T = self.T
+
+            s = 0
+            e = meta_data["frames"]
+            len_valid_clip = e - s
+
+            if len_valid_clip < T * 10:
+                attempts += 1
+                print(f"video {video_path} has less than {T * 10} frames")
+                continue
+
+            try:
+                cap = VideoReader(video_path, fault_tol=1, ctx=cpu(0))
+                total_frames = len(cap)
+                assert total_frames == len(landmark_list)
+                assert total_frames == len(bbox_list)
+                landmark_shape = np.array(landmark_list).shape
+                if landmark_shape != (total_frames, 68, 2):
+                    attempts += 1
+                    print(f"video {video_path} has invalid landmark shape: {landmark_shape}, expected: {(total_frames, 68, 2)}") # we use 68 landmarks     
+                    continue
+            except Exception as e:
+                print(f"video file error:{video_path}")
+                print(e)
+                attempts += 1
+                time.sleep(0.1)
+                continue
+
+            shift_landmarks, bbox_list_union, face_shapes = shift_landmarks_to_face_coordinates(
+                landmark_list, 
+                bbox_list
+            )
+            if self.contorl_face_min_size and face_shapes[0][0] < self.min_face_size:
+                print(f"video {video_path} has face size {face_shapes[0][0]} less than minimum required {self.min_face_size}")
+                attempts += 1
+                continue
+                
+            step = 1
+            drive_idx_start = random.randint(s, e - T * step)
+            drive_idx_list = list(
+                range(drive_idx_start, drive_idx_start + T * step, step))
+            assert len(drive_idx_list) == T
+
+            src_idx_list = []
+            list_index_out_of_range = False
+            for drive_idx in drive_idx_list:
+                src_idx = get_src_idx(
+                    drive_idx, T, self.sample_method, shift_landmarks, face_shapes, self.top_k_ratio)
+                if src_idx is None:
+                    list_index_out_of_range = True
+                    break
+                src_idx = min(src_idx, e - 1)
+                src_idx = max(src_idx, s)
+                src_idx_list.append(src_idx)
+
+            if list_index_out_of_range:
+                attempts += 1
+                print(f"video {video_path} has invalid source index for drive frames")
+                continue
+
+            ref_face_valid_flag = True
+            extra_margin = self.generate_random_value()
+            
+            # Get reference images
+            ref_imgs = []
+            for src_idx in src_idx_list:
+                imSrc = Image.fromarray(cap[src_idx].asnumpy())
+                bbox_s = bbox_list_union[src_idx]
+                imSrc, _, _ = self.crop_resize_img(
+                    imSrc,
+                    bbox_s, 
+                    self.crop_type, 
+                    extra_margin=None
+                )
+                if self.contorl_face_min_size and min(imSrc.size[0], imSrc.size[1]) < self.min_face_size:
+                    ref_face_valid_flag = False
+                    break
+                ref_imgs.append(imSrc)
+
+            if not ref_face_valid_flag:
+                attempts += 1
+                print(f"video {video_path} has reference face size smaller than minimum required {self.min_face_size}")
+                continue
+            
+            # Get target images and masks
+            imSameIDs = []
+            bboxes = []
+            face_masks = []
+            face_mask_valid = True
+            target_face_valid_flag = True
+            
+            for drive_idx in drive_idx_list:
+                imSameID = Image.fromarray(cap[drive_idx].asnumpy())
+                bbox_s = bbox_list_union[drive_idx]
+                imSameID, _ , mask_scaled_factor = self.crop_resize_img(
+                    imSameID, 
+                    bbox_s, 
+                    self.crop_type, 
+                    extra_margin=extra_margin
+                )
+                if self.contorl_face_min_size and min(imSameID.size[0], imSameID.size[1]) < self.min_face_size:
+                    target_face_valid_flag = False
+                    break
+                crop_margin = extra_margin * mask_scaled_factor
+                face_mask = self.get_resized_mouth_mask(
+                    imSameID,
+                    shift_landmarks[drive_idx],
+                    face_shapes[drive_idx],
+                    self.padding_pixel_mouth,
+                    self.image_size,
+                    crop_margin=crop_margin
+                )
+                if np.count_nonzero(face_mask) == 0:
+                    face_mask_valid = False
+                    break
+
+                if face_mask.size[1] == 0 or face_mask.size[0] == 0:
+                    print(f"video {video_path} has invalid face mask size at frame {drive_idx}")
+                    face_mask_valid = False
+                    break
+
+                imSameIDs.append(imSameID)
+                bboxes.append(bbox_s)
+                face_masks.append(face_mask)
+
+            if not face_mask_valid:
+                attempts += 1
+                print(f"video {video_path} has invalid face mask")
+                continue
+
+            if not target_face_valid_flag:
+                attempts += 1
+                print(f"video {video_path} has target face size smaller than minimum required {self.min_face_size}")
+                continue
+
+            # Process audio features
+            audio_offset = drive_idx_list[0]
+            audio_step = step
+            fps = 25.0 / step
+
+            try:
+                audio_feature, audio_offset = self.get_audio_file(wav_path, audio_offset)
+                _, audio_offset = self.get_audio_file_mel(wav_path, audio_offset)
+                audio_feature_mel = self.get_syncnet_input(video_path)
+            except Exception as e:
+                print(f"audio file error:{wav_path}")
+                print(e)
+                attempts += 1
+                time.sleep(0.1)
+                continue
+            
+            mel = self.crop_audio_window(audio_feature_mel, audio_offset)
+            if mel.shape[0] != syncnet_mel_step_size:
+                attempts += 1
+                print(f"video {video_path} has invalid mel spectrogram shape: {mel.shape}, expected: {syncnet_mel_step_size}")
+                continue
+                
+            mel = torch.FloatTensor(mel.T).unsqueeze(0)
+            
+            # Build sample dictionary
+            sample = dict(
+                pixel_values_vid=torch.stack(
+                    [self.to_tensor(imSameID) for imSameID in imSameIDs], dim=0),
+                pixel_values_ref_img=torch.stack(
+                    [self.to_tensor(ref_img) for ref_img in ref_imgs], dim=0),
+                pixel_values_face_mask=torch.stack(
+                    [self.pose_to_tensor(face_mask) for face_mask in face_masks], dim=0),
+                audio_feature=audio_feature[0],
+                audio_offset=audio_offset,
+                audio_step=audio_step,
+                mel=mel,
+                wav_path=wav_path,
+                fps=fps,
+            )
+
+            return sample
+
+        raise ValueError("Unable to find a valid sample after maximum attempts.")
+
+class HDTFDataset(FaceDataset):
+    """HDTF dataset class"""
+    def __init__(self, cfg):
+        root_path = './dataset/HDTF/meta'
+        list_paths = [
+            './dataset/HDTF/train.txt',
+        ]
+        
+
+        repeats = [10]
+        super().__init__(cfg, list_paths, root_path, repeats)
+        print('HDTFDataset: ', len(self))
+
+class VFHQDataset(FaceDataset):
+    """VFHQ dataset class"""
+    def __init__(self, cfg):
+        root_path = './dataset/VFHQ/meta'
+        list_paths = [
+            './dataset/VFHQ/train.txt',
+        ]
+        repeats = [1]
+        super().__init__(cfg, list_paths, root_path, repeats)
+        print('VFHQDataset: ', len(self))
+        
+def PortraitDataset(cfg=None):
+    """Return dataset based on configuration
+    
+    Args:
+        cfg: Configuration dictionary
+        
+    Returns:
+        Dataset: Combined dataset
+    """
+    if cfg["dataset_key"] == "HDTF":
+        return ConcatDataset([HDTFDataset(cfg)])
+    elif cfg["dataset_key"] == "VFHQ":
+        return ConcatDataset([VFHQDataset(cfg)])
+    else:  
+        print("############ use all dataset ############ ")
+        return ConcatDataset([HDTFDataset(cfg), VFHQDataset(cfg)])
+
+
+if __name__ == '__main__':
+    # Set random seeds for reproducibility
+    seed = 42
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+
+    # Create dataset with configuration parameters
+    dataset = PortraitDataset(cfg={
+        'T': 1,  # Number of frames to process at once
+        'random_margin_method': "normal",  # Method for generating random margins: "normal" or "uniform"
+        'dataset_key': "HDTF",  # Dataset to use: "HDTF", "VFHQ", or None for both
+        'image_size': 256,  # Size of processed images (height and width)
+        'sample_method': 'pose_similarity_and_mouth_dissimilarity',  # Method for selecting reference frames
+        'top_k_ratio': 0.51,  # Ratio for top-k selection in reference frame sampling
+        'contorl_face_min_size': True,  # Whether to enforce minimum face size
+        'padding_pixel_mouth': 10,  # Padding pixels around mouth region in mask
+        'min_face_size': 200,  # Minimum face size requirement for dataset
+        'whisper_path': "./models/whisper",  # Path to Whisper model
+        'cropping_jaw2edge_margin_mean': 10,  # Mean margin for jaw-to-edge cropping
+        'cropping_jaw2edge_margin_std': 10,  # Standard deviation for jaw-to-edge cropping
+        'crop_type': "dynamic_margin_crop_resize",  # Type of cropping: "crop_resize", "dynamic_margin_crop_resize", or "resize"
+    })
+    print(len(dataset))
+    
+    import torchvision
+    os.makedirs('debug', exist_ok=True)
+    for i in range(10):  # Check 10 samples
+        sample = dataset[0]
+        print(f"processing {i}")
+        
+        # Get images and mask
+        ref_img = (sample['pixel_values_ref_img'] + 1.0) / 2  # (b, c, h, w)
+        target_img = (sample['pixel_values_vid'] + 1.0) / 2
+        face_mask = sample['pixel_values_face_mask']
+        
+        # Print dimension information
+        print(f"ref_img shape: {ref_img.shape}")
+        print(f"target_img shape: {target_img.shape}")
+        print(f"face_mask shape: {face_mask.shape}")
+        
+        # Create visualization images
+        b, c, h, w = ref_img.shape
+        
+        # Apply mask only to target image
+        target_mask = face_mask
+        
+        # Keep reference image unchanged
+        ref_with_mask = ref_img.clone()
+        
+        # Create mask overlay for target image
+        target_with_mask = target_img.clone()
+        target_with_mask = target_with_mask * (1 - target_mask) + target_mask  # Apply mask only to target
+        
+        # Save original images, mask, and overlay results
+        # First row: original images
+        # Second row: mask
+        # Third row: overlay effect
+        concatenated_img = torch.cat((
+            ref_img, target_img,  # Original images
+            torch.zeros_like(ref_img), target_mask,  # Mask (black for ref)
+            ref_with_mask, target_with_mask  # Overlay effect
+        ), dim=3)
+        
+        torchvision.utils.save_image(
+            concatenated_img, f'debug/mask_check_{i}.jpg', nrow=2)
--- a/models/MuseTalk/musetalk/data/sample_method.py
+++ b/models/MuseTalk/musetalk/data/sample_method.py
@@ -0,0 +1,233 @@
+import numpy as np
+import random
+
+def summarize_tensor(x):
+    return f"\033[34m{str(tuple(x.shape)).ljust(24)}\033[0m (\033[31mmin {x.min().item():+.4f}\033[0m / \033[32mmean {x.mean().item():+.4f}\033[0m / \033[33mmax {x.max().item():+.4f}\033[0m)"
+
+def calculate_mouth_open_similarity(landmarks_list, select_idx,top_k=50,ascending=True):
+    num_landmarks = len(landmarks_list)
+    mouth_open_ratios = np.zeros(num_landmarks)  # Initialize as a numpy array
+    print(np.shape(landmarks_list))
+    ## Calculate mouth opening ratios
+    for i, landmarks in enumerate(landmarks_list):
+        # Assuming landmarks are in the format [x, y] and accessible by index
+        mouth_top = landmarks[165]  # Adjust index according to your landmarks format
+        mouth_bottom = landmarks[147]  # Adjust index according to your landmarks format
+        mouth_open_ratio = np.linalg.norm(mouth_top - mouth_bottom)
+        mouth_open_ratios[i] = mouth_open_ratio
+
+    # Calculate differences matrix
+    differences_matrix = np.abs(mouth_open_ratios[:, np.newaxis] - mouth_open_ratios[select_idx])
+    differences_matrix_with_signs = mouth_open_ratios[:, np.newaxis] - mouth_open_ratios[select_idx]
+    print(differences_matrix.shape)
+    # Find top_k similar indices for each landmark set
+    if ascending:
+        top_indices = np.argsort(differences_matrix[i])[:top_k]
+    else:
+        top_indices = np.argsort(-differences_matrix[i])[:top_k]
+    similar_landmarks_indices = top_indices.tolist()
+    similar_landmarks_distances = differences_matrix_with_signs[i].tolist() #注意这里不要排序
+
+    return similar_landmarks_indices, similar_landmarks_distances
+#############################################################################################
+def get_closed_mouth(landmarks_list,ascending=True,top_k=50):
+    num_landmarks = len(landmarks_list)
+
+    mouth_open_ratios = np.zeros(num_landmarks)  # Initialize as a numpy array
+    ## Calculate mouth opening ratios
+    #print("landmarks shape",np.shape(landmarks_list))
+    for i, landmarks in enumerate(landmarks_list):
+        # Assuming landmarks are in the format [x, y] and accessible by index
+        #print(landmarks[165])
+        mouth_top = np.array(landmarks[165])# Adjust index according to your landmarks format
+        mouth_bottom = np.array(landmarks[147])  # Adjust index according to your landmarks format
+        mouth_open_ratio = np.linalg.norm(mouth_top - mouth_bottom)
+        mouth_open_ratios[i] = mouth_open_ratio
+
+    # Find top_k similar indices for each landmark set
+    if ascending:
+        top_indices = np.argsort(mouth_open_ratios)[:top_k]
+    else:
+        top_indices = np.argsort(-mouth_open_ratios)[:top_k]
+    return top_indices
+
+def calculate_landmarks_similarity(selected_idx, landmarks_list,image_shapes, start_index, end_index, top_k=50,ascending=True):
+    """
+    Calculate the similarity between sets of facial landmarks and return the indices of the most similar faces.
+
+    Parameters:
+    landmarks_list (list): A list containing sets of facial landmarks, each element is a set of landmarks.
+    image_shapes (list): A list containing the shape of each image, each element is a (width, height) tuple.
+    start_index (int): The starting index of the facial landmarks.
+    end_index (int): The ending index of the facial landmarks.
+    top_k (int): The number of most similar landmark sets to return. Default is 50.
+    ascending (bool): Controls the sorting order. If True, sort in ascending order; If False, sort in descending order. Default is True.
+
+    Returns:
+    similar_landmarks_indices (list): A list containing the indices of the most similar facial landmarks for each face.
+    resized_landmarks (list): A list containing the resized facial landmarks.
+    """
+    num_landmarks = len(landmarks_list)
+    resized_landmarks = []
+
+    # Preprocess landmarks
+    for i in range(num_landmarks):
+        landmark_array = np.array(landmarks_list[i])
+        selected_landmarks = landmark_array[start_index:end_index]
+        resized_landmark = resize_landmark(selected_landmarks, w=image_shapes[i][0], h=image_shapes[i][1],new_w=256,new_h=256)
+        resized_landmarks.append(resized_landmark)
+
+    resized_landmarks_array = np.array(resized_landmarks)  # Convert list to array for easier manipulation
+
+    # Calculate similarity
+    distances = np.linalg.norm(resized_landmarks_array - resized_landmarks_array[selected_idx][np.newaxis, :], axis=2)
+    overall_distances = np.mean(distances, axis=1)  # Calculate mean distance for each set of landmarks
+
+    if ascending:
+        sorted_indices = np.argsort(overall_distances)
+        similar_landmarks_indices = sorted_indices[1:top_k+1].tolist()  # Exclude self and take top_k
+    else:
+        sorted_indices = np.argsort(-overall_distances)
+        similar_landmarks_indices = sorted_indices[0:top_k].tolist()
+
+    return similar_landmarks_indices
+
+def process_bbox_musetalk(face_array, landmark_array):
+    x_min_face, y_min_face, x_max_face, y_max_face = map(int, face_array)
+    x_min_lm = min([int(x) for x, y in landmark_array])
+    y_min_lm = min([int(y) for x, y in landmark_array])
+    x_max_lm = max([int(x) for x, y in landmark_array])
+    y_max_lm = max([int(y) for x, y in landmark_array])
+    x_min = min(x_min_face, x_min_lm)
+    y_min = min(y_min_face, y_min_lm)
+    x_max = max(x_max_face, x_max_lm)
+    y_max = max(y_max_face, y_max_lm)
+
+    x_min = max(x_min, 0)
+    y_min = max(y_min, 0)
+
+    return [x_min, y_min, x_max, y_max]
+
+def shift_landmarks_to_face_coordinates(landmark_list, face_list):
+    """
+        Translates the data in landmark_list to the coordinates of the cropped larger face.
+
+        Parameters:
+        landmark_list (list): A list containing multiple sets of facial landmarks.
+        face_list (list): A list containing multiple facial images.
+
+        Returns:
+        landmark_list_shift (list): The list of translated landmarks.
+        bbox_union (list): The list of union bounding boxes.
+        face_shapes (list): The list of facial shapes.
+    """
+    landmark_list_shift = []
+    bbox_union = []
+    face_shapes = []
+
+    for i in range(len(face_list)):
+        landmark_array = np.array(landmark_list[i])  # 转换为numpy数组并创建副本
+        face_array = face_list[i]
+        f_landmark_bbox = process_bbox_musetalk(face_array, landmark_array) 
+        x_min, y_min, x_max, y_max = f_landmark_bbox
+        landmark_array[:, 0] = landmark_array[:, 0] - f_landmark_bbox[0]
+        landmark_array[:, 1] = landmark_array[:, 1] - f_landmark_bbox[1]
+        landmark_list_shift.append(landmark_array)
+        bbox_union.append(f_landmark_bbox)
+        face_shapes.append((x_max - x_min, y_max - y_min))
+
+    return landmark_list_shift, bbox_union, face_shapes
+
+def resize_landmark(landmark, w, h, new_w, new_h):
+    landmark_norm = landmark / [w, h]
+    landmark_resized = landmark_norm * [new_w, new_h]
+   
+    return landmark_resized
+
+def get_src_idx(drive_idx, T, sample_method,landmarks_list,image_shapes,top_k_ratio):
+    """
+        Calculate the source index (src_idx) based on the given drive index, T, s, e, and sampling method.
+
+        Parameters:
+        - drive_idx (int): The current drive index.
+        - T (int): Total number of frames or a specific range limit.
+        - sample_method (str): Sampling method, which can be "random" or other methods.
+        - landmarks_list (list): List of facial landmarks.
+        - image_shapes (list): List of image shapes.
+        - top_k_ratio (float): Ratio for selecting top k similar frames.
+
+        Returns:
+        - src_idx (int): The calculated source index.
+    """
+    if sample_method == "random":
+        src_idx = random.randint(drive_idx - 5 * T, drive_idx + 5 * T)
+    elif sample_method == "pose_similarity":
+        top_k = int(top_k_ratio*len(landmarks_list))
+        try:
+            top_k = int(top_k_ratio*len(landmarks_list)) 
+            # facial contour
+            landmark_start_idx = 0
+            landmark_end_idx = 16
+            pose_similarity_list = calculate_landmarks_similarity(drive_idx, landmarks_list,image_shapes, landmark_start_idx, landmark_end_idx,top_k=top_k, ascending=True)
+            src_idx = random.choice(pose_similarity_list)
+            while abs(src_idx-drive_idx)<5:
+                src_idx = random.choice(pose_similarity_list)
+        except Exception as e:
+            print(e)
+            return None
+    elif sample_method=="pose_similarity_and_closed_mouth":
+        # facial contour
+        landmark_start_idx = 0
+        landmark_end_idx = 16
+        try:
+            top_k = int(top_k_ratio*len(landmarks_list)) 
+            closed_mouth_list = get_closed_mouth(landmarks_list, ascending=True,top_k=top_k)
+            #print("closed_mouth_list",closed_mouth_list)
+            pose_similarity_list = calculate_landmarks_similarity(drive_idx, landmarks_list,image_shapes, landmark_start_idx, landmark_end_idx,top_k=top_k, ascending=True)
+            #print("pose_similarity_list",pose_similarity_list)
+            common_list = list(set(closed_mouth_list).intersection(set(pose_similarity_list)))
+            if len(common_list) == 0:
+                src_idx = random.randint(drive_idx - 5 * T, drive_idx + 5 * T)
+            else:
+                src_idx = random.choice(common_list)
+
+            while abs(src_idx-drive_idx) <5:
+                src_idx = random.randint(drive_idx - 5 * T, drive_idx + 5 * T)
+
+        except Exception as e:
+            print(e)
+            return None
+        
+    elif sample_method=="pose_similarity_and_mouth_dissimilarity":
+        top_k = int(top_k_ratio*len(landmarks_list))
+        try:
+            top_k = int(top_k_ratio*len(landmarks_list)) 
+            
+            # facial contour for 68 landmarks format
+            landmark_start_idx = 0
+            landmark_end_idx = 16
+           
+            pose_similarity_list = calculate_landmarks_similarity(drive_idx, landmarks_list,image_shapes, landmark_start_idx, landmark_end_idx,top_k=top_k, ascending=True)
+            
+            # Mouth inner coutour for 68 landmarks format
+            landmark_start_idx = 60
+            landmark_end_idx = 67
+            
+            mouth_dissimilarity_list = calculate_landmarks_similarity(drive_idx, landmarks_list,image_shapes, landmark_start_idx, landmark_end_idx,top_k=top_k, ascending=False)
+
+            common_list = list(set(pose_similarity_list).intersection(set(mouth_dissimilarity_list)))
+            if len(common_list) == 0:
+                src_idx = random.randint(drive_idx - 5 * T, drive_idx + 5 * T)
+            else:
+                src_idx = random.choice(common_list)
+
+            while abs(src_idx-drive_idx) <5:
+                src_idx = random.randint(drive_idx - 5 * T, drive_idx + 5 * T)
+
+        except Exception as e:
+            print(e)
+            return None
+        
+    else:
+        raise ValueError(f"Unknown sample_method: {sample_method}")
+    return src_idx
--- a/models/MuseTalk/musetalk/loss/basic_loss.py
+++ b/models/MuseTalk/musetalk/loss/basic_loss.py
@@ -0,0 +1,81 @@
+import torch
+import torch.nn as nn
+from omegaconf import OmegaConf
+import torch
+import torch.nn.functional as F
+from torch import nn, optim
+from torch.optim.lr_scheduler import CosineAnnealingLR
+from musetalk.loss.discriminator import MultiScaleDiscriminator,DiscriminatorFullModel
+import musetalk.loss.vgg_face as vgg_face
+
+class Interpolate(nn.Module):
+    def __init__(self, size=None, scale_factor=None, mode='nearest', align_corners=None):
+        super(Interpolate, self).__init__()
+        self.size = size
+        self.scale_factor = scale_factor
+        self.mode = mode
+        self.align_corners = align_corners
+
+    def forward(self, input):
+        return F.interpolate(input, self.size, self.scale_factor, self.mode, self.align_corners)
+
+def set_requires_grad(net, requires_grad=False):
+    if net is not None:
+        for param in net.parameters():
+            param.requires_grad = requires_grad
+
+if __name__ == "__main__":
+    cfg = OmegaConf.load("config/audio_adapter/E7.yaml")
+
+    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+    pyramid_scale = [1, 0.5, 0.25, 0.125]
+    vgg_IN = vgg_face.Vgg19().to(device)
+    pyramid = vgg_face.ImagePyramide(cfg.loss_params.pyramid_scale, 3).to(device)
+    vgg_IN.eval()
+    downsampler = Interpolate(size=(224, 224), mode='bilinear', align_corners=False)
+    
+    image = torch.rand(8, 3, 256, 256).to(device)
+    image_pred = torch.rand(8, 3, 256, 256).to(device)
+    pyramide_real = pyramid(downsampler(image))
+    pyramide_generated = pyramid(downsampler(image_pred))
+    
+
+    loss_IN = 0
+    for scale in cfg.loss_params.pyramid_scale:
+        x_vgg = vgg_IN(pyramide_generated['prediction_' + str(scale)])
+        y_vgg = vgg_IN(pyramide_real['prediction_' + str(scale)])
+        for i, weight in enumerate(cfg.loss_params.vgg_layer_weight):
+            value = torch.abs(x_vgg[i] - y_vgg[i].detach()).mean() 
+            loss_IN += weight * value
+    loss_IN /= sum(cfg.loss_params.vgg_layer_weight)  # 对vgg不同层取均值，金字塔loss是每层叠
+    print(loss_IN)
+
+    #print(cfg.model_params.discriminator_params)
+
+    discriminator = MultiScaleDiscriminator(**cfg.model_params.discriminator_params).to(device)
+    discriminator_full = DiscriminatorFullModel(discriminator)
+    disc_scales = cfg.model_params.discriminator_params.scales
+    # Prepare optimizer and loss function
+    optimizer_D = optim.AdamW(discriminator.parameters(), 
+                                lr=cfg.discriminator_train_params.lr, 
+                                weight_decay=cfg.discriminator_train_params.weight_decay,
+                                betas=cfg.discriminator_train_params.betas,
+                                eps=cfg.discriminator_train_params.eps)
+    scheduler_D = CosineAnnealingLR(optimizer_D, 
+                                    T_max=cfg.discriminator_train_params.epochs, 
+                                    eta_min=1e-6)
+
+    discriminator.train()
+
+    set_requires_grad(discriminator, False)
+
+    loss_G = 0.
+    discriminator_maps_generated = discriminator(pyramide_generated)
+    discriminator_maps_real = discriminator(pyramide_real)
+
+    for scale in disc_scales:
+        key = 'prediction_map_%s' % scale
+        value = ((1 - discriminator_maps_generated[key]) ** 2).mean()
+        loss_G += value
+
+    print(loss_G)
--- a/models/MuseTalk/musetalk/loss/conv.py
+++ b/models/MuseTalk/musetalk/loss/conv.py
@@ -0,0 +1,44 @@
+import torch
+from torch import nn
+from torch.nn import functional as F
+
+class Conv2d(nn.Module):
+    def __init__(self, cin, cout, kernel_size, stride, padding, residual=False, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.conv_block = nn.Sequential(
+                            nn.Conv2d(cin, cout, kernel_size, stride, padding),
+                            nn.BatchNorm2d(cout)
+                            )
+        self.act = nn.ReLU()
+        self.residual = residual
+
+    def forward(self, x):
+        out = self.conv_block(x)
+        if self.residual:
+            out += x
+        return self.act(out)
+
+class nonorm_Conv2d(nn.Module):
+    def __init__(self, cin, cout, kernel_size, stride, padding, residual=False, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.conv_block = nn.Sequential(
+                            nn.Conv2d(cin, cout, kernel_size, stride, padding),
+                            )
+        self.act = nn.LeakyReLU(0.01, inplace=True)
+
+    def forward(self, x):
+        out = self.conv_block(x)
+        return self.act(out)
+
+class Conv2dTranspose(nn.Module):
+    def __init__(self, cin, cout, kernel_size, stride, padding, output_padding=0, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.conv_block = nn.Sequential(
+                            nn.ConvTranspose2d(cin, cout, kernel_size, stride, padding, output_padding),
+                            nn.BatchNorm2d(cout)
+                            )
+        self.act = nn.ReLU()
+
+    def forward(self, x):
+        out = self.conv_block(x)
+        return self.act(out)
--- a/models/MuseTalk/musetalk/loss/discriminator.py
+++ b/models/MuseTalk/musetalk/loss/discriminator.py
@@ -0,0 +1,145 @@
+from torch import nn
+import torch.nn.functional as F
+import torch
+from musetalk.loss.vgg_face import ImagePyramide
+
+class DownBlock2d(nn.Module):
+    """
+    Simple block for processing video (encoder).
+    """
+
+    def __init__(self, in_features, out_features, norm=False, kernel_size=4, pool=False, sn=False):
+        super(DownBlock2d, self).__init__()
+        self.conv = nn.Conv2d(in_channels=in_features, out_channels=out_features, kernel_size=kernel_size)
+
+        if sn:
+            self.conv = nn.utils.spectral_norm(self.conv)
+
+        if norm:
+            self.norm = nn.InstanceNorm2d(out_features, affine=True)
+        else:
+            self.norm = None
+        self.pool = pool
+
+    def forward(self, x):
+        out = x
+        out = self.conv(out)
+        if self.norm:
+            out = self.norm(out)
+        out = F.leaky_relu(out, 0.2)
+        if self.pool:
+            out = F.avg_pool2d(out, (2, 2))
+        return out
+
+
+class Discriminator(nn.Module):
+    """
+    Discriminator similar to Pix2Pix
+    """
+
+    def __init__(self, num_channels=3, block_expansion=64, num_blocks=4, max_features=512,
+                 sn=False, **kwargs):
+        super(Discriminator, self).__init__()
+
+        down_blocks = []
+        for i in range(num_blocks):
+            down_blocks.append(
+                DownBlock2d(num_channels if i == 0 else min(max_features, block_expansion * (2 ** i)),
+                            min(max_features, block_expansion * (2 ** (i + 1))),
+                            norm=(i != 0), kernel_size=4, pool=(i != num_blocks - 1), sn=sn))
+
+        self.down_blocks = nn.ModuleList(down_blocks)
+        self.conv = nn.Conv2d(self.down_blocks[-1].conv.out_channels, out_channels=1, kernel_size=1)
+        if sn:
+            self.conv = nn.utils.spectral_norm(self.conv)
+
+    def forward(self, x):
+        feature_maps = []
+        out = x
+
+        for down_block in self.down_blocks:
+            feature_maps.append(down_block(out))
+            out = feature_maps[-1]
+        prediction_map = self.conv(out)
+
+        return feature_maps, prediction_map
+
+
+class MultiScaleDiscriminator(nn.Module):
+    """ 
+    Multi-scale (scale) discriminator
+    """
+
+    def __init__(self, scales=(), **kwargs):
+        super(MultiScaleDiscriminator, self).__init__()
+        self.scales = scales
+        discs = {}
+        for scale in scales:
+            discs[str(scale).replace('.', '-')] = Discriminator(**kwargs)
+        self.discs = nn.ModuleDict(discs)
+
+    def forward(self, x):
+        out_dict = {}
+        for scale, disc in self.discs.items():
+            scale = str(scale).replace('-', '.')
+            key = 'prediction_' + scale
+            #print(key)
+            #print(x)
+            feature_maps, prediction_map = disc(x[key])
+            out_dict['feature_maps_' + scale] = feature_maps
+            out_dict['prediction_map_' + scale] = prediction_map
+        return out_dict
+
+
+
+class DiscriminatorFullModel(torch.nn.Module):
+    """
+    Merge all discriminator related updates into single model for better multi-gpu usage
+    """
+
+    def __init__(self, discriminator):
+        super(DiscriminatorFullModel, self).__init__()
+        self.discriminator = discriminator
+        self.scales = self.discriminator.scales
+        print("scales",self.scales)
+        self.pyramid = ImagePyramide(self.scales, 3)
+        if torch.cuda.is_available():
+            self.pyramid = self.pyramid.cuda()
+
+        self.zero_tensor = None
+
+    def get_zero_tensor(self, input):
+        if self.zero_tensor is None:
+            self.zero_tensor = torch.FloatTensor(1).fill_(0).cuda()
+            self.zero_tensor.requires_grad_(False)
+        return self.zero_tensor.expand_as(input)
+
+    def forward(self, x, generated, gan_mode='ls'):
+        pyramide_real = self.pyramid(x)
+        pyramide_generated = self.pyramid(generated.detach())
+
+        discriminator_maps_generated = self.discriminator(pyramide_generated)
+        discriminator_maps_real = self.discriminator(pyramide_real)
+
+        value_total = 0
+        for scale in self.scales:
+            key = 'prediction_map_%s' % scale
+            if gan_mode == 'hinge':
+                value = -torch.mean(torch.min(discriminator_maps_real[key]-1, self.get_zero_tensor(discriminator_maps_real[key]))) - torch.mean(torch.min(-discriminator_maps_generated[key]-1, self.get_zero_tensor(discriminator_maps_generated[key])))
+            elif gan_mode == 'ls':
+                value = ((1 - discriminator_maps_real[key]) ** 2 + discriminator_maps_generated[key] ** 2).mean()
+            else:
+                raise ValueError('Unexpected gan_mode {}'.format(self.train_params['gan_mode']))
+
+            value_total += value
+
+        return value_total
+    
+def main():
+    discriminator = MultiScaleDiscriminator(scales=[1],
+                                        block_expansion=32,
+                                        max_features=512,
+                                        num_blocks=4,
+                                        sn=True,
+                                        image_channel=3,
+                                        estimate_jacobian=False)
--- a/models/MuseTalk/musetalk/loss/resnet.py
+++ b/models/MuseTalk/musetalk/loss/resnet.py
@@ -0,0 +1,152 @@
+import torch.nn as nn
+import math
+
+__all__ = ['ResNet', 'resnet50']
+
+def conv3x3(in_planes, out_planes, stride=1):
+    """3x3 convolution with padding"""
+    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
+                     padding=1, bias=False)
+
+
+class BasicBlock(nn.Module):
+    expansion = 1
+
+    def __init__(self, inplanes, planes, stride=1, downsample=None):
+        super(BasicBlock, self).__init__()
+        self.conv1 = conv3x3(inplanes, planes, stride)
+        self.bn1 = nn.BatchNorm2d(planes)
+        self.relu = nn.ReLU(inplace=True)
+        self.conv2 = conv3x3(planes, planes)
+        self.bn2 = nn.BatchNorm2d(planes)
+        self.downsample = downsample
+        self.stride = stride
+
+    def forward(self, x):
+        residual = x
+
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+
+        out = self.conv2(out)
+        out = self.bn2(out)
+
+        if self.downsample is not None:
+            residual = self.downsample(x)
+
+        out += residual
+        out = self.relu(out)
+
+        return out
+
+
+class Bottleneck(nn.Module):
+    expansion = 4
+
+    def __init__(self, inplanes, planes, stride=1, downsample=None):
+        super(Bottleneck, self).__init__()
+        self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, stride=stride, bias=False)
+        self.bn1 = nn.BatchNorm2d(planes)
+        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
+        self.bn2 = nn.BatchNorm2d(planes)
+        self.conv3 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)
+        self.bn3 = nn.BatchNorm2d(planes * 4)
+        self.relu = nn.ReLU(inplace=True)
+        self.downsample = downsample
+        self.stride = stride
+
+    def forward(self, x):
+        residual = x
+
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+
+        out = self.conv2(out)
+        out = self.bn2(out)
+        out = self.relu(out)
+
+        out = self.conv3(out)
+        out = self.bn3(out)
+
+        if self.downsample is not None:
+            residual = self.downsample(x)
+
+        out += residual
+        out = self.relu(out)
+
+        return out
+
+
+class ResNet(nn.Module):
+
+    def __init__(self, block, layers, num_classes=1000, include_top=True):
+        self.inplanes = 64
+        super(ResNet, self).__init__()
+        self.include_top = include_top
+        
+        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
+        self.bn1 = nn.BatchNorm2d(64)
+        self.relu = nn.ReLU(inplace=True)
+        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=0, ceil_mode=True)
+
+        self.layer1 = self._make_layer(block, 64, layers[0])
+        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
+        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
+        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
+        self.avgpool = nn.AvgPool2d(7, stride=1)
+        self.fc = nn.Linear(512 * block.expansion, num_classes)
+
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
+                m.weight.data.normal_(0, math.sqrt(2. / n))
+            elif isinstance(m, nn.BatchNorm2d):
+                m.weight.data.fill_(1)
+                m.bias.data.zero_()
+
+    def _make_layer(self, block, planes, blocks, stride=1):
+        downsample = None
+        if stride != 1 or self.inplanes != planes * block.expansion:
+            downsample = nn.Sequential(
+                nn.Conv2d(self.inplanes, planes * block.expansion,
+                          kernel_size=1, stride=stride, bias=False),
+                nn.BatchNorm2d(planes * block.expansion),
+            )
+
+        layers = []
+        layers.append(block(self.inplanes, planes, stride, downsample))
+        self.inplanes = planes * block.expansion
+        for i in range(1, blocks):
+            layers.append(block(self.inplanes, planes))
+
+        return nn.Sequential(*layers)
+
+    def forward(self, x):
+        x = x * 255.
+        x = x.flip(1)
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = self.relu(x)
+        x = self.maxpool(x)
+
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+
+        x = self.avgpool(x)
+        
+        if not self.include_top:
+            return x
+        
+        x = x.view(x.size(0), -1)
+        x = self.fc(x)
+        return x
+
+def resnet50(**kwargs):
+    """Constructs a ResNet-50 model.
+    """
+    model = ResNet(Bottleneck, [3, 4, 6, 3], **kwargs)
+    return model
--- a/models/MuseTalk/musetalk/loss/syncnet.py
+++ b/models/MuseTalk/musetalk/loss/syncnet.py
@@ -0,0 +1,95 @@
+import torch
+from torch import nn
+from torch.nn import functional as F
+
+from .conv import Conv2d
+
+logloss = nn.BCELoss(reduction="none")
+def cosine_loss(a, v, y):
+    d = nn.functional.cosine_similarity(a, v)
+    d = d.clamp(0,1) # cosine_similarity的取值范围是【-1，1】，BCE如果输入负数会报错RuntimeError: CUDA error: device-side assert triggered
+    loss = logloss(d.unsqueeze(1), y).squeeze()
+    loss = loss.mean()
+    return loss, d
+
+def get_sync_loss(
+    audio_embed, 
+    gt_frames, 
+    pred_frames, 
+    syncnet, 
+    adapted_weight,
+    frames_left_index=0,
+    frames_right_index=16,
+):
+    # 跟gt_frames做随机的插入交换，节省显存开销
+    assert pred_frames.shape[1] == (frames_right_index - frames_left_index) * 3
+    # 3通道图像
+    frames_sync_loss = torch.cat(
+        [gt_frames[:, :3 * frames_left_index, ...], pred_frames, gt_frames[:, 3 * frames_right_index:, ...]], 
+        axis=1
+    )
+    vision_embed = syncnet.get_image_embed(frames_sync_loss)
+    y = torch.ones(frames_sync_loss.size(0), 1).float().to(audio_embed.device)
+    loss, score = cosine_loss(audio_embed, vision_embed, y)
+    return loss, score
+
+class SyncNet_color(nn.Module):
+    def __init__(self):
+        super(SyncNet_color, self).__init__()
+
+        self.face_encoder = nn.Sequential(
+            Conv2d(15, 32, kernel_size=(7, 7), stride=1, padding=3),
+
+            Conv2d(32, 64, kernel_size=5, stride=(1, 2), padding=1),
+            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),
+            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),
+
+            Conv2d(64, 128, kernel_size=3, stride=2, padding=1),
+            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),
+            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),
+            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),
+
+            Conv2d(128, 256, kernel_size=3, stride=2, padding=1),
+            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),
+            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),
+
+            Conv2d(256, 512, kernel_size=3, stride=2, padding=1),
+            Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True),
+            Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True),
+
+            Conv2d(512, 512, kernel_size=3, stride=2, padding=1),
+            Conv2d(512, 512, kernel_size=3, stride=1, padding=0),
+            Conv2d(512, 512, kernel_size=1, stride=1, padding=0),)
+
+        self.audio_encoder = nn.Sequential(
+            Conv2d(1, 32, kernel_size=3, stride=1, padding=1),
+            Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True),
+            Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True),
+
+            Conv2d(32, 64, kernel_size=3, stride=(3, 1), padding=1),
+            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),
+            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),
+
+            Conv2d(64, 128, kernel_size=3, stride=3, padding=1),
+            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),
+            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),
+
+            Conv2d(128, 256, kernel_size=3, stride=(3, 2), padding=1),
+            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),
+            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),
+
+            Conv2d(256, 512, kernel_size=3, stride=1, padding=0),
+            Conv2d(512, 512, kernel_size=1, stride=1, padding=0),)
+
+    def forward(self, audio_sequences, face_sequences): # audio_sequences := (B, dim, T)
+        face_embedding = self.face_encoder(face_sequences)
+        audio_embedding = self.audio_encoder(audio_sequences)
+
+        audio_embedding = audio_embedding.view(audio_embedding.size(0), -1)
+        face_embedding = face_embedding.view(face_embedding.size(0), -1)
+
+        audio_embedding = F.normalize(audio_embedding, p=2, dim=1)
+        face_embedding = F.normalize(face_embedding, p=2, dim=1)
+
+
+        return audio_embedding, face_embedding
--- a/models/MuseTalk/musetalk/loss/vgg_face.py
+++ b/models/MuseTalk/musetalk/loss/vgg_face.py
@@ -0,0 +1,237 @@
+'''
+    This part of code contains a pretrained vgg_face model.
+    ref link: https://github.com/prlz77/vgg-face.pytorch
+'''
+import torch
+import torch.nn.functional as F
+import torch.utils.model_zoo
+import pickle
+from musetalk.loss import resnet as ResNet
+
+
+MODEL_URL = "https://github.com/claudio-unipv/vggface-pytorch/releases/download/v0.1/vggface-9d491dd7c30312.pth"
+VGG_FACE_PATH = '/apdcephfs_cq8/share_1367250/zhentaoyu/Driving/00_VASA/00_data/models/pretrain_models/resnet50_ft_weight.pkl'
+
+# It was 93.5940, 104.7624, 129.1863 before dividing by 255
+MEAN_RGB = [
+    0.367035294117647,
+    0.41083294117647057,
+    0.5066129411764705
+]
+def load_state_dict(model, fname):
+    """
+    Set parameters converted from Caffe models authors of VGGFace2 provide.
+    See https://www.robots.ox.ac.uk/~vgg/data/vgg_face2/.
+
+    Arguments:
+        model: model
+        fname: file name of parameters converted from a Caffe model, assuming the file format is Pickle.
+    """
+    with open(fname, 'rb') as f:
+        weights = pickle.load(f, encoding='latin1')
+
+    own_state = model.state_dict()
+    for name, param in weights.items():
+        if name in own_state:
+            try:
+                own_state[name].copy_(torch.from_numpy(param))
+            except Exception:
+                raise RuntimeError('While copying the parameter named {}, whose dimensions in the model are {} and whose '\
+                                   'dimensions in the checkpoint are {}.'.format(name, own_state[name].size(), param.size()))
+        else:
+            raise KeyError('unexpected key "{}" in state_dict'.format(name))
+        
+
+def vggface2(pretrained=True):
+    vggface = ResNet.resnet50(num_classes=8631, include_top=True)
+    load_state_dict(vggface, VGG_FACE_PATH)
+    return vggface
+
+def vggface(pretrained=False, **kwargs):
+    """VGGFace model.
+
+    Args:
+        pretrained (bool): If True, returns pre-trained model
+    """
+    model = VggFace(**kwargs)
+    if pretrained:
+        state = torch.utils.model_zoo.load_url(MODEL_URL)
+        model.load_state_dict(state)
+    return model
+
+
+class VggFace(torch.nn.Module):
+    def __init__(self, classes=2622):
+        """VGGFace model.
+
+        Face recognition network.  It takes as input a Bx3x224x224
+        batch of face images and gives as output a BxC score vector
+        (C is the number of identities).
+        Input images need to be scaled in the 0-1 range and then
+        normalized with respect to the mean RGB used during training.
+
+        Args:
+            classes (int): number of identities recognized by the
+            network
+
+        """
+        super().__init__()
+        self.conv1 = _ConvBlock(3, 64, 64)
+        self.conv2 = _ConvBlock(64, 128, 128)
+        self.conv3 = _ConvBlock(128, 256, 256, 256)
+        self.conv4 = _ConvBlock(256, 512, 512, 512)
+        self.conv5 = _ConvBlock(512, 512, 512, 512)
+        self.dropout = torch.nn.Dropout(0.5)
+        self.fc1 = torch.nn.Linear(7 * 7 * 512, 4096)
+        self.fc2 = torch.nn.Linear(4096, 4096)
+        self.fc3 = torch.nn.Linear(4096, classes)
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.conv2(x)
+        x = self.conv3(x)
+        x = self.conv4(x)
+        x = self.conv5(x)
+        x = x.view(x.size(0), -1)
+        x = self.dropout(F.relu(self.fc1(x)))
+        x = self.dropout(F.relu(self.fc2(x)))
+        x = self.fc3(x)
+        return x
+
+
+class _ConvBlock(torch.nn.Module):
+    """A Convolutional block."""
+
+    def __init__(self, *units):
+        """Create a block with len(units) - 1 convolutions.
+
+        convolution number i transforms the number of channels from
+        units[i - 1] to units[i] channels.
+
+        """
+        super().__init__()
+        self.convs = torch.nn.ModuleList([
+            torch.nn.Conv2d(in_, out, 3, 1, 1)
+            for in_, out in zip(units[:-1], units[1:])
+        ])
+
+    def forward(self, x):
+        # Each convolution is followed by a ReLU, then the block is
+        # concluded by a max pooling.
+        for c in self.convs:
+            x = F.relu(c(x))
+        return F.max_pool2d(x, 2, 2, 0, ceil_mode=True)
+
+
+
+import numpy as np
+from torchvision import models
+class Vgg19(torch.nn.Module):
+    """
+    Vgg19 network for perceptual loss.
+    """
+    def __init__(self, requires_grad=False):
+        super(Vgg19, self).__init__()
+        vgg_pretrained_features = models.vgg19(pretrained=True).features
+        self.slice1 = torch.nn.Sequential()
+        self.slice2 = torch.nn.Sequential()
+        self.slice3 = torch.nn.Sequential()
+        self.slice4 = torch.nn.Sequential()
+        self.slice5 = torch.nn.Sequential()
+        for x in range(2):
+            self.slice1.add_module(str(x), vgg_pretrained_features[x])
+        for x in range(2, 7):
+            self.slice2.add_module(str(x), vgg_pretrained_features[x])
+        for x in range(7, 12):
+            self.slice3.add_module(str(x), vgg_pretrained_features[x])
+        for x in range(12, 21):
+            self.slice4.add_module(str(x), vgg_pretrained_features[x])
+        for x in range(21, 30):
+            self.slice5.add_module(str(x), vgg_pretrained_features[x])
+
+        self.mean = torch.nn.Parameter(data=torch.Tensor(np.array([0.485, 0.456, 0.406]).reshape((1, 3, 1, 1))),
+                                       requires_grad=False)
+        self.std = torch.nn.Parameter(data=torch.Tensor(np.array([0.229, 0.224, 0.225]).reshape((1, 3, 1, 1))),
+                                      requires_grad=False)
+
+        if not requires_grad:
+            for param in self.parameters():
+                param.requires_grad = False
+
+    def forward(self, X):
+        X = (X - self.mean) / self.std
+        h_relu1 = self.slice1(X)
+        h_relu2 = self.slice2(h_relu1)
+        h_relu3 = self.slice3(h_relu2)
+        h_relu4 = self.slice4(h_relu3)
+        h_relu5 = self.slice5(h_relu4)
+        out = [h_relu1, h_relu2, h_relu3, h_relu4, h_relu5]
+        return out
+
+
+from torch import nn
+class AntiAliasInterpolation2d(nn.Module):
+    """
+    Band-limited downsampling, for better preservation of the input signal.
+    """
+    def __init__(self, channels, scale):
+        super(AntiAliasInterpolation2d, self).__init__()
+        sigma = (1 / scale - 1) / 2
+        kernel_size = 2 * round(sigma * 4) + 1
+        self.ka = kernel_size // 2
+        self.kb = self.ka - 1 if kernel_size % 2 == 0 else self.ka
+
+        kernel_size = [kernel_size, kernel_size]
+        sigma = [sigma, sigma]
+        # The gaussian kernel is the product of the
+        # gaussian function of each dimension.
+        kernel = 1
+        meshgrids = torch.meshgrid(
+            [
+                torch.arange(size, dtype=torch.float32)
+                for size in kernel_size
+                ]
+        )
+        for size, std, mgrid in zip(kernel_size, sigma, meshgrids):
+            mean = (size - 1) / 2
+            kernel *= torch.exp(-(mgrid - mean) ** 2 / (2 * std ** 2))
+
+        # Make sure sum of values in gaussian kernel equals 1.
+        kernel = kernel / torch.sum(kernel)
+        # Reshape to depthwise convolutional weight
+        kernel = kernel.view(1, 1, *kernel.size())
+        kernel = kernel.repeat(channels, *[1] * (kernel.dim() - 1))
+
+        self.register_buffer('weight', kernel)
+        self.groups = channels
+        self.scale = scale
+        inv_scale = 1 / scale
+        self.int_inv_scale = int(inv_scale)
+
+    def forward(self, input):
+        if self.scale == 1.0:
+            return input
+
+        out = F.pad(input, (self.ka, self.kb, self.ka, self.kb))
+        out = F.conv2d(out, weight=self.weight, groups=self.groups)
+        out = out[:, :, ::self.int_inv_scale, ::self.int_inv_scale]
+
+        return out
+
+
+class ImagePyramide(torch.nn.Module):
+    """
+    Create image pyramide for computing pyramide perceptual loss.
+    """
+    def __init__(self, scales, num_channels):
+        super(ImagePyramide, self).__init__()
+        downs = {}
+        for scale in scales:
+            downs[str(scale).replace('.', '-')] = AntiAliasInterpolation2d(num_channels, scale)
+        self.downs = nn.ModuleDict(downs)
+
+    def forward(self, x):
+        out_dict = {}
+        for scale, down_module in self.downs.items():
+            out_dict['prediction_' + str(scale).replace('-', '.')] = down_module(x)
+        return out_dict
--- a/models/MuseTalk/musetalk/models/syncnet.py
+++ b/models/MuseTalk/musetalk/models/syncnet.py
@@ -0,0 +1,240 @@
+"""
+This file is modified from LatentSync (https://github.com/bytedance/LatentSync/blob/main/latentsync/models/stable_syncnet.py).
+"""
+
+import torch
+from torch import nn
+from einops import rearrange
+from torch.nn import functional as F
+
+import torch.nn as nn
+import torch.nn.functional as F
+
+from diffusers.models.attention import Attention as CrossAttention, FeedForward
+from diffusers.utils.import_utils import is_xformers_available
+from einops import rearrange
+
+
+class SyncNet(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.audio_encoder = DownEncoder2D(
+            in_channels=config["audio_encoder"]["in_channels"],
+            block_out_channels=config["audio_encoder"]["block_out_channels"],
+            downsample_factors=config["audio_encoder"]["downsample_factors"],
+            dropout=config["audio_encoder"]["dropout"],
+            attn_blocks=config["audio_encoder"]["attn_blocks"],
+        )
+
+        self.visual_encoder = DownEncoder2D(
+            in_channels=config["visual_encoder"]["in_channels"],
+            block_out_channels=config["visual_encoder"]["block_out_channels"],
+            downsample_factors=config["visual_encoder"]["downsample_factors"],
+            dropout=config["visual_encoder"]["dropout"],
+            attn_blocks=config["visual_encoder"]["attn_blocks"],
+        )
+
+        self.eval()
+
+    def forward(self, image_sequences, audio_sequences):
+        vision_embeds = self.visual_encoder(image_sequences)  # (b, c, 1, 1)
+        audio_embeds = self.audio_encoder(audio_sequences)  # (b, c, 1, 1)
+
+        vision_embeds = vision_embeds.reshape(vision_embeds.shape[0], -1)  # (b, c)
+        audio_embeds = audio_embeds.reshape(audio_embeds.shape[0], -1)  # (b, c)
+
+        # Make them unit vectors
+        vision_embeds = F.normalize(vision_embeds, p=2, dim=1)
+        audio_embeds = F.normalize(audio_embeds, p=2, dim=1)
+
+        return vision_embeds, audio_embeds
+    
+    def get_image_embed(self, image_sequences):
+        vision_embeds = self.visual_encoder(image_sequences)  # (b, c, 1, 1)
+
+        vision_embeds = vision_embeds.reshape(vision_embeds.shape[0], -1)  # (b, c)
+
+        # Make them unit vectors
+        vision_embeds = F.normalize(vision_embeds, p=2, dim=1)
+
+        return vision_embeds
+
+    def get_audio_embed(self, audio_sequences):
+        audio_embeds = self.audio_encoder(audio_sequences)  # (b, c, 1, 1)
+
+        audio_embeds = audio_embeds.reshape(audio_embeds.shape[0], -1)  # (b, c)
+        
+        audio_embeds = F.normalize(audio_embeds, p=2, dim=1)
+
+        return audio_embeds
+
+class ResnetBlock2D(nn.Module):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        dropout: float = 0.0,
+        norm_num_groups: int = 32,
+        eps: float = 1e-6,
+        act_fn: str = "silu",
+        downsample_factor=2,
+    ):
+        super().__init__()
+
+        self.norm1 = nn.GroupNorm(num_groups=norm_num_groups, num_channels=in_channels, eps=eps, affine=True)
+        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1)
+
+        self.norm2 = nn.GroupNorm(num_groups=norm_num_groups, num_channels=out_channels, eps=eps, affine=True)
+        self.dropout = nn.Dropout(dropout)
+        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1)
+
+        if act_fn == "relu":
+            self.act_fn = nn.ReLU()
+        elif act_fn == "silu":
+            self.act_fn = nn.SiLU()
+
+        if in_channels != out_channels:
+            self.conv_shortcut = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0)
+        else:
+            self.conv_shortcut = None
+
+        if isinstance(downsample_factor, list):
+            downsample_factor = tuple(downsample_factor)
+
+        if downsample_factor == 1:
+            self.downsample_conv = None
+        else:
+            self.downsample_conv = nn.Conv2d(
+                out_channels, out_channels, kernel_size=3, stride=downsample_factor, padding=0
+            )
+            self.pad = (0, 1, 0, 1)
+            if isinstance(downsample_factor, tuple):
+                if downsample_factor[0] == 1:
+                    self.pad = (0, 1, 1, 1)  # The padding order is from back to front
+                elif downsample_factor[1] == 1:
+                    self.pad = (1, 1, 0, 1)
+
+    def forward(self, input_tensor):
+        hidden_states = input_tensor
+
+        hidden_states = self.norm1(hidden_states)
+        hidden_states = self.act_fn(hidden_states)
+
+        hidden_states = self.conv1(hidden_states)
+        hidden_states = self.norm2(hidden_states)
+        hidden_states = self.act_fn(hidden_states)
+
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.conv2(hidden_states)
+
+        if self.conv_shortcut is not None:
+            input_tensor = self.conv_shortcut(input_tensor)
+
+        hidden_states += input_tensor
+
+        if self.downsample_conv is not None:
+            hidden_states = F.pad(hidden_states, self.pad, mode="constant", value=0)
+            hidden_states = self.downsample_conv(hidden_states)
+
+        return hidden_states
+
+
+class AttentionBlock2D(nn.Module):
+    def __init__(self, query_dim, norm_num_groups=32, dropout=0.0):
+        super().__init__()
+        if not is_xformers_available():
+            raise ModuleNotFoundError(
+                "You have to install xformers to enable memory efficient attetion", name="xformers"
+            )
+        # inner_dim = dim_head * heads
+        self.norm1 = torch.nn.GroupNorm(num_groups=norm_num_groups, num_channels=query_dim, eps=1e-6, affine=True)
+        self.norm2 = nn.LayerNorm(query_dim)
+        self.norm3 = nn.LayerNorm(query_dim)
+
+        self.ff = FeedForward(query_dim, dropout=dropout, activation_fn="geglu")
+
+        self.conv_in = nn.Conv2d(query_dim, query_dim, kernel_size=1, stride=1, padding=0)
+        self.conv_out = nn.Conv2d(query_dim, query_dim, kernel_size=1, stride=1, padding=0)
+
+        self.attn = CrossAttention(query_dim=query_dim, heads=8, dim_head=query_dim // 8, dropout=dropout, bias=True)
+        self.attn._use_memory_efficient_attention_xformers = True
+
+    def forward(self, hidden_states):
+        assert hidden_states.dim() == 4, f"Expected hidden_states to have ndim=4, but got ndim={hidden_states.dim()}."
+
+        batch, channel, height, width = hidden_states.shape
+        residual = hidden_states
+
+        hidden_states = self.norm1(hidden_states)
+        hidden_states = self.conv_in(hidden_states)
+        hidden_states = rearrange(hidden_states, "b c h w -> b (h w) c")
+
+        norm_hidden_states = self.norm2(hidden_states)
+        hidden_states = self.attn(norm_hidden_states, attention_mask=None) + hidden_states
+        hidden_states = self.ff(self.norm3(hidden_states)) + hidden_states
+
+        hidden_states = rearrange(hidden_states, "b (h w) c -> b c h w", h=height, w=width)
+        hidden_states = self.conv_out(hidden_states)
+
+        hidden_states = hidden_states + residual
+        return hidden_states
+
+
+class DownEncoder2D(nn.Module):
+    def __init__(
+        self,
+        in_channels=4 * 16,
+        block_out_channels=[64, 128, 256, 256],
+        downsample_factors=[2, 2, 2, 2],
+        layers_per_block=2,
+        norm_num_groups=32,
+        attn_blocks=[1, 1, 1, 1],
+        dropout: float = 0.0,
+        act_fn="silu",
+    ):
+        super().__init__()
+        self.layers_per_block = layers_per_block
+
+        # in
+        self.conv_in = nn.Conv2d(in_channels, block_out_channels[0], kernel_size=3, stride=1, padding=1)
+
+        # down
+        self.down_blocks = nn.ModuleList([])
+
+        output_channels = block_out_channels[0]
+        for i, block_out_channel in enumerate(block_out_channels):
+            input_channels = output_channels
+            output_channels = block_out_channel
+            # is_final_block = i == len(block_out_channels) - 1
+
+            down_block = ResnetBlock2D(
+                in_channels=input_channels,
+                out_channels=output_channels,
+                downsample_factor=downsample_factors[i],
+                norm_num_groups=norm_num_groups,
+                dropout=dropout,
+                act_fn=act_fn,
+            )
+
+            self.down_blocks.append(down_block)
+
+            if attn_blocks[i] == 1:
+                attention_block = AttentionBlock2D(query_dim=output_channels, dropout=dropout)
+                self.down_blocks.append(attention_block)
+
+        # out
+        self.norm_out = nn.GroupNorm(num_channels=block_out_channels[-1], num_groups=norm_num_groups, eps=1e-6)
+        self.act_fn_out = nn.ReLU()
+
+    def forward(self, hidden_states):
+        hidden_states = self.conv_in(hidden_states)
+
+        # down
+        for down_block in self.down_blocks:
+            hidden_states = down_block(hidden_states)
+
+        # post-process
+        hidden_states = self.norm_out(hidden_states)
+        hidden_states = self.act_fn_out(hidden_states)
+
+        return hidden_states
--- a/models/MuseTalk/musetalk/models/unet.py
+++ b/models/MuseTalk/musetalk/models/unet.py
@@ -0,0 +1,51 @@
+import torch
+import torch.nn as nn
+import math
+import json
+
+from diffusers import UNet2DConditionModel
+import sys
+import time
+import numpy as np
+import os
+
+class PositionalEncoding(nn.Module):
+    def __init__(self, d_model=384, max_len=5000):
+        super(PositionalEncoding, self).__init__()
+        pe = torch.zeros(max_len, d_model)
+        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
+        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        pe = pe.unsqueeze(0)
+        self.register_buffer('pe', pe)
+
+    def forward(self, x):
+        b, seq_len, d_model = x.size()
+        pe = self.pe[:, :seq_len, :]
+        x = x + pe.to(x.device)
+        return x
+    
+class UNet():
+    def __init__(self, 
+                 unet_config,
+                 model_path,
+                 use_float16=False,
+                 device=None
+        ):
+        with open(unet_config, 'r') as f:
+            unet_config = json.load(f)
+        self.model = UNet2DConditionModel(**unet_config)
+        self.pe = PositionalEncoding(d_model=384)
+        if device != None:
+            self.device = device
+        else:
+            self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        weights = torch.load(model_path) if torch.cuda.is_available() else torch.load(model_path, map_location=self.device)
+        self.model.load_state_dict(weights)
+        if use_float16:
+            self.model = self.model.half()
+        self.model.to(self.device)
+    
+if __name__ == "__main__":
+    unet = UNet()
--- a/models/MuseTalk/musetalk/models/vae.py
+++ b/models/MuseTalk/musetalk/models/vae.py
@@ -0,0 +1,148 @@
+from diffusers import AutoencoderKL
+import torch
+import torchvision.transforms as transforms
+import torch.nn.functional as F
+import cv2
+import numpy as np
+from PIL import Image
+import os
+
+class VAE():
+    """
+    VAE (Variational Autoencoder) class for image processing.
+    """
+
+    def __init__(self, model_path="./models/sd-vae-ft-mse/", resized_img=256, use_float16=False):
+        """
+        Initialize the VAE instance.
+
+        :param model_path: Path to the trained model.
+        :param resized_img: The size to which images are resized.
+        :param use_float16: Whether to use float16 precision.
+        """
+        self.model_path = model_path
+        self.vae = AutoencoderKL.from_pretrained(self.model_path)
+
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        self.vae.to(self.device)
+
+        if use_float16:
+            self.vae = self.vae.half()
+            self._use_float16 = True
+        else:
+            self._use_float16 = False
+
+        self.scaling_factor = self.vae.config.scaling_factor
+        self.transform = transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
+        self._resized_img = resized_img
+        self._mask_tensor = self.get_mask_tensor()
+        
+    def get_mask_tensor(self):
+        """
+        Creates a mask tensor for image processing.
+        :return: A mask tensor.
+        """
+        mask_tensor = torch.zeros((self._resized_img,self._resized_img))
+        mask_tensor[:self._resized_img//2,:] = 1
+        mask_tensor[mask_tensor< 0.5] = 0
+        mask_tensor[mask_tensor>= 0.5] = 1
+        return mask_tensor
+            
+    def preprocess_img(self,img_name,half_mask=False):
+        """
+        Preprocess an image for the VAE.
+
+        :param img_name: The image file path or a list of image file paths.
+        :param half_mask: Whether to apply a half mask to the image.
+        :return: A preprocessed image tensor.
+        """
+        window = []
+        if isinstance(img_name, str):
+            window_fnames = [img_name]
+            for fname in window_fnames:
+                img = cv2.imread(fname)
+                img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
+                img = cv2.resize(img, (self._resized_img, self._resized_img),
+                                     interpolation=cv2.INTER_LANCZOS4)
+                window.append(img)
+        else:
+            img = cv2.cvtColor(img_name, cv2.COLOR_BGR2RGB)
+            window.append(img)
+            
+        x = np.asarray(window) / 255.
+        x = np.transpose(x, (3, 0, 1, 2))
+        x = torch.squeeze(torch.FloatTensor(x))
+        if half_mask:
+            x = x * (self._mask_tensor>0.5)
+        x = self.transform(x)
+        
+        x = x.unsqueeze(0) # [1, 3, 256, 256] torch tensor
+        x = x.to(self.vae.device)
+
+        return x
+
+    def encode_latents(self,image):
+        """
+        Encode an image into latent variables.
+
+        :param image: The image tensor to encode.
+        :return: The encoded latent variables.
+        """
+        with torch.no_grad():
+            init_latent_dist = self.vae.encode(image.to(self.vae.dtype)).latent_dist
+        init_latents = self.scaling_factor * init_latent_dist.sample()
+        return init_latents
+    
+    def decode_latents(self, latents):
+        """
+        Decode latent variables back into an image.
+        :param latents: The latent variables to decode.
+        :return: A NumPy array representing the decoded image.
+        """
+        latents = (1/  self.scaling_factor) * latents
+        image = self.vae.decode(latents.to(self.vae.dtype)).sample
+        image = (image / 2 + 0.5).clamp(0, 1)
+        image = image.detach().cpu().permute(0, 2, 3, 1).float().numpy()
+        image = (image * 255).round().astype("uint8")
+        image = image[...,::-1] # RGB to BGR
+        return image
+    
+    def get_latents_for_unet(self,img):
+        """
+        Prepare latent variables for a U-Net model.
+        :param img: The image to process.
+        :return: A concatenated tensor of latents for U-Net input.
+        """
+        
+        ref_image = self.preprocess_img(img,half_mask=True) # [1, 3, 256, 256] RGB, torch tensor
+        masked_latents = self.encode_latents(ref_image) # [1, 4, 32, 32], torch tensor
+        ref_image = self.preprocess_img(img,half_mask=False) # [1, 3, 256, 256] RGB, torch tensor
+        ref_latents = self.encode_latents(ref_image) # [1, 4, 32, 32], torch tensor
+        latent_model_input = torch.cat([masked_latents, ref_latents], dim=1)
+        return latent_model_input
+
+if __name__ == "__main__":
+    vae_mode_path = "./models/sd-vae-ft-mse/"
+    vae = VAE(model_path = vae_mode_path,use_float16=False)
+    img_path = "./results/sun001_crop/00000.png"
+    
+    crop_imgs_path = "./results/sun001_crop/"
+    latents_out_path = "./results/latents/"
+    if not os.path.exists(latents_out_path):
+        os.mkdir(latents_out_path)
+
+    files = os.listdir(crop_imgs_path)
+    files.sort()
+    files = [file for file in files if file.split(".")[-1] == "png"]
+
+    for file in files:
+        index = file.split(".")[0]
+        img_path = crop_imgs_path + file
+        latents = vae.get_latents_for_unet(img_path)
+        print(img_path,"latents",latents.size())
+        #torch.save(latents,os.path.join(latents_out_path,index+".pt"))
+        #reload_tensor = torch.load('tensor.pt')
+        #print(reload_tensor.size())
+        
+
+    
--- a/models/MuseTalk/musetalk/utils/init.py
+++ b/models/MuseTalk/musetalk/utils/init.py
@@ -0,0 +1,5 @@
+import sys
+from os.path import abspath, dirname
+current_dir = dirname(abspath(__file__))
+parent_dir = dirname(current_dir)
+sys.path.append(parent_dir+'/utils')
--- a/models/MuseTalk/musetalk/utils/audio_processor.py
+++ b/models/MuseTalk/musetalk/utils/audio_processor.py
@@ -0,0 +1,113 @@
+import math
+import os
+
+import librosa
+import numpy as np
+import torch
+from einops import rearrange
+from transformers import AutoFeatureExtractor
+
+
+class AudioProcessor:
+    def __init__(self, feature_extractor_path="openai/whisper-tiny/"):
+        self.feature_extractor = AutoFeatureExtractor.from_pretrained(feature_extractor_path)
+
+    def get_audio_feature(self, wav_path, start_index=0, weight_dtype=None):
+        if not os.path.exists(wav_path):
+            return None
+        librosa_output, sampling_rate = librosa.load(wav_path, sr=16000)
+        assert sampling_rate == 16000
+        # Split audio into 30s segments
+        segment_length = 30 * sampling_rate
+        segments = [librosa_output[i:i + segment_length] for i in range(0, len(librosa_output), segment_length)]
+
+        features = []
+        for segment in segments:
+            audio_feature = self.feature_extractor(
+                segment,
+                return_tensors="pt",
+                sampling_rate=sampling_rate
+            ).input_features
+            if weight_dtype is not None:
+                audio_feature = audio_feature.to(dtype=weight_dtype)
+            features.append(audio_feature)
+
+        return features, len(librosa_output)
+
+    def get_whisper_chunk(
+        self,
+        whisper_input_features,
+        device,
+        weight_dtype,
+        whisper,
+        librosa_length,
+        fps=25,
+        audio_padding_length_left=2,
+        audio_padding_length_right=2,
+    ):
+        audio_feature_length_per_frame = 2 * (audio_padding_length_left + audio_padding_length_right + 1)
+        whisper_feature = []
+        # Process multiple 30s mel input features
+        for input_feature in whisper_input_features:
+            input_feature = input_feature.to(device).to(weight_dtype)
+            audio_feats = whisper.encoder(input_feature, output_hidden_states=True).hidden_states
+            audio_feats = torch.stack(audio_feats, dim=2)
+            whisper_feature.append(audio_feats)
+
+        whisper_feature = torch.cat(whisper_feature, dim=1)
+        # Trim the last segment to remove padding
+        sr = 16000
+        audio_fps = 50
+        fps = int(fps)
+        whisper_idx_multiplier = audio_fps / fps
+        num_frames = math.floor((librosa_length / sr) * fps)
+        actual_length = math.floor((librosa_length / sr) * audio_fps)
+        whisper_feature = whisper_feature[:,:actual_length,...]
+
+        # Calculate padding amount
+        padding_nums = math.ceil(whisper_idx_multiplier)
+        # Add padding at start and end
+        whisper_feature = torch.cat([
+            torch.zeros_like(whisper_feature[:, :padding_nums * audio_padding_length_left]),
+            whisper_feature,
+            # Add extra padding to prevent out of bounds
+            torch.zeros_like(whisper_feature[:, :padding_nums * 3 * audio_padding_length_right])
+        ], 1)
+
+        audio_prompts = []
+        for frame_index in range(num_frames):
+            audio_index = math.floor(frame_index * whisper_idx_multiplier)
+            end_index = audio_index + audio_feature_length_per_frame
+            
+            # Handle case where audio is shorter than video
+            if end_index > whisper_feature.shape[1]:
+                available = whisper_feature[:, audio_index:]
+                padding_size = end_index - whisper_feature.shape[1]
+                if padding_size > 0:
+                    padding = torch.zeros((whisper_feature.shape[0], padding_size, *whisper_feature.shape[2:]), 
+                                         device=whisper_feature.device, dtype=whisper_feature.dtype)
+                    audio_clip = torch.cat([available, padding], dim=1)
+                else:
+                    audio_clip = available
+            else:
+                audio_clip = whisper_feature[:, audio_index: end_index]
+            
+            # Final size check and padding
+            if audio_clip.shape[1] < audio_feature_length_per_frame:
+                padding_size = audio_feature_length_per_frame - audio_clip.shape[1]
+                padding = torch.zeros((whisper_feature.shape[0], padding_size, *whisper_feature.shape[2:]),
+                                     device=whisper_feature.device, dtype=whisper_feature.dtype)
+                audio_clip = torch.cat([audio_clip, padding], dim=1)
+            
+            audio_prompts.append(audio_clip)
+
+        audio_prompts = torch.cat(audio_prompts, dim=0)  # T, 10, 5, 384
+        audio_prompts = rearrange(audio_prompts, 'b c h w -> b (c h) w')
+        return audio_prompts
+
+if __name__ == "__main__":
+    audio_processor = AudioProcessor()
+    wav_path = "./2.wav"
+    audio_feature, librosa_feature_length = audio_processor.get_audio_feature(wav_path)
+    print("Audio Feature shape:", audio_feature.shape)
+    print("librosa_feature_length:", librosa_feature_length)
--- a/models/MuseTalk/musetalk/utils/audio_utils.py
+++ b/models/MuseTalk/musetalk/utils/audio_utils.py
@@ -0,0 +1,17 @@
+import os, subprocess
+
+def ensure_wav(input_path: str, target_path: str | None = None) -> str:
+    """
+    Convert any audio (mp3/ogg/m4a/wav/…) to 16kHz mono PCM WAV via ffmpeg.
+    Returns path to the converted .wav (original if already correct).
+    """
+    if not isinstance(input_path, str) or not os.path.exists(input_path):
+        return input_path
+    base, ext = os.path.splitext(input_path)
+    ext = ext.lower()
+    
+    if target_path is None:
+        target_path = base + "_16k.wav"
+    cmd = ["ffmpeg", "-y", "-i", input_path, "-ar", "16000", "-ac", "1", "-c:a", "pcm_s16le", target_path]
+    subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+    return target_path
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Kevin Wong	190fc2e590	更新	2026-03-03 12:23:49 +08:00
Kevin Wong	48bc78fe38	更新	2026-03-02 16:35:16 +08:00
Kevin Wong	abf005f225	更新	2026-02-28 17:49:32 +08:00
Kevin Wong	9de2cb40b4	更新	2026-02-28 14:44:51 +08:00
Kevin Wong	29c67f629d	更新	2026-02-28 09:16:41 +08:00
Kevin Wong	0e3502c6f0	更新	2026-02-27 16:11:34 +08:00
Kevin Wong	a1604979f0	更新	2026-02-26 11:13:03 +08:00
Kevin Wong	08221e48de	更新	2026-02-26 10:49:22 +08:00
Kevin Wong	42b5cc0c02	更新	2026-02-26 10:14:41 +08:00