更新

2026-02-11 13:57:41 +08:00 · 2026-02-11 13:48:45 +08:00 · 2026-02-10 13:31:29 +08:00 · 2026-02-09 14:47:19 +08:00 · 2026-02-08 19:54:11 +08:00
335 changed files with 99536 additions and 1869 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -20,11 +20,14 @@ node_modules/
 out/
 .turbo/

-# ============ IDE ============
+# ============ IDE / AI 工具 ============
 .vscode/
 .idea/
 *.swp
 *.swo
+.agents/
+.opencode/
+.claude/

 # ============ 系统文件 ============
 .DS_Store
@@ -35,11 +38,21 @@ desktop.ini
 backend/outputs/
 backend/uploads/
 backend/cookies/
+backend/user_data/
+backend/debug_screenshots/
 *_cookies.json

-# ============ MuseTalk ============
+# ============ 模型权重 ============
+models/*/checkpoints/
 models/MuseTalk/models/
 models/MuseTalk/results/
+models/LatentSync/temp/
+
+# ============ Remotion 构建 ============
+remotion/dist/
+
+# ============ 临时文件 ============
+Temp/

 # ============ 日志 ============
 *.log
--- a/Docs/BACKEND_DEV.md
+++ b/Docs/BACKEND_DEV.md
@@ -29,15 +29,16 @@ backend/
 ├── app/
 │   ├── core/                # config、deps、security、response
 │   ├── modules/             # 业务模块（路由 + 逻辑）
-│   │   ├── videos/          # 视频生成任务
-│   │   ├── materials/       # 素材管理
+│   │   ├── videos/          # 视频生成任务（router/schemas/service/workflow）
+│   │   ├── materials/       # 素材管理（router/schemas/service）
 │   │   ├── publish/         # 多平台发布
 │   │   ├── auth/            # 认证与会话
-│   │   ├── ai/              # AI 功能（标题标签生成等）
+│   │   ├── ai/              # AI 功能（标题标签生成、多语言翻译）
 │   │   ├── assets/          # 静态资源（字体/样式/BGM）
-│   │   ├── ref_audios/      # 声音克隆参考音频
+│   │   ├── ref_audios/      # 声音克隆参考音频（router/schemas/service）
+│   │   ├── generated_audios/ # 预生成配音管理（router/schemas/service）
 │   │   ├── login_helper/    # 扫码登录辅助
-│   │   ├── tools/           # 工具接口
+│   │   ├── tools/           # 工具接口（router/schemas/service）
 │   │   └── admin/           # 管理员功能
 │   ├── repositories/        # Supabase 数据访问
 │   ├── services/            # 外部服务集成
@@ -124,10 +125,13 @@ backend/user_data/{user_uuid}/cookies/

 ## 8. 开发流程建议

- **新增功能**：先建模块，再写 router/service/workflow。
- **修复 Bug**：顺手把涉及的逻辑抽到对应 service/workflow。
+- **新增功能**：先建模块，**必须**包含 `router.py + schemas.py + service.py`，不允许 router-only。
+- **修复 Bug**：顺手把涉及的逻辑抽到对应 service/workflow（渐进式改造）。
+- **改旧模块**：改动哪部分就拆哪部分，不要求一次重构整个文件。
 - **核心流程变更**：必跑冒烟（登录/生成/发布）。

+> **渐进原则**：新代码高标准，旧代码逐步改。不做大规模一次性重构，避免引入回归风险。
+
 ---

 ## 9. 常用环境变量
--- a/Docs/BACKEND_README.md
+++ b/Docs/BACKEND_README.md
@@ -15,16 +15,17 @@ backend/
 ├── app/
 │   ├── core/             # 核心配置 (config.py, security.py, response.py)
 │   ├── modules/          # 业务模块 (router/service/workflow/schemas)
-│   │   ├── videos/       # 视频生成任务
-│   │   ├── materials/    # 素材管理
+│   │   ├── videos/       # 视频生成任务（router/schemas/service/workflow）
+│   │   ├── materials/    # 素材管理（router/schemas/service）
 │   │   ├── publish/      # 多平台发布
 │   │   ├── auth/         # 认证与会话
-│   │   ├── ai/           # AI 功能（标题标签生成）
-│   │   ├── assets/       # 静态资源（字体/样式/BGM）
-│   │   ├── ref_audios/   # 声音克隆参考音频
-│   │   ├── login_helper/ # 扫码登录辅助
-│   │   ├── tools/        # 工具接口（文案提取等）
-│   │   └── admin/        # 管理员功能
+│   │   ├── ai/              # AI 功能（标题标签生成、多语言翻译）
+│   │   ├── assets/          # 静态资源（字体/样式/BGM）
+│   │   ├── ref_audios/      # 声音克隆参考音频（router/schemas/service）
+│   │   ├── generated_audios/ # 预生成配音管理（router/schemas/service）
+│   │   ├── login_helper/    # 扫码登录辅助
+│   │   ├── tools/           # 工具接口（router/schemas/service）
+│   │   └── admin/           # 管理员功能
 │   ├── repositories/     # Supabase 数据访问
 │   ├── services/         # 外部服务集成 (TTS/Remotion/Storage/Uploader 等)
 │   └── tests/            # 单元测试与集成测试
@@ -50,6 +51,8 @@ backend/
    *   `POST /api/auth/register`: 用户注册
    *   `GET /api/auth/me`: 获取当前用户信息

+> 授权有效期策略：在登录与受保护接口鉴权时，后端会检查 `users.expires_at`。账号到期会自动停用 (`is_active=false`) 并清理 session，返回 `403: 会员已到期，请续费`。
+
 2.  **视频生成 (Videos)**
    *   `POST /api/videos/generate`: 提交生成任务
    *   `GET /api/videos/tasks/{task_id}`: 查询单个任务状态
@@ -76,20 +79,29 @@ backend/
    *   `GET /api/assets/bgm`: 背景音乐列表

 6.  **声音克隆 (Ref Audios)**
-    *   `POST /api/ref-audios`: 上传参考音频 (multipart/form-data)
+    *   `POST /api/ref-audios`: 上传参考音频 (multipart/form-data，自动 Whisper 转写 ref_text)
    *   `GET /api/ref-audios`: 获取参考音频列表
    *   `PUT /api/ref-audios/{id}`: 重命名参考音频
    *   `DELETE /api/ref-audios/{id}`: 删除参考音频
+    *   `POST /api/ref-audios/{id}/retranscribe`: 重新识别参考音频文字（Whisper 转写 + 超 10s 自动截取）

 7.  **AI 功能 (AI)**
    *   `POST /api/ai/generate-meta`: AI 生成标题和标签
+    *   `POST /api/ai/translate`: AI 多语言翻译（支持 9 种目标语言）

-8.  **工具 (Tools)**
+8.  **预生成配音 (Generated Audios)**
+    *   `POST /api/generated-audios/generate`: 异步生成配音（返回 task_id）
+    *   `GET /api/generated-audios/tasks/{task_id}`: 轮询生成进度
+    *   `GET /api/generated-audios`: 列出用户所有配音
+    *   `DELETE /api/generated-audios/{audio_id}`: 删除配音
+    *   `PUT /api/generated-audios/{audio_id}`: 重命名配音
+
+9.  **工具 (Tools)**
    *   `POST /api/tools/extract-script`: 从视频链接提取文案

-9.  **健康检查**
+10. **健康检查**
    *   `GET /api/lipsync/health`: LatentSync 服务健康状态
-    *   `GET /api/voiceclone/health`: Qwen3-TTS 服务健康状态
+    *   `GET /api/voiceclone/health`: CosyVoice 3.0 服务健康状态

 ### 统一响应结构

@@ -108,9 +120,16 @@ backend/

 `POST /api/videos/generate` 支持以下可选字段：

+- `material_path`: 视频素材路径（单素材模式）
+- `material_paths`: 多素材路径数组（多机位模式，≥2 个素材时按句子自动切换）
 - `tts_mode`: TTS 模式 (`edgetts` / `voiceclone`)
 - `voice`: EdgeTTS 音色 ID（edgetts 模式）
 - `ref_audio_id` / `ref_text`: 参考音频 ID 与文本（voiceclone 模式）
+- `generated_audio_id`: 预生成配音 ID（存在时跳过内联 TTS，使用已生成的配音文件）
+- `speed`: 语速（声音克隆模式，默认 1.0，范围 0.8-1.2）
+- `custom_assignments`: 自定义素材分配数组（每项含 `material_path` / `start` / `end` / `source_start` / `source_end?`），存在时优先按时间轴可见段生成
+- `output_aspect_ratio`: 输出画面比例（`9:16` 或 `16:9`，默认 `9:16`）
+- `language`: TTS 语言（默认自动检测，声音克隆时透传给 CosyVoice 3.0）
 - `title`: 片头标题文字
 - `subtitle_style_id`: 字幕样式 ID
 - `title_style_id`: 标题样式 ID
@@ -122,6 +141,12 @@ backend/
 - `bgm_id`: 背景音乐 ID
 - `bgm_volume`: 背景音乐音量（0-1，默认 0.2）

+### 多素材稳定性说明
+
+- 多素材片段在拼接前统一重编码，并强制 `25fps + CFR`，减少段边界时间基不一致导致的画面卡顿。
+- concat 流程启用 `+genpts` 重建时间戳，提升拼接后时间轴连续性。
+- 对带旋转元数据的 MOV 素材会先做方向归一化，再进入分辨率判断和后续流程。
+
 ## 📦 资源库与静态资源

 - 本地资源目录：`backend/assets/{fonts,bgm,styles}`
--- a/Docs/COSYVOICE3_DEPLOY.md
+++ b/Docs/COSYVOICE3_DEPLOY.md
@@ -0,0 +1,211 @@
+# CosyVoice 3.0 部署文档
+
+## 概览
+
+| 项目 | 值 |
+|------|------|
+| 模型 | Fun-CosyVoice3-0.5B-2512 (0.5B 参数) |
+| 端口 | 8010 |
+| GPU | 0 (CUDA_VISIBLE_DEVICES=0) |
+| PM2 名称 | vigent2-cosyvoice (id=15) |
+| Conda 环境 | cosyvoice (Python 3.10) |
+| 启动脚本 | `run_cosyvoice.sh` |
+| 服务脚本 | `models/CosyVoice/cosyvoice_server.py` |
+| 模型加载时间 | ~22-34 秒 |
+| 显存占用 | ~3-5 GB |
+
+## 支持语言
+
+中文、英文、日语、韩语、德语、西班牙语、法语、意大利语、俄语，18+ 中国方言
+
+## 目录结构
+
+```
+models/CosyVoice/
+├── cosyvoice_server.py              # FastAPI 服务 (端口 8010)
+├── cosyvoice/                        # CosyVoice 源码
+│   └── cli/cosyvoice.py             # AutoModel 入口
+├── third_party/Matcha-TTS/          # 子模块依赖
+├── pretrained_models/
+│   ├── Fun-CosyVoice3-0.5B/        # 模型文件 (~8.2GB)
+│   │   ├── llm.pt                   # LLM 模型 (1.9GB)
+│   │   ├── llm.rl.pt               # RL 模型 (1.9GB, 备用)
+│   │   ├── flow.pt                  # Flow 模型 (1.3GB)
+│   │   ├── hift.pt                  # HiFT 声码器 (80MB)
+│   │   ├── campplus.onnx            # 说话人嵌入 (27MB)
+│   │   ├── speech_tokenizer_v3.onnx # 语音分词器 (925MB)
+│   │   ├── cosyvoice3.yaml          # 模型配置
+│   │   └── CosyVoice-BlankEN/      # Qwen tokenizer
+│   └── CosyVoice-ttsfrd/           # 文本正则化资源
+│       ├── resource/                # 解压后的 ttsfrd 资源
+│       └── resource.zip
+run_cosyvoice.sh                      # PM2 启动脚本
+```
+
+## API 接口
+
+### GET /health
+
+健康检查，返回：
+```json
+{
+  "service": "CosyVoice 3.0 Voice Clone",
+  "model": "Fun-CosyVoice3-0.5B",
+  "ready": true,
+  "gpu_id": 0
+}
+```
+
+### POST /generate
+
+声音克隆生成。
+
+**参数 (multipart/form-data)：**
+
+| 参数 | 类型 | 必填 | 说明 |
+|------|------|------|------|
+| ref_audio | File | 是 | 参考音频 (WAV) |
+| text | string | 是 | 要合成的文本 |
+| ref_text | string | 是 | 参考音频的转写文字 |
+| language | string | 否 | 语言 (默认 "Chinese"，CosyVoice 自动检测) |
+| speed | float | 否 | 语速 (默认 1.0，范围 0.5-2.0，建议 0.8-1.2) |
+
+**返回：** WAV 音频文件
+
+**状态码：**
+- 200: 成功
+- 429: GPU 忙，请重试
+- 500: 生成失败/超时
+- 503: 模型未加载/服务中毒
+
+## 安全机制
+
+1. **GPU 推理锁** (`asyncio.Lock`): 防止并发推理导致 GPU 状态损坏
+2. **429 拒绝**: 锁被占用时立即返回 429，客户端重试
+3. **超时保护**: `60 + len(text) * 2` 秒，上限 300 秒
+4. **Poisoned 标记**: 超时后标记服务为中毒状态，健康检查返回 `ready: false`
+5. **强制退出**: 超时后 1.5 秒强制 `os._exit(1)`，PM2 自动重启
+6. **启动自检**: 启动时用短文本做一次真实推理，验证 GPU 推理链路可用；失败则 `_model_loaded = False`，健康检查返回 `ready: false`，避免假阳性
+7. **参考音频自动截取**: 参考音频超过 10 秒时自动截取前 10 秒（CosyVoice 建议 3-10 秒），避免采样异常
+
+## 运维命令
+
+```bash
+# 启动
+pm2 start run_cosyvoice.sh --name vigent2-cosyvoice
+
+# 重启
+pm2 restart vigent2-cosyvoice
+
+# 查看日志
+pm2 logs vigent2-cosyvoice --lines 50
+
+# 健康检查
+curl http://localhost:8010/health
+
+# 停止
+pm2 stop vigent2-cosyvoice
+```
+
+## 从零部署步骤
+
+### 1. 克隆仓库
+
+```bash
+cd /home/rongye/ProgramFiles/ViGent2/models
+git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
+cd CosyVoice
+git submodule update --init --recursive
+```
+
+### 2. 创建 Conda 环境
+
+```bash
+conda create -n cosyvoice -y python=3.10
+conda activate cosyvoice
+```
+
+### 3. 安装依赖
+
+注意：不能直接 `pip install -r requirements.txt`，有版本冲突需要处理。
+
+```bash
+# 安装 PyTorch 2.3.1 (CUDA 12.1) — 必须先装，版本严格要求
+pip install torch==2.3.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
+
+# 核心推理依赖
+pip install conformer==0.3.2 HyperPyYAML==1.2.2 inflect==7.3.1 \
+  librosa==0.10.2 lightning==2.2.4 modelscope==1.20.0 omegaconf==2.3.0 \
+  pydantic==2.7.0 soundfile==0.12.1 fastapi==0.115.6 uvicorn==0.30.0 \
+  transformers==4.51.3 protobuf==4.25 hydra-core==1.3.2 \
+  rich==13.7.1 diffusers==0.29.0 x-transformers==2.11.24 wetext==0.0.4
+
+# onnxruntime-gpu
+pip install onnxruntime-gpu==1.18.0 \
+  --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
+
+# 其他必要依赖
+pip install gdown matplotlib pyarrow wget onnx python-multipart httpx
+
+# openai-whisper 需要 setuptools < 71（提供 pkg_resources）
+pip install "setuptools<71"
+pip install --no-build-isolation openai-whisper==20231117
+
+# pyworld 需要 g++ 和 Cython
+pip install Cython
+PATH="/usr/bin:$PATH" pip install pyworld==0.3.4
+
+# 关键版本修复
+pip install "numpy<2"               # onnxruntime-gpu 不兼容 numpy 2.x
+pip install "ruamel.yaml<0.18"      # hyperpyyaml 不兼容 ruamel.yaml 0.19+
+```
+
+> **重要**: CosyVoice 要求 torch==2.3.1。torch 2.10+ 会导致 CUBLAS_STATUS_INVALID_VALUE 错误。
+> torch 2.3.1+cu121 自带 nvidia-cudnn-cu12，onnxruntime CUDAExecutionProvider 可正常使用。
+
+### 4. 下载模型
+
+```bash
+# 使用 huggingface_hub (国内用 hf-mirror.com)
+HF_ENDPOINT=https://hf-mirror.com python -c "
+from huggingface_hub import snapshot_download
+snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
+snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
+"
+```
+
+### 5. 安装 ttsfrd (可选，提升文本正则化质量)
+
+```bash
+cd pretrained_models/CosyVoice-ttsfrd/
+unzip resource.zip -d .
+pip install ttsfrd_dependency-0.1-py3-none-any.whl
+pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
+```
+
+### 6. 注册 PM2
+
+```bash
+pm2 start run_cosyvoice.sh --name vigent2-cosyvoice
+pm2 save
+```
+
+## 已知问题
+
+1. **ttsfrd "prepare tts engine failed"**: ttsfrd C 库内部日志，Python 层初始化成功，不影响使用
+2. **Sliding Window Attention 警告**: transformers 库提示，不影响推理结果
+3. **onnxruntime Memcpy 性能提示**: `Memcpy nodes are not supported by the CUDA EP`，仅为性能建议日志，不影响功能
+
+> 注：libcudnn.so.8 问题在 torch 2.3.1+cu121 环境下已解决（自带 nvidia-cudnn-cu12），onnxruntime CUDAExecutionProvider 可正常加载。
+
+## 与 Qwen3-TTS 对比
+
+| 特性 | Qwen3-TTS (已停用) | CosyVoice 3.0 (当前) |
+|------|-----------|----------------|
+| 端口 | 8009 | 8010 |
+| 模型大小 | 0.6B | 0.5B |
+| 语言 | 中/英/日/韩 | 9 语言 + 18 方言 |
+| 克隆方式 | ref_audio + ref_text | ref_audio + ref_text |
+| prompt 格式 | 直接传 ref_text | `You are a helpful assistant.<\|endofprompt\|>` + ref_text |
+| 内置分段 | 无，需客户端分段 | 内置 text_normalize 自动分段 |
+| 状态 | 已停用 (PM2 stopped) | 生产使用中 |
--- a/Docs/DEPLOY_MANUAL.md
+++ b/Docs/DEPLOY_MANUAL.md
@@ -165,6 +165,8 @@ playwright install chromium
    CREATE POLICY "Allow public read" ON storage.objects FOR SELECT TO anon USING (bucket_id = 'materials' OR bucket_id = 'outputs');
    EOF
    ```
+
+> **注意**：后端启动时会自动创建额外的存储桶（`ref-audios`、`generated-audios`），无需手动创建。
 
 ---
 
@@ -334,34 +336,28 @@ chmod +x run_latentsync.sh
 pm2 start ./run_latentsync.sh --name vigent2-latentsync
 ```

-### 4. 启动 Qwen3-TTS 声音克隆服务 (可选)
+### 4. 启动 CosyVoice 3.0 声音克隆服务 (可选)

-> 如需使用声音克隆功能，需要启动此服务。
+> 如需使用声音克隆功能，需要启动此服务。详细部署步骤见 [CosyVoice 3.0 部署文档](COSYVOICE3_DEPLOY.md)。

-1. 安装 HTTP 服务依赖:
-```bash
-conda activate qwen-tts
-pip install fastapi uvicorn python-multipart
-```
+1. 启动脚本位于项目根目录: `run_cosyvoice.sh`

-2. 启动脚本位于项目根目录: `run_qwen_tts.sh`
-
-3. 使用 pm2 启动:
+2. 使用 pm2 启动:
 ```bash
 cd /home/rongye/ProgramFiles/ViGent2
-pm2 start ./run_qwen_tts.sh --name vigent2-qwen-tts
+pm2 start ./run_cosyvoice.sh --name vigent2-cosyvoice
 pm2 save
 ```

-4. 验证服务:
+3. 验证服务:
 ```bash
 # 检查健康状态
-curl http://localhost:8009/health
+curl http://localhost:8010/health
 ```

 ### 5. 启动服务看门狗 (Watchdog)

-> 🛡️ **推荐**：监控 Qwen-TTS 和 LatentSync 服务健康状态，卡死时自动重启。
+> 🛡️ **推荐**：监控 CosyVoice 和 LatentSync 服务健康状态，卡死时自动重启。

 ```bash
 cd /home/rongye/ProgramFiles/ViGent2
@@ -382,7 +378,7 @@ pm2 startup
 pm2 status                    # 查看所有服务状态
 pm2 logs                      # 查看所有日志
 pm2 logs vigent2-backend      # 查看后端日志
-pm2 logs vigent2-qwen-tts     # 查看 Qwen3-TTS 日志
+pm2 logs vigent2-cosyvoice    # 查看 CosyVoice 日志
 pm2 restart all               # 重启所有服务
 pm2 stop vigent2-latentsync   # 停止 LatentSync 服务
 pm2 delete all                # 删除所有服务
@@ -521,7 +517,7 @@ python3 -c "import torch; print(torch.cuda.is_available())"
 sudo lsof -i :8006
 sudo lsof -i :3002
 sudo lsof -i :8007
-sudo lsof -i :8009  # Qwen3-TTS
+sudo lsof -i :8010  # CosyVoice
 ```

 ### 查看日志
@@ -531,7 +527,7 @@ sudo lsof -i :8009  # Qwen3-TTS
 pm2 logs vigent2-backend
 pm2 logs vigent2-frontend
 pm2 logs vigent2-latentsync
-pm2 logs vigent2-qwen-tts
+pm2 logs vigent2-cosyvoice
 ```

 ### SSH 连接卡顿 / 系统响应慢
@@ -570,6 +566,7 @@ pm2 logs vigent2-qwen-tts
 | `next` | React 框架 |
 | `swr` | 数据请求与缓存 |
 | `tailwindcss` | CSS 样式 |
+| `wavesurfer.js` | 音频波形（时间轴编辑器） |

 ### LatentSync 关键依赖

--- a/Docs/DevLogs/Day12.md
+++ b/Docs/DevLogs/Day12.md
@@ -342,6 +342,6 @@ models/Qwen3-TTS/

 ## 🔗 相关文档

- [task_complete.md](../task_complete.md) - 任务总览
+- [TASK_COMPLETE.md](../TASK_COMPLETE.md) - 任务总览
 - [Day11.md](./Day11.md) - 上传架构重构
 - [QWEN3_TTS_DEPLOY.md](../QWEN3_TTS_DEPLOY.md) - Qwen3-TTS 部署指南
--- a/Docs/DevLogs/Day13.md
+++ b/Docs/DevLogs/Day13.md
@@ -273,7 +273,7 @@ pm2 logs vigent2-qwen-tts --lines 50

 ## 🔗 相关文档

- [task_complete.md](../task_complete.md) - 任务总览
+- [TASK_COMPLETE.md](../TASK_COMPLETE.md) - 任务总览
 - [Day12.md](./Day12.md) - iOS 兼容与 Qwen3-TTS 部署
 - [QWEN3_TTS_DEPLOY.md](../QWEN3_TTS_DEPLOY.md) - Qwen3-TTS 部署指南
 - [SUBTITLE_DEPLOY.md](../SUBTITLE_DEPLOY.md) - 字幕功能部署指南
--- a/Docs/DevLogs/Day14.md
+++ b/Docs/DevLogs/Day14.md
@@ -397,6 +397,6 @@ if ((status === 401 || status === 403) && !isRedirecting && !isPublicPath) {

 ## 🔗 相关文档

- [task_complete.md](../task_complete.md) - 任务总览
+- [TASK_COMPLETE.md](../TASK_COMPLETE.md) - 任务总览
 - [Day13.md](./Day13.md) - 声音克隆功能集成 + 字幕功能
 - [QWEN3_TTS_DEPLOY.md](../QWEN3_TTS_DEPLOY.md) - Qwen3-TTS 1.7B 部署指南
--- a/Docs/DevLogs/Day15.md
+++ b/Docs/DevLogs/Day15.md
@@ -342,7 +342,7 @@ pm2 restart vigent2-backend vigent2-frontend

 ## 🔗 相关文档

- [task_complete.md](../task_complete.md) - 任务总览
+- [TASK_COMPLETE.md](../TASK_COMPLETE.md) - 任务总览
 - [Day14.md](./Day14.md) - 模型升级 + AI 标题标签
 - [AUTH_DEPLOY.md](../AUTH_DEPLOY.md) - 认证系统部署指南

--- a/Docs/DevLogs/Day16.md
+++ b/Docs/DevLogs/Day16.md
@@ -136,4 +136,4 @@ if service["failures"] >= service['threshold']:

 - [x] `Docs/QWEN3_TTS_DEPLOY.md`: 添加 Flash Attention 安装指南
 - [x] `Docs/DEPLOY_MANUAL.md`: 添加 Watchdog 部署说明
- [x] `Docs/task_complete.md`: 更新进度至 100% (Day 16)
+- [x] `Docs/TASK_COMPLETE.md`: 更新进度至 100% (Day 16)
--- a/Docs/DevLogs/Day21.md
+++ b/Docs/DevLogs/Day21.md
@@ -246,3 +246,204 @@ PLATFORM_CONFIGS = {
 pm2 restart vigent2-backend    # 发布服务 + QR登录
 npm run build && pm2 restart vigent2-frontend  # 刷脸验证UI
 ```
+
+---
+
+## 🏗️ 架构优化：前端结构微调 + 后端模块分层 (Day 21)
+
+### 概述
+根据架构审计结果，完成前端目录规范化和后端核心模块的分层补全。
+
+### 一、前端结构微调
+
+#### 1. ScriptExtractionModal 迁移
+- `components/ScriptExtractionModal.tsx` → `features/home/ui/ScriptExtractionModal.tsx`
+- 连带 `components/script-extraction/` 目录一并迁移到 `features/home/ui/script-extraction/`
+- 更新 `HomePage.tsx` 的 import 路径
+
+#### 2. contexts/ 目录归并
+- `src/contexts/AuthContext.tsx` → `src/shared/contexts/AuthContext.tsx`
+- `src/contexts/TaskContext.tsx` → `src/shared/contexts/TaskContext.tsx`
+- 更新 6 处 import（layout.tsx, useHomeController.ts, usePublishController.ts, AccountSettingsDropdown.tsx, GlobalTaskIndicator.tsx）
+- 删除空的 `src/contexts/` 目录
+
+#### 3. 清理重构遗留空目录
+- 删除 `src/lib/`、`src/components/home/`、`src/hooks/`
+
+### 二、后端模块分层补全
+
+将 3 个 400+ 行的 router-only 模块拆分为 `router.py + schemas.py + service.py`：
+
+| 模块 | 改造前 | 改造后 router |
+|------|--------|--------------|
+| `materials/` | 416 行 | 63 行 |
+| `tools/` | 417 行 | 33 行 |
+| `ref_audios/` | 421 行 | 71 行 |
+
+业务逻辑全部提取到 `service.py`，数据模型定义在 `schemas.py`，router 只做参数校验 + 调用 service + 返回响应。
+
+### 三、开发规范更新
+
+`BACKEND_DEV.md` 第 8 节新增渐进原则：
+- 新模块**必须**包含 `router.py + schemas.py + service.py`
+- 改旧模块时顺手拆涉及的部分
+- 新代码高标准，旧代码逐步改
+
+### 涉及文件汇总
+
+| 文件 | 变更 |
+|------|------|
+| `frontend/src/features/home/ui/ScriptExtractionModal.tsx` | 从 components/ 迁入 |
+| `frontend/src/features/home/ui/script-extraction/` | 从 components/ 迁入 |
+| `frontend/src/shared/contexts/AuthContext.tsx` | 从 contexts/ 迁入 |
+| `frontend/src/shared/contexts/TaskContext.tsx` | 从 contexts/ 迁入 |
+| `backend/app/modules/materials/schemas.py` | **新建** |
+| `backend/app/modules/materials/service.py` | **新建** |
+| `backend/app/modules/materials/router.py` | 精简为薄路由 |
+| `backend/app/modules/tools/schemas.py` | **新建** |
+| `backend/app/modules/tools/service.py` | **新建** |
+| `backend/app/modules/tools/router.py` | 精简为薄路由 |
+| `backend/app/modules/ref_audios/schemas.py` | **新建** |
+| `backend/app/modules/ref_audios/service.py` | **新建** |
+| `backend/app/modules/ref_audios/router.py` | 精简为薄路由 |
+| `Docs/BACKEND_DEV.md` | 目录结构标注分层、新增渐进原则 |
+| `Docs/BACKEND_README.md` | 目录结构标注分层 |
+| `Docs/FRONTEND_DEV.md` | 更新目录结构（contexts 迁移、ScriptExtractionModal 迁移） |
+
+### 重启要求
+```bash
+pm2 restart vigent2-backend
+npm run build && pm2 restart vigent2-frontend
+```
+
+---
+
+## 🎬 多素材视频生成（多机位效果）
+
+### 概述
+支持用户上传多个不同角度的自拍视频，生成视频时按句子自动切换素材，最终效果类似多机位拍摄。单素材时走原有流程，无额外开销。
+
+### 核心架构
+
+#### 流水线变更
+```
+【单素材（不变）】
+text → TTS → audio → LatentSync(1个素材+完整audio) → Whisper字幕 → Remotion → 成片
+
+【多素材（新增）】
+text → TTS → audio → Whisper字幕(提前) → 按素材数量均分时长(对齐字边界)
+  → 对每段: 切分audio + LatentSync(素材[i]+音频片段[i])
+  → FFmpeg拼接所有片段 → Remotion(完整字幕时间戳) → 成片
+```
+
+#### 素材切换逻辑（均分方案）
+1. Whisper 对完整音频转录，得到字级别时间戳
+2. 按素材数量**均分音频总时长**（`total_duration / N`）
+3. 每个分割点对齐到最近的 Whisper 字边界，避免在字中间切分
+4. 首段 start 扩展为 0.0，末段 end 扩展为音频结尾，确保完整覆盖
+
+> **设计决策**：最初方案基于原始文案标点分句，但用户文案往往不含句号（只有逗号），导致只产生 1 段。改为均分方案后不依赖文案标点，对任何输入都能正确切分。
+
+---
+
+### 一、后端改动
+
+#### 1. `backend/app/modules/videos/schemas.py`
+- 新增 `material_paths: Optional[List[str]]` 字段
+- 保留 `material_path: str` 向后兼容
+
+#### 2. `backend/app/modules/videos/workflow.py`（核心改动）
+
+**新增函数**：
+- `_split_equal(segments, material_paths)`: 按素材数量均分音频时长，对齐到最近的 Whisper 字边界
+
+**修改 `process_video_generation()`**：
+- `is_multi = len(material_paths) > 1` 判断走多素材/单素材分支
+- 多素材分支：Whisper 提前 → 均分切分 → 音频切分 → 逐段 LatentSync → FFmpeg 拼接
+
+#### 3. `backend/app/services/video_service.py`
+- 新增 `concat_videos()`: FFmpeg concat demuxer (`-c copy`) 拼接视频片段
+- 新增 `split_audio()`: FFmpeg 按时间范围切分音频 (`-ss` + `-t` + `-c copy`)
+
+#### 4. `backend/scripts/watchdog.py`
+- 健康检查阈值从 3 次提高到 5 次（容忍期 2.5 分钟）
+- 新增重启后 120 秒冷却期，避免模型加载期间被误判为故障
+- 启动时给所有服务 60 秒初始冷却期
+
+---
+
+### 二、前端改动
+
+#### 1. 新增依赖
+```bash
+npm install @dnd-kit/core @dnd-kit/sortable @dnd-kit/utilities
+```
+
+#### 2. `frontend/src/features/home/model/useMaterials.ts`
+- `selectedMaterial: string` → `selectedMaterials: string[]`（多选）
+- 新增 `toggleMaterial(id)`: 切换选中/取消（至少保留1个）
+- 新增 `reorderMaterials(activeId, overId)`: 拖拽排序
+- 上传格式扩展：新增 `.mkv/.webm/.flv/.wmv/.m4v/.ts/.mts`
+
+#### 3. `frontend/src/features/home/ui/MaterialSelector.tsx`（重写）
+- 素材列表每行增加复选框 + 序号徽标（①②③）
+- 选中 ≥2 个时显示拖拽排序区（@dnd-kit `SortableContext`）
+- 每个排序项：拖拽把手 + 序号 + 素材名 + 移除按钮
+- HTML input accept 改为 `video/*`
+
+#### 4. `frontend/src/features/home/model/useHomeController.ts`
+- 多素材 payload：`material_paths` 数组 + `material_path` 向后兼容
+- `enable_subtitles` 硬编码为 `true`（移除开关）
+- 验证：至少选中 1 个素材
+
+#### 5. `frontend/src/features/home/model/useHomePersistence.ts`
+- 素材持久化改为 JSON 数组，向后兼容旧格式（单字符串）
+- 移除 `enableSubtitles` 持久化
+
+#### 6. `frontend/src/features/home/ui/TitleSubtitlePanel.tsx`
+- 移除"逐字高亮字幕"开关，字幕样式区始终显示
+
+#### 7. `frontend/src/features/home/ui/HomePage.tsx`
+- 更新 props 传递（`selectedMaterials`, `toggleMaterial`, `reorderMaterials`）
+
+---
+
+### 三、Bug 修复记录
+
+#### BUG-1: 多素材只使用第一个视频（基于标点的分句方案失败）
+- **现象**: 选了 2 个素材但生成的视频只使用第 1 个，日志显示 `Multi-material: 1 segments, 2 materials`。
+- **根因 v1**: 最初通过正则 `[。！？!?]` 在 Whisper 输出中分句，但 Whisper 不输出标点。
+- **修复 v1**: 改为用原始文案标点分句——但用户文案往往只含逗号（，），无句末标点（。！？），仍退化为 1 段。
+- **最终修复**: 彻底放弃基于标点的分句方案，改为 `_split_equal()` **按素材数量均分音频时长**，对齐到最近的 Whisper 字边界。不依赖任何标点符号，对所有文案均有效。
+
+#### BUG-2: 口型对不上（音频时间偏移）
+- **根因**: `split_audio` 用 Whisper 的 start/end 时间（如 0.11~7.21）切分音频，但 `compose()` 用完整原始音频（0.0~结尾）合成，导致时间偏移。
+- **修复**: 强制首段 start=0.0，末段 end=音频实际时长，确保切分音频完整覆盖。
+
+#### BUG-3: min_segment_sec 过度合并导致退化（已随方案切换移除）
+- **根因**: 旧方案中 2 个句子第 2 句不足 3 秒时，最短时长检查合并为 1 段，多素材退化为单素材。
+- **状态**: 均分方案不存在此问题，相关代码已移除。
+
+---
+
+### 涉及文件汇总
+
+| 文件 | 变更类型 | 说明 |
+|------|----------|------|
+| `backend/app/modules/videos/schemas.py` | 修改 | 新增 material_paths 字段 |
+| `backend/app/modules/videos/workflow.py` | 修改 | 多素材流水线核心逻辑 + 3个 Bug 修复 |
+| `backend/app/services/video_service.py` | 修改 | 新增 concat_videos / split_audio |
+| `backend/scripts/watchdog.py` | 修改 | 阈值优化 + 冷却期机制 |
+| `frontend/package.json` | 修改 | 新增 @dnd-kit 依赖 |
+| `frontend/src/features/home/model/useMaterials.ts` | 修改 | 多选 + 排序状态管理 |
+| `frontend/src/features/home/ui/MaterialSelector.tsx` | 重写 | 多选复选框 + 拖拽排序 UI |
+| `frontend/src/features/home/model/useHomeController.ts` | 修改 | 多素材 payload + 移除字幕开关 |
+| `frontend/src/features/home/model/useHomePersistence.ts` | 修改 | JSON 数组持久化 |
+| `frontend/src/features/home/ui/TitleSubtitlePanel.tsx` | 修改 | 移除字幕开关 |
+| `frontend/src/features/home/ui/HomePage.tsx` | 修改 | 更新 props 传递 |
+
+### 重启要求
+```bash
+pm2 restart vigent2-backend
+npm run build && pm2 restart vigent2-frontend
+```
--- a/Docs/DevLogs/Day22.md
+++ b/Docs/DevLogs/Day22.md
@@ -0,0 +1,221 @@
+## 🔧 多素材生成优化与健壮性加固 (Day 22)
+
+### 概述
+对 Day 21 实现的多素材视频生成（多机位）功能进行全面审查，修复 6 个高优先级 Bug、完成 8 项体验优化，并将多素材流水线从"逐段 LatentSync"重构为"先拼接再推理"架构，推理次数从 N 次降为 1 次。
+
+---
+
+### 一、后端高优 Bug 修复
+
+#### 1. `_split_equal()` 素材数 > 字符数边界溢出
+- **问题**: 5 个素材但只有 2 个 Whisper 字符时，边界索引重复，部分素材被跳过
+- **修复**: 加入 `n = min(n, len(all_chars))` 上限保护
+- **文件**: `backend/app/modules/videos/workflow.py`
+
+#### 2. 多素材 LatentSync 单段失败无 fallback
+- **问题**: 单素材模式下 LatentSync 失败会 fallback 到原始素材，但多素材模式直接抛异常，整个任务失败
+- **修复**: 多素材循环中加 try-except，失败时 fallback 到原始素材片段
+- **文件**: `backend/app/modules/videos/workflow.py`
+
+#### 3. `num_segments == 0` 时 ZeroDivisionError
+- **问题**: 所有 assignments 被跳过后 `i / num_segments` 触发除零
+- **修复**: 循环前加 `if num_segments == 0` 检查并抛出明确错误
+- **文件**: `backend/app/modules/videos/workflow.py`
+
+#### 4. `split_audio` 未校验 duration > 0
+- **问题**: `end <= start` 时 FFmpeg 行为异常
+- **修复**: 加入 `if duration <= 0: raise ValueError(...)`
+- **文件**: `backend/app/services/video_service.py`
+
+#### 5. Whisper 失败时按时长均分兜底
+- **问题**: Whisper 失败后直接退化为单素材，其他素材被浪费
+- **修复**: 按 `audio_duration / len(material_paths)` 均分，不依赖字符对齐
+- **文件**: `backend/app/modules/videos/workflow.py`
+
+#### 6. `concat_videos` 空列表未检查
+- **问题**: 传入空 `video_paths` 时 FFmpeg 报错
+- **修复**: 加入 `if not video_paths: raise ValueError(...)`
+- **文件**: `backend/app/services/video_service.py`
+
+---
+
+### 二、前端优化
+
+#### 1. payload 构建非空断言修复
+- `m!.path` → `m?.path` + `.filter(Boolean)`，防止素材被删后 crash
+- **文件**: `frontend/src/features/home/model/useHomeController.ts`
+
+#### 2. 生成按钮展示后端进度消息
+- 新增 `message` prop，生成中显示如"(正在处理片段 2/3...)"
+- **文件**: `frontend/src/features/home/ui/GenerateActionBar.tsx`, `HomePage.tsx`
+
+#### 3. 新上传素材自动选中
+- 上传成功后对比前后素材列表，新增的 ID 自动追加到 `selectedMaterials`
+- **文件**: `frontend/src/features/home/model/useMaterials.ts`
+
+#### 4. Material 接口统一
+- 三处 `interface Material` 重复定义提取到 `shared/types/material.ts`
+- **文件**: `frontend/src/shared/types/material.ts` (新建), `useMaterials.ts`, `useHomeController.ts`, `MaterialSelector.tsx`
+
+#### 5. 拖拽排序修复
+- 移除 `DragOverlay`（`backdrop-blur` 创建新 containing block 导致定位错乱）
+- 改为 `useSortable` 原生拖拽 + `CSS.Translate`，拖拽中元素高亮加阴影
+- **文件**: `frontend/src/features/home/ui/MaterialSelector.tsx`
+
+#### 6. 素材选择上限 4 个
+- `toggleMaterial` 新增 `MAX_MATERIALS = 4` 限制
+- UI 选满后未选中项变半透明禁用，提示文字改为"可多选，最多4个"
+- **文件**: `useMaterials.ts`, `MaterialSelector.tsx`
+
+#### 7. 移动端排序区域响应式
+- 素材列表 `max-h-64` → `max-h-48 sm:max-h-64`
+- **文件**: `MaterialSelector.tsx`
+
+#### 8. 多素材耗时提示
+- 选中 ≥2 素材时生成按钮下方显示"多素材模式 (N 个机位)，生成耗时较长"
+- **文件**: `GenerateActionBar.tsx`, `HomePage.tsx`
+
+---
+
+### 三、核心架构重构：先拼接再推理
+
+#### V1 (Day 21): 逐段 LatentSync
+```
+素材A → LatentSync(素材A, 音频片段1) → lipsync_A
+素材B → LatentSync(素材B, 音频片段2) → lipsync_B
+FFmpeg concat(lipsync_A, lipsync_B) → 最终视频
+```
+- 缺点：N 个素材 = N 次 LatentSync 推理（每次 ~30s）
+
+#### V2 (Day 22): 先拼接再推理
+```
+素材A → prepare_segment(裁剪到3.67s) → prepared_A
+素材B → prepare_segment(裁剪到4.00s) → prepared_B
+FFmpeg concat(prepared_A, prepared_B) → concat_video (7.67s)
+LatentSync(concat_video, 完整音频) → 最终视频
+```
+- 优点：只需 **1 次** LatentSync 推理，时间从 N×30s 降为 1×30s
+
+#### 新增 `prepare_segment()` 方法
+```python
+def prepare_segment(self, video_path, target_duration, output_path, target_resolution=None):
+    # 素材时长 > 目标: 裁剪 (-t)
+    # 素材时长 < 目标: 循环 (-stream_loop) + 裁剪
+    # 分辨率一致: -c copy 无损 (不重编码)
+    # 分辨率不一致: scale + pad 统一到第一个素材分辨率
+```
+
+#### 分辨率处理策略
+- 新增 `get_resolution()` 方法检测各素材分辨率
+- 所有素材分辨率相同时：`-c copy` 无损裁剪（保持原画质）
+- 分辨率不一致时：统一到第一个素材的分辨率，`force_original_aspect_ratio=decrease` + `pad` 居中
+- LatentSync 只处理嘴部 512×512 区域，输出保持原分辨率
+
+#### 时间对齐验证
+
+| 环节 | 时间基准 | 对齐关系 |
+|------|---------|---------|
+| TTS 音频 | 原始时长 (7.67s) | 基准 |
+| Whisper 字幕 | 基于 TTS 音频 | 时间戳对齐音频 |
+| 均分切分 | assignments 总时长 = 音频时长 | 首段 start=0, 末段 end=audio_duration |
+| prepare 各段 | `-t seg_dur` 精确截断 | 总和 ≈ 音频时长 |
+| LatentSync | concat_video + 完整音频 | 内部 0.5s 容差 |
+| compose | lipsync_video + 音频/BGM | `-shortest` 保证同步 |
+| Remotion | 基于 captions_path 渲染字幕 | 时间戳对齐音频 |
+
+---
+
+### 涉及文件汇总
+
+| 文件 | 变更类型 | 说明 |
+|------|----------|------|
+| `backend/app/modules/videos/workflow.py` | 修改 | 6 个 Bug 修复 + 流水线重构（先拼接再推理）|
+| `backend/app/services/video_service.py` | 修改 | 新增 `prepare_segment()`、`get_resolution()`，`split_audio` 校验，`concat_videos` 空列表检查 |
+| `frontend/src/shared/types/material.ts` | 新建 | 统一 Material 接口 |
+| `frontend/src/features/home/model/useMaterials.ts` | 修改 | 上传自动选中、素材上限 4 个 |
+| `frontend/src/features/home/model/useHomeController.ts` | 修改 | payload 非空断言修复、Material 接口引用 |
+| `frontend/src/features/home/ui/MaterialSelector.tsx` | 修改 | 拖拽修复、上限 4 个 UI、移动端响应式 |
+| `frontend/src/features/home/ui/GenerateActionBar.tsx` | 修改 | 进度消息展示、多素材耗时提示 |
+| `frontend/src/features/home/ui/HomePage.tsx` | 修改 | 传递 message、materialCount prop |
+
+---
+
+### 四、AI 多语言翻译
+
+#### 功能
+在文案编辑区新增「AI多语言」按钮，支持将中文口播文案一键翻译为 9 种语言，并可随时还原原文。
+
+#### 支持语言
+英语 English、日语 日本語、韩语 한국어、法语 Français、德语 Deutsch、西班牙语 Español、俄语 Русский、意大利语 Italiano、葡萄牙语 Português
+
+#### 实现
+
+##### 后端
+- **`backend/app/services/glm_service.py`** — 新增 `translate_text()` 方法，调用智谱 GLM API（temperature=0.3），prompt 要求只返回译文、保持语气风格
+- **`backend/app/modules/ai/router.py`** — 新增 `POST /api/ai/translate` 接口，接收 `{text, target_lang}`，返回 `{translated_text}`
+
+##### 前端
+- **`frontend/src/features/home/ui/ScriptEditor.tsx`** — 新增 `LANGUAGES` 列表（9 种语言）、语言下拉菜单（点击外部自动关闭）、翻译中 loading 状态、「还原原文」按钮（翻译过后出现在菜单顶部）
+- **`frontend/src/features/home/model/useHomeController.ts`** — 新增 `handleTranslate`（调用翻译 API、首次翻译保存原文）、`originalText` 状态、`handleRestoreOriginal`（恢复原文）
+
+#### 涉及文件
+
+| 文件 | 变更 | 说明 |
+|------|------|------|
+| `backend/app/services/glm_service.py` | 修改 | 新增 `translate_text()` 方法 |
+| `backend/app/modules/ai/router.py` | 修改 | 新增 `/api/ai/translate` 接口 |
+| `frontend/src/features/home/ui/ScriptEditor.tsx` | 修改 | 语言菜单 UI、翻译 loading、还原原文按钮 |
+| `frontend/src/features/home/model/useHomeController.ts` | 修改 | `handleTranslate`、`originalText`、`handleRestoreOriginal` |
+
+---
+
+### 五、TTS 多语言支持
+
+#### 背景
+翻译功能实现后，用户可将中文文案翻译为其他语言。但翻译后生成视频时 TTS 仍只支持中文：
+- **EdgeTTS**：声音列表只有 5 个 `zh-CN-*` 中文声音
+- **声音克隆 (Qwen3-TTS)**：`language` 参数硬编码为 `"Chinese"`
+
+#### 实现方案
+
+##### 1. 前端：语言感知的声音列表
+- `VOICES` 从扁平数组扩展为 `Record<string, VoiceOption[]>`，覆盖 10 种语言（zh-CN / en-US / ja-JP / ko-KR / fr-FR / de-DE / es-ES / ru-RU / it-IT / pt-BR），每种语言 2 个声音（男/女）
+- 新增 `LANG_TO_LOCALE` 映射：翻译目标语言名 → EdgeTTS locale（如 `"English" → "en-US"`）
+- 新增 `textLang` 状态，跟踪当前文案语言，默认 `"zh-CN"`
+
+##### 2. 翻译时自动切换声音
+- `handleTranslate` 成功后：根据目标语言设置 `textLang`，EdgeTTS 模式下自动切换 `voice` 为目标语言的默认声音
+- `handleRestoreOriginal` 还原时：重置 `textLang` 为 `"zh-CN"`，恢复中文默认声音
+- `VoiceSelector` 根据 `textLang` 动态显示对应语言的声音列表
+
+##### 3. 声音克隆语言透传
+- 前端：新增 `LOCALE_TO_QWEN_LANG` 映射（`zh-CN→"Chinese"`, `en-US→"English"`, 其他→`"Auto"`）
+- 生成请求 payload 加入 `language` 字段（仅声音克隆模式）
+- 后端 `GenerateRequest` schema 新增 `language: str = "Chinese"` 字段
+- `workflow.py`：`language="Chinese"` 硬编码改为 `language=req.language`
+
+##### 4. Bug 修复：textLang 持久化
+- **问题**: `voice` 已持久化但 `textLang` 未持久化，刷新页面后 `voice` 恢复为英文声音但 `textLang` 默认回中文，导致 VoiceSelector 显示中文声音列表却选中英文声音，无高亮按钮
+- **修复**: 在 `useHomePersistence` 中加入 `textLang` 的 localStorage 读写
+
+#### 数据流
+
+```
+用户翻译 "English"
+  → ScriptEditor.onTranslate("English")
+  → LANG_TO_LOCALE["English"] = "en-US"
+  → setTextLang("en-US"), setVoice("en-US-GuyNeural")
+  → VoiceSelector 显示 VOICES["en-US"] = [Guy, Jenny]
+  → 生成时:
+      EdgeTTS: payload.voice = "en-US-GuyNeural"
+      声音克隆: payload.language = "English" (via getQwenLanguage)
+```
+
+#### 涉及文件
+
+| 文件 | 变更 | 说明 |
+|------|------|------|
+| `frontend/src/features/home/model/useHomeController.ts` | 修改 | VOICES 多语言 Record、textLang 状态、LANG_TO_LOCALE / LOCALE_TO_QWEN_LANG 映射、翻译自动切换 voice |
+| `frontend/src/features/home/model/useHomePersistence.ts` | 修改 | textLang 持久化读写 |
+| `backend/app/modules/videos/schemas.py` | 修改 | GenerateRequest 加 `language` 字段 |
+| `backend/app/modules/videos/workflow.py` | 修改 | 声音克隆调用处用 `req.language` 替代硬编码 |
--- a/Docs/DevLogs/Day23.md
+++ b/Docs/DevLogs/Day23.md
@@ -0,0 +1,856 @@
+## 🎙️ 配音前置重构 — 第一阶段 (Day 23)
+
+### 概述
+
+将配音从视频生成流程中独立出来，实现"先生成配音 → 选中配音 → 再选素材 → 生成视频"的新工作流。用户可以独立管理配音（生成/试听/改名/删除/选择），并在选中配音后看到时长信息，为第二阶段的素材时间轴编排奠定数据基础。
+
+**旧流程**: 文案 + 选素材 → 一键生成（内联 TTS → Whisper → 均分 → LipSync → 合成）
+**新流程**: 文案 → 配音方式 → **生成配音** → 选中配音 → 选素材 → 背景音乐 → 生成视频
+
+---
+
+### 一、后端：新增 `generated_audios` 模块
+
+#### 模块结构
+
+```
+backend/app/modules/generated_audios/
+├── __init__.py
+├── router.py      # 5 个 API 端点
+├── schemas.py     # 请求/响应模型
+└── service.py     # 生成/列表/删除/改名
+```
+
+#### API 端点
+
+| 方法 | 路径 | 说明 |
+|------|------|------|
+| POST | `/api/generated-audios/generate` | 异步生成配音（返回 task_id） |
+| GET | `/api/generated-audios/tasks/{task_id}` | 轮询生成进度 |
+| GET | `/api/generated-audios` | 列出用户所有配音 |
+| DELETE | `/api/generated-audios/{audio_id}` | 删除配音 |
+| PUT | `/api/generated-audios/{audio_id}` | 改名 |
+
+#### 存储方案
+
+- Supabase 存储桶：`generated-audios`（启动时自动创建）
+- 音频文件：`{user_id}/{timestamp}_audio.wav`
+- 元数据文件：`{user_id}/{timestamp}_audio.json`（含 display_name、text、tts_mode、duration_sec 等）
+
+#### 生成流程
+
+复用现有 `TTSService` / `voice_clone_service` / `task_store`：
+
+```
+POST /generate → 创建 task → BackgroundTask:
+  1. edgetts → TTSService.generate_audio()
+     voiceclone → 下载 ref_audio → voice_clone_service.generate_audio()
+  2. ffprobe 获取时长
+  3. 上传 .wav + .json 到 generated-audios 桶
+  4. 更新 task(status=completed, output={audio_id, duration_sec, ...})
+```
+
+---
+
+### 二、后端：修改视频生成 workflow
+
+#### `GenerateRequest` 新增字段
+
+```python
+generated_audio_id: Optional[str] = None  # 预生成配音 ID（存在时跳过内联 TTS）
+```
+
+#### `workflow.py` TTS 阶段新增分支
+
+```python
+if req.generated_audio_id:
+    # 下载预生成配音 + 从元数据读取 language
+elif req.tts_mode == "voiceclone":
+    # 原有声音克隆逻辑
+else:
+    # 原有 EdgeTTS 逻辑
+```
+
+向后兼容：不传 `generated_audio_id` 时，原有内联 TTS 流程不受影响。
+
+---
+
+### 三、前端：新增配音列表 hook + 面板
+
+#### `useGeneratedAudios.ts`
+
+- 状态：`generatedAudios[]`、`selectedAudio`、`isGeneratingAudio`、`audioTask`
+- 方法：`fetchGeneratedAudios()`、`generateAudio()`、`deleteAudio()`、`renameAudio()`、`selectAudio()`
+- 轮询：生成后 1s 轮询 task 状态，完成后自动刷新列表并选中最新配音
+- 独立于视频生成的 TaskContext（不互相干扰）
+
+#### `GeneratedAudiosPanel.tsx`
+
+- 每条配音：播放/暂停、名称、时长、重命名、删除
+- 选中态：`border-purple-500 bg-purple-500/20`
+- 内嵌进度条（生成中显示）
+- 底部显示选中配音的原始文案（截断）
+- 播放逻辑自包含于面板内（`new Audio()` + play/pause toggle）
+
+---
+
+### 四、前端：UI 面板重排序
+
+**旧顺序**: MaterialSelector → ScriptEditor → TitleSubtitle → VoiceSelector → BgmPanel → GenerateActionBar
+
+**新顺序**:
+1. ScriptEditor（文案编辑）
+2. TitleSubtitlePanel（标题与字幕样式）
+3. VoiceSelector（配音方式）
+4. **GeneratedAudiosPanel**（配音列表）← 新增
+5. MaterialSelector（视频素材）← 后移，需选中配音才解锁
+6. BgmPanel（背景音乐）
+7. GenerateActionBar（生成视频）
+
+#### 素材区门控
+
+未选中配音时，素材区显示半透明遮罩 + "请先生成并选中配音"提示。素材上传/预览/改名/删除始终可用，仅选择勾选被遮罩。
+
+#### 时长信息
+
+选中配音后，MaterialSelector 顶部显示：
+```
+当前配音: 45.2 秒 | 已选 3 个素材（自动均分每段 ~15.1 秒）
+```
+
+#### 生成按钮条件更新
+
+```typescript
+// 旧条件
+disabled={isGenerating || selectedMaterials.length === 0 || (ttsMode === "voiceclone" && !selectedRefAudio)}
+// 新条件
+disabled={isGenerating || selectedMaterials.length === 0 || !selectedAudio}
+```
+
+---
+
+### 五、持久化
+
+`useHomePersistence` 新增 `selectedAudioId` 的 localStorage 读写，刷新页面后恢复选中的配音。
+
+---
+
+### 涉及文件汇总
+
+#### 后端新增
+
+| 文件 | 说明 |
+|------|------|
+| `backend/app/modules/generated_audios/__init__.py` | 模块标记 |
+| `backend/app/modules/generated_audios/router.py` | 5 个 API 端点 |
+| `backend/app/modules/generated_audios/service.py` | 生成/列表/删除/改名 |
+| `backend/app/modules/generated_audios/schemas.py` | 请求/响应模型 |
+
+#### 后端修改
+
+| 文件 | 变更 |
+|------|------|
+| `backend/app/main.py` | 注册 generated_audios 路由 |
+| `backend/app/services/storage.py` | 新增 `BUCKET_GENERATED_AUDIOS`，启动时自动创建桶 |
+| `backend/app/modules/videos/schemas.py` | `GenerateRequest` 新增 `generated_audio_id` 字段 |
+| `backend/app/modules/videos/workflow.py` | TTS 阶段新增预生成音频分支 |
+
+#### 前端新增
+
+| 文件 | 说明 |
+|------|------|
+| `frontend/src/features/home/model/useGeneratedAudios.ts` | 配音列表 hook |
+| `frontend/src/features/home/ui/GeneratedAudiosPanel.tsx` | 配音列表面板 |
+
+#### 前端修改
+
+| 文件 | 变更 |
+|------|------|
+| `frontend/src/features/home/ui/HomePage.tsx` | 面板重排序 + 素材区门控 + 插入 GeneratedAudiosPanel |
+| `frontend/src/features/home/ui/MaterialSelector.tsx` | 新增 `selectedAudioDuration` prop + 时长信息显示 |
+| `frontend/src/features/home/ui/GenerateActionBar.tsx` | 禁用条件改为 `!selectedAudio` |
+| `frontend/src/features/home/model/useHomeController.ts` | 集成 useGeneratedAudios、新增 handleGenerateAudio、修改 handleGenerate 使用 generated_audio_id |
+| `frontend/src/features/home/model/useHomePersistence.ts` | 新增 selectedAudioId 持久化 |
+
+---
+
+## 🎞️ 素材时间轴编排 — 第二阶段 (Day 23)
+
+### 概述
+
+在第一阶段"配音前置"基础上，新增**时间轴编辑器**，用户可以：
+1. 在音频波形上查看各素材块的时长分配
+2. 拖拽分割线调整每段素材的时长（无缝铺满，调整一段自动压缩/扩展相邻段）
+3. 为每段素材设置**源视频截取起点**（从视频任意位置开始，而非始终从头）
+
+**旧行为**: 多素材时自动均分（`_split_equal`），无法控制每段时长和源视频起始点
+**新行为**: 时间轴编辑器可视化分配 + 拖拽调整 + ClipTrimmer 截取设置
+
+---
+
+### 一、后端改动
+
+#### 1.1 新增 `CustomAssignment` 模型
+
+```python
+# backend/app/modules/videos/schemas.py
+class CustomAssignment(BaseModel):
+    material_path: str
+    start: float           # 音频时间轴起点
+    end: float             # 音频时间轴终点
+    source_start: float = 0.0  # 源视频截取起点
+```
+
+`GenerateRequest` 新增 `custom_assignments: Optional[List[CustomAssignment]] = None`。存在时跳过 Whisper 均分，直接使用用户定义的分配。
+
+#### 1.2 `prepare_segment` 支持 `source_start`
+
+```python
+def prepare_segment(self, video_path, target_duration, output_path,
+                    target_resolution=None, source_start: float = 0.0):
+```
+
+关键逻辑：
+- `source_start > 0` 时使用 `-ss` 快速 seek，并强制重编码（避免 stream copy 关键帧不精确）
+- 当需要循环且有 `source_start` 时，先裁剪出 `source_start` 到视频结尾的片段，再循环裁剪后的文件（避免 `stream_loop` 从视频 0s 开始循环）
+- 裁剪临时文件在 `finally` 中自动清理
+
+#### 1.3 `workflow.py` 支持 `custom_assignments`
+
+- **多素材模式**: `custom_assignments` 存在时，直接使用用户分配（仍运行 Whisper 生成字幕），每个 `prepare_segment` 调用传入 `source_start`
+- **单素材模式**: `custom_assignments` 有 1 条且 `source_start > 0` 时，先截取片段再传入 LatentSync
+- **向后兼容**: `custom_assignments` 为 `None` 时完全走旧路径
+
+---
+
+### 二、前端新增组件
+
+#### 2.1 `useTimelineEditor.ts` — 时间轴段管理 hook
+
+```typescript
+interface TimelineSegment {
+  id: string;              // React key
+  materialId: string;      // 素材 ID
+  materialName: string;    // 显示名
+  start: number;           // 音频时间轴开始秒数
+  end: number;             // 音频时间轴结束秒数
+  sourceStart: number;     // 源视频截取起点（默认 0）
+  sourceEnd: number;       // 源视频截取终点（0 = 到结尾）
+  color: string;           // 色块颜色
+}
+```
+
+核心方法：
+- `initSegments()`: selectedMaterials 变化时按数量均分 audioDuration
+- `resizeSegment(id, newEnd)`: 拖拽右边界，约束每段最小 1s
+- `setSourceRange(id, sourceStart, sourceEnd)`: 设置截取范围
+- `toCustomAssignments()`: 转为后端 `CustomAssignment[]` 格式
+
+#### 2.2 `TimelineEditor.tsx` — 波形 + 色块时间轴
+
+- **wavesurfer.js** 渲染音频波形（仅展示，不播放）
+- 色块层按比例排列，显示素材名 + 时长 + 截取标记
+- 色块间分割线可拖拽（`onPointerDown/Move/Up` 实现连续像素拖拽）
+- 点击色块打开 ClipTrimmer
+
+#### 2.3 `ClipTrimmer.tsx` — 素材截取模态框
+
+- HTML5 `<video>` 实时预览，拖拽滑块时 `video.currentTime` 跟随
+- 双端 Range Slider（起点/终点），互锁约束 ≥ 0.5s
+- 显示截取时长 vs 分配时长对比（循环补足/截断提示）
+- `loadedmetadata` 获取源视频时长
+
+---
+
+### 三、前端整合改动
+
+#### 3.1 `useHomeController.ts`
+
+- 集成 `useTimelineEditor` hook
+- 新增 `clipTrimmerOpen` / `clipTrimmerSegmentId` 状态
+- `handleGenerate` 多素材时始终发送 `custom_assignments`；单素材 + `sourceStart > 0` 时也发送
+- 移除不再使用的 `reorderMaterials` 导出
+
+#### 3.2 `HomePage.tsx`
+
+- 在 MaterialSelector 和 BgmPanel 之间插入 TimelineEditor（仅当有配音且已选素材时显示）
+- 底部新增 ClipTrimmer 模态框
+- 移除 `reorderMaterials` 和 `selectedAudioDuration` prop 传递
+
+#### 3.3 `MaterialSelector.tsx`
+
+- 移除配音时长信息栏（功能迁至 TimelineEditor）
+- 移除拖拽排序区（SortableChip + @dnd-kit 相关代码）
+- 移除 `onReorderMaterials` / `selectedAudioDuration` prop
+
+---
+
+### 四、审查修复的 Bug
+
+| # | 严重程度 | 问题 | 修复 |
+|---|---------|------|------|
+| 1 | **中** | `prepare_segment` 使用 `source_start > 0` + stream copy 时 seek 不精确 | 添加 `source_start > 0` 到重编码条件 |
+| 2 | **高** | `stream_loop + source_start` 循环时从视频 0s 开始而非从 source_start 循环 | 改为两步：先裁剪片段再循环裁剪后的文件 |
+| 3 | **低** | `useHomeController` 导出已废弃的 `reorderMaterials` | 移除 |
+
+---
+
+### 涉及文件汇总
+
+#### 后端修改
+
+| 文件 | 变更 |
+|------|------|
+| `backend/app/modules/videos/schemas.py` | 新增 `CustomAssignment` model，`GenerateRequest` 新增 `custom_assignments` 字段 |
+| `backend/app/services/video_service.py` | `prepare_segment` 新增 `source_start` 参数，循环+截取两步处理 |
+| `backend/app/modules/videos/workflow.py` | 多素材/单素材流水线支持 `custom_assignments`，传递 `source_start` |
+
+#### 前端新增
+
+| 文件 | 说明 |
+|------|------|
+| `frontend/src/features/home/model/useTimelineEditor.ts` | 时间轴段管理 hook |
+| `frontend/src/features/home/ui/TimelineEditor.tsx` | 波形 + 色块时间轴组件 |
+| `frontend/src/features/home/ui/ClipTrimmer.tsx` | 素材截取模态框 |
+
+#### 前端修改
+
+| 文件 | 变更 |
+|------|------|
+| `frontend/src/features/home/ui/HomePage.tsx` | 插入 TimelineEditor + ClipTrimmer |
+| `frontend/src/features/home/ui/MaterialSelector.tsx` | 移除时长信息 + 拖拽排序区 + 相关 prop |
+| `frontend/src/features/home/model/useHomeController.ts` | 集成 useTimelineEditor，handleGenerate 发送 custom_assignments |
+| `frontend/package.json` | 新增 `wavesurfer.js` 依赖 |
+
+---
+
+## 🎨 UI 体验优化 + TTS 稳定性修复 — 第三阶段 (Day 23)
+
+### 概述
+
+根据用户反馈，修复 6 项 UI 体验问题，同时修复声音克隆服务的 SoX 路径问题和显存缓存管理。
+
+> **注**: Qwen3-TTS 已在后续被 CosyVoice 3.0 (端口 8010) 替换，以下记录为当时的修复过程。
+
+---
+
+### 一、Qwen3-TTS 稳定性修复 (已被 CosyVoice 3.0 替换)
+
+#### 1.1 SoX PATH 修复
+
+**问题**: PM2 启动 qwen-tts 时，`sox` 工具安装在 conda env 的 bin 目录中，系统 PATH 找不到，导致音频编解码走 fallback 路径（CPU 密集型），日志中出现 `SoX could not be found!` 警告。
+
+**修复**: `run_qwen_tts.sh` 中 export conda env bin 到 PATH：
+
+```bash
+export PATH="/home/rongye/ProgramFiles/miniconda3/envs/qwen-tts/bin:$PATH"
+```
+
+#### 1.2 CUDA 缓存清理
+
+**修复**: `qwen_tts_server.py` 每次生成完成后（无论成功或失败）调用 `torch.cuda.empty_cache()`，防止显存碎片累积。使用 `asyncio.to_thread()` 在线程池中运行推理，避免阻塞事件循环导致健康检查超时。
+
+> **后续**: Qwen3-TTS 已停用，CosyVoice 3.0 沿用了相同的保护机制（GPU 推理锁、超时保护、显存清理、启动自检）。
+
+---
+
+### 二、配音列表按钮布局统一 (反馈 #1 + #6)
+
+**问题**: `GeneratedAudiosPanel` 的试听按钮位于左侧（独立于 Edit/Delete），与 `RefAudioPanel` 的布局不一致。底部文案摘要区域不需要展示。
+
+**修复**:
+- Play/Edit/Delete 按钮统一放在右侧同组，hover 显示，顺序为 试听→重命名→删除
+- 移除选中配音的文案摘要区域
+- 布局与 RefAudioPanel 一致：左侧名称+时长，右侧操作按钮组
+
+---
+
+### 三、视频素材区域移除配音依赖遮罩 (反馈 #2)
+
+**问题**: MaterialSelector 被 `!selectedAudio` 遮罩覆盖，必须先选配音才能操作素材。
+
+**修复**: 移除 `HomePage.tsx` 中 MaterialSelector 外层的 disabled overlay `<div>`。素材随时可上传/预览/管理，仅 TimelineEditor 需要选中配音才显示（已有独立条件 `selectedAudio && selectedMaterials.length > 0`）。
+
+---
+
+### 四、时间轴拖拽排序 (反馈 #3)
+
+**问题**: TimelineEditor 不支持调换素材顺序。
+
+**修复**:
+- `useTimelineEditor` 已有 `reorderSegments()` 方法（交换两个段的素材信息但保留时间范围）
+- 通过 `useHomeController` 暴露 `reorderSegments`，传入 `TimelineEditor`
+- 色块支持 HTML5 Drag & Drop：`draggable` + `onDragStart/Over/Drop/End`
+- 拖拽时：源色块半透明（`opacity-50`），目标色块高亮 ring（`ring-2 ring-purple-400 scale-[1.02]`）
+- 光标样式：`cursor-grab` / `active:cursor-grabbing`
+
+---
+
+### 五、截取设置双手柄 Range Slider (反馈 #4)
+
+**问题**: ClipTrimmer 使用两个独立的 `<input type="range">` 滑块，起点和终点分开操作，体验不直观。
+
+**修复**: 改为自定义双手柄 range slider：
+- 单条轨道，紫色圆形手柄（起点）+ 粉色圆形手柄（终点）
+- 轨道底色 `bg-white/10`，选中范围用素材对应颜色高亮
+- Pointer Events 实现拖拽：`onPointerDown` 捕获手柄 → `onPointerMove` 更新位置 → `onPointerUp` 释放
+- 手柄互锁约束：起点不超过终点 - 0.5s，终点不低于起点 + 0.5s
+- 底部显示起点（紫色）和终点（粉色）时间标签
+
+---
+
+### 六、截取设置视频预览 (反馈 #5)
+
+**问题**: ClipTrimmer 的视频只能静态查看，无法播放预览截取范围。
+
+**修复**:
+- 视频区域点击可播放/暂停（Play/Pause 图标覆盖层）
+- 播放范围：从 sourceStart 播放到 sourceEnd 自动停止
+- 播放结束后回到起点
+- 拖拽手柄时 `video.currentTime` 实时跟随（seek 到当前位置查看画面）
+- 播放进度条（白色竖线）叠加在 range slider 轨道上
+- `preload="auto"` 预加载视频，确保拖拽时快速 seek
+
+---
+
+### 涉及文件汇总
+
+#### 后端修改
+
+| 文件 | 变更 |
+|------|------|
+| `run_qwen_tts.sh` | export conda env bin 到 PATH，修复 SoX 找不到问题 (已停用) |
+| `models/Qwen3-TTS/qwen_tts_server.py` | 每次生成后 `torch.cuda.empty_cache()`，asyncio.to_thread 避免阻塞 (已停用) |
+
+#### 前端修改
+
+| 文件 | 变更 |
+|------|------|
+| `frontend/src/features/home/ui/GeneratedAudiosPanel.tsx` | 按钮布局统一（Play/Edit/Delete 右侧同组），移除文案摘要 |
+| `frontend/src/features/home/ui/HomePage.tsx` | 移除 MaterialSelector 配音遮罩，传入 onReorderSegment |
+| `frontend/src/features/home/ui/TimelineEditor.tsx` | 新增 HTML5 Drag & Drop 排序，新增 onReorderSegment prop |
+| `frontend/src/features/home/ui/ClipTrimmer.tsx` | 双手柄 range slider + 视频播放预览 + 播放进度指示 |
+| `frontend/src/features/home/model/useHomeController.ts` | 暴露 reorderSegments 方法 |
+
+---
+
+## 📝 历史文案保存 + 时间轴拖拽修复 — 第四阶段 (Day 23)
+
+### 概述
+
+新增文案手动保存与加载功能，修复时间轴拖拽排序后素材时长不跟随的 Bug，统一按钮视觉规范。
+
+---
+
+### 一、历史文案保存与加载
+
+#### 功能
+
+用户可手动保存当前文案到历史列表，随时从历史中加载恢复。只有手动保存的文案才出现在历史列表中，与自动保存（`useHomePersistence`）完全独立。
+
+#### UI 布局
+
+```
+按钮栏: [历史文案▼] [文案提取助手] [AI多语言▼] [AI生成标题标签]
+底部栏: 128 字                                    [保存文案]
+```
+
+- **历史文案下拉**: 展示已保存列表（名称 + 日期 + 删除按钮），点击条目加载文案，空列表显示"暂无保存的文案"
+- **保存文案按钮**: 文案为空时 disabled，点击后 `toast.success("文案已保存")`
+- **预计时长已移除**: 底部栏只保留字数 + 保存按钮
+
+#### 实现
+
+##### `useSavedScripts.ts`（新建）
+
+```typescript
+interface SavedScript { id: string; name: string; content: string; savedAt: number }
+```
+
+- localStorage key: `vigent_{storageKey}_savedScripts`
+- `saveScript(content)`: 取前 15 字符自动命名，新条目插入列表头部，**直接写入 localStorage**
+- `deleteScript(id)`: 删除指定条目，直接写入 localStorage
+- `useEffect([lsKey])`: lsKey 变化时（guest → userId）重新从 localStorage 读取
+- **不使用自动持久化 effect**，避免 storageKey 切换时空数组覆盖已有数据
+
+##### 数据流
+
+```
+ScriptEditor (UI)
+  ↑ savedScripts / onSaveScript / onLoadScript / onDeleteScript (纯 props + callbacks)
+  │
+useHomeController
+  ├── useSavedScripts(storageKey) → { savedScripts, saveScript, deleteScript }
+  └── handleSaveScript() → saveScript(text) + toast
+  │
+HomePage
+  └── 传递 props 到 ScriptEditor
+```
+
+---
+
+### 二、时间轴拖拽排序 Bug 修复
+
+#### 问题
+
+拖拽调换素材顺序后，各素材的时长没有跟随素材移动，而是留在原槽位。例如：素材1(3s) + 素材2(8s+4s循环)，拖拽后变成素材2(3s) + 素材1(8s+4s循环)，时长分配没变。
+
+#### 根因
+
+`reorderSegments` 使用**属性交换**方式：逐个拷贝 `materialId`、`sourceStart`、`sourceEnd` 等属性在两个槽位间交换，然后调用 `recalcPositions` 重算位置。
+
+#### 修复
+
+改为**数组移动**（splice）：将整个 segment 对象从旧位置取出，插入到新位置。segment 对象携带全部属性（materialId、sourceStart、sourceEnd、color 等）作为一个整体移动，再由 `recalcPositions` 重算位置。
+
+```typescript
+// 修复前：属性交换
+const fromMat = { materialId: next[fromIdx].materialId, ... };
+const toMat = { materialId: next[toIdx].materialId, ... };
+next[fromIdx] = { ...next[fromIdx], ...toMat };
+next[toIdx] = { ...next[toIdx], ...fromMat };
+
+// 修复后：数组移动
+const [moved] = next.splice(fromIdx, 1);
+next.splice(toIdx, 0, moved);
+```
+
+附带优势：3+ 素材拖拽行为从"交换"变为"插入"，更符合用户直觉。
+
+---
+
+### 三、按钮视觉统一
+
+#### 问题
+
+历史文案、文案提取助手、AI多语言、AI生成标题标签 4 个按钮高度不一致，AI 按钮的文本被 `<span>` 嵌套包裹导致内部布局差异。
+
+#### 修复
+
+- 4 个按钮统一为 `h-7 px-2.5 text-xs rounded inline-flex items-center gap-1`（固定高度 28px）
+- 移除 AI多语言 / AI生成标题标签 按钮内多余的 `<span>` 嵌套，改为 `<>...</>` fragment
+
+---
+
+### 涉及文件汇总
+
+#### 前端新增
+
+| 文件 | 说明 |
+|------|------|
+| `frontend/src/features/home/model/useSavedScripts.ts` | 历史文案 hook（localStorage 持久化） |
+
+#### 前端修改
+
+| 文件 | 变更 |
+|------|------|
+| `frontend/src/features/home/ui/ScriptEditor.tsx` | 历史文案下拉 + 保存按钮 + 移除预计时长 + 按钮高度统一 |
+| `frontend/src/features/home/model/useHomeController.ts` | 集成 useSavedScripts，新增 handleSaveScript |
+| `frontend/src/features/home/ui/HomePage.tsx` | 传递 savedScripts / handleSaveScript / deleteSavedScript 到 ScriptEditor |
+| `frontend/src/features/home/model/useTimelineEditor.ts` | reorderSegments 从属性交换改为数组移动（splice） |
+
+---
+
+## 🔤 字幕语言不匹配 + 视频比例错位修复 — 第五阶段 (Day 23)
+
+### 概述
+
+修复两个视频生成 Bug：
+1. **字幕语言不匹配**: 中文配音 + 英文翻译文案 → 字幕错误显示英文（Whisper 独立转录，忽略原文）
+2. **标题字幕比例错位**: 9:16 竖屏素材生成视频后，标题/字幕按 16:9 横屏布局渲染
+
+附带修复代码审查中发现的 `split_word_to_chars` 英文空格丢失问题。
+
+---
+
+### 一、字幕用原文替换 Whisper 转录文字
+
+#### 根因
+
+Whisper 对音频独立转录，完全忽略传入的 `text` 参数。当配音语言与编辑器文案语言不一致时（例如：用户先写中文文案 → 翻译成英文 → 生成英文配音 → 再改回中文文案），Whisper "听到"英文语音就输出英文字幕。
+
+#### 修复思路
+
+Whisper 仅负责检测**语音总时间范围**（`first_start` → `last_end`），字幕文字永远用配音保存的原始文案。
+
+#### `whisper_service.py` — `align()` 新增 `original_text` 参数
+
+```python
+async def align(self, audio_path, text, output_path=None,
+                language="zh", original_text=None):
+```
+
+当 `original_text` 非空时：
+1. 正常运行 Whisper 转录，记录 `whisper_first_start` 和 `whisper_last_end`
+2. 将 `original_text` 传入 `split_word_to_chars()` 在总时间范围上线性分布
+3. 用 `split_segment_to_lines()` 按标点和字数断行
+4. 替换 Whisper 的转录结果
+
+#### `workflow.py` — 配音元数据无条件覆盖 + 传入原文
+
+```python
+# 改前（只在文案为空时覆盖）
+if not req.text.strip():
+    req.text = meta.get("text", req.text)
+
+# 改后（无条件用配音元数据覆盖）
+meta_text = meta.get("text", "")
+if meta_text:
+    req.text = meta_text
+```
+
+所有 4 处 `whisper_service.align()` 调用添加 `original_text=req.text`。
+
+---
+
+### 二、Remotion 动态传入视频尺寸
+
+#### 根因
+
+`remotion/src/Root.tsx` 硬编码 `width={1280} height={720}`。虽然 `render.ts` 用 ffprobe 检测真实尺寸后覆盖 `composition.width/height`，但 `selectComposition` 阶段组件已按 1280×720 初始化，标题和字幕定位基于错误的画布尺寸。
+
+#### 修复
+
+##### `Root.tsx` — `calculateMetadata` 从 props 读取尺寸
+
+```tsx
+<Composition
+  id="ViGentVideo"
+  component={Video}
+  durationInFrames={300}
+  fps={25}
+  width={1080}
+  height={1920}
+  calculateMetadata={async ({ props }) => ({
+    width: props.width || 1080,
+    height: props.height || 1920,
+  })}
+  defaultProps={{
+    videoSrc: '',
+    width: 1080,
+    height: 1920,
+    // ...
+  }}
+/>
+```
+
+默认从 1280×720 改为 1080×1920（竖屏优先），`calculateMetadata` 确保 `selectComposition` 阶段使用 ffprobe 检测的真实尺寸。
+
+##### `Video.tsx` — VideoProps 新增可选 `width/height`
+
+仅供 `calculateMetadata` 访问，组件渲染不引用。
+
+##### `render.ts` — inputProps 统一传入视频尺寸
+
+```typescript
+const inputProps = {
+  videoSrc: videoFileName,
+  captions,
+  title: options.title,
+  // ...
+  width: videoWidth,     // ffprobe 检测值
+  height: videoHeight,   // ffprobe 检测值
+};
+```
+
+`selectComposition` 和 `renderMedia` 使用同一个 `inputProps`。保留显式 `composition.width/height` 覆盖作为保险。
+
+---
+
+### 三、代码审查修复：英文空格丢失
+
+#### 问题
+
+`split_word_to_chars` 原设计处理 Whisper 单个词（如 `" Hello"`），但 `original_text` 传入整段文本时，中间空格被 `continue` 跳过且不 flush `ascii_buffer`，导致 `"Hello World"` 变成 `"HelloWorld"`。
+
+#### 执行路径追踪
+
+```
+输入: "Hello World"
+  H,e,l,l,o → ascii_buffer = "Hello"
+  ' '       → continue（跳过，不 flush！）
+  W,o,r,l,d → ascii_buffer = "HelloWorld"
+结果: tokens = ["HelloWorld"]  ← 空格丢失
+```
+
+#### 修复
+
+遇到空格时 flush `ascii_buffer`，并用 `pending_space` 标记给下一个 token 前置空格：
+
+```python
+if not char.strip():
+    if ascii_buffer:
+        tokens.append(ascii_buffer)
+        ascii_buffer = ""
+    if tokens:
+        pending_space = True
+    continue
+```
+
+修复后：`"Hello World"` → tokens = `["Hello", " World"]` → 字幕正确显示。中文不受影响。
+
+---
+
+### 涉及文件汇总
+
+#### 后端修改
+
+| 文件 | 变更 |
+|------|------|
+| `backend/app/services/whisper_service.py` | `align()` 新增 `original_text` 参数；`split_word_to_chars` 修复英文空格丢失 |
+| `backend/app/modules/videos/workflow.py` | 配音元数据无条件覆盖 text/language；4 处 `align()` 调用传入 `original_text` |
+
+#### 前端修改（Remotion）
+
+| 文件 | 变更 |
+|------|------|
+| `remotion/src/Root.tsx` | 默认尺寸改为 1080×1920，新增 `calculateMetadata` + width/height defaultProps |
+| `remotion/src/Video.tsx` | VideoProps 新增可选 `width`/`height` |
+| `remotion/render.ts` | inputProps 统一传入 `videoWidth`/`videoHeight`，selectComposition 和 renderMedia 共用 |
+
+---
+
+## 🎤 参考音频自动转写 + 语速控制 — 第六阶段 (Day 23)
+
+### 概述
+
+解决声音克隆 ref_text 不匹配问题：旧方案使用前端固定文字作为 ref_text，CosyVoice zero-shot 克隆要求 ref_text 必须与参考音频实际内容匹配，不匹配时模型会在生成音频开头"幻觉"出多余片段。
+
+**改进**：上传参考音频时自动调用 Whisper 转写内容作为 ref_text，同时新增语速控制功能。
+
+---
+
+### 一、Whisper 自动转写参考音频
+
+#### 1.1 `whisper_service.py` — 语言自动检测
+
+`transcribe()` 方法原先硬编码 `language="zh"`，改为接受可选 `language` 参数（默认 `None` = 自动检测），支持多语言参考音频。
+
+#### 1.2 `ref_audios/service.py` — 上传时自动转写
+
+上传流程变更：转码 WAV → 检查时长(≥1s) → 超 10s 在静音点截取 → **Whisper 自动转写** → 验证非空 → 上传。
+
+```python
+try:
+    transcribed = await whisper_service.transcribe(tmp_wav_path)
+    if transcribed.strip():
+        ref_text = transcribed.strip()
+except Exception as e:
+    logger.warning(f"Auto-transcribe failed: {e}")
+
+if not ref_text or not ref_text.strip():
+    raise ValueError("无法识别音频内容，请确保音频包含清晰的语音")
+```
+
+#### 1.3 `ref_audios/router.py` — ref_text 改为可选
+
+`ref_text: str = Form("")`（不再必填），前端不再发送固定文字。
+
+---
+
+### 二、参考音频智能截取（10 秒上限）
+
+CosyVoice 对 3-10 秒参考音频效果最好。
+
+#### 2.1 静音点检测
+
+使用 ffmpeg `silencedetect` 找 10 秒内最后一个静音结束点（阈值 -30dB，最短 0.3s），避免在字词中间硬切：
+
+```python
+def _find_silence_cut_point(file_path, max_duration):
+    # silencedetect → 解析 silence_end → 找 3s~max_duration 内最后的静音点
+    # 找不到则回退到 max_duration
+```
+
+#### 2.2 淡出处理
+
+截取时末尾 0.1 秒淡出（`afade=t=out`），避免截断爆音。
+
+---
+
+### 三、重新识别功能（旧数据迁移）
+
+#### 3.1 新增 API
+
+`POST /api/ref-audios/{audio_id}/retranscribe` — 下载音频 → 超 10s 截取 → Whisper 转写 → 重新上传音频和元数据。
+
+#### 3.2 前端 UI
+
+- RefAudioPanel 新增 RotateCw 按钮（"重新识别文字"），转写中显示 `animate-spin`
+- 旧音频 ref_text 以固定文字开头时显示 ⚠ 黄色警告
+
+---
+
+### 四、语速控制（CosyVoice speed 参数）
+
+#### 4.1 全链路传递
+
+```
+前端 GeneratedAudiosPanel (速度选择器)
+  → useHomeController (speed state + persistence)
+  → useGeneratedAudios.generateAudio(params)
+  → POST /api/generated-audios/generate { speed: 1.0 }
+  → GenerateAudioRequest.speed (Pydantic)
+  → generate_audio_task → voice_clone_service.generate_audio(speed=)
+  → _generate_once → POST /generate { speed: "1.0" }
+  → cosyvoice_server → _model.inference_zero_shot(speed=speed)
+```
+
+#### 4.2 前端 UI
+
+声音克隆模式下，配音列表面板标题栏"生成配音"按钮左侧显示语速下拉菜单（`语速: 正常 ▼`）：
+
+| 标签 | speed 值 |
+|------|----------|
+| 较慢 | 0.8 |
+| 稍慢 | 0.9 |
+| 正常 | 1.0 (默认) |
+| 稍快 | 1.1 |
+| 较快 | 1.2 |
+
+语速选择持久化到 localStorage（`vigent_{storageKey}_speed`）。
+
+---
+
+### 五、缺少参考音频门控
+
+声音克隆模式下未选参考音频时：
+- "生成配音"按钮禁用 + title 提示"请先选择参考音频"
+- 面板内显示黄色警告条"声音克隆模式需要先选择参考音频"
+
+---
+
+### 六、前端清理
+
+- 移除 `FIXED_REF_TEXT` 常量和 `fixedRefText` prop
+- 移除"请朗读以下内容"引导区块
+- 上传提示简化为"上传任意语音样本（3-10秒），系统将自动识别内容并克隆声音"
+- 录音区备注"建议 3-10 秒，超出将自动截取"
+
+---
+
+### 涉及文件汇总
+
+#### 后端修改
+
+| 文件 | 变更 |
+|------|------|
+| `backend/app/services/whisper_service.py` | `transcribe()` 增加可选 `language` 参数，默认 None (自动检测) |
+| `backend/app/modules/ref_audios/service.py` | 上传自动转写 + 静音点截取 + 淡出 + retranscribe 函数 |
+| `backend/app/modules/ref_audios/router.py` | `ref_text` 改为 Form("")，新增 retranscribe 端点 |
+| `backend/app/modules/generated_audios/schemas.py` | `GenerateAudioRequest` 新增 `speed: float = 1.0` |
+| `backend/app/modules/generated_audios/service.py` | 传递 `req.speed` 到 voice_clone_service |
+| `backend/app/services/voice_clone_service.py` | `generate_audio()` / `_generate_once()` 接受并传递 speed |
+| `models/CosyVoice/cosyvoice_server.py` | `/generate` 端点接受 `speed` 参数，传递到 `inference_zero_shot(speed=)` |
+
+#### 前端修改
+
+| 文件 | 变更 |
+|------|------|
+| `frontend/src/features/home/model/useHomeController.ts` | 新增 speed state，移除 FIXED_REF_TEXT，handleGenerateAudio 传 speed |
+| `frontend/src/features/home/model/useHomePersistence.ts` | 新增 speed 持久化 |
+| `frontend/src/features/home/model/useRefAudios.ts` | 移除 fixedRefText，新增 retranscribe |
+| `frontend/src/features/home/model/useGeneratedAudios.ts` | generateAudio params 新增 speed |
+| `frontend/src/features/home/ui/GeneratedAudiosPanel.tsx` | 新增语速选择器 + 缺少参考音频门控 |
+| `frontend/src/features/home/ui/RefAudioPanel.tsx` | 移除朗读引导，新增重新识别按钮 + ⚠ 警告 |
+| `frontend/src/features/home/ui/HomePage.tsx` | 传递 speed/setSpeed/ttsMode 到 GeneratedAudiosPanel |
--- a/Docs/DevLogs/Day24.md
+++ b/Docs/DevLogs/Day24.md
@@ -0,0 +1,168 @@
+## 🔧 鉴权到期治理 + 多素材时间轴稳定性修复 (Day 24)
+
+### 概述
+
+本日主要完成两条主线：
+
+1. **账号与鉴权治理**：会员到期改为请求时自动失效（登录/鉴权接口触发），并统一返回续费提示。
+2. **视频生成稳定性**：围绕多素材时间轴、截取语义、拼接边界冻结、画面比例与字幕标题适配进行一轮端到端修复。
+
+---
+
+## 🔐 会员到期请求时失效 — 第一阶段 (Day 24)
+
+### 目标
+
+避免依赖定时任务，用户在触发登录或访问受保护接口时即可完成到期判定与账号停用。
+
+### 行为调整
+
+- 到期判断基于 `users.expires_at`。
+- 判定到期后：
+  - 将 `is_active` 自动置为 `false`
+  - 删除该用户全部 session
+  - 返回 `403`，提示：`会员已到期，请续费`
+
+### 实现点
+
+- `users.py` 新增 `deactivate_user_if_expired()`，并补充 `_parse_expires_at()` 统一时区解析。
+- `deps.py` 在 `get_current_user` / `get_current_user_optional` 中统一接入到期检查。
+- `auth/router.py` 在登录路径增加到期停用逻辑；`/api/auth/me` 统一走 `Depends(get_current_user)`。
+
+---
+
+## 🖼️ 画面比例控制 + 字幕标题适配 — 第二阶段 (Day 24)
+
+### 2.1 输出画面比例可配置
+
+- 时间轴顶部新增“画面比例”下拉：`9:16` / `16:9`。
+- 默认值 `9:16`，并持久化到 localStorage。
+- 生成请求携带 `output_aspect_ratio`，后端在单素材与多素材流程中统一按目标分辨率处理。
+
+### 2.2 标题/字幕在窄屏画布防溢出
+
+为减少“预览正常、成片溢出”的差异，统一了预览与渲染策略：
+
+- 根据 composition 宽度进行响应式缩放。
+- 开启可换行：`white-space: normal` + `word-break` + `overflow-wrap`。
+- 描边、字距、上下边距同步按比例缩放。
+
+---
+
+## 🎥 方向归一化 + 多素材拼接稳定性 — 第三阶段 (Day 24)
+
+### 3.1 MOV 旋转元数据导致横竖识别错误
+
+问题场景：编码分辨率是横屏，但依赖 rotation side-data 才能正确显示为竖屏（常见于手机 MOV）。
+
+修复方案：
+
+- `get_video_metadata()` 扩展返回 `rotation/effective_width/effective_height`。
+- 新增 `normalize_orientation()`，在流程前对带旋转元数据素材做物理方向归一化。
+- 单素材和多素材下载后统一执行方向归一化，再做分辨率决策。
+
+### 3.2 多素材“只看到第一段”与边界冻结
+
+针对拼接可靠性补了两类保护：
+
+- **分配保护**：`custom_assignments` 与素材数量不一致时，后端回退自动分配，避免异常输入导致仅首段生效。
+- **编码一致性**：
+  - 片段准备阶段统一重编码；
+  - concat 阶段不再走拷贝；
+  - 进一步统一为 `25fps + CFR`，并在 concat 增加 `+genpts`，降低段边界时间基不连续导致的“画面冻结口型还动”风险。
+
+---
+
+## ⏱️ 时间轴截取语义对齐修复 — 第四阶段 (Day 24)
+
+### 背景
+
+时间轴设计语义是：
+
+- 每段可以设置 `sourceStart/sourceEnd`；
+- 总时长超出音频时，仅保留可见段，末段截齐音频；
+- 总时长不足时，由最后可见段循环补齐。
+
+本日将前后端对齐到这一语义。
+
+### 4.1 `source_end` 全链路打通
+
+此前仅传 `source_start`，导致后端无法准确知道“截到哪里”。
+
+本次改动：
+
+- 前端 `toCustomAssignments()` 增加可选 `source_end`。
+- 后端 `CustomAssignment` schema 增加 `source_end`。
+- workflow 将 `source_end` 透传到 `prepare_segment()`（单素材/多素材均支持）。
+- `prepare_segment()` 增加 `source_end` 参数，按 `[source_start, source_end)` 计算可用片段，并在需要循环时先裁剪再循环，避免循环范围错位。
+
+### 4.2 时间轴有效时长计算修复
+
+修复 `sourceStart > 0 且 sourceEnd = 0` 时的有效时长错误：
+
+- 旧逻辑会按整段素材时长计算；
+- 新逻辑改为 `materialDuration - sourceStart`。
+
+该修复同时用于：
+
+- `recalcPositions()` 的段时长计算；
+- TimelineEditor 中“循环补足”可视化比例计算。
+
+### 4.3 可见段分配优先级修复
+
+修复“可见段数 < 已选素材数时，custom_assignments 被丢弃回退自动分配”的问题：
+
+- 生成请求优先以时间轴可见段的 `assignments` 为准；
+- 超出时间轴的素材不参与本次生成。
+
+### 4.4 单素材截取触发条件补齐
+
+单素材模式下，若只改了终点（`sourceEnd > 0`）也会发送 `custom_assignments`，确保截取生效。
+
+---
+
+## 🧭 页面交互与体验细节 — 第五阶段 (Day 24)
+
+- 页面刷新后自动回到顶部，避免从历史滚动位置进入页面。
+- 素材列表与历史视频列表滚动增加“跳过首次自动滚动”保护，减少恢复状态时页面跳动。
+- 时间轴比例区移除多余文案，保持信息简洁。
+
+---
+
+## 涉及文件汇总
+
+### 后端修改
+
+| 文件 | 变更 |
+|------|------|
+| `backend/app/repositories/users.py` | 新增 `deactivate_user_if_expired()` 与 `_parse_expires_at()` |
+| `backend/app/core/deps.py` | `get_current_user` / `get_current_user_optional` 接入到期失效检查 |
+| `backend/app/modules/auth/router.py` | 登录时到期停用 + `/api/auth/me` 统一鉴权依赖 |
+| `backend/app/modules/videos/schemas.py` | `CustomAssignment` 新增 `source_end`；保留 `output_aspect_ratio` |
+| `backend/app/modules/videos/workflow.py` | 多素材/单素材透传 `source_end`；多素材 prepare/concat 统一 25fps |
+| `backend/app/services/video_service.py` | 旋转元数据解析与方向归一化；`prepare_segment` 支持 `source_end/target_fps`；concat 强制 CFR + `+genpts` |
+
+### 前端修改
+
+| 文件 | 变更 |
+|------|------|
+| `frontend/src/features/home/model/useTimelineEditor.ts` | `CustomAssignment` 新增 `source_end`；修复 sourceStart 开放终点时长计算 |
+| `frontend/src/features/home/model/useHomeController.ts` | 多素材以可见 assignments 为准发送；单素材截取触发条件补齐 |
+| `frontend/src/features/home/ui/TimelineEditor.tsx` | 画面比例下拉；循环比例按截取后有效时长计算 |
+| `frontend/src/features/home/model/useHomePersistence.ts` | `outputAspectRatio` 持久化 |
+| `frontend/src/features/home/ui/HomePage.tsx` | 页面进入滚动到顶部；ClipTrimmer/Timeline 交互保持一致 |
+| `frontend/src/features/home/ui/FloatingStylePreview.tsx` | 标题/字幕样式预览与成片渲染策略对齐 |
+
+### Remotion 修改
+
+| 文件 | 变更 |
+|------|------|
+| `remotion/src/components/Title.tsx` | 标题响应式缩放与自动换行，优化竖屏窄画布适配 |
+| `remotion/src/components/Subtitles.tsx` | 字幕响应式缩放与自动换行，减少预览/成片差异 |
+
+---
+
+## 验证记录
+
+- 后端语法检查：`python -m py_compile backend/app/modules/videos/schemas.py backend/app/modules/videos/workflow.py backend/app/services/video_service.py`
+- 前端类型检查：`npx tsc --noEmit`
--- a/Docs/DevLogs/Day7.md
+++ b/Docs/DevLogs/Day7.md
@@ -389,7 +389,7 @@ if not qr_element:

 ## 📋 文档规则优化 (16:42 - 17:10)

-**问题**：Doc_Rules需要优化，避免误删历史内容、规范工具使用、防止任务清单遗漏
+**问题**：DOC_RULES需要优化，避免误删历史内容、规范工具使用、防止任务清单遗漏

 **优化内容（最终版）**：

@@ -411,7 +411,7 @@ if not qr_element:
   - 移除无关项目组件

 **修改文件**：
- `Docs/Doc_Rules.md` - 包含检查清单的最终完善版
+- `Docs/DOC_RULES.md` - 包含检查清单的最终完善版

 ---

--- a/Docs/Doc_Rules.md
+++ b/Docs/Doc_Rules.md
@@ -8,8 +8,8 @@

 | 规则 | 说明 |
 |------|------|
-| **默认更新** | 只更新 `DayN.md` |
-| **按需更新** | `task_complete.md` 仅在用户**明确要求**时更新 |
+| **默认更新** | 更新 `DayN.md` 和 `TASK_COMPLETE.md` |
+| **按需更新** | 其他文档仅在内容变化涉及时更新 |
 | **智能修改** | 错误→替换，改进→追加（见下方详细规则） |
 | **先读后写** | 更新前先查看文件当前内容 |
 | **日内合并** | 同一天的多次小修改合并为最终版本 |
@@ -23,14 +23,14 @@
 | 优先级 | 文件路径 | 检查重点 |
 | :---: | :--- | :--- |
 | 🔥 **High** | `Docs/DevLogs/DayN.md` | **(最新日志)** 详细记录变更、修复、代码片段 |
-| 🔥 **High** | `Docs/task_complete.md` | **(任务总览)** 更新 `[x]`、进度条、时间线 |
+| 🔥 **High** | `Docs/TASK_COMPLETE.md` | **(任务总览)** 更新 `[x]`、进度条、时间线 |
 | ⚡ **Med** | `README.md` | **(项目主页)** 功能特性、技术栈、最新截图 |
 | ⚡ **Med** | `Docs/DEPLOY_MANUAL.md` | **(部署手册)** 环境变量、依赖包、启动命令变更 |
 | ⚡ **Med** | `Docs/BACKEND_DEV.md` | **(后端规范)** 接口契约、模块划分、环境变量 |
 | ⚡ **Med** | `Docs/BACKEND_README.md` | **(后端文档)** 接口说明、架构设计 |
 | ⚡ **Med** | `Docs/FRONTEND_DEV.md` | **(前端规范)** API封装、日期格式化、新页面规范 |
 | ⚡ **Med** | `Docs/FRONTEND_README.md` | **(前端文档)** 功能说明、页面变更 |
-| 🧊 **Low** | `Docs/*_DEPLOY.md` | **(子系统部署)** LatentSync/Qwen3/字幕等独立部署文档 |
+| 🧊 **Low** | `Docs/*_DEPLOY.md` | **(子系统部署)** LatentSync/CosyVoice/字幕等独立部署文档 |

 ---

@@ -186,16 +186,16 @@ new_string: "**状态**：✅ 已修复"

 ```
 ViGent2/Docs/
-├── task_complete.md              # 任务总览（仅按需更新）
-├── Doc_Rules.md                  # 本文件
+├── TASK_COMPLETE.md              # 任务总览（仅按需更新）
+├── DOC_RULES.md                  # 本文件
 ├── BACKEND_DEV.md                # 后端开发规范
 ├── BACKEND_README.md             # 后端功能文档
 ├── FRONTEND_DEV.md               # 前端开发规范
 ├── FRONTEND_README.md            # 前端功能文档
 ├── DEPLOY_MANUAL.md              # 部署手册
 ├── SUPABASE_DEPLOY.md            # Supabase 部署文档
-├── LatentSync_DEPLOY.md          # LatentSync 部署文档
-├── QWEN3_TTS_DEPLOY.md           # 声音克隆部署文档
+├── LATENTSYNC_DEPLOY.md          # LatentSync 部署文档
+├── COSYVOICE3_DEPLOY.md           # 声音克隆部署文档
 ├── SUBTITLE_DEPLOY.md            # 字幕系统部署文档
 └── DevLogs/
    ├── Day1.md                   # 开发日志
@@ -206,8 +206,16 @@ ViGent2/Docs/

 ## 📅 DayN.md 更新规则（日常更新）

+### 更新时机
+
+> **边开发边记录，不要等到最后才写。**
+
+- 每完成一个功能/修复后，**立即**追加到 DayN.md
+- 避免积攒到对话末尾一次性补写，容易遗漏变更
+- `TASK_COMPLETE.md` 同理，重要变更完成后及时同步
+
 ### 新建判断 (对话开始前)
-1. **回顾进度**：查看 `task_complete.md` 了解当前状态
+1. **回顾进度**：查看 `TASK_COMPLETE.md` 了解当前状态
 2. **检查日期**：查看最新 `DayN.md`
   - **今天 (与当前日期相同)** → 🚨 **绝对禁止创建新文件**，必须**追加**到现有 `DayN.md` 末尾！即使是完全不同的功能模块。
   - **之前 (昨天或更早)** → 创建 `Day{N+1}.md`
@@ -263,17 +271,17 @@ ViGent2/Docs/

 ---

-## 📝 task_complete.md 更新规则（仅按需）
+## 📝 TASK_COMPLETE.md 更新规则

-> ⚠️ **仅当用户明确要求更新 `task_complete.md` 时才更新**
+> 与 DayN.md 同步更新，记录重要变更时更新任务总览。

 ### 更新原则
- **格式一致性**：直接参考 `task_complete.md` 现有格式追加内容。
+- **格式一致性**：直接参考 `TASK_COMPLETE.md` 现有格式追加内容。
 - **进度更新**：仅在阶段性里程碑时更新进度百分比。

 ### 🔍 完整性检查清单 (必做)

-每次更新 `task_complete.md` 时，必须**逐一检查**以下所有板块：
+每次更新 `TASK_COMPLETE.md` 时，必须**逐一检查**以下所有板块：

 1. **文件头部 & 导航**
   - [ ] `更新时间`：必须是当天日期
--- a/Docs/FRONTEND_DEV.md
+++ b/Docs/FRONTEND_DEV.md
@@ -19,19 +19,28 @@ frontend/src/
 │   │   │   ├── useHomePersistence.ts   # 持久化管理
 │   │   │   ├── useBgm.ts
 │   │   │   ├── useGeneratedVideos.ts
+│   │   │   ├── useGeneratedAudios.ts
 │   │   │   ├── useMaterials.ts
 │   │   │   ├── useMediaPlayers.ts
 │   │   │   ├── useRefAudios.ts
+│   │   │   ├── useSavedScripts.ts
+│   │   │   ├── useTimelineEditor.ts
 │   │   │   └── useTitleSubtitleStyles.ts
 │   │   └── ui/                # UI 组件（纯 props + 回调）
 │   │       ├── HomePage.tsx
 │   │       ├── HomeHeader.tsx
 │   │       ├── MaterialSelector.tsx
 │   │       ├── ScriptEditor.tsx
+│   │       ├── ScriptExtractionModal.tsx
+│   │       ├── script-extraction/
+│   │       │   └── useScriptExtraction.ts
 │   │       ├── TitleSubtitlePanel.tsx
 │   │       ├── FloatingStylePreview.tsx
 │   │       ├── VoiceSelector.tsx
 │   │       ├── RefAudioPanel.tsx
+│   │       ├── GeneratedAudiosPanel.tsx
+│   │       ├── TimelineEditor.tsx
+│   │       ├── ClipTrimmer.tsx
 │   │       ├── BgmPanel.tsx
 │   │       ├── GenerateActionBar.tsx
 │   │       ├── PreviewPanel.tsx
@@ -55,11 +64,11 @@ frontend/src/
 │   ├── types/
 │   │   ├── user.ts            # User 类型定义
 │   │   └── publish.ts         # 发布相关类型
-│   └── contexts/              # 已迁移的 Context
-├── contexts/                  # 全局 Context（Auth、Task）
+│   └── contexts/              # 全局 Context（Auth、Task）
+│       ├── AuthContext.tsx
+│       └── TaskContext.tsx
 ├── components/                # 遗留通用组件
-│   ├── VideoPreviewModal.tsx
-│   └── ScriptExtractionModal.tsx
+│   └── VideoPreviewModal.tsx
 └── proxy.ts                   # Next.js middleware（路由保护）
 ```

@@ -278,8 +287,8 @@ import { formatDate } from '@/shared/lib/media';
 - `shared/lib`：通用工具函数（media.ts / auth.ts / title.ts）
 - `shared/hooks`：跨功能通用 hooks
 - `shared/types`：跨功能实体类型（User / PublishVideo 等）
- `contexts/`：全局 Context（AuthContext / TaskContext）
- `components/`：遗留通用组件（VideoPreviewModal 等）
+- `shared/contexts`：全局 Context（AuthContext / TaskContext）
+- `components/`：遗留通用组件（VideoPreviewModal）

 ## 类型定义规范

@@ -298,6 +307,16 @@ import { formatDate } from '@/shared/lib/media';
  - 标题字号 / 字幕字号
  - 背景音乐选择 / 音量 / 开关状态
  - 素材选择 / 历史作品选择
+  - 选中配音 ID (`selectedAudioId`)
+  - 语速 (`speed`，声音克隆模式)
+  - 时间轴段信息 (`useTimelineEditor` 的 localStorage)
+
+### 历史文案（独立持久化）
+
+`useSavedScripts` hook 独立管理历史文案的 localStorage 持久化：
+- key: `vigent_{storageKey}_savedScripts`
+- 仅在用户手动保存/删除时写入 localStorage，不使用自动持久化 effect
+- 与 `useHomePersistence` 完全独立，互不影响

 ### 实施规范
 - 使用 `storageKey = userId || 'guest'`，按用户隔离。
@@ -343,9 +362,11 @@ import { formatDate } from '@/shared/lib/media';

 | 接口 | 方法 | 功能 |
 |------|------|------|
-| `/api/ref-audios` | POST | 上传参考音频 (multipart/form-data: file + ref_text) |
+| `/api/ref-audios` | POST | 上传参考音频 (multipart/form-data: file，ref_text 可选，后端自动 Whisper 转写) |
 | `/api/ref-audios` | GET | 列出用户的参考音频 |
+| `/api/ref-audios/{id}` | PUT | 重命名参考音频 |
 | `/api/ref-audios/{id}` | DELETE | 删除参考音频 (id 需 encodeURIComponent) |
+| `/api/ref-audios/{id}/retranscribe` | POST | 重新识别参考音频文字（Whisper 转写 + 超 10s 自动截取） |

 ### 视频生成 API 扩展

@@ -364,7 +385,8 @@ await api.post('/api/videos/generate', {
    text: '口播文案',
    tts_mode: 'voiceclone',
    ref_audio_id: 'user_id/timestamp_name.wav',
-    ref_text: '参考音频对应文字',
+    ref_text: '参考音频对应文字',  // 从参考音频 metadata 自动获取
+    speed: 1.0,  // 语速 (0.8-1.2)
 });
 ```

@@ -378,8 +400,14 @@ const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
 const mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
 ```

+### 参考音频自动处理
+
+- **自动转写**: 上传参考音频时后端自动调用 Whisper 转写内容作为 `ref_text`，无需用户手动输入
+- **自动截取**: 参考音频超过 10 秒时自动在静音点截取前 10 秒（CosyVoice 建议 3-10 秒）
+- **重新识别**: 旧参考音频可通过 retranscribe 端点重新转写并截取
+
 ### UI 结构

 配音方式使用 Tab 切换：
 - **EdgeTTS 音色** - 预设音色 2x3 网格
- **声音克隆** - 参考音频列表 + 在线录音 + 参考文字输入
+- **声音克隆** - 参考音频列表 + 在线录音 + 语速下拉菜单 (5 档: 较慢/稍慢/正常/稍快/较快)
--- a/Docs/FRONTEND_README.md
+++ b/Docs/FRONTEND_README.md
@@ -17,7 +17,9 @@ ViGent2 的前端界面，采用 Next.js 16 + TailwindCSS 构建。
 - **作品预览**: 生成完成后直接播放下载（作品预览 + 历史作品）。
 - **预览优化**: 预览视频 `metadata` 预取，首帧加载更快。
 - **本地保存**: 文案/标题/偏好由 `useHomePersistence` 统一持久化，刷新后恢复 (Day 14/17)。
+- **历史文案**: 手动保存/加载/删除历史文案，独立 localStorage 持久化 (Day 23)。
 - **选择持久化**: 首页/发布页作品选择均使用稳定 `id` 持久化，刷新保持用户选择；新视频生成后自动选中最新 (Day 21)。
+- **AI 多语言翻译**: 支持 9 种目标语言翻译文案 + 还原原文 (Day 22)。

 ### 2. 全自动发布 (`/publish`) [Day 7 新增]
 - **多平台管理**: 统一管理抖音、微信视频号、B站、小红书账号状态。
@@ -33,10 +35,23 @@ ViGent2 的前端界面，采用 Next.js 16 + TailwindCSS 构建。

 ### 3. 声音克隆 [Day 13 新增]
 - **TTS 模式选择**: EdgeTTS (预设音色) / 声音克隆 (自定义音色) 切换。
- **参考音频管理**: 上传/列表/删除参考音频 (3-20秒 WAV)。
- **一键克隆**: 选择参考音频后自动调用 Qwen3-TTS 服务。
+- **参考音频管理**: 上传/列表/重命名/删除参考音频，上传后自动 Whisper 转写 ref_text + 超 10s 自动截取。
+- **重新识别**: 旧参考音频可重新转写并截取 (RotateCw 按钮)。
+- **一键克隆**: 选择参考音频后自动调用 CosyVoice 3.0 服务。
+- **语速控制**: 声音克隆模式下支持 5 档语速 (0.8-1.2)，选择持久化 (Day 23)。
+- **多语言支持**: EdgeTTS 10 语言声音列表，声音克隆 language 透传 (Day 22)。

-### 4. 字幕与标题 [Day 13 新增]
+### 4. 配音前置 + 时间轴编排 [Day 23 新增]
+- **配音独立生成**: 先生成配音 → 选中配音 → 再选素材 → 生成视频。
+- **配音管理面板**: 生成/试听/改名/删除/选中，异步生成 + 进度轮询。
+- **时间轴编辑器**: wavesurfer.js 音频波形 + 色块可视化素材分配，拖拽分割线调整各段时长。
+- **素材截取设置**: ClipTrimmer 双手柄 range slider + HTML5 视频预览播放。
+- **拖拽排序**: 时间轴色块支持 HTML5 Drag & Drop 调换素材顺序。
+- **自定义分配**: 后端 `custom_assignments` 支持用户定义的素材分配方案（含 `source_start/source_end` 截取区间）。
+- **时间轴语义对齐**: 超出音频时仅保留可见段并截齐末段，超出段不参与生成；不足音频时最后可见段自动循环补齐。
+- **画面比例控制**: 时间轴顶部支持 `9:16 / 16:9` 输出比例选择，设置持久化并透传后端。
+
+### 5. 字幕与标题 [Day 13 新增]
 - **片头标题**: 可选输入，限制 15 字，视频开头显示 3 秒淡入淡出标题。
 - **标题同步**: 首页片头标题修改会同步到发布信息标题。
 - **逐字高亮字幕**: 卡拉OK效果，默认开启，可关闭。
@@ -45,16 +60,16 @@ ViGent2 的前端界面，采用 Next.js 16 + TailwindCSS 构建。
 - **默认样式**: 标题 90px 站酷快乐体；字幕 60px 经典黄字 + DingTalkJinBuTi (Day 17)。
 - **样式持久化**: 标题/字幕样式与字号刷新保留 (Day 17)。

-### 5. 背景音乐 [Day 16 新增]
+### 6. 背景音乐 [Day 16 新增]
 - **试听预览**: 点击试听即选中，音量滑块实时生效。
 - **混音控制**: 仅影响 BGM，配音保持原音量。

-### 6. 账户设置 [Day 15 新增]
+### 7. 账户设置 [Day 15 新增]
 - **手机号登录**: 11位中国手机号验证登录。
 - **账户下拉菜单**: 显示有效期 + 修改密码 + 安全退出。
 - **修改密码**: 弹窗输入当前密码与新密码，修改后强制重新登录。

-### 7. 文案提取助手 (`ScriptExtractionModal`) [Day 15 新增]
+### 8. 文案提取助手 (`ScriptExtractionModal`) [Day 15 新增]
 - **多源提取**: 支持文件拖拽上传与 URL 粘贴 (B站/抖音/TikTok)。
 - **AI 洗稿**: 集成 GLM-4.7-Flash，自动改写为口播文案。
 - **一键填入**: 提取结果直接填充至视频生成输入框。
@@ -66,6 +81,7 @@ ViGent2 的前端界面，采用 Next.js 16 + TailwindCSS 构建。
 - **样式**: TailwindCSS
 - **图标**: Lucide React
 - **组件**: 自定义现代化组件 (Glassmorphism 风格)
+- **音频波形**: wavesurfer.js (时间轴编辑器)
 - **API**: Axios 实例 `@/shared/api/axios` (对接后端 FastAPI :8006)

 ## 🚀 开发指南
--- a/Docs/QWEN3_TTS_DEPLOY.md
+++ b/Docs/QWEN3_TTS_DEPLOY.md
@@ -298,12 +298,20 @@ Response: audio/wav 文件
 SoX could not be found!
 ```

-**解决**: 通过 conda 安装 sox：
+**解决**:
+
+1. 通过 conda 安装 sox：

 ```bash
 conda install -y -c conda-forge sox
 ```

+2. 确保启动脚本 `run_qwen_tts.sh` 中已 export conda env bin 到 PATH（PM2 启动时系统 PATH 不含 conda 环境目录）：
+
+```bash
+export PATH="/home/rongye/ProgramFiles/miniconda3/envs/qwen-tts/bin:$PATH"
+```
+
 ### CUDA 内存不足

 Qwen3-TTS 1.7B 通常需要 8-10GB VRAM。如果遇到 OOM：
@@ -371,6 +379,7 @@ FOR INSERT TO anon WITH CHECK (bucket_id = 'ref-audios');

 | 日期 | 版本 | 说明 |
 |------|------|------|
+| 2026-02-09 | 1.2.0 | 修复 SoX PATH 问题（run_qwen_tts.sh export conda bin），每次生成后 empty_cache() |
 | 2026-01-30 | 1.1.0 | 明确默认模型升级为 1.7B-Base，替换旧版 0.6B 路径 |

 ---
--- a/Docs/SUBTITLE_DEPLOY.md
+++ b/Docs/SUBTITLE_DEPLOY.md
@@ -15,9 +15,13 @@
 原有流程:
  文本 → EdgeTTS → 音频 → LatentSync → FFmpeg合成 → 最终视频

-新流程:
-  文本 → EdgeTTS → 音频 ─┬→ LatentSync → 唇形视频 ─┐
-                        └→ faster-whisper → 字幕JSON ─┴→ Remotion合成 → 最终视频
+新流程 (单素材):
+  文本 → EdgeTTS/Qwen3-TTS/预生成配音 → 音频 ─┬→ LatentSync → 唇形视频 ─┐
+                                              └→ faster-whisper → 字幕JSON ─┴→ Remotion合成 → 最终视频
+
+新流程 (多素材):
+  音频 → 多素材按 custom_assignments 拼接 → LatentSync (单次推理) → 唇形视频 ─┐
+  音频 → faster-whisper → 字幕JSON ─────────────────────────────────────────────┴→ Remotion合成 → 最终视频
 ```

 ## 系统要求
@@ -140,7 +144,7 @@ remotion/
 | 阶段 | 进度 | 说明 |
 |------|------|------|
 | 下载素材 | 0% → 5% | 从 Supabase 下载输入视频 |
-| TTS 语音生成 | 5% → 25% | EdgeTTS 或 Qwen3-TTS 生成音频 |
+| TTS 语音生成 | 5% → 25% | EdgeTTS / Qwen3-TTS / 预生成配音下载 |
 | 唇形同步 | 25% → 80% | LatentSync 推理 |
 | 字幕对齐 | 80% → 85% | faster-whisper 生成字级别时间戳 |
 | Remotion 渲染 | 85% → 95% | 合成字幕和标题 |
@@ -282,4 +286,5 @@ WhisperService(device="cuda:0")  # 或 "cuda:1"
 | 日期 | 版本 | 说明 |
 |------|------|------|
 | 2026-01-29 | 1.0.0 | 初始版本，使用 faster-whisper + Remotion 实现逐字高亮字幕和片头标题 |
+| 2026-02-10 | 1.1.0 | 更新架构图：多素材 concat-then-infer、预生成配音选项 |
 | 2026-01-30 | 1.0.1 | 字幕高亮样式与标题动画优化，视觉表现更清晰 |
--- a/Docs/task_complete.md
+++ b/Docs/task_complete.md
@@ -1,8 +1,8 @@
 # ViGent2 开发任务清单 (Task Log)

 **项目**: ViGent2 数字人口播视频生成系统
-**进度**: 100% (Day 21 - 缺陷修复与持久化回归治理)
-**更新时间**: 2026-02-08
+**进度**: 100% (Day 24 - 鉴权到期治理 + 多素材时间轴稳定性修复)
+**更新时间**: 2026-02-11

 ---

@@ -10,18 +10,92 @@

 > 这里记录了每一天的核心开发内容与 milestone。

-### Day 21: 缺陷修复与持久化回归治理 (Current)
+### Day 24: 鉴权到期治理 + 多素材时间轴稳定性修复 (Current)
+- [x] **会员到期请求时失效**: 登录与鉴权接口统一执行 `expires_at` 检查；到期后自动停用账号、清理 session，并返回“会员已到期，请续费”。
+- [x] **画面比例控制**: 时间轴新增 `9:16 / 16:9` 输出比例选择，前端持久化并透传后端，单素材/多素材统一按目标分辨率处理。
+- [x] **标题/字幕防溢出**: Remotion 与前端预览统一响应式缩放、自动换行、描边/字距/边距比例缩放，降低预览与成片差异。
+- [x] **MOV 方向归一化**: 新增旋转元数据解析与 orientation normalize，修复“编码横屏+旋转元数据”导致的竖屏判断偏差。
+- [x] **多素材拼接稳定性**: 片段 prepare 与 concat 统一 25fps/CFR，concat 增加 `+genpts`，缓解段切换处“画面冻结口型还动”。
+- [x] **时间轴语义对齐**: 打通 `source_end` 全链路；修复 `sourceStart>0 且 sourceEnd=0` 时长计算；生成时以时间轴可见段 assignments 为准，超出段不参与。
+- [x] **交互细节优化**: 页面刷新回顶部；素材/历史列表首轮自动滚动抑制，减少恢复状态时页面跳动。
+
+### Day 23: 配音前置重构 + 素材时间轴编排 + UI 体验优化 + 声音克隆增强
+
+#### 第一阶段：配音前置
+- [x] **配音生成独立化**: 新增 `generated_audios` 后端模块（router/schemas/service），5 个 API 端点，复用现有 TTSService / voice_clone_service / task_store。
+- [x] **配音管理面板**: 前端新增 `useGeneratedAudios` hook + `GeneratedAudiosPanel` 组件，支持生成/试听/改名/删除/选中。
+- [x] **UI 面板重排序**: 文案 → 标题字幕 → 配音方式 → 配音列表 → 素材选择 → BGM → 生成视频。
+- [x] **素材区门控**: 未选中配音时素材区显示遮罩，选中后显示配音时长 + 素材均分信息。
+- [x] **视频生成对接**: workflow.py 新增预生成音频分支（`generated_audio_id`），跳过内联 TTS，向后兼容。
+- [x] **持久化**: selectedAudioId 加入 useHomePersistence，刷新页面恢复选中配音。
+
+#### 第二阶段：素材时间轴编排
+- [x] **时间轴编辑器**: 新增 `TimelineEditor` 组件，wavesurfer.js 音频波形 + 色块可视化素材分配，拖拽分割线调整各段时长。
+- [x] **素材截取设置**: 新增 `ClipTrimmer` 模态框，HTML5 视频预览 + 双端滑块设置源视频截取起点/终点。
+- [x] **后端自定义分配**: 新增 `CustomAssignment` 模型，`prepare_segment` 支持 `source_start`，workflow 多素材/单素材流水线支持 `custom_assignments`。
+- [x] **循环截取修复**: `stream_loop + source_start` 改为两步处理（先裁剪再循环），确保从截取起点循环而非从视频 0s 开始。
+- [x] **MaterialSelector 精简**: 移除旧的时长信息栏和拖拽排序区（功能迁移到 TimelineEditor）。
+
+#### 第三阶段：UI 体验优化 + TTS 稳定性
+- [x] **TTS SoX PATH 修复**: `run_qwen_tts.sh` export conda env bin 到 PATH (Qwen3-TTS 已停用，已被 CosyVoice 3.0 替换)。
+- [x] **TTS 显存管理**: 每次生成后 `torch.cuda.empty_cache()`，asyncio.to_thread 避免阻塞事件循环 (CosyVoice 沿用相同机制)。
+- [x] **配音列表按钮统一**: Play/Edit/Delete 按钮右侧同组 hover 显示，与 RefAudioPanel 一致，移除文案摘要。
+- [x] **素材区解除配音门控**: 移除 MaterialSelector 的 selectedAudio 遮罩，素材随时可上传管理。
+- [x] **时间轴拖拽排序**: TimelineEditor 色块支持 HTML5 Drag & Drop 调换素材顺序。
+- [x] **截取设置 Range Slider**: ClipTrimmer 改为单轨道双手柄（紫色起点+粉色终点），替换两个独立滑块。
+- [x] **截取设置视频预览**: 视频区域可播放/暂停，从 sourceStart 到 sourceEnd 自动停止，拖拽手柄时实时 seek。
+
+#### 第四阶段：历史文案 + Bug 修复
+- [x] **历史文案保存与加载**: 新增 `useSavedScripts` hook，手动保存/加载/删除历史文案，独立 localStorage 持久化。
+- [x] **时间轴拖拽修复**: `reorderSegments` 从属性交换改为数组移动（splice），修复拖拽后时长不跟随素材的 Bug。
+- [x] **按钮视觉统一**: 文案编辑区 4 个按钮统一为固定高度 `h-7`，移除多余 `<span>` 嵌套。
+- [x] **底部栏调整**: "保存文案"按钮移至底部右侧，移除预计时长显示。
+
+#### 第五阶段：字幕语言不匹配 + 视频比例错位修复
+- [x] **字幕用原文替换 Whisper 转录**: `align()` 新增 `original_text` 参数，字幕文字永远用配音保存的原始文案。
+- [x] **Remotion 动态视频尺寸**: `calculateMetadata` 从 props 读取真实尺寸，修复标题/字幕比例错位。
+- [x] **英文空格丢失修复**: `split_word_to_chars` 遇到空格时 flush buffer + pending_space 标记。
+
+#### 第六阶段：参考音频自动转写 + 语速控制
+- [x] **Whisper 自动转写 ref_text**: 上传参考音频时自动调用 Whisper 转写内容作为 ref_text，不再使用前端固定文字。
+- [x] **参考音频自动截取**: 超过 10 秒自动在静音点截取（ffmpeg silencedetect），末尾 0.1 秒淡出避免截断爆音。
+- [x] **重新识别功能**: 新增 `POST /ref-audios/{id}/retranscribe` 端点 + 前端 RotateCw 按钮，旧音频可重新转写并截取。
+- [x] **语速控制**: 全链路 speed 参数（前端选择器 → 持久化 → 后端 → CosyVoice `inference_zero_shot(speed=)`），5 档：较慢(0.8)/稍慢(0.9)/正常(1.0)/稍快(1.1)/较快(1.2)。
+- [x] **缺少参考音频门控**: 声音克隆模式下未选参考音频时，生成配音按钮禁用 + 黄色警告提示。
+- [x] **Whisper 语言自动检测**: `transcribe()` language 参数改为可选（默认 None = 自动检测），支持多语言参考音频。
+- [x] **前端清理**: 移除固定 ref_text 常量、朗读引导文字，简化为"上传任意语音样本，系统将自动识别内容并克隆声音"。
+
+### Day 22: 多素材优化 + AI 翻译 + TTS 多语言
+- [x] **多素材 Bug 修复**: 6 个高优 Bug（边界溢出、单段 fallback、除零、duration 校验、Whisper 兜底、空列表检查）。
+- [x] **架构重构**: 多素材从"逐段 LatentSync"重构为"先拼接再推理"，推理次数 N→1。
+- [x] **前端优化**: payload 安全、进度消息、上传自动选中、Material 接口统一、拖拽修复、素材上限 4 个。
+- [x] **AI 多语言翻译**: 新增 `/api/ai/translate` 接口，前端 9 种语言翻译 + 还原原文。
+- [x] **TTS 多语言**: EdgeTTS 10 语言声音列表、翻译自动切换声音、声音克隆 language 透传、textLang 持久化。
+
+### Day 21: 缺陷修复 + 浮动预览 + 发布重构 + 架构优化 + 多素材生成
 - [x] **Remotion 崩溃容错**: 渲染进程 SIGABRT 退出时检查输出文件，避免误判失败导致标题/字幕丢失。
 - [x] **首页作品选择持久化**: 修复 `fetchGeneratedVideos` 无条件覆盖恢复值的问题，新增 `preferVideoId` 参数控制选中逻辑。
 - [x] **发布页作品选择持久化**: 根因为签名 URL 不稳定，全面改用 `video.id` 替代 `path` 进行选择/持久化/比较。
 - [x] **预取缓存补全**: 首页预取发布页数据时加入 `id` 字段，确保缓存数据可用于持久化匹配。
+- [x] **浮动样式预览窗口**: 标题字幕预览改为 `position: fixed` 浮动窗口，固定左上角，滚动时始终可见。
+- [x] **移动端适配**: ScriptEditor 按钮换行、预览默认比例改为 9:16 竖屏。
+- [x] **多平台发布重构**: 平台配置独立化（DOUYIN_*/WEIXIN_*）、用户隔离 Cookie 管理、抖音刷脸验证二维码、微信发布流程优化。
+- [x] **前端结构微调**: ScriptExtractionModal 迁移到 features/、contexts 迁移到 shared/contexts/、清理空目录。
+- [x] **后端模块分层**: materials/tools/ref_audios 三个模块补全 router+schemas+service 分层。
+- [x] **开发规范更新**: BACKEND_DEV.md 新增渐进原则、DOC_RULES.md 取消 TASK_COMPLETE.md 手动触发约束。
+- [x] **文档全面更新**: BACKEND_DEV/README、FRONTEND_DEV、DEPLOY_MANUAL、README.md 同步更新。
+- [x] **多素材视频生成（多机位效果）**: 支持多选素材 + 拖拽排序，按素材数量均分音频时长（对齐 Whisper 字边界）自动切换机位。逐段 LatentSync + FFmpeg 拼接。前端 @dnd-kit 拖拽排序 UI。
+- [x] **字幕开关移除**: 默认启用逐字高亮字幕，移除开关及相关死代码。
+- [x] **视频格式扩展**: 上传支持 mkv/webm/flv/wmv/m4v/ts/mts 等常见格式。
+- [x] **Watchdog 优化**: 健康检查阈值提高到 5 次，新增重启冷却期 120 秒，避免误重启。
+- [x] **多素材 Bug 修复**: 修复标点分句方案对无句末标点文案无效（改为均分方案）、音频时间偏移导致口型不对齐等缺陷。

 ### Day 20: 代码质量与安全优化
 - [x] **功能性修复**: LatentSync 回退逻辑、任务状态接口认证、User 类型统一。
 - [x] **性能优化**: N+1 查询修复、视频上传流式处理、httpx 异步替换、GLM 异步包装。
 - [x] **安全修复**: 硬编码 Cookie 配置化、日志敏感信息脱敏、ffprobe 安全调用、CORS 配置化。
 - [x] **配置优化**: 存储路径环境变量化、Remotion 预编译加速、LatentSync 绝对路径。
- [x] **文档更新**: 更新 Doc_Rules.md 清单，补齐后端与部署文档；更新 SUBTITLE_DEPLOY.md, FRONTEND_DEV.md, implementation_plan.md。
+- [x] **文档更新**: 更新 DOC_RULES.md 清单，补齐后端与部署文档；更新 SUBTITLE_DEPLOY.md, FRONTEND_DEV.md, implementation_plan.md。
 - [x] **缺陷修复**: 修复 Remotion 路径解析、发布页持久化竞态、首页选中回归、素材闭包陷阱。

 ### Day 19: 自动发布稳定性与发布体验优化 🚀
@@ -66,7 +140,7 @@
 - [x] **体验细节优化**: 录音预览 URL 回收，预览弹窗滚动恢复，全局任务提示挂载。

 ### Day 16: 深度性能优化
- [x] **Qwen-TTS 加速**: 集成 Flash Attention 2，模型加载速度提升至 8.9s。
+- [x] **Qwen-TTS 加速**: 集成 Flash Attention 2 (已停用，被 CosyVoice 3.0 替换)。
 - [x] **服务守护**: 开发 `Watchdog` 看门狗机制，自动监控并重启僵死服务。
 - [x] **LatentSync 性能确认**: 验证 DeepCache + 原生 Flash Attn 生效。
 - [x] **文档重构**: 全面更新 README、部署手册及后端文档。
@@ -79,10 +153,10 @@
 ### Day 14: AI 增强与体验优化
 - [x] **AI 标题/标签**: 集成 GLM-4API 自动生成视频元数据。
 - [x] **字幕升级**: Remotion 逐字高亮字幕 (卡拉OK效果) 及动画片头。
- [x] **模型升级**: Qwen3-TTS 升级至 1.7B-Base 版本。
+- [x] **模型升级**: 声音克隆已迁移至 CosyVoice 3.0 (0.5B)。

 ### Day 13: 声音克隆集成
- [x] **声音克隆微服务**: 封装 Qwen3-TTS 为独立 API (8009端口)。
+- [x] **声音克隆微服务**: 封装 CosyVoice 3.0 为独立 API (8010端口，替换 Qwen3-TTS)。
 - [x] **参考音频管理**: Supabase 存储桶配置与管理接口。
 - [x] **多模态 TTS**: 前端支持 EdgeTTS / Clone Voice 切换。

@@ -117,6 +191,7 @@
 ## 🛤️ 后续规划 (Roadmap)

 ### 🔴 优先待办
+- [x] ~~**配音前置重构 — 第二阶段**: 素材片段截取 + 语音时间轴编排~~ ✅ Day 23 已完成
 - [ ] **批量生成架构**: 支持 Excel 导入，批量生产视频。
 - [ ] **定时任务后台化**: 迁移前端触发的定时发布到后端 APScheduler。
 - [ ] **发布任务恢复机制**: 发布任务化 + 状态持久化 + 前端断点恢复，解决刷新后状态丢失。
@@ -134,7 +209,7 @@
 | **核心 API** | 100% | ✅ 稳定 |
 | **Web UI** | 100% | ✅ 稳定 (移动端适配) |
 | **唇形同步** | 100% | ✅ LatentSync 1.6 |
-| **TTS 配音** | 100% | ✅ EdgeTTS + Qwen3 |
+| **TTS 配音** | 100% | ✅ EdgeTTS + CosyVoice 3.0 + 配音前置 + 时间轴编排 + 自动转写 + 语速控制 |
 | **自动发布** | 100% | ✅ 抖音/微信视频号/B站/小红书 |
 | **用户认证** | 100% | ✅ 手机号 + JWT |
 | **部署运维** | 100% | ✅ PM2 + Watchdog |
--- a/README.md
+++ b/README.md
@@ -5,7 +5,7 @@
 > 📹 **上传人物** · 🎙️ **输入文案** · 🎬 **一键成片**

 基于 **LatentSync 1.6 + EdgeTTS** 的开源数字人口播视频生成系统。
-集成 **Qwen3-TTS** 声音克隆与自动社交媒体发布功能。
+集成 **CosyVoice 3.0** 声音克隆与自动社交媒体发布功能。

 [功能特性](#-功能特性) • [技术栈](#-技术栈) • [文档中心](#-文档中心) • [部署指南](Docs/DEPLOY_MANUAL.md)

@@ -17,13 +17,15 @@

 ### 核心能力
 - 🎬 **高清唇形同步** - LatentSync 1.6 驱动，512×512 高分辨率 Latent Diffusion 模型。
- 🎙️ **多模态配音** - 支持 **EdgeTTS** (微软超自然语音) 和 **Qwen3-TTS** (3秒极速声音克隆)。
+- 🎙️ **多模态配音** - 支持 **EdgeTTS** (微软超自然语音, 10 语言) 和 **CosyVoice 3.0** (3秒极速声音克隆, 9语言+18方言, 语速可调)。上传参考音频自动 Whisper 转写 + 智能截取。配音前置工作流：先生成配音 → 选素材 → 生成视频。
 - 📝 **智能字幕** - 集成 faster-whisper + Remotion，自动生成逐字高亮 (卡拉OK效果) 字幕。
 - 🎨 **样式预设** - 标题/字幕样式选择 + 预览 + 字号调节，支持自定义字体库。
- 🖼️ **作品预览一致性** - 标题/字幕预览按素材分辨率缩放，效果更接近成片。
- 💾 **用户偏好持久化** - 首页状态统一恢复/保存，刷新后延续上次配置。
+- 🖼️ **作品预览一致性** - 标题/字幕预览与 Remotion 成片统一响应式缩放和自动换行，窄屏画布也稳定显示。
+- 🎞️ **多素材多机位** - 支持多选素材 + 时间轴编辑器 (wavesurfer.js 波形可视化)，拖拽分割线调整时长、拖拽排序切换机位、按 `source_start/source_end` 截取片段。
+- 📐 **画面比例控制** - 时间轴一键切换 `9:16 / 16:9` 输出比例，生成链路全程按目标比例处理。
+- 💾 **用户偏好持久化** - 首页状态统一恢复/保存，刷新后延续上次配置。历史文案手动保存与加载。
 - 🎵 **背景音乐** - 试听 + 音量控制 + 混音，保持配音音量稳定。
- 🤖 **AI 辅助创作** - 内置 GLM-4.7-Flash，支持 B站/抖音链接文案提取、AI 洗稿、标题/标签自动生成。
+- 🤖 **AI 辅助创作** - 内置 GLM-4.7-Flash，支持 B站/抖音链接文案提取、AI 洗稿、标题/标签自动生成、9 语言翻译。

 ### 平台化功能
 - 📱 **全自动发布** - 支持抖音/微信视频号/B站/小红书立即发布；扫码登录 + Cookie 持久化。
@@ -40,11 +42,11 @@

 | 领域 | 核心技术 | 说明 |
 |------|----------|------|
-| **前端** | Next.js 16 | TypeScript, TailwindCSS, SWR |
+| **前端** | Next.js 16 | TypeScript, TailwindCSS, SWR, wavesurfer.js |
 | **后端** | FastAPI | Python 3.10, AsyncIO, PM2 |
 | **数据库** | Supabase | PostgreSQL, Storage (本地/S3), Auth |
 | **唇形同步** | LatentSync 1.6 | PyTorch 2.5, Diffusers, DeepCache |
-| **声音克隆** | Qwen3-TTS | 1.7B 参数量，Flash Attention 2 加速 |
+| **声音克隆** | CosyVoice 3.0 | 0.5B 参数量，9 语言 + 18 方言 |
 | **自动化** | Playwright | 社交媒体无头浏览器自动化 |
 | **部署** | Docker & PM2 | 混合部署架构 |

@@ -56,7 +58,7 @@

 ### 部署运维
 - **[部署手册 (DEPLOY_MANUAL.md)](Docs/DEPLOY_MANUAL.md)** - 👈 **部署请看这里**！包含完整的环境搭建步骤。
- [参考音频服务部署 (QWEN3_TTS_DEPLOY.md)](Docs/QWEN3_TTS_DEPLOY.md) - 声音克隆模型部署指南。
+- [参考音频服务部署 (COSYVOICE3_DEPLOY.md)](Docs/COSYVOICE3_DEPLOY.md) - 声音克隆模型部署指南。
 - [LatentSync 部署指南](models/LatentSync/DEPLOY.md) - 唇形同步模型独立部署。
 - [Supabase 部署指南 (SUPABASE_DEPLOY.md)](Docs/SUPABASE_DEPLOY.md) - Supabase 与认证系统配置。

@@ -81,7 +83,7 @@ ViGent2/
 ├── remotion/             # Remotion 视频渲染 (标题/字幕合成)
 ├── models/               # AI 模型仓库
 │   ├── LatentSync/       # 唇形同步服务
-│   └── Qwen3-TTS/        # 声音克隆服务
+│   └── CosyVoice/       # 声音克隆服务
 └── Docs/                 # 项目文档
 ```

@@ -96,7 +98,7 @@ ViGent2/
 | **Web UI** | 3002 | 用户访问入口 (Next.js) |
 | **Backend API** | 8006 | 核心业务接口 (FastAPI) |
 | **LatentSync** | 8007 | 唇形同步推理服务 |
-| **Qwen3-TTS** | 8009 | 声音克隆推理服务 |
+| **CosyVoice 3.0** | 8010 | 声音克隆推理服务 |
 | **Supabase** | 8008 | 数据库与认证网关 |

 ---
--- a/backend/app/core/deps.py
+++ b/backend/app/core/deps.py
@@ -2,11 +2,11 @@
 依赖注入模块：认证和用户获取
 """
 from typing import Optional, Any, Dict, cast
-from fastapi import Request, HTTPException, Depends, status
-from app.core.security import decode_access_token, TokenData
-from app.repositories.sessions import get_session
-from app.repositories.users import get_user_by_id
-from loguru import logger
+from fastapi import Request, HTTPException, Depends, status
+from app.core.security import decode_access_token
+from app.repositories.sessions import get_session, delete_sessions
+from app.repositories.users import get_user_by_id, deactivate_user_if_expired
+from loguru import logger


 async def get_token_from_cookie(request: Request) -> Optional[str]:
@@ -35,8 +35,12 @@ async def get_current_user_optional(
            logger.warning(f"Session token 无效: user_id={token_data.user_id}")
            return None

-        user = get_user_by_id(token_data.user_id)
-        return cast(Optional[Dict[str, Any]], user)
+        user = cast(Optional[Dict[str, Any]], get_user_by_id(token_data.user_id))
+        if user and deactivate_user_if_expired(user):
+            delete_sessions(token_data.user_id)
+            return None
+
+        return user
    except Exception as e:
        logger.error(f"获取用户信息失败: {e}")
        return None
@@ -82,14 +86,12 @@ async def get_current_user(
            )
        user = cast(Dict[str, Any], user)

-        if user.get("expires_at"):
-            from datetime import datetime, timezone
-            expires_at = datetime.fromisoformat(user["expires_at"].replace("Z", "+00:00"))
-            if datetime.now(timezone.utc) > expires_at:
-                raise HTTPException(
-                    status_code=status.HTTP_403_FORBIDDEN,
-                    detail="授权已过期，请联系管理员续期"
-                )
+        if deactivate_user_if_expired(user):
+            delete_sessions(token_data.user_id)
+            raise HTTPException(
+                status_code=status.HTTP_403_FORBIDDEN,
+                detail="会员已到期，请续费"
+            )

        return user
    except HTTPException:
--- a/backend/app/main.py
+++ b/backend/app/main.py
@@ -15,6 +15,7 @@ from app.modules.ref_audios.router import router as ref_audios_router
 from app.modules.ai.router import router as ai_router
 from app.modules.tools.router import router as tools_router
 from app.modules.assets.router import router as assets_router
+from app.modules.generated_audios.router import router as generated_audios_router
 from loguru import logger
 import os

@@ -124,6 +125,7 @@ app.include_router(ref_audios_router, prefix="/api/ref-audios", tags=["RefAudios
 app.include_router(ai_router)  # /api/ai
 app.include_router(tools_router, prefix="/api/tools", tags=["Tools"])
 app.include_router(assets_router, prefix="/api/assets", tags=["Assets"])
+app.include_router(generated_audios_router, prefix="/api/generated-audios", tags=["GeneratedAudios"])


@app.on_event("startup")
--- a/backend/app/modules/ai/router.py
+++ b/backend/app/modules/ai/router.py
@@ -24,6 +24,33 @@ class GenerateMetaResponse(BaseModel):
    tags: list[str]


+class TranslateRequest(BaseModel):
+    """翻译请求"""
+    text: str
+    target_lang: str
+
+
+@router.post("/translate")
+async def translate_text(req: TranslateRequest):
+    """
+    AI 翻译文案
+
+    将文案翻译为指定目标语言
+    """
+    if not req.text or not req.text.strip():
+        raise HTTPException(status_code=400, detail="文案不能为空")
+    if not req.target_lang or not req.target_lang.strip():
+        raise HTTPException(status_code=400, detail="目标语言不能为空")
+
+    try:
+        logger.info(f"Translating text to {req.target_lang}: {req.text[:50]}...")
+        translated = await glm_service.translate_text(req.text.strip(), req.target_lang.strip())
+        return success_response({"translated_text": translated})
+    except Exception as e:
+        logger.error(f"Translate failed: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+
+
@router.post("/generate-meta")
 async def generate_meta(req: GenerateMetaRequest):
    """
--- a/backend/app/modules/auth/router.py
+++ b/backend/app/modules/auth/router.py
@@ -1,7 +1,7 @@
 """
 认证 API：注册、登录、登出、修改密码
 """
-from fastapi import APIRouter, HTTPException, Response, status, Request
+from fastapi import APIRouter, HTTPException, Response, status, Request, Depends
 from pydantic import BaseModel, field_validator
 from app.core.security import (
    get_password_hash,
@@ -13,7 +13,15 @@ from app.core.security import (
    decode_access_token
 )
 from app.repositories.sessions import create_session, delete_sessions
-from app.repositories.users import create_user, get_user_by_id, get_user_by_phone, user_exists_by_phone, update_user
+from app.repositories.users import (
+    create_user,
+    get_user_by_id,
+    get_user_by_phone,
+    user_exists_by_phone,
+    update_user,
+    deactivate_user_if_expired,
+)
+from app.core.deps import get_current_user
 from app.core.response import success_response
 from loguru import logger
 from typing import Optional, Any, cast
@@ -130,22 +138,20 @@ async def login(request: LoginRequest, response: Response):
                detail="手机号或密码错误"
            )
        
-        # 检查是否激活
-        if not user["is_active"]:
-            raise HTTPException(
-                status_code=status.HTTP_403_FORBIDDEN,
-                detail="账号未激活，请等待管理员审核"
-            )
-        
-        # 检查授权是否过期
-        if user.get("expires_at"):
-            from datetime import datetime, timezone
-            expires_at = datetime.fromisoformat(user["expires_at"].replace("Z", "+00:00"))
-            if datetime.now(timezone.utc) > expires_at:
-                raise HTTPException(
-                    status_code=status.HTTP_403_FORBIDDEN,
-                    detail="授权已过期，请联系管理员续期"
-                )
+        # 授权过期时自动停用账号
+        if deactivate_user_if_expired(user):
+            delete_sessions(user["id"])
+            raise HTTPException(
+                status_code=status.HTTP_403_FORBIDDEN,
+                detail="会员已到期，请续费"
+            )
+
+        # 检查是否激活
+        if not user["is_active"]:
+            raise HTTPException(
+                status_code=status.HTTP_403_FORBIDDEN,
+                detail="账号未激活，请等待管理员审核"
+            )
        
        # 生成新的 session_token (后踢前)
        session_token = generate_session_token()
@@ -258,31 +264,9 @@ async def change_password(request: ChangePasswordRequest, req: Request, response
        )


-@router.get("/me")
-async def get_me(request: Request):
-    """获取当前用户信息"""
-    # 从 Cookie 获取用户
-    token = request.cookies.get("access_token")
-    if not token:
-        raise HTTPException(
-            status_code=status.HTTP_401_UNAUTHORIZED,
-            detail="未登录"
-        )
-    
-    token_data = decode_access_token(token)
-    if not token_data:
-        raise HTTPException(
-            status_code=status.HTTP_401_UNAUTHORIZED,
-            detail="Token 无效"
-        )
-    
-    user = cast(dict[str, Any], get_user_by_id(token_data.user_id) or {})
-    if not user:
-        raise HTTPException(
-            status_code=status.HTTP_401_UNAUTHORIZED,
-            detail="用户不存在"
-        )
-    
+@router.get("/me")
+async def get_me(user: dict = Depends(get_current_user)):
+    """获取当前用户信息"""
    return success_response(UserResponse(
        id=user["id"],
        phone=user["phone"],
--- a/backend/app/modules/generated_audios/init.py
+++ b/backend/app/modules/generated_audios/init.py
--- a/backend/app/modules/generated_audios/router.py
+++ b/backend/app/modules/generated_audios/router.py
@@ -0,0 +1,77 @@
+"""生成配音 API"""
+from fastapi import APIRouter, BackgroundTasks, Depends, HTTPException
+import uuid
+from loguru import logger
+
+from app.core.deps import get_current_user
+from app.core.response import success_response
+from app.modules.videos.task_store import create_task, get_task
+from app.modules.generated_audios.schemas import GenerateAudioRequest, RenameAudioRequest
+from app.modules.generated_audios import service
+
+router = APIRouter()
+
+
+@router.post("/generate")
+async def generate_audio(
+    req: GenerateAudioRequest,
+    background_tasks: BackgroundTasks,
+    user: dict = Depends(get_current_user),
+):
+    """异步生成配音（返回 task_id）"""
+    task_id = str(uuid.uuid4())
+    create_task(task_id, user["id"])
+    background_tasks.add_task(service.generate_audio_task, task_id, req, user["id"])
+    return success_response({"task_id": task_id})
+
+
+@router.get("/tasks/{task_id}")
+async def get_audio_task(task_id: str, user: dict = Depends(get_current_user)):
+    """轮询配音生成进度"""
+    task = get_task(task_id)
+    if task.get("status") != "not_found" and task.get("user_id") != user["id"]:
+        return success_response({"status": "not_found"})
+    return success_response(task)
+
+
+@router.get("")
+async def list_audios(user: dict = Depends(get_current_user)):
+    """列出当前用户所有已生成配音"""
+    try:
+        result = await service.list_generated_audios(user["id"])
+        return success_response(result)
+    except Exception as e:
+        logger.error(f"列出配音失败: {e}")
+        raise HTTPException(status_code=500, detail=f"获取列表失败: {str(e)}")
+
+
+@router.delete("/{audio_id:path}")
+async def delete_audio(audio_id: str, user: dict = Depends(get_current_user)):
+    """删除配音"""
+    try:
+        await service.delete_generated_audio(audio_id, user["id"])
+        return success_response(message="删除成功")
+    except PermissionError as e:
+        raise HTTPException(status_code=403, detail=str(e))
+    except Exception as e:
+        logger.error(f"删除配音失败: {e}")
+        raise HTTPException(status_code=500, detail=f"删除失败: {str(e)}")
+
+
+@router.put("/{audio_id:path}")
+async def rename_audio(
+    audio_id: str,
+    request: RenameAudioRequest,
+    user: dict = Depends(get_current_user),
+):
+    """重命名配音"""
+    try:
+        result = await service.rename_generated_audio(audio_id, request.new_name, user["id"])
+        return success_response(result, message="重命名成功")
+    except PermissionError as e:
+        raise HTTPException(status_code=403, detail=str(e))
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    except Exception as e:
+        logger.error(f"重命名配音失败: {e}")
+        raise HTTPException(status_code=500, detail=f"重命名失败: {str(e)}")
--- a/backend/app/modules/generated_audios/schemas.py
+++ b/backend/app/modules/generated_audios/schemas.py
@@ -0,0 +1,31 @@
+from pydantic import BaseModel
+from typing import Optional, List
+
+
+class GenerateAudioRequest(BaseModel):
+    text: str
+    tts_mode: str = "edgetts"
+    voice: str = "zh-CN-YunxiNeural"
+    ref_audio_id: Optional[str] = None
+    ref_text: Optional[str] = None
+    language: str = "zh-CN"
+    speed: float = 1.0
+
+
+class RenameAudioRequest(BaseModel):
+    new_name: str
+
+
+class GeneratedAudioItem(BaseModel):
+    id: str
+    name: str
+    path: str
+    duration_sec: float
+    text: str
+    tts_mode: str
+    language: str
+    created_at: int
+
+
+class GeneratedAudioListResponse(BaseModel):
+    items: List[GeneratedAudioItem]
--- a/backend/app/modules/generated_audios/service.py
+++ b/backend/app/modules/generated_audios/service.py
@@ -0,0 +1,264 @@
+"""生成配音 - 业务逻辑"""
+import re
+import json
+import time
+import asyncio
+import subprocess
+import tempfile
+import os
+from pathlib import Path
+from typing import Optional
+
+import httpx
+from loguru import logger
+
+from app.services.storage import storage_service
+from app.services.tts_service import TTSService
+from app.services.voice_clone_service import voice_clone_service
+from app.modules.videos.task_store import task_store
+from app.modules.generated_audios.schemas import (
+    GenerateAudioRequest,
+    GeneratedAudioItem,
+    GeneratedAudioListResponse,
+)
+
+BUCKET = "generated-audios"
+
+
+def _locale_to_tts_lang(locale: str) -> str:
+    mapping = {"zh": "Chinese", "en": "English"}
+    return mapping.get(locale.split("-")[0], "Auto")
+
+
+def _get_audio_duration(file_path: str) -> float:
+    try:
+        result = subprocess.run(
+            ['ffprobe', '-v', 'quiet', '-show_entries', 'format=duration',
+             '-of', 'csv=p=0', file_path],
+            capture_output=True, text=True, timeout=10
+        )
+        return float(result.stdout.strip())
+    except Exception as e:
+        logger.warning(f"获取音频时长失败: {e}")
+        return 0.0
+
+
+async def generate_audio_task(task_id: str, req: GenerateAudioRequest, user_id: str):
+    """后台任务：生成配音"""
+    try:
+        task_store.update(task_id, {"status": "processing", "progress": 10, "message": "正在生成配音..."})
+
+        with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp:
+            audio_path = tmp.name
+
+        try:
+            if req.tts_mode == "voiceclone":
+                if not req.ref_audio_id or not req.ref_text:
+                    raise ValueError("声音克隆模式需要提供参考音频和参考文字")
+
+                task_store.update(task_id, {"progress": 20, "message": "正在下载参考音频..."})
+
+                with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_ref:
+                    ref_local = tmp_ref.name
+
+                try:
+                    ref_url = await storage_service.get_signed_url(
+                        bucket="ref-audios", path=req.ref_audio_id
+                    )
+                    timeout = httpx.Timeout(None)
+                    async with httpx.AsyncClient(timeout=timeout) as client:
+                        async with client.stream("GET", ref_url) as resp:
+                            resp.raise_for_status()
+                            with open(ref_local, "wb") as f:
+                                async for chunk in resp.aiter_bytes():
+                                    f.write(chunk)
+
+                    task_store.update(task_id, {"progress": 40, "message": "正在克隆声音..."})
+                    await voice_clone_service.generate_audio(
+                        text=req.text,
+                        ref_audio_path=ref_local,
+                        ref_text=req.ref_text,
+                        output_path=audio_path,
+                        language=_locale_to_tts_lang(req.language),
+                        speed=req.speed,
+                    )
+                finally:
+                    if os.path.exists(ref_local):
+                        os.unlink(ref_local)
+            else:
+                task_store.update(task_id, {"progress": 30, "message": "正在生成语音..."})
+                tts = TTSService()
+                await tts.generate_audio(req.text, req.voice, audio_path)
+
+            task_store.update(task_id, {"progress": 70, "message": "正在上传配音..."})
+
+            duration = _get_audio_duration(audio_path)
+            timestamp = int(time.time())
+            audio_id = f"{user_id}/{timestamp}_audio.wav"
+            meta_id = f"{user_id}/{timestamp}_audio.json"
+
+            # 生成 display_name
+            now = time.strftime("%Y%m%d_%H%M", time.localtime(timestamp))
+            display_name = f"配音_{now}"
+
+            with open(audio_path, "rb") as f:
+                wav_data = f.read()
+
+            await storage_service.upload_file(
+                bucket=BUCKET, path=audio_id,
+                file_data=wav_data, content_type="audio/wav",
+            )
+
+            metadata = {
+                "display_name": display_name,
+                "text": req.text,
+                "tts_mode": req.tts_mode,
+                "voice": req.voice if req.tts_mode == "edgetts" else None,
+                "ref_audio_id": req.ref_audio_id,
+                "language": req.language,
+                "duration_sec": duration,
+                "created_at": timestamp,
+            }
+            await storage_service.upload_file(
+                bucket=BUCKET, path=meta_id,
+                file_data=json.dumps(metadata, ensure_ascii=False).encode("utf-8"),
+                content_type="application/json",
+            )
+
+            signed_url = await storage_service.get_signed_url(BUCKET, audio_id)
+
+            task_store.update(task_id, {
+                "status": "completed",
+                "progress": 100,
+                "message": f"配音生成完成 ({duration:.1f}s)",
+                "output": {
+                    "audio_id": audio_id,
+                    "name": display_name,
+                    "path": signed_url,
+                    "duration_sec": duration,
+                    "text": req.text,
+                    "tts_mode": req.tts_mode,
+                    "language": req.language,
+                    "created_at": timestamp,
+                },
+            })
+        finally:
+            if os.path.exists(audio_path):
+                os.unlink(audio_path)
+
+    except Exception as e:
+        import traceback
+        task_store.update(task_id, {
+            "status": "failed",
+            "message": f"配音生成失败: {str(e)}",
+            "error": traceback.format_exc(),
+        })
+        logger.error(f"Generate audio failed: {e}")
+
+
+async def list_generated_audios(user_id: str) -> dict:
+    """列出用户的所有已生成配音"""
+    files = await storage_service.list_files(BUCKET, user_id)
+    wav_files = [f for f in files if f.get("name", "").endswith("_audio.wav")]
+
+    if not wav_files:
+        return GeneratedAudioListResponse(items=[]).model_dump()
+
+    async def fetch_info(f):
+        name = f.get("name", "")
+        storage_path = f"{user_id}/{name}"
+        meta_name = name.replace("_audio.wav", "_audio.json")
+        meta_path = f"{user_id}/{meta_name}"
+
+        display_name = name
+        text = ""
+        tts_mode = "edgetts"
+        language = "zh-CN"
+        duration_sec = 0.0
+        created_at = 0
+
+        try:
+            meta_url = await storage_service.get_signed_url(BUCKET, meta_path)
+            async with httpx.AsyncClient(timeout=5.0) as client:
+                resp = await client.get(meta_url)
+                if resp.status_code == 200:
+                    meta = resp.json()
+                    display_name = meta.get("display_name", name)
+                    text = meta.get("text", "")
+                    tts_mode = meta.get("tts_mode", "edgetts")
+                    language = meta.get("language", "zh-CN")
+                    duration_sec = meta.get("duration_sec", 0.0)
+                    created_at = meta.get("created_at", 0)
+        except Exception as e:
+            logger.debug(f"读取配音 metadata 失败: {e}")
+            try:
+                created_at = int(name.split("_")[0])
+            except:
+                pass
+
+        signed_url = await storage_service.get_signed_url(BUCKET, storage_path)
+
+        return GeneratedAudioItem(
+            id=storage_path,
+            name=display_name,
+            path=signed_url,
+            duration_sec=duration_sec,
+            text=text,
+            tts_mode=tts_mode,
+            language=language,
+            created_at=created_at,
+        )
+
+    items = await asyncio.gather(*[fetch_info(f) for f in wav_files])
+    items = sorted(items, key=lambda x: x.created_at, reverse=True)
+    return GeneratedAudioListResponse(items=items).model_dump()
+
+
+async def delete_generated_audio(audio_id: str, user_id: str) -> None:
+    if not audio_id.startswith(f"{user_id}/"):
+        raise PermissionError("无权删除此文件")
+
+    await storage_service.delete_file(BUCKET, audio_id)
+    meta_path = audio_id.replace("_audio.wav", "_audio.json")
+    try:
+        await storage_service.delete_file(BUCKET, meta_path)
+    except:
+        pass
+
+
+async def rename_generated_audio(audio_id: str, new_name: str, user_id: str) -> dict:
+    if not audio_id.startswith(f"{user_id}/"):
+        raise PermissionError("无权修改此文件")
+
+    new_name = new_name.strip()
+    if not new_name:
+        raise ValueError("新名称不能为空")
+
+    meta_path = audio_id.replace("_audio.wav", "_audio.json")
+    try:
+        meta_url = await storage_service.get_signed_url(BUCKET, meta_path)
+        async with httpx.AsyncClient() as client:
+            resp = await client.get(meta_url)
+            if resp.status_code == 200:
+                metadata = resp.json()
+            else:
+                raise Exception(f"Failed to fetch metadata: {resp.status_code}")
+    except Exception as e:
+        logger.warning(f"无法读取配音元数据: {e}, 将创建新的")
+        metadata = {
+            "display_name": new_name,
+            "text": "",
+            "tts_mode": "edgetts",
+            "language": "zh-CN",
+            "duration_sec": 0.0,
+            "created_at": int(time.time()),
+        }
+
+    metadata["display_name"] = new_name
+    await storage_service.upload_file(
+        bucket=BUCKET,
+        path=meta_path,
+        file_data=json.dumps(metadata, ensure_ascii=False).encode("utf-8"),
+        content_type="application/json",
+    )
+    return {"name": new_name}
--- a/backend/app/modules/materials/router.py
+++ b/backend/app/modules/materials/router.py
@@ -1,416 +1,62 @@
-from fastapi import APIRouter, UploadFile, File, HTTPException, Request, BackgroundTasks, Depends
-from app.core.config import settings
-from app.core.deps import get_current_user
-from app.core.response import success_response
-from app.services.storage import storage_service
-import re
-import time
-import traceback
-import os
-import aiofiles
-from pathlib import Path
-from loguru import logger
-import asyncio
-from pydantic import BaseModel
-from typing import Optional
-import httpx
+from fastapi import APIRouter, HTTPException, Request, Depends
+from loguru import logger

+from app.core.deps import get_current_user
+from app.core.response import success_response
+from app.modules.materials.schemas import RenameMaterialRequest
+from app.modules.materials import service

-router = APIRouter()
-
-
-class RenameMaterialRequest(BaseModel):
-    new_name: str
-
-def sanitize_filename(filename: str) -> str:
-    safe_name = re.sub(r'[<>:"/\\|?*]', '_', filename)
-    if len(safe_name) > 100:
-        ext = Path(safe_name).suffix
-        safe_name = safe_name[:100 - len(ext)] + ext
-    return safe_name
-
-async def process_and_upload(temp_file_path: str, original_filename: str, content_type: str, user_id: str):
-    """Background task to strip multipart headers and upload to Supabase"""
-    try:
-        logger.info(f"Processing raw upload: {temp_file_path} for user {user_id}")
-
-        # 1. Analyze file to find actual video content (strip multipart boundaries)
-        # This is a simplified manual parser for a SINGLE file upload.
-        # Structure:
-        # --boundary
-        # Content-Disposition: form-data; name="file"; filename="..."
-        # Content-Type: video/mp4
-        # \r\n\r\n
-        # [DATA]
-        # \r\n--boundary--
-
-        # We need to read the first few KB to find the header end
-        start_offset = 0
-        end_offset = 0
-        boundary = b""
-
-        file_size = os.path.getsize(temp_file_path)
-
-        with open(temp_file_path, 'rb') as f:
-            # Read first 4KB to find header
-            head = f.read(4096)
-
-            # Find boundary
-            first_line_end = head.find(b'\r\n')
-            if first_line_end == -1:
-                raise Exception("Could not find boundary in multipart body")
-
-            boundary = head[:first_line_end] # e.g. --boundary123
-            logger.info(f"Detected boundary: {boundary}")
-
-            # Find end of headers (\r\n\r\n)
-            header_end = head.find(b'\r\n\r\n')
-            if header_end == -1:
-                raise Exception("Could not find end of multipart headers")
-
-            start_offset = header_end + 4
-            logger.info(f"Video data starts at offset: {start_offset}")
-
-            # Find end boundary (read from end of file)
-            # It should be \r\n + boundary + -- + \r\n
-            # We seek to end-200 bytes
-            f.seek(max(0, file_size - 200))
-            tail = f.read()
-
-            # The closing boundary is usually --boundary--
-            # We look for the last occurrence of the boundary
-            last_boundary_pos = tail.rfind(boundary)
-            if last_boundary_pos != -1:
-                # The data ends before \r\n + boundary
-                # The tail buffer relative position needs to be converted to absolute
-                end_pos_in_tail = last_boundary_pos
-                # We also need to check for the preceding \r\n
-                if end_pos_in_tail >= 2 and tail[end_pos_in_tail-2:end_pos_in_tail] == b'\r\n':
-                     end_pos_in_tail -= 2
-
-                # Absolute end offset
-                end_offset = (file_size - 200) + last_boundary_pos
-                # Correction for CRLF before boundary
-                # Actually, simply: read until (file_size - len(tail) + last_boundary_pos) - 2
-                end_offset = (max(0, file_size - 200) + last_boundary_pos) - 2
-            else:
-                logger.warning("Could not find closing boundary, assuming EOF")
-                end_offset = file_size
-
-            logger.info(f"Video data ends at offset: {end_offset}. Total video size: {end_offset - start_offset}")
-
-        # 2. Extract and Upload to Supabase
-        # Since we have the file on disk, we can just pass the file object (seeked) to upload_file?
-        # Or if upload_file expects bytes/path, checking storage.py...
-        # It takes `file_data` (bytes) or file-like?
-        # supabase-py's `upload` method handles parsing if we pass a file object.
-        # But we need to pass ONLY the video slice.
-        # So we create a generator or a sliced file object?
-        # Simpler: Read the slice into memory if < 1GB? Or copy to new temp file?
-        # Copying to new temp file is safer for memory.
-
-        video_path = temp_file_path + "_video.mp4"
-        with open(temp_file_path, 'rb') as src, open(video_path, 'wb') as dst:
-            src.seek(start_offset)
-            # Copy in chunks
-            bytes_to_copy = end_offset - start_offset
-            copied = 0
-            while copied < bytes_to_copy:
-                chunk_size = min(1024*1024*10, bytes_to_copy - copied) # 10MB chunks
-                chunk = src.read(chunk_size)
-                if not chunk:
-                    break
-                dst.write(chunk)
-                copied += len(chunk)
-
-        logger.info(f"Extracted video content to {video_path}")
-
-        # 3. Upload to Supabase with user isolation
-        timestamp = int(time.time())
-        safe_name = re.sub(r'[^a-zA-Z0-9._-]', '', original_filename)
-        # 使用 user_id 作为目录前缀实现隔离
-        storage_path = f"{user_id}/{timestamp}_{safe_name}"
-
-        # Use storage service (this calls Supabase which might do its own http request)
-        # We read the cleaned video file
-        with open(video_path, 'rb') as f:
-            file_content = f.read() # Still reading into memory for simple upload call, but server has 32GB RAM so ok for 500MB
-            await storage_service.upload_file(
-                bucket=storage_service.BUCKET_MATERIALS,
-                path=storage_path,
-                file_data=file_content,
-                content_type=content_type
-            )
-
-        logger.info(f"Upload to Supabase complete: {storage_path}")
-
-        # Cleanup
-        os.remove(temp_file_path)
-        os.remove(video_path)
-
-        return storage_path
-
-    except Exception as e:
-        logger.error(f"Background upload processing failed: {e}\n{traceback.format_exc()}")
-        raise
+router = APIRouter()


@router.post("")
 async def upload_material(
    request: Request,
-    background_tasks: BackgroundTasks,
    current_user: dict = Depends(get_current_user)
 ):
    user_id = current_user["id"]
-    logger.info(f"ENTERED upload_material (Streaming Mode) for user {user_id}. Headers: {request.headers}")
-    
-    filename = "unknown_video.mp4" # Fallback
-    content_type = "video/mp4"
-    
-    # Try to parse filename from header if possible (unreliable in raw stream)
-    # We will rely on post-processing or client hint
-    # Frontend sends standard multipart.
-    
-    # Create temp file
-    timestamp = int(time.time())
-    temp_filename = f"upload_{timestamp}.raw"
-    temp_path = os.path.join("/tmp", temp_filename) # Use /tmp on Linux
-    # Ensure /tmp exists (it does) but verify paths
-    if os.name == 'nt': # Local dev
-        temp_path = f"d:/tmp/{temp_filename}"
-        os.makedirs("d:/tmp", exist_ok=True)
-
+    logger.info(f"Upload material request from user {user_id}")
    try:
-        total_size = 0
-        last_log = 0
-        
-        async with aiofiles.open(temp_path, 'wb') as f:
-            async for chunk in request.stream():
-                await f.write(chunk)
-                total_size += len(chunk)
-                
-                # Log progress every 20MB
-                if total_size - last_log > 20 * 1024 * 1024:
-                    logger.info(f"Receiving stream... Processed {total_size / (1024*1024):.2f} MB")
-                    last_log = total_size
-                    
-        logger.info(f"Stream reception complete. Total size: {total_size} bytes. Saved to {temp_path}")
-        
-        if total_size == 0:
-            raise HTTPException(400, "Received empty body")
-
-        # Attempt to extract filename from the saved file's first bytes?
-        # Or just accept it as "uploaded_video.mp4" for now to prove it works.
-        # We can try to regex the header in the file content we just wrote.
-        # Implemented in background task to return success immediately.
-        
-        # Wait, if we return immediately, the user's UI might not show the file yet?
-        # The prompt says "Wait for upload". 
-        # But to avoid User Waiting Timeout, maybe returning early is better?
-        # NO, user expects the file to be in the list.
-        # So we Must await the processing.
-        # But "Processing" (Strip + Upload to Supabase) takes time.
-        # Receiving took time.
-        # If we await Supabase upload, does it timeout?
-        # Supabase upload is outgoing. Usually faster/stable.
-        
-        # Let's await the processing to ensure "List Materials" shows it.
-        # We need to extract the filename for the list.
-        
-        # Quick extract filename from first 4kb
-        with open(temp_path, 'rb') as f:
-            head = f.read(4096).decode('utf-8', errors='ignore')
-            match = re.search(r'filename="([^"]+)"', head)
-            if match:
-                filename = match.group(1)
-                logger.info(f"Extracted filename from body: {filename}")
-            
-        # Run processing sync (in await)
-        storage_path = await process_and_upload(temp_path, filename, content_type, user_id)
-
-        # Get signed URL (it exists now)
-        signed_url = await storage_service.get_signed_url(
-            bucket=storage_service.BUCKET_MATERIALS,
-            path=storage_path
-        )
-        
-        size_mb = total_size / (1024 * 1024) # Approximate (includes headers)
-
-        # 从 storage_path 提取显示名
-        display_name = storage_path.split('/')[-1]  # 去掉 user_id 前缀
-        if '_' in display_name:
-            parts = display_name.split('_', 1)
-            if parts[0].isdigit():
-                display_name = parts[1]
-
-        return success_response({
-            "id": storage_path,
-            "name": display_name,
-            "path": signed_url,
-            "size_mb": size_mb,
-            "type": "video"
-        })
-        
+        result = await service.upload_material(request, user_id)
+        return success_response(result)
+    except ValueError as e:
+        raise HTTPException(400, str(e))
    except Exception as e:
-        error_msg = f"Streaming upload failed: {str(e)}"
-        detail_msg = f"Exception: {repr(e)}\nArgs: {e.args}\n{traceback.format_exc()}"
-        logger.error(error_msg + "\n" + detail_msg)
-        
-        # Write to debug file
-        try:
-             with open("debug_upload.log", "a") as logf:
-                 logf.write(f"\n--- Error at {time.ctime()} ---\n")
-                 logf.write(detail_msg)
-                 logf.write("\n-----------------------------\n")
-        except:
-             pass
-
-        if os.path.exists(temp_path):
-            try:
-                os.remove(temp_path)
-            except:
-                pass
-        raise HTTPException(500, f"Upload failed. Check server logs. Error: {str(e)}")
+        raise HTTPException(500, f"Upload failed. Error: {str(e)}")


@router.get("")
 async def list_materials(current_user: dict = Depends(get_current_user)):
    user_id = current_user["id"]
-    try:
-        # 只列出当前用户目录下的文件
-        files_obj = await storage_service.list_files(
-            bucket=storage_service.BUCKET_MATERIALS,
-            path=user_id
-        )
-        semaphore = asyncio.Semaphore(8)
-
-        async def build_item(f):
-            name = f.get('name')
-            if not name or name == '.emptyFolderPlaceholder':
-                return None
-            display_name = name
-            if '_' in name:
-                parts = name.split('_', 1)
-                if parts[0].isdigit():
-                    display_name = parts[1]
-            full_path = f"{user_id}/{name}"
-            async with semaphore:
-                signed_url = await storage_service.get_signed_url(
-                    bucket=storage_service.BUCKET_MATERIALS,
-                    path=full_path
-                )
-            metadata = f.get('metadata', {})
-            size = metadata.get('size', 0)
-            created_at_str = f.get('created_at', '')
-            created_at = 0
-            if created_at_str:
-                from datetime import datetime
-                try:
-                    dt = datetime.fromisoformat(created_at_str.replace('Z', '+00:00'))
-                    created_at = int(dt.timestamp())
-                except Exception:
-                    pass
-            return {
-                "id": full_path,
-                "name": display_name,
-                "path": signed_url,
-                "size_mb": size / (1024 * 1024),
-                "type": "video",
-                "created_at": created_at
-            }
-
-        tasks = [build_item(f) for f in files_obj]
-        results = await asyncio.gather(*tasks, return_exceptions=True)
-
-        materials = []
-        for item in results:
-            if not item:
-                continue
-            if isinstance(item, Exception):
-                logger.warning(f"Material signed url build failed: {item}")
-                continue
-            materials.append(item)
-        materials.sort(key=lambda x: x['id'], reverse=True)
-        return success_response({"materials": materials})
-    except Exception as e:
-        logger.error(f"List materials failed: {e}")
-        return success_response({"materials": []}, message="获取素材失败")
+    materials = await service.list_materials(user_id)
+    return success_response({"materials": materials})


-@router.delete("/{material_id:path}")
-async def delete_material(material_id: str, current_user: dict = Depends(get_current_user)):
+@router.delete("/{material_id:path}")
+async def delete_material(material_id: str, current_user: dict = Depends(get_current_user)):
    user_id = current_user["id"]
-    # 验证 material_id 属于当前用户
-    if not material_id.startswith(f"{user_id}/"):
-        raise HTTPException(403, "无权删除此素材")
    try:
-        await storage_service.delete_file(
-            bucket=storage_service.BUCKET_MATERIALS,
-            path=material_id
-        )
-        return success_response(message="素材已删除")
-    except Exception as e:
-        raise HTTPException(500, f"删除失败: {str(e)}")
-
-
-@router.put("/{material_id:path}")
-async def rename_material(
-    material_id: str,
-    payload: RenameMaterialRequest,
-    current_user: dict = Depends(get_current_user)
-):
-    user_id = current_user["id"]
-    if not material_id.startswith(f"{user_id}/"):
-        raise HTTPException(403, "无权重命名此素材")
-
-    new_name_raw = payload.new_name.strip() if payload.new_name else ""
-    if not new_name_raw:
-        raise HTTPException(400, "新名称不能为空")
-
-    old_name = material_id.split("/", 1)[1]
-    old_ext = Path(old_name).suffix
-    base_name = Path(new_name_raw).stem if Path(new_name_raw).suffix else new_name_raw
-    safe_base = sanitize_filename(base_name).strip()
-    if not safe_base:
-        raise HTTPException(400, "新名称无效")
-
-    new_filename = f"{safe_base}{old_ext}"
-
-    prefix = None
-    if "_" in old_name:
-        maybe_prefix, _ = old_name.split("_", 1)
-        if maybe_prefix.isdigit():
-            prefix = maybe_prefix
-    if prefix:
-        new_filename = f"{prefix}_{new_filename}"
-
-    new_path = f"{user_id}/{new_filename}"
-    try:
-        if new_path != material_id:
-            await storage_service.move_file(
-                bucket=storage_service.BUCKET_MATERIALS,
-                from_path=material_id,
-                to_path=new_path
-            )
-
-        signed_url = await storage_service.get_signed_url(
-            bucket=storage_service.BUCKET_MATERIALS,
-            path=new_path
-        )
-
-        display_name = new_filename
-        if "_" in new_filename:
-            parts = new_filename.split("_", 1)
-            if parts[0].isdigit():
-                display_name = parts[1]
-
-        return success_response({
-            "id": new_path,
-            "name": display_name,
-            "path": signed_url,
-        }, message="重命名成功")
-    except Exception as e:
-        raise HTTPException(500, f"重命名失败: {str(e)}")
-
+        await service.delete_material(material_id, user_id)
+        return success_response(message="素材已删除")
+    except PermissionError as e:
+        raise HTTPException(403, str(e))
+    except Exception as e:
+        raise HTTPException(500, f"删除失败: {str(e)}")


+@router.put("/{material_id:path}")
+async def rename_material(
+    material_id: str,
+    payload: RenameMaterialRequest,
+    current_user: dict = Depends(get_current_user)
+):
+    user_id = current_user["id"]
+    try:
+        result = await service.rename_material(material_id, payload.new_name, user_id)
+        return success_response(result, message="重命名成功")
+    except PermissionError as e:
+        raise HTTPException(403, str(e))
+    except ValueError as e:
+        raise HTTPException(400, str(e))
+    except Exception as e:
+        raise HTTPException(500, f"重命名失败: {str(e)}")
--- a/backend/app/modules/materials/schemas.py
+++ b/backend/app/modules/materials/schemas.py
@@ -0,0 +1,14 @@
+from pydantic import BaseModel
+
+
+class RenameMaterialRequest(BaseModel):
+    new_name: str
+
+
+class MaterialItem(BaseModel):
+    id: str
+    name: str
+    path: str
+    size_mb: float
+    type: str = "video"
+    created_at: int = 0
--- a/backend/app/modules/materials/service.py
+++ b/backend/app/modules/materials/service.py
@@ -0,0 +1,296 @@
+import re
+import os
+import time
+import asyncio
+import traceback
+import aiofiles
+from pathlib import Path
+from loguru import logger
+
+from app.services.storage import storage_service
+
+
+def sanitize_filename(filename: str) -> str:
+    safe_name = re.sub(r'[<>:"/\\|?*]', '_', filename)
+    if len(safe_name) > 100:
+        ext = Path(safe_name).suffix
+        safe_name = safe_name[:100 - len(ext)] + ext
+    return safe_name
+
+
+def _extract_display_name(storage_name: str) -> str:
+    """从存储文件名中提取显示名（去掉时间戳前缀）"""
+    if '_' in storage_name:
+        parts = storage_name.split('_', 1)
+        if parts[0].isdigit():
+            return parts[1]
+    return storage_name
+
+
+async def _process_and_upload(temp_file_path: str, original_filename: str, content_type: str, user_id: str) -> str:
+    """Strip multipart headers and upload to Supabase, return storage_path"""
+    try:
+        logger.info(f"Processing raw upload: {temp_file_path} for user {user_id}")
+
+        file_size = os.path.getsize(temp_file_path)
+
+        with open(temp_file_path, 'rb') as f:
+            head = f.read(4096)
+
+            first_line_end = head.find(b'\r\n')
+            if first_line_end == -1:
+                raise Exception("Could not find boundary in multipart body")
+
+            boundary = head[:first_line_end]
+            logger.info(f"Detected boundary: {boundary}")
+
+            header_end = head.find(b'\r\n\r\n')
+            if header_end == -1:
+                raise Exception("Could not find end of multipart headers")
+
+            start_offset = header_end + 4
+            logger.info(f"Video data starts at offset: {start_offset}")
+
+            f.seek(max(0, file_size - 200))
+            tail = f.read()
+
+            last_boundary_pos = tail.rfind(boundary)
+            if last_boundary_pos != -1:
+                end_offset = (max(0, file_size - 200) + last_boundary_pos) - 2
+            else:
+                logger.warning("Could not find closing boundary, assuming EOF")
+                end_offset = file_size
+
+            logger.info(f"Video data ends at offset: {end_offset}. Total video size: {end_offset - start_offset}")
+
+        video_path = temp_file_path + "_video.mp4"
+        with open(temp_file_path, 'rb') as src, open(video_path, 'wb') as dst:
+            src.seek(start_offset)
+            bytes_to_copy = end_offset - start_offset
+            copied = 0
+            while copied < bytes_to_copy:
+                chunk_size = min(1024 * 1024 * 10, bytes_to_copy - copied)
+                chunk = src.read(chunk_size)
+                if not chunk:
+                    break
+                dst.write(chunk)
+                copied += len(chunk)
+
+        logger.info(f"Extracted video content to {video_path}")
+
+        timestamp = int(time.time())
+        safe_name = re.sub(r'[^a-zA-Z0-9._-]', '', original_filename)
+        storage_path = f"{user_id}/{timestamp}_{safe_name}"
+
+        with open(video_path, 'rb') as f:
+            file_content = f.read()
+            await storage_service.upload_file(
+                bucket=storage_service.BUCKET_MATERIALS,
+                path=storage_path,
+                file_data=file_content,
+                content_type=content_type
+            )
+
+        logger.info(f"Upload to Supabase complete: {storage_path}")
+
+        os.remove(temp_file_path)
+        os.remove(video_path)
+
+        return storage_path
+
+    except Exception as e:
+        logger.error(f"Background upload processing failed: {e}\n{traceback.format_exc()}")
+        raise
+
+
+async def upload_material(request, user_id: str) -> dict:
+    """接收流式上传并存储到 Supabase，返回素材信息"""
+    filename = "unknown_video.mp4"
+    content_type = "video/mp4"
+
+    timestamp = int(time.time())
+    temp_filename = f"upload_{timestamp}.raw"
+    temp_path = os.path.join("/tmp", temp_filename)
+    if os.name == 'nt':
+        temp_path = f"d:/tmp/{temp_filename}"
+        os.makedirs("d:/tmp", exist_ok=True)
+
+    try:
+        total_size = 0
+        last_log = 0
+
+        async with aiofiles.open(temp_path, 'wb') as f:
+            async for chunk in request.stream():
+                await f.write(chunk)
+                total_size += len(chunk)
+
+                if total_size - last_log > 20 * 1024 * 1024:
+                    logger.info(f"Receiving stream... Processed {total_size / (1024*1024):.2f} MB")
+                    last_log = total_size
+
+        logger.info(f"Stream reception complete. Total size: {total_size} bytes. Saved to {temp_path}")
+
+        if total_size == 0:
+            raise ValueError("Received empty body")
+
+        with open(temp_path, 'rb') as f:
+            head = f.read(4096).decode('utf-8', errors='ignore')
+            match = re.search(r'filename="([^"]+)"', head)
+            if match:
+                filename = match.group(1)
+                logger.info(f"Extracted filename from body: {filename}")
+
+        storage_path = await _process_and_upload(temp_path, filename, content_type, user_id)
+
+        signed_url = await storage_service.get_signed_url(
+            bucket=storage_service.BUCKET_MATERIALS,
+            path=storage_path
+        )
+
+        size_mb = total_size / (1024 * 1024)
+        display_name = _extract_display_name(storage_path.split('/')[-1])
+
+        return {
+            "id": storage_path,
+            "name": display_name,
+            "path": signed_url,
+            "size_mb": size_mb,
+            "type": "video"
+        }
+
+    except Exception as e:
+        error_msg = f"Streaming upload failed: {str(e)}"
+        detail_msg = f"Exception: {repr(e)}\nArgs: {e.args}\n{traceback.format_exc()}"
+        logger.error(error_msg + "\n" + detail_msg)
+
+        try:
+            with open("debug_upload.log", "a") as logf:
+                logf.write(f"\n--- Error at {time.ctime()} ---\n")
+                logf.write(detail_msg)
+                logf.write("\n-----------------------------\n")
+        except:
+            pass
+
+        if os.path.exists(temp_path):
+            try:
+                os.remove(temp_path)
+            except:
+                pass
+        raise
+
+
+async def list_materials(user_id: str) -> list[dict]:
+    """列出用户的所有素材"""
+    try:
+        files_obj = await storage_service.list_files(
+            bucket=storage_service.BUCKET_MATERIALS,
+            path=user_id
+        )
+        semaphore = asyncio.Semaphore(8)
+
+        async def build_item(f):
+            name = f.get('name')
+            if not name or name == '.emptyFolderPlaceholder':
+                return None
+            display_name = _extract_display_name(name)
+            full_path = f"{user_id}/{name}"
+            async with semaphore:
+                signed_url = await storage_service.get_signed_url(
+                    bucket=storage_service.BUCKET_MATERIALS,
+                    path=full_path
+                )
+            metadata = f.get('metadata', {})
+            size = metadata.get('size', 0)
+            created_at_str = f.get('created_at', '')
+            created_at = 0
+            if created_at_str:
+                from datetime import datetime
+                try:
+                    dt = datetime.fromisoformat(created_at_str.replace('Z', '+00:00'))
+                    created_at = int(dt.timestamp())
+                except Exception:
+                    pass
+            return {
+                "id": full_path,
+                "name": display_name,
+                "path": signed_url,
+                "size_mb": size / (1024 * 1024),
+                "type": "video",
+                "created_at": created_at
+            }
+
+        tasks = [build_item(f) for f in files_obj]
+        results = await asyncio.gather(*tasks, return_exceptions=True)
+
+        materials = []
+        for item in results:
+            if not item:
+                continue
+            if isinstance(item, Exception):
+                logger.warning(f"Material signed url build failed: {item}")
+                continue
+            materials.append(item)
+        materials.sort(key=lambda x: x['id'], reverse=True)
+        return materials
+    except Exception as e:
+        logger.error(f"List materials failed: {e}")
+        return []
+
+
+async def delete_material(material_id: str, user_id: str) -> None:
+    """删除素材"""
+    if not material_id.startswith(f"{user_id}/"):
+        raise PermissionError("无权删除此素材")
+    await storage_service.delete_file(
+        bucket=storage_service.BUCKET_MATERIALS,
+        path=material_id
+    )
+
+
+async def rename_material(material_id: str, new_name_raw: str, user_id: str) -> dict:
+    """重命名素材，返回更新后的素材信息"""
+    if not material_id.startswith(f"{user_id}/"):
+        raise PermissionError("无权重命名此素材")
+
+    new_name_raw = new_name_raw.strip() if new_name_raw else ""
+    if not new_name_raw:
+        raise ValueError("新名称不能为空")
+
+    old_name = material_id.split("/", 1)[1]
+    old_ext = Path(old_name).suffix
+    base_name = Path(new_name_raw).stem if Path(new_name_raw).suffix else new_name_raw
+    safe_base = sanitize_filename(base_name).strip()
+    if not safe_base:
+        raise ValueError("新名称无效")
+
+    new_filename = f"{safe_base}{old_ext}"
+
+    prefix = None
+    if "_" in old_name:
+        maybe_prefix, _ = old_name.split("_", 1)
+        if maybe_prefix.isdigit():
+            prefix = maybe_prefix
+    if prefix:
+        new_filename = f"{prefix}_{new_filename}"
+
+    new_path = f"{user_id}/{new_filename}"
+
+    if new_path != material_id:
+        await storage_service.move_file(
+            bucket=storage_service.BUCKET_MATERIALS,
+            from_path=material_id,
+            to_path=new_path
+        )
+
+    signed_url = await storage_service.get_signed_url(
+        bucket=storage_service.BUCKET_MATERIALS,
+        path=new_path
+    )
+
+    display_name = _extract_display_name(new_filename)
+
+    return {
+        "id": new_path,
+        "name": display_name,
+        "path": signed_url,
+    }
--- a/backend/app/modules/ref_audios/router.py
+++ b/backend/app/modules/ref_audios/router.py
@@ -1,240 +1,27 @@
-"""
-参考音频管理 API
-支持上传/列表/删除参考音频，用于 Qwen3-TTS 声音克隆
-"""
+"""参考音频管理 API"""
 from fastapi import APIRouter, UploadFile, File, Form, HTTPException, Depends
-from pydantic import BaseModel
-from typing import List, Optional
-from pathlib import Path
 from loguru import logger
-import time
-import json
-import subprocess
-import tempfile
-import os
-import re

 from app.core.deps import get_current_user
-from app.services.storage import storage_service
 from app.core.response import success_response
+from app.modules.ref_audios.schemas import RenameRequest
+from app.modules.ref_audios import service

 router = APIRouter()

-# 支持的音频格式
-ALLOWED_AUDIO_EXTENSIONS = {'.wav', '.mp3', '.m4a', '.webm', '.ogg', '.flac', '.aac'}
-
-# 参考音频 bucket
-BUCKET_REF_AUDIOS = "ref-audios"
-
-
-class RefAudioResponse(BaseModel):
-    id: str
-    name: str
-    path: str  # signed URL for playback
-    ref_text: str
-    duration_sec: float
-    created_at: int
-
-
-class RefAudioListResponse(BaseModel):
-    items: List[RefAudioResponse]
-
-
-def sanitize_filename(filename: str) -> str:
-    """清理文件名，移除特殊字符"""
-    safe_name = re.sub(r'[<>:"/\\|?*\s]', '_', filename)
-    if len(safe_name) > 50:
-        ext = Path(safe_name).suffix
-        safe_name = safe_name[:50 - len(ext)] + ext
-    return safe_name
-
-
-def get_audio_duration(file_path: str) -> float:
-    """获取音频时长 (秒)"""
-    try:
-        result = subprocess.run(
-            ['ffprobe', '-v', 'quiet', '-show_entries', 'format=duration',
-             '-of', 'csv=p=0', file_path],
-            capture_output=True, text=True, timeout=10
-        )
-        return float(result.stdout.strip())
-    except Exception as e:
-        logger.warning(f"获取音频时长失败: {e}")
-        return 0.0
-
-
-def convert_to_wav(input_path: str, output_path: str) -> bool:
-    """将音频转换为 WAV 格式 (16kHz, mono)"""
-    try:
-        subprocess.run([
-            'ffmpeg', '-y', '-i', input_path,
-            '-ar', '16000',  # 16kHz 采样率
-            '-ac', '1',      # 单声道
-            '-acodec', 'pcm_s16le',  # 16-bit PCM
-            output_path
-        ], capture_output=True, timeout=60, check=True)
-        return True
-    except Exception as e:
-        logger.error(f"音频转换失败: {e}")
-        return False
-

@router.post("")
 async def upload_ref_audio(
    file: UploadFile = File(...),
-    ref_text: str = Form(...),
+    ref_text: str = Form(""),
    user: dict = Depends(get_current_user)
 ):
-    """
-    上传参考音频
-
-    - file: 音频文件 (支持 wav, mp3, m4a, webm 等)
-    - ref_text: 参考音频的转写文字 (必填)
-    """
-    user_id = user["id"]
-
-    if not file.filename:
-        raise HTTPException(status_code=400, detail="文件名无效")
-    filename = file.filename
-
-    # 验证文件扩展名
-    ext = Path(filename).suffix.lower()
-    if ext not in ALLOWED_AUDIO_EXTENSIONS:
-        raise HTTPException(
-            status_code=400,
-            detail=f"不支持的音频格式: {ext}。支持的格式: {', '.join(ALLOWED_AUDIO_EXTENSIONS)}"
-        )
-
-    # 验证 ref_text
-    if not ref_text or len(ref_text.strip()) < 2:
-        raise HTTPException(status_code=400, detail="参考文字不能为空")
-
+    """上传参考音频"""
    try:
-        # 创建临时文件
-        with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as tmp_input:
-            content = await file.read()
-            tmp_input.write(content)
-            tmp_input_path = tmp_input.name
-
-        # 转换为 WAV 格式
-        tmp_wav_path = tmp_input_path + ".wav"
-        if ext != '.wav':
-            if not convert_to_wav(tmp_input_path, tmp_wav_path):
-                raise HTTPException(status_code=500, detail="音频格式转换失败")
-        else:
-            # 即使是 wav 也要标准化格式
-            convert_to_wav(tmp_input_path, tmp_wav_path)
-
-        # 获取音频时长
-        duration = get_audio_duration(tmp_wav_path)
-        if duration < 1.0:
-            raise HTTPException(status_code=400, detail="音频时长过短，至少需要 1 秒")
-        if duration > 60.0:
-            raise HTTPException(status_code=400, detail="音频时长过长，最多 60 秒")
-
-
-        # 3. 处理重名逻辑 (Friendly Display Name)
-        original_name = filename
-        
-        # 获取用户现有的所有参考音频列表 (为了检查文件名冲突)
-        # 注意: 这种列表方式在文件极多时性能一般，但考虑到单用户参考音频数量有限，目前可行
-        existing_files = await storage_service.list_files(BUCKET_REF_AUDIOS, user_id)
-        existing_names = set()
-        
-        # 预加载所有现有的 display name
-        # 这里需要并发请求 metadata 可能会慢，优化: 仅检查 metadata 文件并解析
-        # 简易方案: 仅在 metadata 中读取 original_filename 
-        # 但 list_files 返回的是 name，我们需要 metadata
-        # 考虑到性能，这里使用一种妥协方案：
-        # 我们不做全量检查，而是简单的检查：如果用户上传 myvoice.wav
-        # 我们看看有没有 (timestamp)_myvoice.wav 这种其实并不能准确判断 display name 是否冲突
-        # 
-        # 正确做法: 应该有个数据库表存 metadata。但目前是无数据库设计。
-        # 
-        # 改用简单方案: 
-        # 既然我们无法快速获取所有 display name，
-        # 我们暂时只处理 "在新上传时，original_filename 保持原样"
-        # 但用户希望 "如果在列表中看到重复的，自动加(1)"
-        # 
-        # 鉴于无数据库架构的限制，要在上传时知道"已有的 display name" 成本太高(需遍历下载所有json)。
-        # 
-        # 💡 替代方案: 
-        # 我们不检查旧的。我们只保证**存储**唯一。
-        # 对于用户提到的 "新上传的文件名后加个数字" -> 这通常是指 "另存为" 的逻辑。
-        # 既然用户现在的痛点是 "显示了时间戳太丑"，而我已经去掉了时间戳显示。
-        # 那么如果用户上传两个 "TEST.wav"，列表里就会有两个 "TEST.wav" (但时间不同)。
-        # 这其实是可以接受的。
-        # 
-        # 但如果用户强求 "自动重命名":
-        # 我们可以在这里做一个轻量级的 "同名检测"：
-        # 检查有没有 *_{original_name} 的文件存在。
-        # 如果 storage 里已经有 123_abc.wav, 456_abc.wav
-        # 我们可以认为 abc.wav 已经存在。
-        
-        dup_count = 0
-        search_suffix = f"_{original_name}" # 比如 _test.wav
-        
-        for f in existing_files:
-            fname = f.get('name', '')
-            if fname.endswith(search_suffix):
-                dup_count += 1
-                
-        final_display_name = original_name
-        if dup_count > 0:
-            name_stem = Path(original_name).stem
-            name_ext = Path(original_name).suffix
-            final_display_name = f"{name_stem}({dup_count}){name_ext}"
-            
-        # 生成存储路径 (唯一ID)
-        timestamp = int(time.time())
-        safe_name = sanitize_filename(Path(filename).stem)
-        storage_path = f"{user_id}/{timestamp}_{safe_name}.wav"
-
-        # 上传 WAV 文件到 Supabase
-        with open(tmp_wav_path, 'rb') as f:
-            wav_data = f.read()
-
-        await storage_service.upload_file(
-            bucket=BUCKET_REF_AUDIOS,
-            path=storage_path,
-            file_data=wav_data,
-            content_type="audio/wav"
-        )
-
-        # 上传元数据 JSON
-        metadata = {
-            "ref_text": ref_text.strip(),
-            "original_filename": final_display_name, # 这里的名字如果有重复会自动加(1)
-            "duration_sec": duration,
-            "created_at": timestamp
-        }
-        metadata_path = f"{user_id}/{timestamp}_{safe_name}.json"
-        await storage_service.upload_file(
-            bucket=BUCKET_REF_AUDIOS,
-            path=metadata_path,
-            file_data=json.dumps(metadata, ensure_ascii=False).encode('utf-8'),
-            content_type="application/json"
-        )
-
-        # 获取签名 URL
-        signed_url = await storage_service.get_signed_url(BUCKET_REF_AUDIOS, storage_path)
-
-        # 清理临时文件
-        os.unlink(tmp_input_path)
-        if os.path.exists(tmp_wav_path):
-            os.unlink(tmp_wav_path)
-
-        return success_response(RefAudioResponse(
-            id=storage_path,
-            name=filename,
-            path=signed_url,
-            ref_text=ref_text.strip(),
-            duration_sec=duration,
-            created_at=timestamp
-        ).model_dump())
-
-    except HTTPException:
-        raise
+        result = await service.upload_ref_audio(file, ref_text, user["id"])
+        return success_response(result)
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
    except Exception as e:
        logger.error(f"上传参考音频失败: {e}")
        raise HTTPException(status_code=500, detail=f"上传失败: {str(e)}")
@@ -243,81 +30,9 @@ async def upload_ref_audio(
@router.get("")
 async def list_ref_audios(user: dict = Depends(get_current_user)):
    """列出当前用户的所有参考音频"""
-    user_id = user["id"]
-
    try:
-        # 列出用户目录下的文件
-        files = await storage_service.list_files(BUCKET_REF_AUDIOS, user_id)
-
-        # 过滤出 .wav 文件
-        wav_files = [f for f in files if f.get("name", "").endswith(".wav")]
-
-        if not wav_files:
-            return success_response(RefAudioListResponse(items=[]).model_dump())
-
-        # 并发获取所有 metadata 和签名 URL
-        async def fetch_audio_info(f):
-            """获取单个音频的信息（metadata + signed URL）"""
-            name = f.get("name", "")
-            storage_path = f"{user_id}/{name}"
-            metadata_name = name.replace(".wav", ".json")
-            metadata_path = f"{user_id}/{metadata_name}"
-
-            ref_text = ""
-            duration_sec = 0.0
-            created_at = 0
-            original_filename = ""
-
-            try:
-                # 获取 metadata 内容
-                metadata_url = await storage_service.get_signed_url(BUCKET_REF_AUDIOS, metadata_path)
-                import httpx
-                async with httpx.AsyncClient(timeout=5.0) as client:
-                    resp = await client.get(metadata_url)
-                    if resp.status_code == 200:
-                        metadata = resp.json()
-                        ref_text = metadata.get("ref_text", "")
-                        duration_sec = metadata.get("duration_sec", 0.0)
-                        created_at = metadata.get("created_at", 0)
-                        original_filename = metadata.get("original_filename", "")
-            except Exception as e:
-                logger.debug(f"读取 metadata 失败: {e}")
-                # 从文件名提取时间戳
-                try:
-                    created_at = int(name.split("_")[0])
-                except:
-                    pass
-
-            # 获取音频签名 URL
-            signed_url = await storage_service.get_signed_url(BUCKET_REF_AUDIOS, storage_path)
-
-            # 优先显示原始文件名 (去掉时间戳前缀)
-            display_name = original_filename if original_filename else name
-            # 如果原始文件名丢失，尝试从现有文件名中通过正则去掉时间戳
-            if not display_name or display_name == name:
-                 # 匹配 "1234567890_filename.wav"
-                 match = re.match(r'^\d+_(.+)$', name)
-                 if match:
-                     display_name = match.group(1)
-
-            return RefAudioResponse(
-                id=storage_path,
-                name=display_name,
-                path=signed_url,
-                ref_text=ref_text,
-                duration_sec=duration_sec,
-                created_at=created_at
-            )
-
-        # 使用 asyncio.gather 并发获取所有音频信息
-        import asyncio
-        items = await asyncio.gather(*[fetch_audio_info(f) for f in wav_files])
-
-        # 按创建时间倒序排列
-        items = sorted(items, key=lambda x: x.created_at, reverse=True)
-
-        return success_response(RefAudioListResponse(items=items).model_dump())
-
+        result = await service.list_ref_audios(user["id"])
+        return success_response(result)
    except Exception as e:
        logger.error(f"列出参考音频失败: {e}")
        raise HTTPException(status_code=500, detail=f"获取列表失败: {str(e)}")
@@ -326,96 +41,48 @@ async def list_ref_audios(user: dict = Depends(get_current_user)):
@router.delete("/{audio_id:path}")
 async def delete_ref_audio(audio_id: str, user: dict = Depends(get_current_user)):
    """删除参考音频"""
-    user_id = user["id"]
-
-    # 安全检查：确保只能删除自己的文件
-    if not audio_id.startswith(f"{user_id}/"):
-        raise HTTPException(status_code=403, detail="无权删除此文件")
-
    try:
-        # 删除 WAV 文件
-        await storage_service.delete_file(BUCKET_REF_AUDIOS, audio_id)
-
-        # 删除 metadata JSON
-        metadata_path = audio_id.replace(".wav", ".json")
-        try:
-            await storage_service.delete_file(BUCKET_REF_AUDIOS, metadata_path)
-        except:
-            pass  # metadata 可能不存在
-
+        await service.delete_ref_audio(audio_id, user["id"])
        return success_response(message="删除成功")
-
+    except PermissionError as e:
+        raise HTTPException(status_code=403, detail=str(e))
    except Exception as e:
        logger.error(f"删除参考音频失败: {e}")
        raise HTTPException(status_code=500, detail=f"删除失败: {str(e)}")


-class RenameRequest(BaseModel):
-    new_name: str
-
-
@router.put("/{audio_id:path}")
 async def rename_ref_audio(
    audio_id: str,
    request: RenameRequest,
    user: dict = Depends(get_current_user)
 ):
-    """重命名参考音频 (修改 metadata 中的 display name)"""
-    user_id = user["id"]
-
-    # 安全检查
-    if not audio_id.startswith(f"{user_id}/"):
-        raise HTTPException(status_code=403, detail="无权修改此文件")
-    
-    new_name = request.new_name.strip()
-    if not new_name:
-         raise HTTPException(status_code=400, detail="新名称不能为空")
-         
-    # 确保新名称有后缀 (保留原后缀或添加 .wav)
-    if not Path(new_name).suffix:
-        new_name += ".wav"
-
+    """重命名参考音频"""
    try:
-        # 1. 下载现有的 metadata
-        metadata_path = audio_id.replace(".wav", ".json")
-        try:
-             # 获取已有的 JSON
-             import httpx
-             metadata_url = await storage_service.get_signed_url(BUCKET_REF_AUDIOS, metadata_path)
-             if not metadata_url:
-                  # 如果 json 不存在，则需要新建一个基础的
-                  raise Exception("Metadata not found")
-             
-             async with httpx.AsyncClient() as client:
-                resp = await client.get(metadata_url)
-                if resp.status_code == 200:
-                    metadata = resp.json()
-                else:
-                    raise Exception(f"Failed to fetch metadata: {resp.status_code}")
-                    
-        except Exception as e:
-            logger.warning(f"无法读取元数据: {e}, 将创建新的元数据")
-            # 兜底：如果读取失败，构建最小元数据
-            metadata = {
-                "ref_text": "", # 可能丢失
-                "duration_sec": 0.0,
-                "created_at": int(time.time()),
-                "original_filename": new_name
-            }
-
-        # 2. 更新 original_filename
-        metadata["original_filename"] = new_name
-        
-        # 3. 覆盖上传 metadata
-        await storage_service.upload_file(
-            bucket=BUCKET_REF_AUDIOS,
-            path=metadata_path,
-            file_data=json.dumps(metadata, ensure_ascii=False).encode('utf-8'),
-            content_type="application/json"
-        )
-        
-        return success_response({"name": new_name}, message="重命名成功")
-
+        result = await service.rename_ref_audio(audio_id, request.new_name, user["id"])
+        return success_response(result, message="重命名成功")
+    except PermissionError as e:
+        raise HTTPException(status_code=403, detail=str(e))
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
    except Exception as e:
        logger.error(f"重命名失败: {e}")
        raise HTTPException(status_code=500, detail=f"重命名失败: {str(e)}")
+
+
+@router.post("/{audio_id:path}/retranscribe")
+async def retranscribe_ref_audio(
+    audio_id: str,
+    user: dict = Depends(get_current_user)
+):
+    """重新识别参考音频的文字内容"""
+    try:
+        result = await service.retranscribe_ref_audio(audio_id, user["id"])
+        return success_response(result, message="识别完成")
+    except PermissionError as e:
+        raise HTTPException(status_code=403, detail=str(e))
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    except Exception as e:
+        logger.error(f"重新识别失败: {e}")
+        raise HTTPException(status_code=500, detail=f"识别失败: {str(e)}")
--- a/backend/app/modules/ref_audios/schemas.py
+++ b/backend/app/modules/ref_audios/schemas.py
@@ -0,0 +1,19 @@
+from pydantic import BaseModel
+from typing import List
+
+
+class RefAudioResponse(BaseModel):
+    id: str
+    name: str
+    path: str
+    ref_text: str
+    duration_sec: float
+    created_at: int
+
+
+class RefAudioListResponse(BaseModel):
+    items: List[RefAudioResponse]
+
+
+class RenameRequest(BaseModel):
+    new_name: str
--- a/backend/app/modules/ref_audios/service.py
+++ b/backend/app/modules/ref_audios/service.py
@@ -0,0 +1,395 @@
+import re
+import os
+import time
+import json
+import asyncio
+import subprocess
+import tempfile
+from pathlib import Path
+from typing import Optional
+
+import httpx
+from loguru import logger
+
+from app.services.storage import storage_service
+from app.modules.ref_audios.schemas import RefAudioResponse, RefAudioListResponse
+
+ALLOWED_AUDIO_EXTENSIONS = {'.wav', '.mp3', '.m4a', '.webm', '.ogg', '.flac', '.aac'}
+BUCKET_REF_AUDIOS = "ref-audios"
+
+
+def sanitize_filename(filename: str) -> str:
+    """清理文件名，移除特殊字符"""
+    safe_name = re.sub(r'[<>:"/\\|?*\s]', '_', filename)
+    if len(safe_name) > 50:
+        ext = Path(safe_name).suffix
+        safe_name = safe_name[:50 - len(ext)] + ext
+    return safe_name
+
+
+def _get_audio_duration(file_path: str) -> float:
+    """获取音频时长 (秒)"""
+    try:
+        result = subprocess.run(
+            ['ffprobe', '-v', 'quiet', '-show_entries', 'format=duration',
+             '-of', 'csv=p=0', file_path],
+            capture_output=True, text=True, timeout=10
+        )
+        return float(result.stdout.strip())
+    except Exception as e:
+        logger.warning(f"获取音频时长失败: {e}")
+        return 0.0
+
+
+def _find_silence_cut_point(file_path: str, max_duration: float) -> float:
+    """在 max_duration 附近找一个静音点作为截取位置，找不到则回退到 max_duration"""
+    try:
+        # 用 silencedetect 找所有静音段（阈值 -30dB，最短 0.3 秒）
+        result = subprocess.run(
+            ['ffmpeg', '-i', file_path, '-af',
+             'silencedetect=noise=-30dB:d=0.3', '-f', 'null', '-'],
+            capture_output=True, text=True, timeout=30
+        )
+        # 解析 silence_end 时间点
+        import re as _re
+        ends = [float(m) for m in _re.findall(r'silence_end:\s*([\d.]+)', result.stderr)]
+        # 找 max_duration 之前最后一个静音结束点（至少 3 秒）
+        candidates = [t for t in ends if 3.0 <= t <= max_duration]
+        if candidates:
+            cut = candidates[-1]
+            logger.info(f"Found silence cut point at {cut:.1f}s (max={max_duration}s)")
+            return cut
+    except Exception as e:
+        logger.warning(f"Silence detection failed: {e}")
+    return max_duration
+
+
+def _convert_to_wav(input_path: str, output_path: str, max_duration: float = 0) -> bool:
+    """将音频转换为 WAV 格式 (16kHz, mono)，可选截取前 max_duration 秒并淡出"""
+    try:
+        cmd = ['ffmpeg', '-y', '-i', input_path]
+        if max_duration > 0:
+            cmd += ['-t', str(max_duration)]
+            # 末尾 0.1 秒淡出，避免截断爆音
+            fade_start = max(0, max_duration - 0.1)
+            cmd += ['-af', f'afade=t=out:st={fade_start}:d=0.1']
+        cmd += ['-ar', '16000', '-ac', '1', '-acodec', 'pcm_s16le', output_path]
+        subprocess.run(cmd, capture_output=True, timeout=60, check=True)
+        return True
+    except Exception as e:
+        logger.error(f"音频转换失败: {e}")
+        return False
+
+
+async def upload_ref_audio(file, ref_text: str, user_id: str) -> dict:
+    """上传参考音频：转码、获取时长、存储到 Supabase"""
+    if not file.filename:
+        raise ValueError("文件名无效")
+    filename = file.filename
+
+    ext = Path(filename).suffix.lower()
+    if ext not in ALLOWED_AUDIO_EXTENSIONS:
+        raise ValueError(f"不支持的音频格式: {ext}。支持的格式: {', '.join(ALLOWED_AUDIO_EXTENSIONS)}")
+
+    # 创建临时文件
+    with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as tmp_input:
+        content = await file.read()
+        tmp_input.write(content)
+        tmp_input_path = tmp_input.name
+
+    try:
+        # 转换为 WAV 格式
+        tmp_wav_path = tmp_input_path + ".wav"
+        if not _convert_to_wav(tmp_input_path, tmp_wav_path):
+            raise RuntimeError("音频格式转换失败")
+
+        # 获取音频时长
+        duration = _get_audio_duration(tmp_wav_path)
+        if duration < 1.0:
+            raise ValueError("音频时长过短，至少需要 1 秒")
+
+        # 超过 10 秒自动在静音点截取（CosyVoice 对 3-10 秒效果最好）
+        MAX_REF_DURATION = 10.0
+        if duration > MAX_REF_DURATION:
+            cut_point = _find_silence_cut_point(tmp_wav_path, MAX_REF_DURATION)
+            logger.info(f"Ref audio {duration:.1f}s > {MAX_REF_DURATION}s, trimming at {cut_point:.1f}s")
+            trimmed_path = tmp_input_path + "_trimmed.wav"
+            if not _convert_to_wav(tmp_wav_path, trimmed_path, max_duration=cut_point):
+                raise RuntimeError("音频截取失败")
+            os.unlink(tmp_wav_path)
+            tmp_wav_path = trimmed_path
+            duration = _get_audio_duration(tmp_wav_path)
+
+        # 自动转写参考音频内容
+        try:
+            from app.services.whisper_service import whisper_service
+            transcribed = await whisper_service.transcribe(tmp_wav_path)
+            if transcribed.strip():
+                ref_text = transcribed.strip()
+                logger.info(f"Auto-transcribed ref audio: {ref_text[:50]}...")
+        except Exception as e:
+            logger.warning(f"Auto-transcribe failed: {e}")
+
+        if not ref_text or not ref_text.strip():
+            raise ValueError("无法识别音频内容，请确保音频包含清晰的语音")
+
+        # 检查重名
+        existing_files = await storage_service.list_files(BUCKET_REF_AUDIOS, user_id)
+        dup_count = 0
+        search_suffix = f"_{filename}"
+        for f in existing_files:
+            fname = f.get('name', '')
+            if fname.endswith(search_suffix):
+                dup_count += 1
+
+        final_display_name = filename
+        if dup_count > 0:
+            name_stem = Path(filename).stem
+            name_ext = Path(filename).suffix
+            final_display_name = f"{name_stem}({dup_count}){name_ext}"
+
+        # 生成存储路径
+        timestamp = int(time.time())
+        safe_name = sanitize_filename(Path(filename).stem)
+        storage_path = f"{user_id}/{timestamp}_{safe_name}.wav"
+
+        # 上传 WAV 文件
+        with open(tmp_wav_path, 'rb') as f:
+            wav_data = f.read()
+
+        await storage_service.upload_file(
+            bucket=BUCKET_REF_AUDIOS,
+            path=storage_path,
+            file_data=wav_data,
+            content_type="audio/wav"
+        )
+
+        # 上传元数据 JSON
+        metadata = {
+            "ref_text": ref_text.strip(),
+            "original_filename": final_display_name,
+            "duration_sec": duration,
+            "created_at": timestamp
+        }
+        metadata_path = f"{user_id}/{timestamp}_{safe_name}.json"
+        await storage_service.upload_file(
+            bucket=BUCKET_REF_AUDIOS,
+            path=metadata_path,
+            file_data=json.dumps(metadata, ensure_ascii=False).encode('utf-8'),
+            content_type="application/json"
+        )
+
+        # 获取签名 URL
+        signed_url = await storage_service.get_signed_url(BUCKET_REF_AUDIOS, storage_path)
+
+        return RefAudioResponse(
+            id=storage_path,
+            name=filename,
+            path=signed_url,
+            ref_text=ref_text.strip(),
+            duration_sec=duration,
+            created_at=timestamp
+        ).model_dump()
+
+    finally:
+        os.unlink(tmp_input_path)
+        if os.path.exists(tmp_input_path + ".wav"):
+            os.unlink(tmp_input_path + ".wav")
+
+
+async def list_ref_audios(user_id: str) -> dict:
+    """列出用户的所有参考音频"""
+    files = await storage_service.list_files(BUCKET_REF_AUDIOS, user_id)
+    wav_files = [f for f in files if f.get("name", "").endswith(".wav")]
+
+    if not wav_files:
+        return RefAudioListResponse(items=[]).model_dump()
+
+    async def fetch_audio_info(f):
+        name = f.get("name", "")
+        storage_path = f"{user_id}/{name}"
+        metadata_name = name.replace(".wav", ".json")
+        metadata_path = f"{user_id}/{metadata_name}"
+
+        ref_text = ""
+        duration_sec = 0.0
+        created_at = 0
+        original_filename = ""
+
+        try:
+            metadata_url = await storage_service.get_signed_url(BUCKET_REF_AUDIOS, metadata_path)
+            async with httpx.AsyncClient(timeout=5.0) as client:
+                resp = await client.get(metadata_url)
+                if resp.status_code == 200:
+                    metadata = resp.json()
+                    ref_text = metadata.get("ref_text", "")
+                    duration_sec = metadata.get("duration_sec", 0.0)
+                    created_at = metadata.get("created_at", 0)
+                    original_filename = metadata.get("original_filename", "")
+        except Exception as e:
+            logger.debug(f"读取 metadata 失败: {e}")
+            try:
+                created_at = int(name.split("_")[0])
+            except:
+                pass
+
+        signed_url = await storage_service.get_signed_url(BUCKET_REF_AUDIOS, storage_path)
+
+        display_name = original_filename if original_filename else name
+        if not display_name or display_name == name:
+            match = re.match(r'^\d+_(.+)$', name)
+            if match:
+                display_name = match.group(1)
+
+        return RefAudioResponse(
+            id=storage_path,
+            name=display_name,
+            path=signed_url,
+            ref_text=ref_text,
+            duration_sec=duration_sec,
+            created_at=created_at
+        )
+
+    items = await asyncio.gather(*[fetch_audio_info(f) for f in wav_files])
+    items = sorted(items, key=lambda x: x.created_at, reverse=True)
+
+    return RefAudioListResponse(items=items).model_dump()
+
+
+async def delete_ref_audio(audio_id: str, user_id: str) -> None:
+    """删除参考音频及其元数据"""
+    if not audio_id.startswith(f"{user_id}/"):
+        raise PermissionError("无权删除此文件")
+
+    await storage_service.delete_file(BUCKET_REF_AUDIOS, audio_id)
+
+    metadata_path = audio_id.replace(".wav", ".json")
+    try:
+        await storage_service.delete_file(BUCKET_REF_AUDIOS, metadata_path)
+    except:
+        pass
+
+
+async def rename_ref_audio(audio_id: str, new_name: str, user_id: str) -> dict:
+    """重命名参考音频（修改 metadata 中的 display name）"""
+    if not audio_id.startswith(f"{user_id}/"):
+        raise PermissionError("无权修改此文件")
+
+    new_name = new_name.strip()
+    if not new_name:
+        raise ValueError("新名称不能为空")
+
+    if not Path(new_name).suffix:
+        new_name += ".wav"
+
+    # 下载现有 metadata
+    metadata_path = audio_id.replace(".wav", ".json")
+    try:
+        metadata_url = await storage_service.get_signed_url(BUCKET_REF_AUDIOS, metadata_path)
+        async with httpx.AsyncClient() as client:
+            resp = await client.get(metadata_url)
+            if resp.status_code == 200:
+                metadata = resp.json()
+            else:
+                raise Exception(f"Failed to fetch metadata: {resp.status_code}")
+    except Exception as e:
+        logger.warning(f"无法读取元数据: {e}, 将创建新的元数据")
+        metadata = {
+            "ref_text": "",
+            "duration_sec": 0.0,
+            "created_at": int(time.time()),
+            "original_filename": new_name
+        }
+
+    # 更新并覆盖上传
+    metadata["original_filename"] = new_name
+    await storage_service.upload_file(
+        bucket=BUCKET_REF_AUDIOS,
+        path=metadata_path,
+        file_data=json.dumps(metadata, ensure_ascii=False).encode('utf-8'),
+        content_type="application/json"
+    )
+
+    return {"name": new_name}
+
+
+async def retranscribe_ref_audio(audio_id: str, user_id: str) -> dict:
+    """重新转写参考音频的 ref_text，并截取前 10 秒重新上传（用于迁移旧数据）"""
+    if not audio_id.startswith(f"{user_id}/"):
+        raise PermissionError("无权修改此文件")
+
+    # 下载音频到临时文件
+    audio_url = await storage_service.get_signed_url(BUCKET_REF_AUDIOS, audio_id)
+    tmp_wav_path = None
+    trimmed_path = None
+    try:
+        with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp:
+            tmp_wav_path = tmp.name
+            timeout = httpx.Timeout(None)
+            async with httpx.AsyncClient(timeout=timeout) as client:
+                async with client.stream("GET", audio_url) as resp:
+                    resp.raise_for_status()
+                    async for chunk in resp.aiter_bytes():
+                        tmp.write(chunk)
+
+        # 超过 10 秒则截取前 10 秒并重新上传音频
+        MAX_REF_DURATION = 10.0
+        duration = _get_audio_duration(tmp_wav_path)
+        transcribe_path = tmp_wav_path
+        need_reupload = False
+
+        if duration > MAX_REF_DURATION:
+            cut_point = _find_silence_cut_point(tmp_wav_path, MAX_REF_DURATION)
+            logger.info(f"Retranscribe: trimming {audio_id} from {duration:.1f}s at {cut_point:.1f}s")
+            trimmed_path = tmp_wav_path + "_trimmed.wav"
+            if _convert_to_wav(tmp_wav_path, trimmed_path, max_duration=cut_point):
+                transcribe_path = trimmed_path
+                duration = _get_audio_duration(trimmed_path)
+                need_reupload = True
+
+        # Whisper 转写
+        from app.services.whisper_service import whisper_service
+        transcribed = await whisper_service.transcribe(transcribe_path)
+        if not transcribed or not transcribed.strip():
+            raise ValueError("无法识别音频内容")
+
+        ref_text = transcribed.strip()
+        logger.info(f"Re-transcribed ref audio {audio_id}: {ref_text[:50]}...")
+
+        # 截取过的音频重新上传覆盖原文件
+        if need_reupload and trimmed_path:
+            with open(trimmed_path, "rb") as f:
+                await storage_service.upload_file(
+                    bucket=BUCKET_REF_AUDIOS, path=audio_id,
+                    file_data=f.read(), content_type="audio/wav",
+                )
+            logger.info(f"Re-uploaded trimmed audio: {audio_id} ({duration:.1f}s)")
+
+        # 更新 metadata
+        metadata_path = audio_id.replace(".wav", ".json")
+        try:
+            meta_url = await storage_service.get_signed_url(BUCKET_REF_AUDIOS, metadata_path)
+            async with httpx.AsyncClient(timeout=5.0) as client:
+                resp = await client.get(meta_url)
+                if resp.status_code == 200:
+                    metadata = resp.json()
+                else:
+                    raise Exception(f"status {resp.status_code}")
+        except Exception:
+            metadata = {}
+
+        metadata["ref_text"] = ref_text
+        metadata["duration_sec"] = duration
+        await storage_service.upload_file(
+            bucket=BUCKET_REF_AUDIOS,
+            path=metadata_path,
+            file_data=json.dumps(metadata, ensure_ascii=False).encode('utf-8'),
+            content_type="application/json"
+        )
+
+        return {"ref_text": ref_text, "duration_sec": duration}
+    finally:
+        if tmp_wav_path and os.path.exists(tmp_wav_path):
+            os.unlink(tmp_wav_path)
+        if trimmed_path and os.path.exists(trimmed_path):
+            os.unlink(trimmed_path)
--- a/backend/app/modules/tools/router.py
+++ b/backend/app/modules/tools/router.py
@@ -1,417 +1,32 @@
 from fastapi import APIRouter, UploadFile, File, Form, HTTPException
-from typing import Optional, Any, cast
-import asyncio
-import shutil
-import os
-import time
-from pathlib import Path
-from loguru import logger
+from typing import Optional
 import traceback
-import re
-import json
-import requests
-from urllib.parse import unquote
+from loguru import logger

-from app.services.whisper_service import whisper_service
-from app.services.glm_service import glm_service
 from app.core.response import success_response
+from app.modules.tools import service

 router = APIRouter()

+
@router.post("/extract-script")
 async def extract_script_tool(
    file: Optional[UploadFile] = File(None),
    url: Optional[str] = Form(None),
    rewrite: bool = Form(True)
 ):
-    """
-    独立文案提取工具
-    支持上传视频/音频 OR 输入视频链接 -> 提取文字 -> (可选) AI洗稿
-    """
-    if not file and not url:
-        raise HTTPException(400, "必须提供文件或视频链接")
-
-    temp_path = None
+    """独立文案提取工具"""
    try:
-        timestamp = int(time.time())
-        temp_dir = Path("/tmp")
-        if os.name == 'nt':
-            temp_dir = Path("d:/tmp")
-        temp_dir.mkdir(parents=True, exist_ok=True)
-
-        # 1. 获取/保存文件
-        loop = asyncio.get_event_loop()
-        
-        if file:
-            filename = file.filename
-            if not filename:
-                raise HTTPException(400, "文件名无效")
-            safe_filename = Path(filename).name.replace(" ", "_")
-            temp_path = temp_dir / f"tool_extract_{timestamp}_{safe_filename}"
-            # 文件 I/O 放入线程池
-            await loop.run_in_executor(None, lambda: shutil.copyfileobj(file.file, open(temp_path, "wb")))
-            logger.info(f"Tool processing upload file: {temp_path}")
-        else:
-            if not url:
-                raise HTTPException(400, "必须提供视频链接")
-            url_value: str = url
-            # URL 下载逻辑
-            # 自动提取文案中的链接 (支持 Douyin/Bilibili 等分享文案)
-            url_match = re.search(r'https?://[^\s]+', url_value)
-            if url_match:
-                extracted_url = url_match.group(0)
-                logger.info(f"Extracted URL from text: {extracted_url}")
-                url_value = extracted_url
-            
-            logger.info(f"Tool downloading URL: {url_value}")
-            
-            # 封装 yt-dlp 下载函数 (Blocking)
-            def _download_yt_dlp():
-                import yt_dlp
-                logger.info("Attempting download with yt-dlp...")
-
-                ydl_opts = {
-                    'format': 'bestaudio/best',
-                    'outtmpl': str(temp_dir / f"tool_download_{timestamp}_%(id)s.%(ext)s"),
-                    'quiet': True,
-                    'no_warnings': True,
-                    'http_headers': {
-                         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
-                         'Referer': 'https://www.douyin.com/',
-                    }
-                }
-                
-                with yt_dlp.YoutubeDL() as ydl_raw:
-                    ydl: Any = ydl_raw
-                    ydl.params.update(ydl_opts)
-                    info = ydl.extract_info(url_value, download=True)
-                    if 'requested_downloads' in info:
-                        downloaded_file = info['requested_downloads'][0]['filepath']
-                    else:
-                        ext = info.get('ext', 'mp4')
-                        id = info.get('id')
-                        downloaded_file = str(temp_dir / f"tool_download_{timestamp}_{id}.{ext}")
-                    
-                    return Path(downloaded_file)
-
-            # 先尝试 yt-dlp (Run in Executor)
-            try:
-                temp_path = await loop.run_in_executor(None, _download_yt_dlp)
-                logger.info(f"yt-dlp downloaded to: {temp_path}")
-
-            except Exception as e:
-                logger.warning(f"yt-dlp download failed: {e}. Trying manual Douyin fallback...")
-                
-                # 失败则尝试手动解析 (Douyin Fallback)
-                if "douyin" in url_value:
-                    manual_path = await download_douyin_manual(url_value, temp_dir, timestamp)
-                    if manual_path:
-                        temp_path = manual_path
-                        logger.info(f"Manual Douyin fallback successful: {temp_path}")
-                    else:
-                         raise HTTPException(400, f"视频下载失败。yt-dlp 报错: {str(e)}")
-                elif "bilibili" in url_value:
-                    manual_path = await download_bilibili_manual(url_value, temp_dir, timestamp)
-                    if manual_path:
-                        temp_path = manual_path
-                        logger.info(f"Manual Bilibili fallback successful: {temp_path}")
-                    else:
-                         raise HTTPException(400, f"视频下载失败。yt-dlp 报错: {str(e)}")
-                else:
-                    raise HTTPException(400, f"视频下载失败: {str(e)}")
-
-        if not temp_path or not temp_path.exists():
-             raise HTTPException(400, "文件获取失败")
-        
-        # 1.5 安全转换: 强制转为 WAV (16k)
-        import subprocess
-        audio_path = temp_dir / f"extract_audio_{timestamp}.wav"
-        
-        def _convert_audio():
-            try:
-                convert_cmd = [
-                    'ffmpeg',
-                    '-i', str(temp_path),
-                    '-vn', # 忽略视频
-                    '-acodec', 'pcm_s16le',
-                    '-ar', '16000', # Whisper 推荐采样率
-                    '-ac', '1',    # 单声道
-                    '-y',          # 覆盖
-                    str(audio_path)
-                ]
-                # 捕获 stderr
-                subprocess.run(convert_cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
-                return True
-            except subprocess.CalledProcessError as e:
-                error_log = e.stderr.decode('utf-8', errors='ignore') if e.stderr else str(e)
-                logger.error(f"FFmpeg check/convert failed: {error_log}")
-                # 检查是否为 HTML
-                head = b""
-                try:
-                    with open(temp_path, 'rb') as f:
-                        head = f.read(100)
-                except: pass
-                if b'<!DOCTYPE html' in head or b'<html' in head:
-                    raise ValueError("HTML_DETECTED")
-                raise ValueError("CONVERT_FAILED")
-
-        # 执行转换 (Run in Executor)
-        try:
-            await loop.run_in_executor(None, _convert_audio)
-            logger.info(f"Converted to WAV: {audio_path}")
-            target_path = audio_path
-        except ValueError as ve:
-            if str(ve) == "HTML_DETECTED":
-                 raise HTTPException(400, "下载的文件是网页而非视频，请重试或手动上传。")
-            else:
-                 raise HTTPException(400, "下载的文件已损坏或格式无法识别。")
-            
-        # 2. 提取文案 (Whisper)
-        script = await whisper_service.transcribe(str(target_path))
-        
-        # 3. AI 洗稿 (GLM)
-        rewritten = None
-        if rewrite:
-            if script and len(script.strip()) > 0:
-                logger.info("Rewriting script...")
-                rewritten = await glm_service.rewrite_script(script)
-            else:
-                logger.warning("No script extracted, skipping rewrite")
-
-        return success_response({
-            "original_script": script,
-            "rewritten_script": rewritten
-        })
-
-    except HTTPException as he:
-        raise he
+        result = await service.extract_script(file=file, url=url, rewrite=rewrite)
+        return success_response(result)
+    except ValueError as e:
+        raise HTTPException(400, str(e))
+    except HTTPException:
+        raise
    except Exception as e:
        logger.error(f"Tool extract failed: {e}")
        logger.error(traceback.format_exc())
-        
-        # Friendly error message
        msg = str(e)
        if "Fresh cookies" in msg:
            msg = "下载失败：目标平台开启了反爬验证，请过段时间重试或直接上传视频文件。"
-            
        raise HTTPException(500, f"提取失败: {msg}")
-    finally:
-        # 清理临时文件
-        if temp_path and temp_path.exists():
-            try:
-                os.remove(temp_path)
-                logger.info(f"Cleaned up temp file: {temp_path}")
-            except Exception as e:
-                logger.warning(f"Failed to cleanup temp file {temp_path}: {e}")
-
-
-async def download_douyin_manual(url: str, temp_dir: Path, timestamp: int) -> Optional[Path]:
-    """
-    手动下载抖音视频 (Fallback logic - Ported from SuperIPAgent/douyinDownloader)
-    使用特定的 User Profile URL 和硬编码 Cookie 绕过反爬
-    """
-    import httpx
-    
-    logger.info(f"[SuperIPAgent] Starting download for: {url}")
-    
-    try:
-        # 1. 提取 Modal ID (支持短链跳转)
-        headers = {
-            "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
-        }
-        
-        # 如果是短链或重定向 - 使用异步 httpx
-        async with httpx.AsyncClient(follow_redirects=True, timeout=10.0) as client:
-            resp = await client.get(url, headers=headers)
-            final_url = str(resp.url)
-        
-        logger.info(f"[SuperIPAgent] Final URL: {final_url}")
-        
-        modal_id = None
-        match = re.search(r'/video/(\d+)', final_url)
-        if match:
-            modal_id = match.group(1)
-        
-        if not modal_id:
-            logger.error("[SuperIPAgent] Could not extract modal_id")
-            return None
-
-        logger.info(f"[SuperIPAgent] Extracted modal_id: {modal_id}")
-        
-        # 2. 构造特定请求 URL (Copy from SuperIPAgent)
-        # 使用特定用户的 Profile 页 + modal_id 参数，配合特定 Cookie
-        target_url = f"https://www.douyin.com/user/MS4wLjABAAAAN_s_hups7LD0N4qnrM3o2gI0vuG3pozNaEolz2_py3cHTTrpVr1Z4dukFD9SOlwY?from_tab_name=main&modal_id={modal_id}"
-        
-        # 3. 使用配置的 Cookie (从环境变量 DOUYIN_COOKIE 读取)
-        from app.core.config import settings
-        if not settings.DOUYIN_COOKIE:
-            logger.warning("[SuperIPAgent] DOUYIN_COOKIE 未配置，视频下载可能失败")
-        
-        headers_with_cookie = {
-            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
-            "cookie": settings.DOUYIN_COOKIE,
-            "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
-        }
-        
-        logger.info(f"[SuperIPAgent] Requesting page with Cookie...")
-        
-        async with httpx.AsyncClient(timeout=10.0) as client:
-            response = await client.get(target_url, headers=headers_with_cookie)
-        
-        # 4. 解析 RENDER_DATA
-        content_match = re.findall(r'<script id="RENDER_DATA" type="application/json">(.*?)</script>', response.text)
-        if not content_match:
-             # 尝试解码后再查找？或者结构变了
-             # 再尝试找 SSR_HYDRATED_DATA
-             if "SSR_HYDRATED_DATA" in response.text:
-                 content_match = re.findall(r'<script id="SSR_HYDRATED_DATA" type="application/json">(.*?)</script>', response.text)
-        
-        if not content_match:
-             logger.error(f"[SuperIPAgent] Could not find RENDER_DATA in page (len={len(response.text)})")
-             return None
-
-        content = unquote(content_match[0])
-        try:
-            data = json.loads(content)
-        except:
-            logger.error("[SuperIPAgent] JSON decode failed")
-            return None
-            
-        # 5. 提取视频流
-        video_url = None
-        try:
-            # 路径通常是: app -> videoDetail -> video -> bitRateList -> playAddr -> src
-            if "app" in data and "videoDetail" in data["app"]:
-                 info = data["app"]["videoDetail"]["video"]
-                 if "bitRateList" in info and info["bitRateList"]:
-                     video_url = info["bitRateList"][0]["playAddr"][0]["src"]
-                 elif "playAddr" in info and info["playAddr"]:
-                      video_url = info["playAddr"][0]["src"]
-        except Exception as e:
-            logger.error(f"[SuperIPAgent] Path extraction failed: {e}")
-            
-        if not video_url:
-            logger.error("[SuperIPAgent] No video_url found")
-            return None
-            
-        if video_url.startswith("//"):
-            video_url = "https:" + video_url
-            
-        logger.info(f"[SuperIPAgent] Found video URL: {video_url[:50]}...")
-        
-        # 6. 下载 (带 Header) - 使用异步 httpx
-        temp_path = temp_dir / f"douyin_manual_{timestamp}.mp4"
-        download_headers = {
-            'Referer': 'https://www.douyin.com/',
-            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
-        }
-        
-        async with httpx.AsyncClient(timeout=60.0) as client:
-            async with client.stream("GET", video_url, headers=download_headers) as dl_resp:
-                if dl_resp.status_code == 200:
-                    with open(temp_path, 'wb') as f:
-                        async for chunk in dl_resp.aiter_bytes(chunk_size=8192):
-                            f.write(chunk)
-                             
-                    logger.info(f"[SuperIPAgent] Downloaded successfully: {temp_path}")
-                    return temp_path
-                else:
-                    logger.error(f"[SuperIPAgent] Download failed: {dl_resp.status_code}")
-                    return None
-
-    except Exception as e:
-        logger.error(f"[SuperIPAgent] Logic failed: {e}")
-        return None
-
-async def download_bilibili_manual(url: str, temp_dir: Path, timestamp: int) -> Optional[Path]:
-    """
-    手动下载 Bilibili 视频 (Fallback logic - Playwright Version)
-    B站通常音视频分离，这里只提取音频即可（因为只需要文案）
-    """
-    from playwright.async_api import async_playwright
-    
-    logger.info(f"[Playwright] Starting Bilibili download for: {url}")
-    
-    playwright = None
-    browser = None
-    try:
-        playwright = await async_playwright().start()
-        # Launch browser (ensure chromium is installed: playwright install chromium)
-        browser = await playwright.chromium.launch(headless=True, args=['--no-sandbox', '--disable-setuid-sandbox'])
-        
-        # Mobile User Agent often gives single stream?
-        # But Bilibili mobile web is tricky. Desktop is fine.
-        context = await browser.new_context(
-            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
-        )
-        
-        page = await context.new_page()
-        
-        # Intercept audio responses?
-        # Bilibili streams are usually .m4s 
-        # But finding the initial state is easier.
-        
-        logger.info("[Playwright] Navigating to Bilibili...")
-        await page.goto(url, timeout=45000)
-        
-        # Wait for video element (triggers loading)
-        try:
-            await page.wait_for_selector('video', timeout=15000)
-        except:
-             logger.warning("[Playwright] Video selector timeout")
-
-        # 1. Try extracting from __playinfo__
-        # window.__playinfo__ contains dash streams
-        playinfo = await page.evaluate("window.__playinfo__")
-        
-        audio_url = None
-        
-        if playinfo and "data" in playinfo and "dash" in playinfo["data"]:
-            dash = playinfo["data"]["dash"]
-            if "audio" in dash and dash["audio"]:
-                audio_url = dash["audio"][0]["baseUrl"]
-                logger.info(f"[Playwright] Found audio stream in __playinfo__: {audio_url[:50]}...")
-        
-        # 2. If playinfo fails, try extracting video src (sometimes it's a blob, which we can't fetch easily without interception)
-        # But interception is complex. Let's try requests with Referer if we have URL.
-        
-        if not audio_url:
-            logger.warning("[Playwright] Could not find audio in __playinfo__")
-            return None
-            
-        # Download the audio stream
-        temp_path = temp_dir / f"bilibili_audio_{timestamp}.m4s" # usually m4s
-        
-        try:
-            api_request = context.request
-            headers = {
-                "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
-                "Referer": "https://www.bilibili.com/"
-            }
-            
-            logger.info(f"[Playwright] Downloading audio stream...")
-            response = await api_request.get(audio_url, headers=headers)
-            
-            if response.status == 200:
-                body = await response.body()
-                with open(temp_path, 'wb') as f:
-                    f.write(body)
-                    
-                logger.info(f"[Playwright] Downloaded successfully: {temp_path}")
-                return temp_path
-            else:
-                logger.error(f"[Playwright] API Request failed: {response.status}")
-                return None
-                
-        except Exception as e:
-             logger.error(f"[Playwright] Download logic error: {e}")
-             return None
-
-    except Exception as e:
-        logger.error(f"[Playwright] Bilibili download failed: {e}")
-        return None
-    finally:
-        if browser:
-            await browser.close()
-        if playwright:
-            await playwright.stop()
--- a/backend/app/modules/tools/schemas.py
+++ b/backend/app/modules/tools/schemas.py
@@ -0,0 +1,7 @@
+from pydantic import BaseModel
+from typing import Optional
+
+
+class ExtractScriptResponse(BaseModel):
+    original_script: Optional[str] = None
+    rewritten_script: Optional[str] = None
--- a/backend/app/modules/tools/service.py
+++ b/backend/app/modules/tools/service.py
@@ -0,0 +1,355 @@
+import asyncio
+import os
+import re
+import json
+import time
+import shutil
+import subprocess
+import traceback
+from pathlib import Path
+from typing import Optional, Any
+from urllib.parse import unquote
+
+import httpx
+from loguru import logger
+
+from app.services.whisper_service import whisper_service
+from app.services.glm_service import glm_service
+
+
+async def extract_script(file=None, url: Optional[str] = None, rewrite: bool = True) -> dict:
+    """
+    文案提取：上传文件或视频链接 -> Whisper 转写 -> (可选) GLM 洗稿
+    """
+    if not file and not url:
+        raise ValueError("必须提供文件或视频链接")
+
+    temp_path = None
+    try:
+        timestamp = int(time.time())
+        temp_dir = Path("/tmp")
+        if os.name == 'nt':
+            temp_dir = Path("d:/tmp")
+        temp_dir.mkdir(parents=True, exist_ok=True)
+
+        loop = asyncio.get_event_loop()
+
+        # 1. 获取/保存文件
+        if file:
+            filename = file.filename
+            if not filename:
+                raise ValueError("文件名无效")
+            safe_filename = Path(filename).name.replace(" ", "_")
+            temp_path = temp_dir / f"tool_extract_{timestamp}_{safe_filename}"
+            await loop.run_in_executor(None, lambda: shutil.copyfileobj(file.file, open(temp_path, "wb")))
+            logger.info(f"Tool processing upload file: {temp_path}")
+        else:
+            temp_path = await _download_video(url, temp_dir, timestamp)
+
+        if not temp_path or not temp_path.exists():
+            raise ValueError("文件获取失败")
+
+        # 1.5 安全转换: 强制转为 WAV (16k)
+        audio_path = temp_dir / f"extract_audio_{timestamp}.wav"
+        try:
+            await loop.run_in_executor(None, lambda: _convert_to_wav(temp_path, audio_path))
+            logger.info(f"Converted to WAV: {audio_path}")
+        except ValueError as ve:
+            if str(ve) == "HTML_DETECTED":
+                raise ValueError("下载的文件是网页而非视频，请重试或手动上传。")
+            else:
+                raise ValueError("下载的文件已损坏或格式无法识别。")
+
+        # 2. 提取文案 (Whisper)
+        script = await whisper_service.transcribe(str(audio_path))
+
+        # 3. AI 洗稿 (GLM)
+        rewritten = None
+        if rewrite and script and len(script.strip()) > 0:
+            logger.info("Rewriting script...")
+            rewritten = await glm_service.rewrite_script(script)
+
+        return {
+            "original_script": script,
+            "rewritten_script": rewritten
+        }
+
+    finally:
+        if temp_path and temp_path.exists():
+            try:
+                os.remove(temp_path)
+                logger.info(f"Cleaned up temp file: {temp_path}")
+            except Exception as e:
+                logger.warning(f"Failed to cleanup temp file {temp_path}: {e}")
+
+
+def _convert_to_wav(input_path: Path, output_path: Path) -> None:
+    """FFmpeg 转换为 16k WAV"""
+    try:
+        convert_cmd = [
+            'ffmpeg',
+            '-i', str(input_path),
+            '-vn',
+            '-acodec', 'pcm_s16le',
+            '-ar', '16000',
+            '-ac', '1',
+            '-y',
+            str(output_path)
+        ]
+        subprocess.run(convert_cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    except subprocess.CalledProcessError as e:
+        error_log = e.stderr.decode('utf-8', errors='ignore') if e.stderr else str(e)
+        logger.error(f"FFmpeg check/convert failed: {error_log}")
+        head = b""
+        try:
+            with open(input_path, 'rb') as f:
+                head = f.read(100)
+        except:
+            pass
+        if b'<!DOCTYPE html' in head or b'<html' in head:
+            raise ValueError("HTML_DETECTED")
+        raise ValueError("CONVERT_FAILED")
+
+
+async def _download_video(url: str, temp_dir: Path, timestamp: int) -> Path:
+    """下载视频（yt-dlp 优先，失败回退手动解析）"""
+    url_value = url
+    url_match = re.search(r'https?://[^\s]+', url_value)
+    if url_match:
+        extracted_url = url_match.group(0)
+        logger.info(f"Extracted URL from text: {extracted_url}")
+        url_value = extracted_url
+
+    logger.info(f"Tool downloading URL: {url_value}")
+    loop = asyncio.get_event_loop()
+
+    # 先尝试 yt-dlp
+    try:
+        temp_path = await loop.run_in_executor(None, lambda: _download_yt_dlp(url_value, temp_dir, timestamp))
+        logger.info(f"yt-dlp downloaded to: {temp_path}")
+        return temp_path
+    except Exception as e:
+        logger.warning(f"yt-dlp download failed: {e}. Trying manual fallback...")
+
+        if "douyin" in url_value:
+            manual_path = await _download_douyin_manual(url_value, temp_dir, timestamp)
+            if manual_path:
+                return manual_path
+            raise ValueError(f"视频下载失败。yt-dlp 报错: {str(e)}")
+        elif "bilibili" in url_value:
+            manual_path = await _download_bilibili_manual(url_value, temp_dir, timestamp)
+            if manual_path:
+                return manual_path
+            raise ValueError(f"视频下载失败。yt-dlp 报错: {str(e)}")
+        else:
+            raise ValueError(f"视频下载失败: {str(e)}")
+
+
+def _download_yt_dlp(url_value: str, temp_dir: Path, timestamp: int) -> Path:
+    """yt-dlp 下载（阻塞调用，应在线程池中运行）"""
+    import yt_dlp
+    logger.info("Attempting download with yt-dlp...")
+
+    ydl_opts = {
+        'format': 'bestaudio/best',
+        'outtmpl': str(temp_dir / f"tool_download_{timestamp}_%(id)s.%(ext)s"),
+        'quiet': True,
+        'no_warnings': True,
+        'http_headers': {
+            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
+            'Referer': 'https://www.douyin.com/',
+        }
+    }
+
+    with yt_dlp.YoutubeDL() as ydl_raw:
+        ydl: Any = ydl_raw
+        ydl.params.update(ydl_opts)
+        info = ydl.extract_info(url_value, download=True)
+        if 'requested_downloads' in info:
+            downloaded_file = info['requested_downloads'][0]['filepath']
+        else:
+            ext = info.get('ext', 'mp4')
+            id = info.get('id')
+            downloaded_file = str(temp_dir / f"tool_download_{timestamp}_{id}.{ext}")
+
+        return Path(downloaded_file)
+
+
+async def _download_douyin_manual(url: str, temp_dir: Path, timestamp: int) -> Optional[Path]:
+    """手动下载抖音视频 (Fallback)"""
+    logger.info(f"[SuperIPAgent] Starting download for: {url}")
+
+    try:
+        headers = {
+            "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
+        }
+
+        async with httpx.AsyncClient(follow_redirects=True, timeout=10.0) as client:
+            resp = await client.get(url, headers=headers)
+            final_url = str(resp.url)
+
+        logger.info(f"[SuperIPAgent] Final URL: {final_url}")
+
+        modal_id = None
+        match = re.search(r'/video/(\d+)', final_url)
+        if match:
+            modal_id = match.group(1)
+
+        if not modal_id:
+            logger.error("[SuperIPAgent] Could not extract modal_id")
+            return None
+
+        logger.info(f"[SuperIPAgent] Extracted modal_id: {modal_id}")
+
+        target_url = f"https://www.douyin.com/user/MS4wLjABAAAAN_s_hups7LD0N4qnrM3o2gI0vuG3pozNaEolz2_py3cHTTrpVr1Z4dukFD9SOlwY?from_tab_name=main&modal_id={modal_id}"
+
+        from app.core.config import settings
+        if not settings.DOUYIN_COOKIE:
+            logger.warning("[SuperIPAgent] DOUYIN_COOKIE 未配置，视频下载可能失败")
+
+        headers_with_cookie = {
+            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
+            "cookie": settings.DOUYIN_COOKIE,
+            "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
+        }
+
+        logger.info(f"[SuperIPAgent] Requesting page with Cookie...")
+
+        async with httpx.AsyncClient(timeout=10.0) as client:
+            response = await client.get(target_url, headers=headers_with_cookie)
+
+        content_match = re.findall(r'<script id="RENDER_DATA" type="application/json">(.*?)</script>', response.text)
+        if not content_match:
+            if "SSR_HYDRATED_DATA" in response.text:
+                content_match = re.findall(r'<script id="SSR_HYDRATED_DATA" type="application/json">(.*?)</script>', response.text)
+
+        if not content_match:
+            logger.error(f"[SuperIPAgent] Could not find RENDER_DATA in page (len={len(response.text)})")
+            return None
+
+        content = unquote(content_match[0])
+        try:
+            data = json.loads(content)
+        except:
+            logger.error("[SuperIPAgent] JSON decode failed")
+            return None
+
+        video_url = None
+        try:
+            if "app" in data and "videoDetail" in data["app"]:
+                info = data["app"]["videoDetail"]["video"]
+                if "bitRateList" in info and info["bitRateList"]:
+                    video_url = info["bitRateList"][0]["playAddr"][0]["src"]
+                elif "playAddr" in info and info["playAddr"]:
+                    video_url = info["playAddr"][0]["src"]
+        except Exception as e:
+            logger.error(f"[SuperIPAgent] Path extraction failed: {e}")
+
+        if not video_url:
+            logger.error("[SuperIPAgent] No video_url found")
+            return None
+
+        if video_url.startswith("//"):
+            video_url = "https:" + video_url
+
+        logger.info(f"[SuperIPAgent] Found video URL: {video_url[:50]}...")
+
+        temp_path = temp_dir / f"douyin_manual_{timestamp}.mp4"
+        download_headers = {
+            'Referer': 'https://www.douyin.com/',
+            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
+        }
+
+        async with httpx.AsyncClient(timeout=60.0) as client:
+            async with client.stream("GET", video_url, headers=download_headers) as dl_resp:
+                if dl_resp.status_code == 200:
+                    with open(temp_path, 'wb') as f:
+                        async for chunk in dl_resp.aiter_bytes(chunk_size=8192):
+                            f.write(chunk)
+
+                    logger.info(f"[SuperIPAgent] Downloaded successfully: {temp_path}")
+                    return temp_path
+                else:
+                    logger.error(f"[SuperIPAgent] Download failed: {dl_resp.status_code}")
+                    return None
+
+    except Exception as e:
+        logger.error(f"[SuperIPAgent] Logic failed: {e}")
+        return None
+
+
+async def _download_bilibili_manual(url: str, temp_dir: Path, timestamp: int) -> Optional[Path]:
+    """手动下载 Bilibili 视频 (Playwright Fallback)"""
+    from playwright.async_api import async_playwright
+
+    logger.info(f"[Playwright] Starting Bilibili download for: {url}")
+
+    playwright = None
+    browser = None
+    try:
+        playwright = await async_playwright().start()
+        browser = await playwright.chromium.launch(headless=True, args=['--no-sandbox', '--disable-setuid-sandbox'])
+
+        context = await browser.new_context(
+            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
+        )
+
+        page = await context.new_page()
+
+        logger.info("[Playwright] Navigating to Bilibili...")
+        await page.goto(url, timeout=45000)
+
+        try:
+            await page.wait_for_selector('video', timeout=15000)
+        except:
+            logger.warning("[Playwright] Video selector timeout")
+
+        playinfo = await page.evaluate("window.__playinfo__")
+
+        audio_url = None
+
+        if playinfo and "data" in playinfo and "dash" in playinfo["data"]:
+            dash = playinfo["data"]["dash"]
+            if "audio" in dash and dash["audio"]:
+                audio_url = dash["audio"][0]["baseUrl"]
+                logger.info(f"[Playwright] Found audio stream in __playinfo__: {audio_url[:50]}...")
+
+        if not audio_url:
+            logger.warning("[Playwright] Could not find audio in __playinfo__")
+            return None
+
+        temp_path = temp_dir / f"bilibili_audio_{timestamp}.m4s"
+
+        try:
+            api_request = context.request
+            headers = {
+                "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
+                "Referer": "https://www.bilibili.com/"
+            }
+
+            logger.info(f"[Playwright] Downloading audio stream...")
+            response = await api_request.get(audio_url, headers=headers)
+
+            if response.status == 200:
+                body = await response.body()
+                with open(temp_path, 'wb') as f:
+                    f.write(body)
+
+                logger.info(f"[Playwright] Downloaded successfully: {temp_path}")
+                return temp_path
+            else:
+                logger.error(f"[Playwright] API Request failed: {response.status}")
+                return None
+
+        except Exception as e:
+            logger.error(f"[Playwright] Download logic error: {e}")
+            return None
+
+    except Exception as e:
+        logger.error(f"[Playwright] Bilibili download failed: {e}")
+        return None
+    finally:
+        if browser:
+            await browser.close()
+        if playwright:
+            await playwright.stop()
--- a/backend/app/modules/videos/schemas.py
+++ b/backend/app/modules/videos/schemas.py
@@ -1,14 +1,25 @@
 from pydantic import BaseModel
-from typing import Optional
+from typing import Optional, List, Literal
+
+
+class CustomAssignment(BaseModel):
+    material_path: str
+    start: float           # 音频时间轴起点
+    end: float             # 音频时间轴终点
+    source_start: float = 0.0  # 源视频截取起点
+    source_end: Optional[float] = None  # 源视频截取终点（可选）


 class GenerateRequest(BaseModel):
    text: str
    voice: str = "zh-CN-YunxiNeural"
    material_path: str
+    material_paths: Optional[List[str]] = None
    tts_mode: str = "edgetts"
    ref_audio_id: Optional[str] = None
    ref_text: Optional[str] = None
+    language: str = "zh-CN"
+    generated_audio_id: Optional[str] = None  # 预生成配音 ID（存在时跳过内联 TTS）
    title: Optional[str] = None
    enable_subtitles: bool = True
    subtitle_style_id: Optional[str] = None
@@ -19,3 +30,5 @@ class GenerateRequest(BaseModel):
    subtitle_bottom_margin: Optional[int] = None
    bgm_id: Optional[str] = None
    bgm_volume: Optional[float] = 0.2
+    custom_assignments: Optional[List[CustomAssignment]] = None
+    output_aspect_ratio: Literal["9:16", "16:9"] = "9:16"
--- a/backend/app/modules/videos/workflow.py
+++ b/backend/app/modules/videos/workflow.py
@@ -1,4 +1,4 @@
-from typing import Optional, Any
+from typing import Optional, Any, List
 from pathlib import Path
 import time
 import traceback
@@ -24,6 +24,17 @@ from .schemas import GenerateRequest
 from .task_store import task_store


+def _locale_to_whisper_lang(locale: str) -> str:
+    """'en-US' → 'en', 'zh-CN' → 'zh'"""
+    return locale.split("-")[0] if "-" in locale else locale
+
+
+def _locale_to_tts_lang(locale: str) -> str:
+    """'zh-CN' → 'Chinese', 'en-US' → 'English', 其他 → 'Auto'"""
+    mapping = {"zh": "Chinese", "en": "English"}
+    return mapping.get(locale.split("-")[0], "Auto")
+
+
 _lipsync_service: Optional[LipSyncService] = None
 _lipsync_ready: Optional[bool] = None
 _lipsync_last_check: float = 0
@@ -79,26 +90,162 @@ def _update_task(task_id: str, **updates: Any) -> None:
    task_store.update(task_id, updates)


+# ── 多素材辅助函数 ──
+
+
+def _split_equal(segments: List[dict], material_paths: List[str]) -> List[dict]:
+    """按素材数量均分音频时长，对齐到最近的 Whisper 字边界。
+
+    Args:
+        segments: Whisper 产出的 segment 列表, 每个包含 words (字级时间戳)
+        material_paths: 素材路径列表
+
+    Returns:
+        [{"material_path": "...", "start": 0.0, "end": 5.2, "index": 0}, ...]
+    """
+    # 展平所有 Whisper 字符
+    all_chars: List[dict] = []
+    for seg in segments:
+        for w in seg.get("words", []):
+            all_chars.append(w)
+
+    n = len(material_paths)
+
+    if not all_chars or n == 0:
+        return [{"material_path": material_paths[0] if material_paths else "",
+                 "start": 0.0, "end": 99999.0, "index": 0}]
+
+    # 素材数不能超过字符数，否则边界会重复
+    if n > len(all_chars):
+        logger.warning(f"[MultiMat] 素材数({n}) > 字符数({len(all_chars)})，裁剪为 {len(all_chars)}")
+        n = len(all_chars)
+
+    total_start = all_chars[0]["start"]
+    total_end = all_chars[-1]["end"]
+    seg_dur = (total_end - total_start) / n
+
+    # 计算 N-1 个分割点，对齐到最近的字边界
+    boundaries = [0]  # 第一段从第 0 个字开始
+    for i in range(1, n):
+        target_time = total_start + i * seg_dur
+        # 找到 start 时间最接近 target_time 的字
+        best_idx = boundaries[-1] + 1  # 至少比上一个边界后移 1
+        best_diff = float("inf")
+        for j in range(boundaries[-1] + 1, len(all_chars)):
+            diff = abs(all_chars[j]["start"] - target_time)
+            if diff < best_diff:
+                best_diff = diff
+                best_idx = j
+            elif diff > best_diff:
+                break  # 时间递增，差值开始变大后可以停了
+        boundaries.append(min(best_idx, len(all_chars) - 1))
+    boundaries.append(len(all_chars))  # 最后一段到末尾
+
+    # 按边界生成分配结果
+    assignments: List[dict] = []
+    for i in range(n):
+        s_idx = boundaries[i]
+        e_idx = boundaries[i + 1]
+        if s_idx >= len(all_chars) or s_idx >= e_idx:
+            continue
+        assignments.append({
+            "material_path": material_paths[i],
+            "start": all_chars[s_idx]["start"],
+            "end": all_chars[e_idx - 1]["end"],
+            "text": "".join(c["word"] for c in all_chars[s_idx:e_idx]),
+            "index": len(assignments),
+        })
+
+    if not assignments:
+        return [{"material_path": material_paths[0], "start": 0.0, "end": 99999.0, "index": 0}]
+
+    logger.info(f"[MultiMat] 均分 {len(all_chars)} 字为 {len(assignments)} 段")
+    for a in assignments:
+        dur = a["end"] - a["start"]
+        logger.info(f"  段{a['index']}: [{a['start']:.2f}-{a['end']:.2f}s] ({dur:.1f}s) {a['text'][:20]}")
+
+    return assignments
+
+
 async def process_video_generation(task_id: str, req: GenerateRequest, user_id: str):
    temp_files = []
    try:
        start_time = time.time()
+
+        # ── 确定素材列表 ──
+        material_paths: List[str] = []
+        if req.custom_assignments and len(req.custom_assignments) > 1:
+            material_paths = [a.material_path for a in req.custom_assignments if a.material_path]
+        elif req.material_paths and len(req.material_paths) > 1:
+            material_paths = req.material_paths
+        else:
+            material_paths = [req.material_path]
+
+        is_multi = len(material_paths) > 1
+        target_resolution = (1080, 1920) if req.output_aspect_ratio == "9:16" else (1920, 1080)
+
+        logger.info(
+            f"[Render] 输出画面比例: {req.output_aspect_ratio}, "
+            f"目标分辨率: {target_resolution[0]}x{target_resolution[1]}"
+        )
+
        _update_task(task_id, status="processing", progress=5, message="正在下载素材...")

        temp_dir = settings.UPLOAD_DIR / "temp"
        temp_dir.mkdir(parents=True, exist_ok=True)
+        video = VideoService()
+        input_material_path: Optional[Path] = None

-        input_material_path = temp_dir / f"{task_id}_input.mp4"
-        temp_files.append(input_material_path)
+        # 单素材模式：下载主素材
+        if not is_multi:
+            input_material_path = temp_dir / f"{task_id}_input.mp4"
+            temp_files.append(input_material_path)
+            await _download_material(material_paths[0], input_material_path)

-        await _download_material(req.material_path, input_material_path)
+            # 归一化旋转元数据（如 iPhone MOV 1920x1080 + rotation=-90）
+            normalized_input_path = temp_dir / f"{task_id}_input_norm.mp4"
+            normalized_result = video.normalize_orientation(
+                str(input_material_path),
+                str(normalized_input_path),
+            )
+            if normalized_result != str(input_material_path):
+                temp_files.append(normalized_input_path)
+                input_material_path = normalized_input_path

        _update_task(task_id, message="正在生成语音...", progress=10)

        audio_path = temp_dir / f"{task_id}_audio.wav"
        temp_files.append(audio_path)

-        if req.tts_mode == "voiceclone":
+        if req.generated_audio_id:
+            # 新流程：使用预生成的配音
+            _update_task(task_id, message="正在下载配音...", progress=12)
+            audio_url = await storage_service.get_signed_url(
+                bucket="generated-audios",
+                path=req.generated_audio_id,
+            )
+            await _download_material(audio_url, audio_path)
+
+            # 从元数据获取 language
+            meta_path = req.generated_audio_id.replace("_audio.wav", "_audio.json")
+            try:
+                meta_url = await storage_service.get_signed_url(
+                    bucket="generated-audios", path=meta_path,
+                )
+                import httpx as _httpx
+                async with _httpx.AsyncClient(timeout=5.0) as client:
+                    resp = await client.get(meta_url)
+                    if resp.status_code == 200:
+                        meta = resp.json()
+                        req.language = meta.get("language", req.language)
+                        # 无条件用配音元数据覆盖文案，确保字幕与配音语言一致
+                        meta_text = meta.get("text", "")
+                        if meta_text:
+                            req.text = meta_text
+            except Exception as e:
+                logger.warning(f"读取配音元数据失败: {e}")
+
+        elif req.tts_mode == "voiceclone":
            if not req.ref_audio_id or not req.ref_text:
                raise ValueError("声音克隆模式需要提供参考音频和参考文字")

@@ -113,13 +260,13 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:
            )
            await _download_material(ref_audio_url, ref_audio_local)

-            _update_task(task_id, message="正在克隆声音 (Qwen3-TTS)...")
+            _update_task(task_id, message="正在克隆声音...")
            await voice_clone_service.generate_audio(
                text=req.text,
                ref_audio_path=str(ref_audio_local),
                ref_text=req.ref_text,
                output_path=str(audio_path),
-                language="Chinese"
+                language=_locale_to_tts_lang(req.language)
            )
        else:
            _update_task(task_id, message="正在生成语音 (EdgeTTS)...")
@@ -128,52 +275,306 @@ async def process_video_generation(task_id: str, req: GenerateRequest, user_id:

        tts_time = time.time() - start_time
        print(f"[Pipeline] TTS completed in {tts_time:.1f}s")
-        _update_task(task_id, progress=25)
-
-        _update_task(task_id, message="正在合成唇形 (LatentSync)...", progress=30)

        lipsync = _get_lipsync_service()
        lipsync_video_path = temp_dir / f"{task_id}_lipsync.mp4"
        temp_files.append(lipsync_video_path)

-        lipsync_start = time.time()
-        is_ready = await _check_lipsync_ready()
-
-        if is_ready:
-            print(f"[LipSync] Starting LatentSync inference...")
-            _update_task(task_id, progress=35, message="正在运行 LatentSync 推理...")
-            await lipsync.generate(str(input_material_path), str(audio_path), str(lipsync_video_path))
-        else:
-            print(f"[LipSync] LatentSync not ready, copying original video")
-            _update_task(task_id, message="唇形同步不可用，使用原始视频...")
-            import shutil
-            shutil.copy(str(input_material_path), lipsync_video_path)
-
-        lipsync_time = time.time() - lipsync_start
-        print(f"[Pipeline] LipSync completed in {lipsync_time:.1f}s")
-        _update_task(task_id, progress=80)
-
        captions_path = None
-        if req.enable_subtitles:
-            _update_task(task_id, message="正在生成字幕 (Whisper)...", progress=82)

-            captions_path = temp_dir / f"{task_id}_captions.json"
-            temp_files.append(captions_path)
+        if is_multi:
+            # ══════════════════════════════════════
+            # 多素材流水线
+            # ══════════════════════════════════════
+            _update_task(task_id, progress=12, message="正在分配素材...")

-            try:
-                await whisper_service.align(
-                    audio_path=str(audio_path),
-                    text=req.text,
-                    output_path=str(captions_path)
+            if req.custom_assignments and len(req.custom_assignments) == len(material_paths):
+                # 用户自定义分配，跳过 Whisper 均分
+                assignments = [
+                    {
+                        "material_path": a.material_path,
+                        "start": a.start,
+                        "end": a.end,
+                        "source_start": a.source_start,
+                        "source_end": a.source_end,
+                        "index": i,
+                    }
+                    for i, a in enumerate(req.custom_assignments)
+                ]
+                # 仍然需要 Whisper 生成字幕（如果启用）
+                captions_path = temp_dir / f"{task_id}_captions.json"
+                temp_files.append(captions_path)
+                if req.enable_subtitles:
+                    _update_task(task_id, message="正在生成字幕 (Whisper)...")
+                    try:
+                        await whisper_service.align(
+                            audio_path=str(audio_path),
+                            text=req.text,
+                            output_path=str(captions_path),
+                            language=_locale_to_whisper_lang(req.language),
+                            original_text=req.text,
+                        )
+                        print(f"[Pipeline] Whisper alignment completed (custom assignments)")
+                    except Exception as e:
+                        logger.warning(f"Whisper alignment failed: {e}")
+                        captions_path = None
+                else:
+                    captions_path = None
+            elif req.custom_assignments:
+                logger.warning(
+                    f"[MultiMat] custom_assignments 数量({len(req.custom_assignments)})"
+                    f" 与素材数量({len(material_paths)})不一致，回退自动分配"
                )
-                print(f"[Pipeline] Whisper alignment completed")
-            except Exception as e:
-                logger.warning(f"Whisper alignment failed, skipping subtitles: {e}")
+
+                # 原有逻辑：Whisper → _split_equal
+                _update_task(task_id, message="正在生成字幕 (Whisper)...")
+
+                captions_path = temp_dir / f"{task_id}_captions.json"
+                temp_files.append(captions_path)
+
+                try:
+                    captions_data = await whisper_service.align(
+                        audio_path=str(audio_path),
+                        text=req.text,
+                        output_path=str(captions_path),
+                        language=_locale_to_whisper_lang(req.language),
+                        original_text=req.text,
+                    )
+                    print(f"[Pipeline] Whisper alignment completed (multi-material)")
+                except Exception as e:
+                    logger.warning(f"Whisper alignment failed: {e}")
+                    captions_data = None
+                    captions_path = None
+
+                _update_task(task_id, progress=15, message="正在分配素材...")
+
+                if captions_data and captions_data.get("segments"):
+                    assignments = _split_equal(captions_data["segments"], material_paths)
+                else:
+                    # Whisper 失败 → 按时长均分（不依赖字符对齐）
+                    logger.warning("[MultiMat] Whisper 无数据，按时长均分")
+                    audio_dur = video._get_duration(str(audio_path))
+                    if audio_dur <= 0:
+                        audio_dur = 30.0  # 安全兜底
+                    seg_dur = audio_dur / len(material_paths)
+                    assignments = [
+                        {"material_path": material_paths[i], "start": i * seg_dur,
+                         "end": (i + 1) * seg_dur, "index": i}
+                        for i in range(len(material_paths))
+                    ]
+
+            else:
+                # 原有逻辑：Whisper → _split_equal
+                _update_task(task_id, message="正在生成字幕 (Whisper)...")
+
+                captions_path = temp_dir / f"{task_id}_captions.json"
+                temp_files.append(captions_path)
+
+                try:
+                    captions_data = await whisper_service.align(
+                        audio_path=str(audio_path),
+                        text=req.text,
+                        output_path=str(captions_path),
+                        language=_locale_to_whisper_lang(req.language),
+                        original_text=req.text,
+                    )
+                    print(f"[Pipeline] Whisper alignment completed (multi-material)")
+                except Exception as e:
+                    logger.warning(f"Whisper alignment failed: {e}")
+                    captions_data = None
+                    captions_path = None
+
+                _update_task(task_id, progress=15, message="正在分配素材...")
+
+                if captions_data and captions_data.get("segments"):
+                    assignments = _split_equal(captions_data["segments"], material_paths)
+                else:
+                    # Whisper 失败 → 按时长均分（不依赖字符对齐）
+                    logger.warning("[MultiMat] Whisper 无数据，按时长均分")
+                    audio_dur = video._get_duration(str(audio_path))
+                    if audio_dur <= 0:
+                        audio_dur = 30.0  # 安全兜底
+                    seg_dur = audio_dur / len(material_paths)
+                    assignments = [
+                        {"material_path": material_paths[i], "start": i * seg_dur,
+                         "end": (i + 1) * seg_dur, "index": i}
+                        for i in range(len(material_paths))
+                    ]
+
+            # 扩展段覆盖完整音频范围：首段从0开始，末段到音频结尾
+            audio_duration = video._get_duration(str(audio_path))
+            if assignments and audio_duration > 0:
+                assignments[0]["start"] = 0.0
+                assignments[-1]["end"] = audio_duration
+
+            num_segments = len(assignments)
+            print(f"[Pipeline] Multi-material: {num_segments} segments, {len(material_paths)} materials")
+
+            if num_segments == 0:
+                raise RuntimeError("Multi-material: no valid segments after splitting")
+
+            lipsync_start = time.time()
+
+            # ── 第一步：下载所有素材并检测分辨率 ──
+            material_locals: List[Path] = []
+            resolutions = []
+
+            for i, assignment in enumerate(assignments):
+                material_local = temp_dir / f"{task_id}_material_{i}.mp4"
+                temp_files.append(material_local)
+                await _download_material(assignment["material_path"], material_local)
+
+                # 归一化旋转元数据，确保分辨率判断与后续推理一致
+                normalized_material = temp_dir / f"{task_id}_material_{i}_norm.mp4"
+                normalized_result = video.normalize_orientation(
+                    str(material_local),
+                    str(normalized_material),
+                )
+                if normalized_result != str(material_local):
+                    temp_files.append(normalized_material)
+                    material_local = normalized_material
+
+                material_locals.append(material_local)
+                resolutions.append(video.get_resolution(str(material_local)))
+
+            # 按用户选择的画面比例统一分辨率
+            base_res = target_resolution
+            need_scale = any(r != base_res for r in resolutions)
+            if need_scale:
+                logger.info(f"[MultiMat] 素材分辨率不一致，统一到 {base_res[0]}x{base_res[1]}")
+
+            # ── 第二步：裁剪每段素材到对应时长 ──
+            prepared_segments: List[Path] = []
+
+            for i, assignment in enumerate(assignments):
+                seg_progress = 15 + int((i / num_segments) * 30)  # 15% → 45%
+                seg_dur = assignment["end"] - assignment["start"]
+                _update_task(
+                    task_id,
+                    progress=seg_progress,
+                    message=f"正在准备素材 {i+1}/{num_segments}..."
+                )
+
+                prepared_path = temp_dir / f"{task_id}_prepared_{i}.mp4"
+                temp_files.append(prepared_path)
+                video.prepare_segment(
+                    str(material_locals[i]), seg_dur, str(prepared_path),
+                    # 多素材拼接前统一重编码为同分辨率/同编码，避免 concat 仅保留首段
+                    target_resolution=base_res,
+                    source_start=assignment.get("source_start", 0.0),
+                    source_end=assignment.get("source_end"),
+                    target_fps=25,
+                )
+                prepared_segments.append(prepared_path)
+
+            # ── 第二步：拼接所有素材片段 ──
+            _update_task(task_id, progress=50, message="正在拼接素材片段...")
+            concat_path = temp_dir / f"{task_id}_concat.mp4"
+            temp_files.append(concat_path)
+            video.concat_videos(
+                [str(p) for p in prepared_segments],
+                str(concat_path),
+                target_fps=25,
+            )
+
+            # ── 第三步：一次 LatentSync 推理 ──
+            is_ready = await _check_lipsync_ready()
+
+            if is_ready:
+                _update_task(task_id, progress=55, message="正在合成唇形 (LatentSync)...")
+                print(f"[LipSync] Multi-material: single LatentSync on concatenated video")
+                try:
+                    await lipsync.generate(str(concat_path), str(audio_path), str(lipsync_video_path))
+                except Exception as e:
+                    logger.warning(f"[LipSync] Failed, fallback to concat without lipsync: {e}")
+                    import shutil
+                    shutil.copy(str(concat_path), str(lipsync_video_path))
+            else:
+                print(f"[LipSync] Not ready, using concatenated video without lipsync")
+                import shutil
+                shutil.copy(str(concat_path), str(lipsync_video_path))
+
+            lipsync_time = time.time() - lipsync_start
+            print(f"[Pipeline] Multi-material prepare + concat + LipSync completed in {lipsync_time:.1f}s")
+            _update_task(task_id, progress=80)
+
+            # 如果用户关闭了字幕，清除 captions_path（Whisper 仅用于句子切分）
+            if not req.enable_subtitles:
                captions_path = None

+        else:
+            # ══════════════════════════════════════
+            # 单素材流水线（原有逻辑）
+            # ══════════════════════════════════════
+
+            if input_material_path is None:
+                raise RuntimeError("单素材流程缺少输入素材")
+
+            # 单素材：按用户选择画面比例统一到目标分辨率，并应用 source_start
+            single_source_start = 0.0
+            single_source_end = None
+            if req.custom_assignments and len(req.custom_assignments) == 1:
+                single_source_start = req.custom_assignments[0].source_start
+                single_source_end = req.custom_assignments[0].source_end
+
+            _update_task(task_id, progress=20, message="正在准备素材片段...")
+            audio_dur = video._get_duration(str(audio_path))
+            if audio_dur <= 0:
+                audio_dur = 30.0
+            prepared_single_path = temp_dir / f"{task_id}_prepared_single.mp4"
+            temp_files.append(prepared_single_path)
+            video.prepare_segment(
+                str(input_material_path),
+                audio_dur,
+                str(prepared_single_path),
+                target_resolution=target_resolution,
+                source_start=single_source_start,
+                source_end=single_source_end,
+            )
+            input_material_path = prepared_single_path
+
+            _update_task(task_id, progress=25)
+            _update_task(task_id, message="正在合成唇形 (LatentSync)...", progress=30)
+
+            lipsync_start = time.time()
+            is_ready = await _check_lipsync_ready()
+
+            if is_ready:
+                print(f"[LipSync] Starting LatentSync inference...")
+                _update_task(task_id, progress=35, message="正在运行 LatentSync 推理...")
+                await lipsync.generate(str(input_material_path), str(audio_path), str(lipsync_video_path))
+            else:
+                print(f"[LipSync] LatentSync not ready, copying original video")
+                _update_task(task_id, message="唇形同步不可用，使用原始视频...")
+                import shutil
+                shutil.copy(str(input_material_path), lipsync_video_path)
+
+            lipsync_time = time.time() - lipsync_start
+            print(f"[Pipeline] LipSync completed in {lipsync_time:.1f}s")
+            _update_task(task_id, progress=80)
+
+            # 单素材模式：Whisper 在 LatentSync 之后
+            if req.enable_subtitles:
+                _update_task(task_id, message="正在生成字幕 (Whisper)...", progress=82)
+
+                captions_path = temp_dir / f"{task_id}_captions.json"
+                temp_files.append(captions_path)
+
+                try:
+                    await whisper_service.align(
+                        audio_path=str(audio_path),
+                        text=req.text,
+                        output_path=str(captions_path),
+                        language=_locale_to_whisper_lang(req.language),
+                        original_text=req.text,
+                    )
+                    print(f"[Pipeline] Whisper alignment completed")
+                except Exception as e:
+                    logger.warning(f"Whisper alignment failed, skipping subtitles: {e}")
+                    captions_path = None
+
        _update_task(task_id, progress=85)

-        video = VideoService()
        final_audio_path = audio_path
        if req.bgm_id:
            _update_task(task_id, message="正在合成背景音乐...", progress=86)
--- a/backend/app/repositories/users.py
+++ b/backend/app/repositories/users.py
@@ -1,3 +1,4 @@
+from datetime import datetime, timezone
 from typing import Any, Dict, List, Optional, cast

 from app.core.supabase import get_supabase
@@ -37,3 +38,33 @@ def update_user(user_id: str, payload: Dict[str, Any]) -> List[Dict[str, Any]]:
    supabase = get_supabase()
    result = supabase.table("users").update(payload).eq("id", user_id).execute()
    return cast(List[Dict[str, Any]], result.data or [])
+
+
+def _parse_expires_at(expires_at: Any) -> Optional[datetime]:
+    try:
+        expires_at_dt = datetime.fromisoformat(str(expires_at).replace("Z", "+00:00"))
+    except Exception:
+        return None
+
+    if expires_at_dt.tzinfo is None:
+        expires_at_dt = expires_at_dt.replace(tzinfo=timezone.utc)
+    return expires_at_dt.astimezone(timezone.utc)
+
+
+def deactivate_user_if_expired(user: Dict[str, Any]) -> bool:
+    expires_at = user.get("expires_at")
+    if not expires_at:
+        return False
+
+    expires_at_dt = _parse_expires_at(expires_at)
+    if not expires_at_dt:
+        return False
+
+    if datetime.now(timezone.utc) <= expires_at_dt:
+        return False
+
+    user_id = user.get("id")
+    if user.get("is_active") and user_id:
+        update_user(cast(str, user_id), {"is_active": False})
+
+    return True
--- a/backend/app/services/glm_service.py
+++ b/backend/app/services/glm_service.py
@@ -43,6 +43,7 @@ class GLMService:
 要求：
 1. 标题要简洁有力，能吸引观众点击，不超过10个字
 2. 标签要与内容相关，便于搜索和推荐，只要3个
+3. 标题和标签必须使用与口播文案相同的语言（如文案是英文就用英文，日文就用日文）

 请严格按以下JSON格式返回（不要包含其他内容）：
 {{"title": "标题", "tags": ["标签1", "标签2", "标签3"]}}"""
@@ -120,6 +121,49 @@ class GLMService:



+    async def translate_text(self, text: str, target_lang: str) -> str:
+        """
+        将文案翻译为指定语言
+
+        Args:
+            text: 原始文案
+            target_lang: 目标语言（如 English, 日本語 等）
+
+        Returns:
+            翻译后的文案
+        """
+        prompt = f"""请将以下文案翻译为{target_lang}。
+
+原文：
+{text}
+
+要求：
+1. 只返回翻译后的文案，不要添加任何解释或说明
+2. 保持原文的语气和风格
+3. 翻译要自然流畅，符合目标语言的表达习惯"""
+
+        try:
+            client = self._get_client()
+            logger.info(f"Using GLM to translate text to {target_lang}")
+
+            import asyncio
+            response = await asyncio.to_thread(
+                client.chat.completions.create,
+                model=settings.GLM_MODEL,
+                messages=[{"role": "user", "content": prompt}],
+                thinking={"type": "disabled"},
+                max_tokens=2000,
+                temperature=0.3
+            )
+
+            content = response.choices[0].message.content
+            logger.info("GLM translation completed")
+            return content.strip()
+
+        except Exception as e:
+            logger.error(f"GLM translate error: {e}")
+            raise Exception(f"AI 翻译失败: {str(e)}")
+
    def _parse_json_response(self, content: str) -> dict:
        """解析 GLM 返回的 JSON 内容"""
        # 尝试直接解析
--- a/backend/app/services/storage.py
+++ b/backend/app/services/storage.py
@@ -20,12 +20,13 @@ class StorageService:
        self.BUCKET_MATERIALS = "materials"
        self.BUCKET_OUTPUTS = "outputs"
        self.BUCKET_REF_AUDIOS = "ref-audios"
+        self.BUCKET_GENERATED_AUDIOS = "generated-audios"
        # 确保所有 bucket 存在
        self._ensure_buckets()

    def _ensure_buckets(self):
        """确保所有必需的 bucket 存在"""
-        buckets = [self.BUCKET_MATERIALS, self.BUCKET_OUTPUTS, self.BUCKET_REF_AUDIOS]
+        buckets = [self.BUCKET_MATERIALS, self.BUCKET_OUTPUTS, self.BUCKET_REF_AUDIOS, self.BUCKET_GENERATED_AUDIOS]
        try:
            existing = self.supabase.storage.list_buckets()
            existing_names = {b.name for b in existing} if existing else set()
--- a/backend/app/services/video_service.py
+++ b/backend/app/services/video_service.py
@@ -9,9 +9,110 @@ from pathlib import Path
 from loguru import logger
 from typing import Optional

-class VideoService:
-    def __init__(self):
-        pass
+class VideoService:
+    def __init__(self):
+        pass
+
+    def get_video_metadata(self, file_path: str) -> dict:
+        """获取视频元信息（含旋转角与有效显示分辨率）"""
+        cmd = [
+            "ffprobe", "-v", "error",
+            "-select_streams", "v:0",
+            "-show_entries", "stream=width,height:stream_side_data=rotation",
+            "-of", "json",
+            file_path,
+        ]
+        default_info = {
+            "width": 0,
+            "height": 0,
+            "rotation": 0,
+            "effective_width": 0,
+            "effective_height": 0,
+        }
+
+        try:
+            result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
+            if result.returncode != 0:
+                return default_info
+
+            payload = json.loads(result.stdout or "{}")
+            streams = payload.get("streams") or []
+            if not streams:
+                return default_info
+
+            stream = streams[0]
+            width = int(stream.get("width") or 0)
+            height = int(stream.get("height") or 0)
+
+            rotation = 0
+            for side_data in stream.get("side_data_list") or []:
+                if not isinstance(side_data, dict):
+                    continue
+                raw_rotation = side_data.get("rotation")
+                if raw_rotation is None:
+                    continue
+                try:
+                    rotation = int(round(float(str(raw_rotation))))
+                except Exception:
+                    rotation = 0
+                break
+
+            norm_rotation = rotation % 360
+            if norm_rotation > 180:
+                norm_rotation -= 360
+            swap_wh = abs(norm_rotation) == 90
+
+            effective_width = height if swap_wh else width
+            effective_height = width if swap_wh else height
+
+            return {
+                "width": width,
+                "height": height,
+                "rotation": norm_rotation,
+                "effective_width": effective_width,
+                "effective_height": effective_height,
+            }
+        except Exception as e:
+            logger.warning(f"获取视频元信息失败: {e}")
+            return default_info
+
+    def normalize_orientation(self, video_path: str, output_path: str) -> str:
+        """将带旋转元数据的视频转为物理方向，避免后续流程忽略 rotation。"""
+        info = self.get_video_metadata(video_path)
+        rotation = int(info.get("rotation") or 0)
+        if rotation == 0:
+            return video_path
+
+        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+        logger.info(
+            f"检测到旋转元数据 rotation={rotation}，归一化方向: "
+            f"{info.get('effective_width', 0)}x{info.get('effective_height', 0)}"
+        )
+
+        cmd = [
+            "ffmpeg", "-y",
+            "-i", video_path,
+            "-map", "0:v:0",
+            "-map", "0:a?",
+            "-c:v", "libx264",
+            "-preset", "fast",
+            "-crf", "18",
+            "-c:a", "copy",
+            "-movflags", "+faststart",
+            output_path,
+        ]
+
+        if self._run_ffmpeg(cmd):
+            normalized = self.get_video_metadata(output_path)
+            logger.info(
+                "视频方向归一化完成: "
+                f"coded={normalized.get('width', 0)}x{normalized.get('height', 0)}, "
+                f"rotation={normalized.get('rotation', 0)}"
+            )
+            return output_path
+
+        logger.warning("视频方向归一化失败，回退使用原视频")
+        return video_path

    def _run_ffmpeg(self, cmd: list) -> bool:
        cmd_str = ' '.join(shlex.quote(str(c)) for c in cmd)
@@ -138,3 +239,167 @@ class VideoService:
            return output_path
        else:
            raise RuntimeError("FFmpeg composition failed")
+
+    def concat_videos(self, video_paths: list, output_path: str, target_fps: int = 25) -> str:
+        """使用 FFmpeg concat demuxer 拼接多个视频片段"""
+        if not video_paths:
+            raise ValueError("No video segments to concat")
+
+        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+
+        # 生成 concat list 文件
+        list_path = Path(output_path).parent / f"{Path(output_path).stem}_concat.txt"
+        with open(list_path, "w", encoding="utf-8") as f:
+            for vp in video_paths:
+                f.write(f"file '{vp}'\n")
+
+        cmd = [
+            "ffmpeg", "-y",
+            "-f", "concat",
+            "-safe", "0",
+            "-fflags", "+genpts",
+            "-i", str(list_path),
+            "-an",
+            "-vsync", "cfr",
+            "-r", str(target_fps),
+            "-c:v", "libx264",
+            "-preset", "fast",
+            "-crf", "18",
+            "-pix_fmt", "yuv420p",
+            "-movflags", "+faststart",
+            output_path,
+        ]
+
+        try:
+            if self._run_ffmpeg(cmd):
+                return output_path
+            else:
+                raise RuntimeError("FFmpeg concat failed")
+        finally:
+            try:
+                list_path.unlink(missing_ok=True)
+            except Exception:
+                pass
+
+    def split_audio(self, audio_path: str, start: float, end: float, output_path: str) -> str:
+        """用 FFmpeg 按时间范围切分音频"""
+        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+
+        duration = end - start
+        if duration <= 0:
+            raise ValueError(f"Invalid audio split range: start={start}, end={end}, duration={duration}")
+
+        cmd = [
+            "ffmpeg", "-y",
+            "-ss", str(start),
+            "-t", str(duration),
+            "-i", audio_path,
+            "-c", "copy",
+            output_path,
+        ]
+
+        if self._run_ffmpeg(cmd):
+            return output_path
+        raise RuntimeError(f"FFmpeg audio split failed: {start}-{end}")
+
+    def get_resolution(self, file_path: str) -> tuple[int, int]:
+        """获取视频有效显示分辨率（考虑旋转元数据）。"""
+        info = self.get_video_metadata(file_path)
+        return (
+            int(info.get("effective_width") or 0),
+            int(info.get("effective_height") or 0),
+        )
+
+    def prepare_segment(self, video_path: str, target_duration: float, output_path: str,
+                        target_resolution: Optional[tuple] = None, source_start: float = 0.0,
+                        source_end: Optional[float] = None, target_fps: Optional[int] = None) -> str:
+        """将素材视频裁剪或循环到指定时长（无音频）。
+        target_resolution: (width, height) 如需统一分辨率则传入，否则保持原分辨率。
+        source_start: 源视频截取起点（秒），默认 0。
+        source_end: 源视频截取终点（秒），默认到素材结尾。
+        target_fps: 输出帧率（可选），用于多素材拼接前统一时间基。
+        """
+        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+
+        video_dur = self._get_duration(video_path)
+        if video_dur <= 0:
+            video_dur = target_duration
+
+        clip_end = video_dur
+        if source_end is not None:
+            try:
+                source_end_value = float(source_end)
+                if source_end_value > source_start:
+                    clip_end = min(source_end_value, video_dur)
+            except Exception:
+                pass
+
+        # 可用时长 = 从 source_start 到视频结尾
+        available = max(clip_end - source_start, 0.1)
+        needs_loop = target_duration > available
+        needs_scale = target_resolution is not None
+        needs_fps = bool(target_fps and target_fps > 0)
+        has_source_end = clip_end < video_dur
+
+        # 当需要循环且存在截取范围时，先裁剪出片段，再循环裁剪后的文件
+        # 避免 stream_loop 循环整个视频（而不是截取后的片段）
+        actual_input = video_path
+        trim_temp = None
+        if needs_loop and (source_start > 0 or has_source_end):
+            trim_temp = str(Path(output_path).parent / (Path(output_path).stem + "_trim_tmp.mp4"))
+            trim_cmd = [
+                "ffmpeg", "-y",
+                "-ss", str(source_start),
+                "-i", video_path,
+                "-t", str(available),
+                "-an",
+                "-c:v", "libx264", "-preset", "fast", "-crf", "18",
+                trim_temp,
+            ]
+            if not self._run_ffmpeg(trim_cmd):
+                raise RuntimeError(f"FFmpeg trim for loop failed: {video_path}")
+            actual_input = trim_temp
+            source_start = 0.0  # 已裁剪，不需要再 seek
+            # 重新计算循环次数（基于裁剪后文件）
+            available = self._get_duration(trim_temp) or available
+
+        loop_count = int(target_duration / available) + 1 if needs_loop else 0
+
+        cmd = ["ffmpeg", "-y"]
+        if needs_loop:
+            cmd.extend(["-stream_loop", str(loop_count)])
+        if source_start > 0:
+            cmd.extend(["-ss", str(source_start)])
+        cmd.extend(["-i", actual_input, "-t", str(target_duration), "-an"])
+
+        filters = []
+        if needs_fps:
+            filters.append(f"fps={int(target_fps)}")
+        if needs_scale:
+            w, h = target_resolution
+            filters.append(f"scale={w}:{h}:force_original_aspect_ratio=decrease,pad={w}:{h}:(ow-iw)/2:(oh-ih)/2")
+
+        if filters:
+            cmd.extend(["-vf", ",".join(filters)])
+        if needs_fps:
+            cmd.extend(["-vsync", "cfr", "-r", str(int(target_fps))])
+
+        # 需要循环、缩放或指定起点时必须重编码，否则用 stream copy 保持原画质
+        if needs_loop or needs_scale or source_start > 0 or has_source_end or needs_fps:
+            cmd.extend(["-c:v", "libx264", "-preset", "fast", "-crf", "18"])
+        else:
+            cmd.extend(["-c:v", "copy"])
+
+        cmd.append(output_path)
+
+        try:
+            if self._run_ffmpeg(cmd):
+                return output_path
+            raise RuntimeError(f"FFmpeg prepare_segment failed: {video_path}")
+        finally:
+            # 清理裁剪临时文件
+            if trim_temp:
+                try:
+                    Path(trim_temp).unlink(missing_ok=True)
+                except Exception:
+                    pass
--- a/backend/app/services/voice_clone_service.py
+++ b/backend/app/services/voice_clone_service.py
@@ -1,37 +1,104 @@
 """
 声音克隆服务
-通过 HTTP 调用 Qwen3-TTS 独立服务 (端口 8009)
+通过 HTTP 调用 CosyVoice 3.0 独立服务 (端口 8010)
 """
-import httpx
 import asyncio
 from pathlib import Path
 from typing import Optional
+
+import httpx
 from loguru import logger

-from app.core.config import settings
-
-# Qwen3-TTS 服务地址
-QWEN_TTS_URL = "http://localhost:8009"
+# CosyVoice 3.0 服务地址
+VOICE_CLONE_URL = "http://localhost:8010"


 class VoiceCloneService:
-    """声音克隆服务 - 调用 Qwen3-TTS HTTP API"""
+    """声音克隆服务 - 调用 CosyVoice 3.0 HTTP API"""

    def __init__(self):
-        self.base_url = QWEN_TTS_URL
+        self.base_url = VOICE_CLONE_URL
        # 健康状态缓存
        self._health_cache: Optional[dict] = None
        self._health_cache_time: float = 0
        # GPU 并发锁 (Serial Queue)
        self._lock = asyncio.Lock()

+    async def _generate_once(
+        self,
+        *,
+        text: str,
+        ref_audio_data: bytes,
+        ref_text: str,
+        language: str,
+        speed: float = 1.0,
+        max_retries: int = 4,
+    ) -> bytes:
+        timeout = httpx.Timeout(240.0)
+
+        for attempt in range(max_retries):
+            try:
+                async with httpx.AsyncClient(timeout=timeout) as client:
+                    response = await client.post(
+                        f"{self.base_url}/generate",
+                        files={"ref_audio": ("ref.wav", ref_audio_data, "audio/wav")},
+                        data={
+                            "text": text,
+                            "ref_text": ref_text,
+                            "language": language,
+                            "speed": str(speed),
+                        },
+                    )
+
+                retryable = False
+                reason = ""
+
+                if response.status_code in (429, 502, 503, 504):
+                    retryable = True
+                    reason = f"HTTP {response.status_code}"
+                elif response.status_code == 500 and (
+                    "生成超时" in response.text or "timeout" in response.text.lower()
+                ):
+                    retryable = True
+                    reason = "upstream timeout"
+
+                if retryable and attempt < max_retries - 1:
+                    wait = 8 * (attempt + 1)
+                    logger.warning(
+                        f"Voice clone retryable error ({reason}), retrying in {wait}s "
+                        f"(attempt {attempt + 1}/{max_retries})"
+                    )
+                    await asyncio.sleep(wait)
+                    continue
+
+                response.raise_for_status()
+                return response.content
+
+            except httpx.HTTPStatusError as e:
+                logger.error(f"Voice clone API error: {e.response.status_code} - {e.response.text}")
+                raise RuntimeError(f"声音克隆服务错误: {e.response.text}")
+            except httpx.RequestError as e:
+                if attempt < max_retries - 1:
+                    wait = 6 * (attempt + 1)
+                    logger.warning(
+                        f"Voice clone connection error: {e}; retrying in {wait}s "
+                        f"(attempt {attempt + 1}/{max_retries})"
+                    )
+                    await asyncio.sleep(wait)
+                    continue
+                logger.error(f"Voice clone connection error: {e}")
+                raise RuntimeError("无法连接声音克隆服务，请检查服务是否启动")
+
+        raise RuntimeError("声音克隆服务繁忙，请稍后重试")
+
    async def generate_audio(
        self,
        text: str,
        ref_audio_path: str,
        ref_text: str,
        output_path: str,
-        language: str = "Chinese"
+        language: str = "Chinese",
+        speed: float = 1.0,
    ) -> str:
        """
        使用声音克隆生成语音
@@ -48,63 +115,52 @@ class VoiceCloneService:
        """
        # 使用锁确保串行执行，避免 GPU 显存溢出
        async with self._lock:
-            logger.info(f"🎤 Voice Clone: {text[:30]}...")
+            logger.info(f"🎤 Voice Clone: {text[:30]}... (language={language})")
            Path(output_path).parent.mkdir(parents=True, exist_ok=True)

-            # 读取参考音频
+            text = text.strip()
+            if not text:
+                raise RuntimeError("文本为空，无法生成语音")
+
            with open(ref_audio_path, "rb") as f:
                ref_audio_data = f.read()

-            # 调用 Qwen3-TTS 服务
-            timeout = httpx.Timeout(300.0)  # 5分钟超时
-            async with httpx.AsyncClient(timeout=timeout) as client:
-                try:
-                    response = await client.post(
-                        f"{self.base_url}/generate",
-                        files={"ref_audio": ("ref.wav", ref_audio_data, "audio/wav")},
-                        data={
-                            "text": text,
-                            "ref_text": ref_text,
-                            "language": language
-                        }
-                    )
-                    response.raise_for_status()
-
-                    # 保存返回的音频
-                    with open(output_path, "wb") as f:
-                        f.write(response.content)
-
-                    logger.info(f"✅ Voice clone saved: {output_path}")
-                    return output_path
-
-                except httpx.HTTPStatusError as e:
-                    logger.error(f"Qwen3-TTS API error: {e.response.status_code} - {e.response.text}")
-                    raise RuntimeError(f"声音克隆服务错误: {e.response.text}")
-                except httpx.RequestError as e:
-                    logger.error(f"Qwen3-TTS connection error: {e}")
-                    raise RuntimeError("无法连接声音克隆服务，请检查服务是否启动")
+            # CosyVoice 内部自带 text_normalize 分段，无需客户端切分
+            audio_bytes = await self._generate_once(
+                text=text,
+                ref_audio_data=ref_audio_data,
+                ref_text=ref_text,
+                language=language,
+                speed=speed,
+            )
+            with open(output_path, "wb") as f:
+                f.write(audio_bytes)
+            logger.info(f"✅ Voice clone saved: {output_path}")
+            return output_path

    async def check_health(self) -> dict:
        """健康检查"""
        import time

-        # 5分钟缓存
+        # 30秒缓存
        now = time.time()
-        if self._health_cache and (now - self._health_cache_time) < 300:
-            return self._health_cache
+        cached = self._health_cache
+        if cached is not None and (now - self._health_cache_time) < 30:
+            return cached

        try:
            async with httpx.AsyncClient(timeout=5.0) as client:
                response = await client.get(f"{self.base_url}/health")
                response.raise_for_status()
-                self._health_cache = response.json()
+                payload = response.json()
+                self._health_cache = payload
                self._health_cache_time = now
-                return self._health_cache
+                return payload
        except Exception as e:
-            logger.warning(f"Qwen3-TTS health check failed: {e}")
+            logger.warning(f"Voice clone health check failed: {e}")
            return {
-                "service": "Qwen3-TTS Voice Clone",
-                "model": "0.6B-Base",
+                "service": "CosyVoice 3.0 Voice Clone",
+                "model": "unknown",
                "ready": False,
                "gpu_id": 0,
                "error": str(e)
--- a/backend/app/services/whisper_service.py
+++ b/backend/app/services/whisper_service.py
@@ -20,24 +20,41 @@ MAX_CHARS_PER_LINE = 12

 def split_word_to_chars(word: str, start: float, end: float) -> list:
    """
-    将词拆分成单个字符，时间戳线性插值
+    将词拆分成单个字符，时间戳线性插值。
+    保留英文词前的空格（Whisper 输出如 " Hello"），用于正确重建英文字幕。

    Args:
-        word: 词文本
+        word: 词文本（可能含前导空格）
        start: 词开始时间
        end: 词结束时间

    Returns:
        单字符列表，每个包含 word/start/end
    """
+    # 保留前导空格（英文 Whisper 输出常见 " Hello" 形式）
+    leading_space = ""
+    if word and not word[0].strip():
+        leading_space = " "
+        word = word.lstrip()
+
    tokens = []
    ascii_buffer = ""
+    pending_space = False  # 记录是否有待处理的空格（用于英文单词间距）

    for char in word:
        if not char.strip():
+            # 空格：flush ascii_buffer，标记下一个 token 需要前导空格
+            if ascii_buffer:
+                tokens.append(ascii_buffer)
+                ascii_buffer = ""
+            if tokens:  # 仅在已有 token 时标记（避免开头重复空格）
+                pending_space = True
            continue

        if char.isascii() and char.isalnum():
+            if pending_space and not ascii_buffer:
+                ascii_buffer = " "  # 将空格前置到新英文单词
+                pending_space = False
            ascii_buffer += char
            continue

@@ -45,7 +62,9 @@ def split_word_to_chars(word: str, start: float, end: float) -> list:
            tokens.append(ascii_buffer)
            ascii_buffer = ""

-        tokens.append(char)
+        prefix = " " if pending_space else ""
+        pending_space = False
+        tokens.append(prefix + char)

    if ascii_buffer:
        tokens.append(ascii_buffer)
@@ -54,7 +73,8 @@ def split_word_to_chars(word: str, start: float, end: float) -> list:
        return []

    if len(tokens) == 1:
-        return [{"word": tokens[0], "start": start, "end": end}]
+        w = leading_space + tokens[0] if leading_space else tokens[0]
+        return [{"word": w, "start": start, "end": end}]

    # 线性插值时间戳
    duration = end - start
@@ -64,8 +84,11 @@ def split_word_to_chars(word: str, start: float, end: float) -> list:
    for i, token in enumerate(tokens):
        token_start = start + i * token_duration
        token_end = start + (i + 1) * token_duration
+        w = token
+        if i == 0 and leading_space:
+            w = leading_space + w
        result.append({
-            "word": token,
+            "word": w,
            "start": round(token_start, 3),
            "end": round(token_end, 3)
        })
@@ -108,7 +131,7 @@ def split_segment_to_lines(words: List[dict], max_chars: int = MAX_CHARS_PER_LIN

        if should_break and current_words:
            segments.append({
-                "text": current_text,
+                "text": current_text.strip(),
                "start": current_words[0]["start"],
                "end": current_words[-1]["end"],
                "words": current_words.copy()
@@ -119,7 +142,7 @@ def split_segment_to_lines(words: List[dict], max_chars: int = MAX_CHARS_PER_LIN
    # 处理剩余的字
    if current_words:
        segments.append({
-            "text": current_text,
+            "text": current_text.strip(),
            "start": current_words[0]["start"],
            "end": current_words[-1]["end"],
            "words": current_words.copy()
@@ -162,7 +185,9 @@ class WhisperService:
        self,
        audio_path: str,
        text: str,
-        output_path: Optional[str] = None
+        output_path: Optional[str] = None,
+        language: str = "zh",
+        original_text: Optional[str] = None,
    ) -> dict:
        """
        对音频进行转录，生成字级别时间戳
@@ -171,12 +196,18 @@ class WhisperService:
            audio_path: 音频文件路径
            text: 原始文本（用于参考，但实际使用 whisper 转录结果）
            output_path: 可选，输出 JSON 文件路径
+            language: 语言代码 (zh/en 等)
+            original_text: 原始文案。非空时，Whisper 仅用于检测总时间范围，
+                           字幕文字用此原文替换（解决语言不匹配问题）

        Returns:
            包含字级别时间戳的字典
        """
        import asyncio

+        # 英文等西文需要更大的每行字数
+        max_chars = 40 if language != "zh" else MAX_CHARS_PER_LINE
+
        def _do_transcribe():
            model = self._load_model()

@@ -185,22 +216,26 @@ class WhisperService:
            # 转录并获取字级别时间戳
            segments_iter, info = model.transcribe(
                audio_path,
-                language="zh",
+                language=language,
                word_timestamps=True,  # 启用字级别时间戳
                vad_filter=True,  # 启用 VAD 过滤静音
            )

            logger.info(f"Detected language: {info.language} (prob: {info.language_probability:.2f})")

+            # 收集 Whisper 转录结果（始终需要，用于获取时间范围）
            all_segments = []
+            whisper_first_start = None
+            whisper_last_end = None
            for segment in segments_iter:
-                # 提取每个字的时间戳，并拆分成单字
                all_words = []
                if segment.words:
                    for word_info in segment.words:
-                        word_text = word_info.word.strip()
-                        if word_text:
-                            # 将词拆分成单字，时间戳线性插值
+                        word_text = word_info.word
+                        if word_text.strip():
+                            if whisper_first_start is None:
+                                whisper_first_start = word_info.start
+                            whisper_last_end = word_info.end
                            chars = split_word_to_chars(
                                word_text,
                                word_info.start,
@@ -208,11 +243,24 @@ class WhisperService:
                            )
                            all_words.extend(chars)

-                # 将长段落按标点和字数拆分成多行
                if all_words:
-                    line_segments = split_segment_to_lines(all_words, MAX_CHARS_PER_LINE)
+                    line_segments = split_segment_to_lines(all_words, max_chars)
                    all_segments.extend(line_segments)

+            # 如果提供了 original_text，用原文替换 Whisper 转录文字
+            if original_text and original_text.strip() and whisper_first_start is not None:
+                logger.info(f"Using original_text for subtitles (len={len(original_text)}), "
+                            f"Whisper time range: {whisper_first_start:.2f}-{whisper_last_end:.2f}s")
+                # 用 split_word_to_chars 拆分原文
+                orig_chars = split_word_to_chars(
+                    original_text.strip(),
+                    whisper_first_start,
+                    whisper_last_end
+                )
+                if orig_chars:
+                    all_segments = split_segment_to_lines(orig_chars, max_chars)
+                    logger.info(f"Rebuilt {len(all_segments)} subtitle segments from original text")
+
            logger.info(f"Generated {len(all_segments)} subtitle segments")
            return {"segments": all_segments}

@@ -230,12 +278,13 @@ class WhisperService:

        return result

-    async def transcribe(self, audio_path: str) -> str:
+    async def transcribe(self, audio_path: str, language: str | None = None) -> str:
        """
        仅转录文本（用于提取文案）

        Args:
            audio_path: 音频/视频文件路径
+            language: 语言代码，None 表示自动检测

        Returns:
            纯文本内容
@@ -249,7 +298,7 @@ class WhisperService:
            # 转录 (无需字级时间戳)
            segments_iter, _ = model.transcribe(
                audio_path,
-                language="zh",
+                language=language,
                word_timestamps=False,
                vad_filter=True,
            )
--- a/backend/scripts/watchdog.py
+++ b/backend/scripts/watchdog.py
@@ -20,62 +20,81 @@ logger = logging.getLogger("Watchdog")
 # 服务配置
 SERVICES = [
    {
-        "name": "vigent2-qwen-tts",
-        "url": "http://localhost:8009/health",
+        "name": "vigent2-cosyvoice",
+        "url": "http://localhost:8010/health",
        "failures": 0,
-        "threshold": 3,
+        "threshold": 3,          # 连续3次失败才重启（3×15s ≈ 45秒容忍期）
        "timeout": 10.0,
-        "restart_cmd": ["pm2", "restart", "vigent2-qwen-tts"]
+        "restart_cmd": ["pm2", "restart", "vigent2-cosyvoice"],
+        "cooldown_until": 0,     # 重启后的冷却截止时间戳
+        "cooldown_sec": 45,      # 重启后等待45秒再开始检查
    }
 ]

 async def check_service(service):
    """检查单个服务健康状态"""
+    # 冷却期内跳过检查
+    now = time.time()
+    if now < service.get("cooldown_until", 0):
+        remaining = int(service["cooldown_until"] - now)
+        logger.debug(f"⏳ 服务 {service['name']} 冷却中，剩余 {remaining}s")
+        return True
+
    try:
        timeout = service.get("timeout", 10.0)
        async with httpx.AsyncClient(timeout=timeout) as client:
            response = await client.get(service["url"])
            if response.status_code == 200:
-                # 成功
-                if service["failures"] > 0:
-                    logger.info(f"✅ 服务 {service['name']} 已恢复正常")
-                service["failures"] = 0
-                return True
+                ready = True
+                try:
+                    payload = response.json()
+                    ready = bool(payload.get("ready", True))
+                except Exception:
+                    payload = {}
+
+                if ready:
+                    if service["failures"] > 0:
+                        logger.info(f"✅ 服务 {service['name']} 已恢复正常")
+                    service["failures"] = 0
+                    return True
+
+                logger.warning(f"⚠️ 服务 {service['name']} ready=false，健康检查未通过: {payload}")
            else:
                logger.warning(f"⚠️ 服务 {service['name']} 返回状态码 {response.status_code}")
    except Exception as e:
        logger.warning(f"⚠️ 无法连接服务 {service['name']}: {str(e)}")
-    
+
    # 失败处理
    service["failures"] += 1
    logger.warning(f"❌ 服务 {service['name']} 连续失败 {service['failures']}/{service['threshold']} 次")
-    
+
    if service["failures"] >= service['threshold']:
        logger.error(f"🚨 服务 {service['name']} 已达到失败阈值，正在重启...")
        try:
            subprocess.run(service["restart_cmd"], check=True)
            logger.info(f"♻️ 服务 {service['name']} 重启命令已发送")
-            # 重启后给予一段宽限期 (例如 60秒) 不检查，等待服务启动
-            service["failures"] = 0 # 重置计数
-            return "restarting" 
+            service["failures"] = 0
+            # 设置冷却期，等待服务完成启动和模型加载
+            service["cooldown_until"] = time.time() + service.get("cooldown_sec", 120)
+            return "restarting"
        except Exception as restart_error:
            logger.error(f"💥 重启服务 {service['name']} 失败: {restart_error}")
-            
+
    return False

 async def main():
    logger.info("🛡️ ViGent2 服务看门狗 (Watchdog) 已启动")
-    
+
+    # 启动时给所有服务一个初始冷却期，避免服务还没起来就被判定失败
+    for service in SERVICES:
+        service["cooldown_until"] = time.time() + 60
+
    while True:
-        # 并发检查所有服务
        for service in SERVICES:
-            result = await check_service(service)
-            if result == "restarting":
-                # 如果有服务重启，额外等待包含启动时间
-                pass
-                
-        # 每 30 秒检查一次
-        await asyncio.sleep(30)
+            await check_service(service)
+
+        # 每 15 秒检查一次
+        await asyncio.sleep(15)

 if __name__ == "__main__":
    try:
--- a/frontend/package-lock.json
+++ b/frontend/package-lock.json
@@ -8,6 +8,9 @@
      "name": "frontend",
      "version": "0.1.0",
      "dependencies": {
+        "@dnd-kit/core": "^6.3.1",
+        "@dnd-kit/sortable": "^10.0.0",
+        "@dnd-kit/utilities": "^3.2.2",
        "@supabase/supabase-js": "^2.93.1",
        "axios": "^1.13.4",
        "lucide-react": "^0.563.0",
@@ -15,7 +18,8 @@
        "react": "19.2.3",
        "react-dom": "19.2.3",
        "sonner": "^2.0.7",
-        "swr": "^2.3.8"
+        "swr": "^2.3.8",
+        "wavesurfer.js": "^7.12.1"
      },
      "devDependencies": {
        "@tailwindcss/postcss": "^4",
@@ -281,6 +285,59 @@
        "node": ">=6.9.0"
      }
    },
+    "node_modules/@dnd-kit/accessibility": {
+      "version": "3.1.1",
+      "resolved": "https://registry.npmjs.org/@dnd-kit/accessibility/-/accessibility-3.1.1.tgz",
+      "integrity": "sha512-2P+YgaXF+gRsIihwwY1gCsQSYnu9Zyj2py8kY5fFvUM1qm2WA2u639R6YNVfU4GWr+ZM5mqEsfHZZLoRONbemw==",
+      "license": "MIT",
+      "dependencies": {
+        "tslib": "^2.0.0"
+      },
+      "peerDependencies": {
+        "react": ">=16.8.0"
+      }
+    },
+    "node_modules/@dnd-kit/core": {
+      "version": "6.3.1",
+      "resolved": "https://registry.npmjs.org/@dnd-kit/core/-/core-6.3.1.tgz",
+      "integrity": "sha512-xkGBRQQab4RLwgXxoqETICr6S5JlogafbhNsidmrkVv2YRs5MLwpjoF2qpiGjQt8S9AoxtIV603s0GIUpY5eYQ==",
+      "license": "MIT",
+      "dependencies": {
+        "@dnd-kit/accessibility": "^3.1.1",
+        "@dnd-kit/utilities": "^3.2.2",
+        "tslib": "^2.0.0"
+      },
+      "peerDependencies": {
+        "react": ">=16.8.0",
+        "react-dom": ">=16.8.0"
+      }
+    },
+    "node_modules/@dnd-kit/sortable": {
+      "version": "10.0.0",
+      "resolved": "https://registry.npmjs.org/@dnd-kit/sortable/-/sortable-10.0.0.tgz",
+      "integrity": "sha512-+xqhmIIzvAYMGfBYYnbKuNicfSsk4RksY2XdmJhT+HAC01nix6fHCztU68jooFiMUB01Ky3F0FyOvhG/BZrWkg==",
+      "license": "MIT",
+      "dependencies": {
+        "@dnd-kit/utilities": "^3.2.2",
+        "tslib": "^2.0.0"
+      },
+      "peerDependencies": {
+        "@dnd-kit/core": "^6.3.0",
+        "react": ">=16.8.0"
+      }
+    },
+    "node_modules/@dnd-kit/utilities": {
+      "version": "3.2.2",
+      "resolved": "https://registry.npmjs.org/@dnd-kit/utilities/-/utilities-3.2.2.tgz",
+      "integrity": "sha512-+MKAJEOfaBe5SmV6t34p80MMKhjvUz0vRrvVJbPT0WElzaOJ/1xs+D+KDv+tD/NE5ujfrChEcshd4fLn0wpiqg==",
+      "license": "MIT",
+      "dependencies": {
+        "tslib": "^2.0.0"
+      },
+      "peerDependencies": {
+        "react": ">=16.8.0"
+      }
+    },
    "node_modules/@emnapi/core": {
      "version": "1.8.1",
      "resolved": "https://registry.npmjs.org/@emnapi/core/-/core-1.8.1.tgz",
@@ -6611,6 +6668,12 @@
        "react": "^16.8.0 || ^17.0.0 || ^18.0.0 || ^19.0.0"
      }
    },
+    "node_modules/wavesurfer.js": {
+      "version": "7.12.1",
+      "resolved": "https://registry.npmjs.org/wavesurfer.js/-/wavesurfer.js-7.12.1.tgz",
+      "integrity": "sha512-NswPjVHxk0Q1F/VMRemCPUzSojjuHHisQrBqQiRXg7MVbe3f5vQ6r0rTTXA/a/neC/4hnOEC4YpXca4LpH0SUg==",
+      "license": "BSD-3-Clause"
+    },
    "node_modules/which": {
      "version": "2.0.2",
      "resolved": "https://registry.npmjs.org/which/-/which-2.0.2.tgz",
--- a/frontend/package.json
+++ b/frontend/package.json
@@ -9,6 +9,9 @@
    "lint": "eslint"
  },
  "dependencies": {
+    "@dnd-kit/core": "^6.3.1",
+    "@dnd-kit/sortable": "^10.0.0",
+    "@dnd-kit/utilities": "^3.2.2",
    "@supabase/supabase-js": "^2.93.1",
    "axios": "^1.13.4",
    "lucide-react": "^0.563.0",
@@ -16,7 +19,8 @@
    "react": "19.2.3",
    "react-dom": "19.2.3",
    "sonner": "^2.0.7",
-    "swr": "^2.3.8"
+    "swr": "^2.3.8",
+    "wavesurfer.js": "^7.12.1"
  },
  "devDependencies": {
    "@tailwindcss/postcss": "^4",
--- a/frontend/src/app/layout.tsx
+++ b/frontend/src/app/layout.tsx
@@ -1,8 +1,8 @@
 import type { Metadata, Viewport } from "next";
 import { Geist, Geist_Mono } from "next/font/google";
 import "./globals.css";
-import { AuthProvider } from "@/contexts/AuthContext";
-import { TaskProvider } from "@/contexts/TaskContext";
+import { AuthProvider } from "@/shared/contexts/AuthContext";
+import { TaskProvider } from "@/shared/contexts/TaskContext";

 import { Toaster } from "sonner";

@@ -46,7 +46,6 @@ export default function RootLayout({
        <Toaster
          position="top-center"
          richColors
-          closeButton
          toastOptions={{
            duration: 3000,
            className: "text-sm",
--- a/frontend/src/components/AccountSettingsDropdown.tsx
+++ b/frontend/src/components/AccountSettingsDropdown.tsx
@@ -1,7 +1,7 @@
 "use client";

 import { useState, useEffect, useRef } from "react";
-import { useAuth } from "@/contexts/AuthContext";
+import { useAuth } from "@/shared/contexts/AuthContext";
 import api from "@/shared/api/axios";
 import { ApiResponse } from "@/shared/api/types";

--- a/frontend/src/components/GlobalTaskIndicator.tsx
+++ b/frontend/src/components/GlobalTaskIndicator.tsx
@@ -1,6 +1,6 @@
 "use client";

-import { useTask } from "@/contexts/TaskContext";
+import { useTask } from "@/shared/contexts/TaskContext";
 import Link from "next/link";
 import { usePathname } from "next/navigation";

--- a/frontend/src/features/home/model/useGeneratedAudios.ts
+++ b/frontend/src/features/home/model/useGeneratedAudios.ts
@@ -0,0 +1,193 @@
+import { useCallback, useEffect, useRef, useState } from "react";
+import api from "@/shared/api/axios";
+import { ApiResponse, unwrap } from "@/shared/api/types";
+import { toast } from "sonner";
+
+export interface GeneratedAudio {
+  id: string;
+  name: string;
+  path: string;
+  duration_sec: number;
+  text: string;
+  tts_mode: string;
+  language: string;
+  created_at: number;
+}
+
+interface AudioTask {
+  status: string;
+  progress?: number;
+  message?: string;
+  output?: GeneratedAudio & { audio_id: string };
+}
+
+interface UseGeneratedAudiosOptions {
+  selectedAudioId: string | null;
+  setSelectedAudioId: React.Dispatch<React.SetStateAction<string | null>>;
+}
+
+export const useGeneratedAudios = ({
+  selectedAudioId,
+  setSelectedAudioId,
+}: UseGeneratedAudiosOptions) => {
+  const [generatedAudios, setGeneratedAudios] = useState<GeneratedAudio[]>([]);
+  const [selectedAudio, setSelectedAudio] = useState<GeneratedAudio | null>(null);
+  const [isGeneratingAudio, setIsGeneratingAudio] = useState(false);
+  const [audioTaskId, setAudioTaskId] = useState<string | null>(null);
+  const [audioTask, setAudioTask] = useState<AudioTask | null>(null);
+  const pollRef = useRef<NodeJS.Timeout | null>(null);
+
+  const fetchGeneratedAudios = useCallback(async (selectId?: string) => {
+    try {
+      const { data: res } = await api.get<ApiResponse<{ items: GeneratedAudio[] }>>(
+        "/api/generated-audios"
+      );
+      const payload = unwrap(res);
+      const items: GeneratedAudio[] = payload.items || [];
+      setGeneratedAudios(items);
+
+      if (selectId && items.length > 0) {
+        if (selectId === "__latest__") {
+          setSelectedAudioId(items[0].id);
+          setSelectedAudio(items[0]);
+        } else {
+          const found = items.find((a) => a.id === selectId);
+          if (found) {
+            setSelectedAudioId(found.id);
+            setSelectedAudio(found);
+          }
+        }
+      }
+    } catch (error) {
+      console.error("获取配音列表失败:", error);
+    }
+  }, [setSelectedAudioId]);
+
+  // Sync selectedAudio when selectedAudioId changes externally (e.g. from persistence)
+  useEffect(() => {
+    if (!selectedAudioId || generatedAudios.length === 0) return;
+    const found = generatedAudios.find((a) => a.id === selectedAudioId);
+    if (found) {
+      setSelectedAudio(found);
+    }
+  }, [selectedAudioId, generatedAudios]);
+
+  const stopPolling = useCallback(() => {
+    if (pollRef.current) {
+      clearInterval(pollRef.current);
+      pollRef.current = null;
+    }
+  }, []);
+
+  const startPolling = useCallback((taskId: string) => {
+    stopPolling();
+    pollRef.current = setInterval(async () => {
+      try {
+        const { data: res } = await api.get<ApiResponse<AudioTask>>(
+          `/api/generated-audios/tasks/${taskId}`
+        );
+        const task = unwrap(res);
+        setAudioTask(task);
+
+        if (task.status === "completed") {
+          stopPolling();
+          setIsGeneratingAudio(false);
+          setAudioTaskId(null);
+          // Refresh list and select the new audio
+          await fetchGeneratedAudios("__latest__");
+          toast.success(task.message || "配音生成完成");
+        } else if (task.status === "failed") {
+          stopPolling();
+          setIsGeneratingAudio(false);
+          setAudioTaskId(null);
+          toast.error(task.message || "配音生成失败");
+        } else if (task.status === "not_found") {
+          stopPolling();
+          setIsGeneratingAudio(false);
+          setAudioTaskId(null);
+          setAudioTask(null);
+          toast.error("任务已丢失（服务可能已重启），请重新生成");
+        }
+      } catch {
+        // Network error, keep polling
+      }
+    }, 1000);
+  }, [stopPolling, fetchGeneratedAudios]);
+
+  // Cleanup on unmount
+  useEffect(() => {
+    return () => stopPolling();
+  }, [stopPolling]);
+
+  const generateAudio = useCallback(async (params: {
+    text: string;
+    tts_mode: string;
+    voice?: string;
+    ref_audio_id?: string;
+    ref_text?: string;
+    language: string;
+    speed?: number;
+  }) => {
+    setIsGeneratingAudio(true);
+    setAudioTask({ status: "pending", progress: 0, message: "正在提交..." });
+
+    try {
+      const { data: res } = await api.post<ApiResponse<{ task_id: string }>>(
+        "/api/generated-audios/generate",
+        params
+      );
+      const { task_id } = unwrap(res);
+      setAudioTaskId(task_id);
+      startPolling(task_id);
+    } catch (err: unknown) {
+      setIsGeneratingAudio(false);
+      setAudioTask(null);
+      const axiosErr = err as { response?: { data?: { message?: string } }; message?: string };
+      const errorMsg = axiosErr.response?.data?.message || axiosErr.message || String(err);
+      toast.error(`配音生成失败: ${errorMsg}`);
+    }
+  }, [startPolling]);
+
+  const deleteAudio = useCallback(async (audioId: string) => {
+    if (!confirm("确定要删除这个配音吗？")) return;
+    try {
+      await api.delete(`/api/generated-audios/${encodeURIComponent(audioId)}`);
+      if (selectedAudioId === audioId) {
+        setSelectedAudioId(null);
+        setSelectedAudio(null);
+      }
+      fetchGeneratedAudios();
+    } catch (error) {
+      toast.error("删除失败: " + error);
+    }
+  }, [fetchGeneratedAudios, selectedAudioId, setSelectedAudioId]);
+
+  const renameAudio = useCallback(async (audioId: string, newName: string) => {
+    try {
+      await api.put(`/api/generated-audios/${encodeURIComponent(audioId)}`, {
+        new_name: newName,
+      });
+      fetchGeneratedAudios();
+    } catch (err: unknown) {
+      toast.error("重命名失败: " + String(err));
+    }
+  }, [fetchGeneratedAudios]);
+
+  const selectAudio = useCallback((audio: GeneratedAudio) => {
+    setSelectedAudioId(audio.id);
+    setSelectedAudio(audio);
+  }, [setSelectedAudioId]);
+
+  return {
+    generatedAudios,
+    selectedAudio,
+    selectedAudioId,
+    isGeneratingAudio,
+    audioTask,
+    fetchGeneratedAudios,
+    generateAudio,
+    deleteAudio,
+    renameAudio,
+    selectAudio,
+  };
+};
--- a/frontend/src/features/home/model/useHomeController.ts
+++ b/frontend/src/features/home/model/useHomeController.ts
@@ -11,33 +11,84 @@ import {
 } from "@/shared/lib/media";
 import { clampTitle } from "@/shared/lib/title";
 import { useTitleInput } from "@/shared/hooks/useTitleInput";
-import { useAuth } from "@/contexts/AuthContext";
-import { useTask } from "@/contexts/TaskContext";
+import { useAuth } from "@/shared/contexts/AuthContext";
+import { useTask } from "@/shared/contexts/TaskContext";
 import { toast } from "sonner";
 import { usePublishPrefetch } from "@/shared/hooks/usePublishPrefetch";
 import { PublishAccount } from "@/shared/types/publish";
 import { useBgm } from "@/features/home/model/useBgm";
 import { useGeneratedVideos } from "@/features/home/model/useGeneratedVideos";
+import { useGeneratedAudios } from "@/features/home/model/useGeneratedAudios";
 import { useHomePersistence } from "@/features/home/model/useHomePersistence";
 import { useMaterials } from "@/features/home/model/useMaterials";
 import { useMediaPlayers } from "@/features/home/model/useMediaPlayers";
 import { useRefAudios } from "@/features/home/model/useRefAudios";
 import { useTitleSubtitleStyles } from "@/features/home/model/useTitleSubtitleStyles";
+import { useTimelineEditor } from "@/features/home/model/useTimelineEditor";
+import { useSavedScripts } from "@/features/home/model/useSavedScripts";
 import { ApiResponse, unwrap } from "@/shared/api/types";

-const VOICES = [
-  { id: "zh-CN-YunxiNeural", name: "云溪 (男声-年轻)" },
-  { id: "zh-CN-YunjianNeural", name: "云健 (男声-新闻)" },
-  { id: "zh-CN-YunyangNeural", name: "云扬 (男声-专业)" },
-  { id: "zh-CN-XiaoxiaoNeural", name: "晓晓 (女声-活泼)" },
-  { id: "zh-CN-XiaoyiNeural", name: "晓伊 (女声-温柔)" },
-];
+const VOICES: Record<string, { id: string; name: string }[]> = {
+  "zh-CN": [
+    { id: "zh-CN-YunxiNeural", name: "云溪 (男声-年轻)" },
+    { id: "zh-CN-YunjianNeural", name: "云健 (男声-新闻)" },
+    { id: "zh-CN-YunyangNeural", name: "云扬 (男声-专业)" },
+    { id: "zh-CN-XiaoxiaoNeural", name: "晓晓 (女声-活泼)" },
+    { id: "zh-CN-XiaoyiNeural", name: "晓伊 (女声-温柔)" },
+  ],
+  "en-US": [
+    { id: "en-US-GuyNeural", name: "Guy (Male)" },
+    { id: "en-US-JennyNeural", name: "Jenny (Female)" },
+  ],
+  "ja-JP": [
+    { id: "ja-JP-KeitaNeural", name: "圭太 (男声)" },
+    { id: "ja-JP-NanamiNeural", name: "七海 (女声)" },
+  ],
+  "ko-KR": [
+    { id: "ko-KR-InJoonNeural", name: "인준 (男声)" },
+    { id: "ko-KR-SunHiNeural", name: "선히 (女声)" },
+  ],
+  "fr-FR": [
+    { id: "fr-FR-HenriNeural", name: "Henri (Male)" },
+    { id: "fr-FR-DeniseNeural", name: "Denise (Female)" },
+  ],
+  "de-DE": [
+    { id: "de-DE-ConradNeural", name: "Conrad (Male)" },
+    { id: "de-DE-KatjaNeural", name: "Katja (Female)" },
+  ],
+  "es-ES": [
+    { id: "es-ES-AlvaroNeural", name: "Álvaro (Male)" },
+    { id: "es-ES-ElviraNeural", name: "Elvira (Female)" },
+  ],
+  "ru-RU": [
+    { id: "ru-RU-DmitryNeural", name: "Дмитрий (Male)" },
+    { id: "ru-RU-SvetlanaNeural", name: "Светлана (Female)" },
+  ],
+  "it-IT": [
+    { id: "it-IT-DiegoNeural", name: "Diego (Male)" },
+    { id: "it-IT-ElsaNeural", name: "Elsa (Female)" },
+  ],
+  "pt-BR": [
+    { id: "pt-BR-AntonioNeural", name: "Antonio (Male)" },
+    { id: "pt-BR-FranciscaNeural", name: "Francisca (Female)" },
+  ],
+};
+
+const LANG_TO_LOCALE: Record<string, string> = {
+  "中文": "zh-CN",
+  "English": "en-US",
+  "日本語": "ja-JP",
+  "한국어": "ko-KR",
+  "Français": "fr-FR",
+  "Deutsch": "de-DE",
+  "Español": "es-ES",
+  "Русский": "ru-RU",
+  "Italiano": "it-IT",
+  "Português": "pt-BR",
+};



-const FIXED_REF_TEXT =
-  "其实生活中有许多美好的瞬间，比如清晨的阳光，或者一杯温热的清茶。希望这次生成的音色能够自然、流畅，完美还原出我最真实的声音状态。";
-
 const scrollContainerToItem = (container: HTMLDivElement, item: HTMLDivElement) => {
  const containerRect = container.getBoundingClientRect();
  const itemRect = item.getBoundingClientRect();
@@ -70,22 +121,17 @@ interface RefAudio {
  created_at: number;
 }

-interface Material {
-  id: string;
-  name: string;
-  path: string;
-  size_mb: number;
-  scene?: string;
-}
+import type { Material } from "@/shared/types/material";

 export const useHomeController = () => {
  const apiBase = getApiBaseUrl();

-  const [selectedMaterial, setSelectedMaterial] = useState<string>("");
+  const [selectedMaterials, setSelectedMaterials] = useState<string[]>([]);
  const [previewMaterial, setPreviewMaterial] = useState<string | null>(null);

  const [text, setText] = useState<string>("");
  const [voice, setVoice] = useState<string>("zh-CN-YunxiNeural");
+  const [textLang, setTextLang] = useState<string>("zh-CN");

  // 使用全局任务状态
  const { currentTask, isGenerating, startTask } = useTask();
@@ -96,7 +142,6 @@ export const useHomeController = () => {

  // 字幕和标题相关状态
  const [videoTitle, setVideoTitle] = useState<string>("");
-  const [enableSubtitles, setEnableSubtitles] = useState<boolean>(true);
  const [selectedSubtitleStyleId, setSelectedSubtitleStyleId] = useState<string>("");
  const [selectedTitleStyleId, setSelectedTitleStyleId] = useState<string>("");
  const [subtitleFontSize, setSubtitleFontSize] = useState<number>(80);
@@ -105,6 +150,7 @@ export const useHomeController = () => {
  const [titleSizeLocked, setTitleSizeLocked] = useState<boolean>(false);
  const [titleTopMargin, setTitleTopMargin] = useState<number>(62);
  const [subtitleBottomMargin, setSubtitleBottomMargin] = useState<number>(80);
+  const [outputAspectRatio, setOutputAspectRatio] = useState<"9:16" | "16:9">("9:16");
  const [showStylePreview, setShowStylePreview] = useState<boolean>(false);
  const [materialDimensions, setMaterialDimensions] = useState<{ width: number; height: number } | null>(null);

@@ -117,7 +163,17 @@ export const useHomeController = () => {
  // 声音克隆相关状态
  const [ttsMode, setTtsMode] = useState<"edgetts" | "voiceclone">("edgetts");
  const [selectedRefAudio, setSelectedRefAudio] = useState<RefAudio | null>(null);
-  const [refText, setRefText] = useState(FIXED_REF_TEXT);
+  const [refText, setRefText] = useState("");
+
+  // 预生成配音选中 ID
+  const [selectedAudioId, setSelectedAudioId] = useState<string | null>(null);
+
+  // 语速控制
+  const [speed, setSpeed] = useState<number>(1.0);
+
+  // ClipTrimmer 模态框状态
+  const [clipTrimmerOpen, setClipTrimmerOpen] = useState(false);
+  const [clipTrimmerSegmentId, setClipTrimmerSegmentId] = useState<string | null>(null);

  // 音频预览与重命名状态
  const [editingAudioId, setEditingAudioId] = useState<string | null>(null);
@@ -181,8 +237,8 @@ export const useHomeController = () => {
        { new_name: editMaterialName.trim() }
      );
      const payload = unwrap(res);
-      if (selectedMaterial === materialId && payload?.id) {
-        setSelectedMaterial(payload.id);
+      if (selectedMaterials.includes(materialId) && payload?.id) {
+        setSelectedMaterials((prev) => prev.map((x) => (x === materialId ? payload.id : x)));
      }
      setEditingMaterialId(null);
      setEditMaterialName("");
@@ -197,6 +253,10 @@ export const useHomeController = () => {
  // AI 生成标题标签
  const [isGeneratingMeta, setIsGeneratingMeta] = useState(false);

+  // AI 多语言翻译
+  const [isTranslating, setIsTranslating] = useState(false);
+  const [originalText, setOriginalText] = useState<string | null>(null);
+
  // 在线录音相关
  const [isRecording, setIsRecording] = useState(false);
  const [recordedBlob, setRecordedBlob] = useState<Blob | null>(null);
@@ -226,11 +286,12 @@ export const useHomeController = () => {
    uploadError,
    setUploadError,
    fetchMaterials,
+    toggleMaterial,
    deleteMaterial,
    handleUpload,
  } = useMaterials({
-    selectedMaterial,
-    setSelectedMaterial,
+    selectedMaterials,
+    setSelectedMaterials,
  });

  const {
@@ -253,8 +314,9 @@ export const useHomeController = () => {
    fetchRefAudios,
    uploadRefAudio,
    deleteRefAudio,
+    retranscribeRefAudio,
+    retranscribingId,
  } = useRefAudios({
-    fixedRefText: FIXED_REF_TEXT,
    selectedRefAudio,
    setSelectedRefAudio,
    setRefText,
@@ -296,6 +358,33 @@ export const useHomeController = () => {
    resolveMediaUrl,
  });

+  const {
+    generatedAudios,
+    selectedAudio,
+    isGeneratingAudio,
+    audioTask,
+    fetchGeneratedAudios,
+    generateAudio,
+    deleteAudio,
+    renameAudio,
+    selectAudio,
+  } = useGeneratedAudios({
+    selectedAudioId,
+    setSelectedAudioId,
+  });
+
+  const {
+    segments: timelineSegments,
+    reorderSegments,
+    setSourceRange,
+    toCustomAssignments,
+  } = useTimelineEditor({
+    audioDuration: selectedAudio?.duration_sec ?? 0,
+    materials,
+    selectedMaterials,
+    storageKey,
+  });
+
  useEffect(() => {
    if (isAuthLoading || !userId) return;
    let active = true;
@@ -338,14 +427,14 @@ export const useHomeController = () => {
    setText,
    videoTitle,
    setVideoTitle,
-    enableSubtitles,
-    setEnableSubtitles,
    ttsMode,
    setTtsMode,
    voice,
    setVoice,
-    selectedMaterial,
-    setSelectedMaterial,
+    textLang,
+    setTextLang,
+    selectedMaterials,
+    setSelectedMaterials,
    selectedSubtitleStyleId,
    setSelectedSubtitleStyleId,
    selectedTitleStyleId,
@@ -360,6 +449,8 @@ export const useHomeController = () => {
    setTitleTopMargin,
    subtitleBottomMargin,
    setSubtitleBottomMargin,
+    outputAspectRatio,
+    setOutputAspectRatio,
    selectedBgmId,
    setSelectedBgmId,
    bgmVolume,
@@ -369,8 +460,20 @@ export const useHomeController = () => {
    selectedVideoId,
    setSelectedVideoId,
    selectedRefAudio,
+    selectedAudioId,
+    setSelectedAudioId,
+    speed,
+    setSpeed,
  });

+  const { savedScripts, saveScript, deleteScript: deleteSavedScript } = useSavedScripts(storageKey);
+
+  const handleSaveScript = () => {
+    if (!text.trim()) return;
+    saveScript(text);
+    toast.success("文案已保存");
+  };
+
  const syncTitleToPublish = (value: string) => {
    if (typeof window !== "undefined") {
      localStorage.setItem(`vigent_${storageKey}_publish_title`, value);
@@ -390,6 +493,7 @@ export const useHomeController = () => {
      fetchMaterials(),
      fetchGeneratedVideos(),
      fetchRefAudios(),
+      fetchGeneratedAudios(),
      refreshSubtitleStyles(),
      refreshTitleStyles(),
      fetchBgmList(),
@@ -410,7 +514,8 @@ export const useHomeController = () => {
  }, [isGenerating, currentTask, fetchGeneratedVideos]);

  useEffect(() => {
-    const material = materials.find((item) => item.id === selectedMaterial);
+    const firstSelected = selectedMaterials[0];
+    const material = materials.find((item) => item.id === firstSelected);
    if (!material?.path) {
      setMaterialDimensions(null);
      return;
@@ -423,7 +528,6 @@ export const useHomeController = () => {

    let isActive = true;
    const video = document.createElement("video");
-    video.crossOrigin = "anonymous";
    video.preload = "metadata";
    video.src = url;
    video.load();
@@ -450,7 +554,7 @@ export const useHomeController = () => {
      video.removeEventListener("loadedmetadata", handleLoaded);
      video.removeEventListener("error", handleError);
    };
-  }, [materials, selectedMaterial]);
+  }, [materials, selectedMaterials]);


  useEffect(() => {
@@ -485,13 +589,22 @@ export const useHomeController = () => {
    }
  }, [selectedBgmId, bgmList]);

+  // 素材列表滚动：跳过首次恢复，仅用户主动操作时滚动
+  const materialScrollReady = useRef(false);
  useEffect(() => {
-    if (!selectedMaterial) return;
-    const target = materialItemRefs.current[selectedMaterial];
+    const firstSelected = selectedMaterials[0];
+    if (!firstSelected) return;
+    if (!materialScrollReady.current) {
+      // 首次有选中素材时标记就绪，但不滚动（避免刷新后整页跳动）
+      materialScrollReady.current = true;
+      return;
+    }
+    const target = materialItemRefs.current[firstSelected];
    if (target) {
      target.scrollIntoView({ block: "nearest", behavior: "smooth" });
    }
-  }, [selectedMaterial, materials]);
+    // eslint-disable-next-line react-hooks/exhaustive-deps
+  }, [selectedMaterials.length]);

  // 【修复】历史视频默认选中逻辑
  // 当持久化恢复完成，且列表加载完毕，如果没选中任何视频，默认选中第一个
@@ -501,7 +614,7 @@ export const useHomeController = () => {
      setSelectedVideoId(firstId);
      setGeneratedVideo(resolveMediaUrl(generatedVideos[0].path));
    }
-  }, [isRestored, generatedVideos, selectedVideoId, setSelectedVideoId, setGeneratedVideo, resolveMediaUrl]);
+  }, [isRestored, generatedVideos, selectedVideoId, setSelectedVideoId, setGeneratedVideo]);

  // 【修复】BGM 默认选中逻辑
  useEffect(() => {
@@ -510,8 +623,14 @@ export const useHomeController = () => {
    }
  }, [isRestored, bgmList, selectedBgmId, enableBgm, setSelectedBgmId]);

+  const videoScrollReady = useRef(false);
  useEffect(() => {
    if (!selectedVideoId) return;
+    if (!videoScrollReady.current) {
+      videoScrollReady.current = true;
+      return;
+    }
+
    const target = videoItemRefs.current[selectedVideoId];
    if (target) {
      target.scrollIntoView({ block: "nearest", behavior: "smooth" });
@@ -639,19 +758,88 @@ export const useHomeController = () => {
    }
  };

+  // AI 多语言翻译
+  const handleTranslate = async (targetLang: string) => {
+    if (!text.trim()) {
+      toast.error("请先输入口播文案");
+      return;
+    }
+
+    // 首次翻译时保存原文
+    if (originalText === null) {
+      setOriginalText(text);
+    }
+
+    setIsTranslating(true);
+    try {
+      const { data: res } = await api.post<ApiResponse<{ translated_text: string }>>(
+        "/api/ai/translate",
+        { text: text.trim(), target_lang: targetLang }
+      );
+      const payload = unwrap(res);
+      setText(payload.translated_text || "");
+
+      // 根据翻译目标语言更新 textLang 并自动切换声音
+      const locale = LANG_TO_LOCALE[targetLang] || "zh-CN";
+      setTextLang(locale);
+      if (ttsMode === "edgetts") {
+        const langVoices = VOICES[locale] || VOICES["zh-CN"];
+        setVoice(langVoices[0].id);
+      }
+    } catch (err: unknown) {
+      console.error("AI translate failed:", err);
+      const axiosErr = err as { response?: { data?: { message?: string } }; message?: string };
+      const errorMsg = axiosErr.response?.data?.message || axiosErr.message || String(err);
+      toast.error(`AI 翻译失败: ${errorMsg}`);
+    } finally {
+      setIsTranslating(false);
+    }
+  };
+
+  const handleRestoreOriginal = () => {
+    if (originalText !== null) {
+      setText(originalText);
+      setOriginalText(null);
+      setTextLang("zh-CN");
+      if (ttsMode === "edgetts") {
+        setVoice(VOICES["zh-CN"][0].id);
+      }
+    }
+  };
+
+  // 生成配音
+  const handleGenerateAudio = async () => {
+    if (!text.trim()) {
+      toast.error("请先输入文案");
+      return;
+    }
+    if (ttsMode === "voiceclone" && !selectedRefAudio) {
+      toast.error("请选择参考音频");
+      return;
+    }
+
+    const params = {
+      text: text.trim(),
+      tts_mode: ttsMode,
+      voice: ttsMode === "edgetts" ? voice : undefined,
+      ref_audio_id: ttsMode === "voiceclone" ? selectedRefAudio!.id : undefined,
+      ref_text: ttsMode === "voiceclone" ? refText : undefined,
+      language: textLang,
+      speed: ttsMode === "voiceclone" ? speed : undefined,
+    };
+    await generateAudio(params);
+  };
+
  // 生成视频
  const handleGenerate = async () => {
-    if (!selectedMaterial || !text.trim()) {
+    if (selectedMaterials.length === 0 || !text.trim()) {
      toast.error("请先选择素材并填写文案");
      return;
    }

-    // 声音克隆模式校验
-    if (ttsMode === "voiceclone") {
-      if (!selectedRefAudio) {
-        toast.error("请选择或上传参考音频");
-        return;
-      }
+    if (!selectedAudio) {
+      toast.error("请先生成并选中配音");
+      return;
    }

    if (enableBgm && !selectedBgmId) {
@@ -663,26 +851,81 @@ export const useHomeController = () => {

    try {
      // 查找选中的素材对象以获取路径
-      const materialObj = materials.find((m) => m.id === selectedMaterial);
-      if (!materialObj) {
+      const firstMaterialObj = materials.find((m) => m.id === selectedMaterials[0]);
+      if (!firstMaterialObj) {
        toast.error("素材数据异常");
        return;
      }

-      // 构建请求参数
+      // 构建请求参数 - 使用预生成配音
      const payload: Record<string, unknown> = {
-        material_path: materialObj.path,
-        text: text,
-        tts_mode: ttsMode,
+        material_path: firstMaterialObj.path,
+        text: selectedAudio.text || text,
+        generated_audio_id: selectedAudio.id,
+        language: selectedAudio.language || textLang,
        title: videoTitle.trim() || undefined,
-        enable_subtitles: enableSubtitles,
+        enable_subtitles: true,
+        output_aspect_ratio: outputAspectRatio,
      };

-      if (enableSubtitles && selectedSubtitleStyleId) {
+      // 多素材
+      if (selectedMaterials.length > 1) {
+        const timelineOrderedIds = timelineSegments
+          .map((seg) => seg.materialId)
+          .filter((id, index, arr) => arr.indexOf(id) === index);
+        const orderedMaterialIds = [
+          ...timelineOrderedIds.filter((id) => selectedMaterials.includes(id)),
+          ...selectedMaterials.filter((id) => !timelineOrderedIds.includes(id)),
+        ];
+
+        const materialPaths = orderedMaterialIds
+          .map((id) => materials.find((x) => x.id === id)?.path)
+          .filter((path): path is string => !!path);
+
+        if (materialPaths.length === 0) {
+          toast.error("多素材解析失败，请刷新素材后重试");
+          return;
+        }
+
+        payload.material_paths = materialPaths;
+        payload.material_path = materialPaths[0];
+
+        // 发送自定义时间轴分配
+        const assignments = toCustomAssignments();
+        if (assignments.length > 0) {
+          const assignmentPaths = assignments
+            .map((a) => a.material_path)
+            .filter((path): path is string => !!path);
+
+          if (assignmentPaths.length === assignments.length) {
+            // 以时间轴可见段为准：超出时间轴的素材不会参与本次生成
+            payload.material_paths = assignmentPaths;
+            payload.material_path = assignmentPaths[0];
+          }
+          payload.custom_assignments = assignments;
+        } else {
+          console.warn(
+            "[Timeline] custom_assignments 为空，回退后端自动分配",
+            { materials: materialPaths.length }
+          );
+        }
+      }
+
+      // 单素材 + 截取范围
+      const singleSeg = timelineSegments[0];
+      if (
+        selectedMaterials.length === 1
+        && singleSeg
+        && (singleSeg.sourceStart > 0 || singleSeg.sourceEnd > 0)
+      ) {
+        payload.custom_assignments = toCustomAssignments();
+      }
+
+      if (selectedSubtitleStyleId) {
        payload.subtitle_style_id = selectedSubtitleStyleId;
      }

-      if (enableSubtitles && subtitleFontSize) {
+      if (subtitleFontSize) {
        payload.subtitle_font_size = Math.round(subtitleFontSize);
      }

@@ -698,22 +941,13 @@ export const useHomeController = () => {
        payload.title_top_margin = Math.round(titleTopMargin);
      }

-      if (enableSubtitles) {
-        payload.subtitle_bottom_margin = Math.round(subtitleBottomMargin);
-      }
+      payload.subtitle_bottom_margin = Math.round(subtitleBottomMargin);

      if (enableBgm && selectedBgmId) {
        payload.bgm_id = selectedBgmId;
        payload.bgm_volume = bgmVolume;
      }

-      if (ttsMode === "edgetts") {
-        payload.voice = voice;
-      } else {
-        payload.ref_audio_id = selectedRefAudio!.id;
-        payload.ref_text = refText;
-      }
-
      // 创建生成任务
      const { data: res } = await api.post<ApiResponse<{ task_id: string }>>(
        "/api/videos/generate",
@@ -774,8 +1008,8 @@ export const useHomeController = () => {
    fetchMaterials,
    deleteMaterial,
    handleUpload,
-    selectedMaterial,
-    setSelectedMaterial,
+    selectedMaterials,
+    toggleMaterial,
    handlePreviewMaterial,
    editingMaterialId,
    editMaterialName,
@@ -789,6 +1023,13 @@ export const useHomeController = () => {
    setExtractModalOpen,
    handleGenerateMeta,
    isGeneratingMeta,
+    handleTranslate,
+    isTranslating,
+    originalText,
+    handleRestoreOriginal,
+    savedScripts,
+    handleSaveScript,
+    deleteSavedScript,
    showStylePreview,
    setShowStylePreview,
    videoTitle,
@@ -809,17 +1050,18 @@ export const useHomeController = () => {
    setTitleTopMargin,
    subtitleBottomMargin,
    setSubtitleBottomMargin,
-    enableSubtitles,
-    setEnableSubtitles,
+    outputAspectRatio,
+    setOutputAspectRatio,
    resolveAssetUrl,
    getFontFormat,
    buildTextShadow,
    materialDimensions,
    ttsMode,
    setTtsMode,
-    voices: VOICES,
+    voices: VOICES[textLang] || VOICES["zh-CN"],
    voice,
    setVoice,
+    textLang,
    refAudios,
    selectedRefAudio,
    handleSelectRefAudio,
@@ -837,6 +1079,8 @@ export const useHomeController = () => {
    saveEditing,
    cancelEditing,
    deleteRefAudio,
+    retranscribeRefAudio,
+    retranscribingId,
    recordedBlob,
    isRecording,
    recordingTime,
@@ -844,7 +1088,6 @@ export const useHomeController = () => {
    stopRecording,
    useRecording,
    formatRecordingTime,
-    fixedRefText: FIXED_REF_TEXT,
    bgmList,
    bgmLoading,
    bgmError,
@@ -870,5 +1113,24 @@ export const useHomeController = () => {
    fetchGeneratedVideos,
    registerVideoRef,
    formatDate,
+    generatedAudios,
+    selectedAudio,
+    selectedAudioId,
+    isGeneratingAudio,
+    audioTask,
+    fetchGeneratedAudios,
+    handleGenerateAudio,
+    deleteAudio,
+    renameAudio,
+    selectAudio,
+    speed,
+    setSpeed,
+    timelineSegments,
+    reorderSegments,
+    setSourceRange,
+    clipTrimmerOpen,
+    setClipTrimmerOpen,
+    clipTrimmerSegmentId,
+    setClipTrimmerSegmentId,
  };
 };
--- a/frontend/src/features/home/model/useHomePersistence.ts
+++ b/frontend/src/features/home/model/useHomePersistence.ts
@@ -17,14 +17,14 @@ interface UseHomePersistenceOptions {
  setText: React.Dispatch<React.SetStateAction<string>>;
  videoTitle: string;
  setVideoTitle: React.Dispatch<React.SetStateAction<string>>;
-  enableSubtitles: boolean;
-  setEnableSubtitles: React.Dispatch<React.SetStateAction<boolean>>;
  ttsMode: 'edgetts' | 'voiceclone';
  setTtsMode: React.Dispatch<React.SetStateAction<'edgetts' | 'voiceclone'>>;
  voice: string;
  setVoice: React.Dispatch<React.SetStateAction<string>>;
-  selectedMaterial: string;
-  setSelectedMaterial: React.Dispatch<React.SetStateAction<string>>;
+  textLang: string;
+  setTextLang: React.Dispatch<React.SetStateAction<string>>;
+  selectedMaterials: string[];
+  setSelectedMaterials: React.Dispatch<React.SetStateAction<string[]>>;
  selectedSubtitleStyleId: string;
  setSelectedSubtitleStyleId: React.Dispatch<React.SetStateAction<string>>;
  selectedTitleStyleId: string;
@@ -39,6 +39,8 @@ interface UseHomePersistenceOptions {
  setTitleTopMargin: React.Dispatch<React.SetStateAction<number>>;
  subtitleBottomMargin: number;
  setSubtitleBottomMargin: React.Dispatch<React.SetStateAction<number>>;
+  outputAspectRatio: '9:16' | '16:9';
+  setOutputAspectRatio: React.Dispatch<React.SetStateAction<'9:16' | '16:9'>>;
  selectedBgmId: string;
  setSelectedBgmId: React.Dispatch<React.SetStateAction<string>>;
  bgmVolume: number;
@@ -48,6 +50,10 @@ interface UseHomePersistenceOptions {
  selectedVideoId: string | null;
  setSelectedVideoId: React.Dispatch<React.SetStateAction<string | null>>;
  selectedRefAudio: RefAudio | null;
+  selectedAudioId: string | null;
+  setSelectedAudioId: React.Dispatch<React.SetStateAction<string | null>>;
+  speed: number;
+  setSpeed: React.Dispatch<React.SetStateAction<number>>;
 }

 export const useHomePersistence = ({
@@ -57,14 +63,14 @@ export const useHomePersistence = ({
  setText,
  videoTitle,
  setVideoTitle,
-  enableSubtitles,
-  setEnableSubtitles,
  ttsMode,
  setTtsMode,
  voice,
  setVoice,
-  selectedMaterial,
-  setSelectedMaterial,
+  textLang,
+  setTextLang,
+  selectedMaterials,
+  setSelectedMaterials,
  selectedSubtitleStyleId,
  setSelectedSubtitleStyleId,
  selectedTitleStyleId,
@@ -79,6 +85,8 @@ export const useHomePersistence = ({
  setTitleTopMargin,
  subtitleBottomMargin,
  setSubtitleBottomMargin,
+  outputAspectRatio,
+  setOutputAspectRatio,
  selectedBgmId,
  setSelectedBgmId,
  bgmVolume,
@@ -88,6 +96,10 @@ export const useHomePersistence = ({
  selectedVideoId,
  setSelectedVideoId,
  selectedRefAudio,
+  selectedAudioId,
+  setSelectedAudioId,
+  speed,
+  setSpeed,
 }: UseHomePersistenceOptions) => {
  const [isRestored, setIsRestored] = useState(false);

@@ -96,9 +108,9 @@ export const useHomePersistence = ({

    const savedText = localStorage.getItem(`vigent_${storageKey}_text`);
    const savedTitle = localStorage.getItem(`vigent_${storageKey}_title`);
-    const savedSubtitles = localStorage.getItem(`vigent_${storageKey}_subtitles`);
    const savedTtsMode = localStorage.getItem(`vigent_${storageKey}_ttsMode`);
    const savedVoice = localStorage.getItem(`vigent_${storageKey}_voice`);
+    const savedTextLang = localStorage.getItem(`vigent_${storageKey}_textLang`);
    const savedMaterial = localStorage.getItem(`vigent_${storageKey}_material`);
    const savedSubtitleStyle = localStorage.getItem(`vigent_${storageKey}_subtitleStyle`);
    const savedTitleStyle = localStorage.getItem(`vigent_${storageKey}_titleStyle`);
@@ -106,18 +118,33 @@ export const useHomePersistence = ({
    const savedTitleFontSize = localStorage.getItem(`vigent_${storageKey}_titleFontSize`);
    const savedBgmId = localStorage.getItem(`vigent_${storageKey}_bgmId`);
    const savedSelectedVideoId = localStorage.getItem(`vigent_${storageKey}_selectedVideoId`);
+    const savedSelectedAudioId = localStorage.getItem(`vigent_${storageKey}_selectedAudioId`);
    const savedBgmVolume = localStorage.getItem(`vigent_${storageKey}_bgmVolume`);
    const savedEnableBgm = localStorage.getItem(`vigent_${storageKey}_enableBgm`);
    const savedTitleTopMargin = localStorage.getItem(`vigent_${storageKey}_titleTopMargin`);
    const savedSubtitleBottomMargin = localStorage.getItem(`vigent_${storageKey}_subtitleBottomMargin`);
+    const savedOutputAspectRatio = localStorage.getItem(`vigent_${storageKey}_outputAspectRatio`);
+    const savedSpeed = localStorage.getItem(`vigent_${storageKey}_speed`);

    setText(savedText || "大家好，欢迎来到我的频道，今天给大家分享一些有趣的内容。");
    setVideoTitle(savedTitle ? clampTitle(savedTitle) : "");
-    setEnableSubtitles(savedSubtitles !== null ? savedSubtitles === 'true' : true);
    setTtsMode((savedTtsMode as 'edgetts' | 'voiceclone') || 'edgetts');
    setVoice(savedVoice || "zh-CN-YunxiNeural");
+    if (savedTextLang) setTextLang(savedTextLang);

-    if (savedMaterial) setSelectedMaterial(savedMaterial);
+    if (savedMaterial) {
+      try {
+        const parsed = JSON.parse(savedMaterial);
+        if (Array.isArray(parsed)) {
+          setSelectedMaterials(parsed);
+        } else {
+          setSelectedMaterials([savedMaterial]);
+        }
+      } catch {
+        // 旧格式: 单字符串
+        setSelectedMaterials([savedMaterial]);
+      }
+    }
    if (savedSubtitleStyle) setSelectedSubtitleStyleId(savedSubtitleStyle);
    if (savedTitleStyle) setSelectedTitleStyleId(savedTitleStyle);

@@ -141,6 +168,7 @@ export const useHomePersistence = ({
    if (savedBgmVolume) setBgmVolume(parseFloat(savedBgmVolume));
    if (savedEnableBgm !== null) setEnableBgm(savedEnableBgm === 'true');
    if (savedSelectedVideoId) setSelectedVideoId(savedSelectedVideoId);
+    if (savedSelectedAudioId) setSelectedAudioId(savedSelectedAudioId);

    if (savedTitleTopMargin) {
      const parsed = parseInt(savedTitleTopMargin, 10);
@@ -151,25 +179,37 @@ export const useHomePersistence = ({
      if (!Number.isNaN(parsed)) setSubtitleBottomMargin(parsed);
    }

+    if (savedOutputAspectRatio === '9:16' || savedOutputAspectRatio === '16:9') {
+      setOutputAspectRatio(savedOutputAspectRatio);
+    }
+
+    if (savedSpeed) {
+      const parsed = parseFloat(savedSpeed);
+      if (!Number.isNaN(parsed)) setSpeed(parsed);
+    }
+
    // eslint-disable-next-line react-hooks/set-state-in-effect
    setIsRestored(true);
  }, [
    isAuthLoading,
    setBgmVolume,
    setEnableBgm,
-    setEnableSubtitles,
    setSelectedBgmId,
-    setSelectedMaterial,
+    setSelectedMaterials,
    setSelectedSubtitleStyleId,
    setSelectedTitleStyleId,
    setSelectedVideoId,
+    setSelectedAudioId,
+    setSpeed,
    setSubtitleFontSize,
    setSubtitleSizeLocked,
    setText,
+    setTextLang,
    setTitleFontSize,
    setTitleSizeLocked,
    setTitleTopMargin,
    setSubtitleBottomMargin,
+    setOutputAspectRatio,
    setTtsMode,
    setVideoTitle,
    setVoice,
@@ -192,10 +232,6 @@ export const useHomePersistence = ({
    return () => clearTimeout(timeout);
  }, [videoTitle, storageKey, isRestored]);

-  useEffect(() => {
-    if (isRestored) localStorage.setItem(`vigent_${storageKey}_subtitles`, String(enableSubtitles));
-  }, [enableSubtitles, storageKey, isRestored]);
-
  useEffect(() => {
    if (isRestored) localStorage.setItem(`vigent_${storageKey}_ttsMode`, ttsMode);
  }, [ttsMode, storageKey, isRestored]);
@@ -205,10 +241,14 @@ export const useHomePersistence = ({
  }, [voice, storageKey, isRestored]);

  useEffect(() => {
-    if (isRestored && selectedMaterial) {
-      localStorage.setItem(`vigent_${storageKey}_material`, selectedMaterial);
+    if (isRestored) localStorage.setItem(`vigent_${storageKey}_textLang`, textLang);
+  }, [textLang, storageKey, isRestored]);
+
+  useEffect(() => {
+    if (isRestored && selectedMaterials.length > 0) {
+      localStorage.setItem(`vigent_${storageKey}_material`, JSON.stringify(selectedMaterials));
    }
-  }, [selectedMaterial, storageKey, isRestored]);
+  }, [selectedMaterials, storageKey, isRestored]);

  useEffect(() => {
    if (isRestored && selectedSubtitleStyleId) {
@@ -246,6 +286,12 @@ export const useHomePersistence = ({
    }
  }, [subtitleBottomMargin, storageKey, isRestored]);

+  useEffect(() => {
+    if (isRestored) {
+      localStorage.setItem(`vigent_${storageKey}_outputAspectRatio`, outputAspectRatio);
+    }
+  }, [outputAspectRatio, storageKey, isRestored]);
+
  useEffect(() => {
    if (isRestored) {
      localStorage.setItem(`vigent_${storageKey}_bgmId`, selectedBgmId);
@@ -275,11 +321,26 @@ export const useHomePersistence = ({
    }
  }, [selectedVideoId, storageKey, isRestored]);

+  useEffect(() => {
+    if (!isRestored) return;
+    if (selectedAudioId) {
+      localStorage.setItem(`vigent_${storageKey}_selectedAudioId`, selectedAudioId);
+    } else {
+      localStorage.removeItem(`vigent_${storageKey}_selectedAudioId`);
+    }
+  }, [selectedAudioId, storageKey, isRestored]);
+
  useEffect(() => {
    if (isRestored && selectedRefAudio) {
      localStorage.setItem(`vigent_${storageKey}_refAudioId`, selectedRefAudio.id);
    }
  }, [selectedRefAudio, storageKey, isRestored]);

+  useEffect(() => {
+    if (isRestored) {
+      localStorage.setItem(`vigent_${storageKey}_speed`, String(speed));
+    }
+  }, [speed, storageKey, isRestored]);
+
  return { isRestored };
 };
--- a/frontend/src/features/home/model/useMaterials.ts
+++ b/frontend/src/features/home/model/useMaterials.ts
@@ -2,23 +2,44 @@ import { useCallback, useState } from "react";
 import api from "@/shared/api/axios";
 import { ApiResponse, unwrap } from "@/shared/api/types";
 import { toast } from "sonner";
+import { resolveMediaUrl } from "@/shared/lib/media";
+import type { Material } from "@/shared/types/material";

-interface Material {
-  id: string;
-  name: string;
-  scene: string;
-  size_mb: number;
-  path: string;
+/** Probe video duration from a URL using <video> element */
+function probeVideoDuration(url: string): Promise<number> {
+  return new Promise((resolve) => {
+    const video = document.createElement("video");
+    video.preload = "metadata";
+    video.crossOrigin = "anonymous";
+    const cleanup = () => {
+      video.removeEventListener("loadedmetadata", onMeta);
+      video.removeEventListener("error", onError);
+      video.src = "";
+    };
+    const onMeta = () => {
+      const dur = video.duration;
+      cleanup();
+      resolve(Number.isFinite(dur) ? dur : 0);
+    };
+    const onError = () => {
+      cleanup();
+      resolve(0);
+    };
+    video.addEventListener("loadedmetadata", onMeta);
+    video.addEventListener("error", onError);
+    video.src = url;
+    video.load();
+  });
 }

 interface UseMaterialsOptions {
-  selectedMaterial: string;
-  setSelectedMaterial: React.Dispatch<React.SetStateAction<string>>;
+  selectedMaterials: string[];
+  setSelectedMaterials: React.Dispatch<React.SetStateAction<string[]>>;
 }

 export const useMaterials = ({
-  selectedMaterial,
-  setSelectedMaterial,
+  selectedMaterials,
+  setSelectedMaterials,
 }: UseMaterialsOptions) => {
  const [materials, setMaterials] = useState<Material[]>([]);
  const [fetchError, setFetchError] = useState<string | null>(null);
@@ -41,12 +62,25 @@ export const useMaterials = ({
      setMaterials(nextMaterials);
      setLastMaterialCount(nextMaterials.length);

-      setSelectedMaterial((prev) => {
-        // 如果当前选中的素材在列表中依然存在，保持选中
-        const exists = nextMaterials.some((item) => item.id === prev);
-        if (exists) return prev;
+      // Probe video durations in background
+      if (nextMaterials.length > 0) {
+        Promise.all(
+          nextMaterials.map(async (m) => {
+            const url = resolveMediaUrl(m.path);
+            if (!url) return m;
+            const dur = await probeVideoDuration(url);
+            return { ...m, duration_sec: dur };
+          })
+        ).then((enriched) => setMaterials(enriched));
+      }
+
+      setSelectedMaterials((prev) => {
+        // 保留已选中且仍存在的
+        const existingIds = new Set(nextMaterials.map((m) => m.id));
+        const kept = prev.filter((id) => existingIds.has(id));
+        if (kept.length > 0) return kept;
        // 否则默认选中第一个
-        return nextMaterials[0]?.id || "";
+        return nextMaterials[0]?.id ? [nextMaterials[0].id] : [];
      });
    } catch (error) {
      console.error("获取素材失败:", error);
@@ -54,29 +88,58 @@ export const useMaterials = ({
    } finally {
      setIsFetching(false);
    }
-  }, [setSelectedMaterial]);
+  }, [setSelectedMaterials]);
+
+  const MAX_MATERIALS = 4;
+
+  const toggleMaterial = useCallback((id: string) => {
+    setSelectedMaterials((prev) => {
+      if (prev.includes(id)) {
+        // 不能取消最后一个
+        if (prev.length <= 1) return prev;
+        return prev.filter((x) => x !== id);
+      }
+      if (prev.length >= MAX_MATERIALS) return prev;
+      return [...prev, id];
+    });
+  }, [setSelectedMaterials]);
+
+  const reorderMaterials = useCallback((activeId: string, overId: string) => {
+    setSelectedMaterials((prev) => {
+      const oldIndex = prev.indexOf(activeId);
+      const newIndex = prev.indexOf(overId);
+      if (oldIndex === -1 || newIndex === -1) return prev;
+      const next = [...prev];
+      next.splice(oldIndex, 1);
+      next.splice(newIndex, 0, activeId);
+      return next;
+    });
+  }, [setSelectedMaterials]);

  const deleteMaterial = useCallback(async (materialId: string) => {
    if (!confirm("确定要删除这个素材吗？")) return;
    try {
      await api.delete(`/api/materials/${materialId}`);
      fetchMaterials();
-      if (selectedMaterial === materialId) {
-        setSelectedMaterial("");
+      if (selectedMaterials.includes(materialId)) {
+        setSelectedMaterials((prev) => {
+          const next = prev.filter((id) => id !== materialId);
+          return next.length > 0 ? next : [];
+        });
      }
    } catch (error) {
      toast.error("删除失败: " + error);
    }
-  }, [fetchMaterials, selectedMaterial, setSelectedMaterial]);
+  }, [fetchMaterials, selectedMaterials, setSelectedMaterials]);

  const handleUpload = useCallback(async (e: React.ChangeEvent<HTMLInputElement>) => {
    const file = e.target.files?.[0];
    if (!file) return;

-    const validTypes = ['.mp4', '.mov', '.avi'];
+    const validTypes = ['.mp4', '.mov', '.avi', '.mkv', '.webm', '.flv', '.wmv', '.m4v', '.ts', '.mts'];
    const ext = file.name.toLowerCase().slice(file.name.lastIndexOf('.'));
    if (!validTypes.includes(ext)) {
-      setUploadError('仅支持 MP4、MOV、AVI 格式');
+      setUploadError('不支持的视频格式');
      return;
    }

@@ -100,7 +163,37 @@ export const useMaterials = ({

      setUploadProgress(100);
      setIsUploading(false);
-      fetchMaterials();
+
+      // 上传后重新拉列表并自动选中新素材
+      const { data: res } = await api.get<ApiResponse<{ materials: Material[] }>>(
+        `/api/materials?t=${new Date().getTime()}`
+      );
+      const payload = unwrap(res);
+      const nextMaterials = payload.materials || [];
+      setMaterials(nextMaterials);
+      setLastMaterialCount(nextMaterials.length);
+
+      // Probe video durations in background
+      if (nextMaterials.length > 0) {
+        Promise.all(
+          nextMaterials.map(async (m) => {
+            const url = resolveMediaUrl(m.path);
+            if (!url) return m;
+            const dur = await probeVideoDuration(url);
+            return { ...m, duration_sec: dur };
+          })
+        ).then((enriched) => setMaterials(enriched));
+      }
+
+      // 找出新增素材并默认仅选中新上传项，避免误触发多素材模式
+      const oldIds = new Set(materials.map((m) => m.id));
+      const newIds = nextMaterials.filter((m) => !oldIds.has(m.id)).map((m) => m.id);
+      if (newIds.length > 0) {
+        setSelectedMaterials([newIds[0]]);
+      } else if (nextMaterials[0]?.id) {
+        // 兜底：即使未识别到新增项，也保持单素材默认选择最新一个
+        setSelectedMaterials([nextMaterials[0].id]);
+      }
    } catch (err: unknown) {
      console.error("Upload failed:", err);
      setIsUploading(false);
@@ -110,7 +203,7 @@ export const useMaterials = ({
    }

    e.target.value = '';
-  }, [fetchMaterials]);
+  }, [materials, setSelectedMaterials]);

  return {
    materials,
@@ -122,6 +215,8 @@ export const useMaterials = ({
    uploadError,
    setUploadError,
    fetchMaterials,
+    toggleMaterial,
+    reorderMaterials,
    deleteMaterial,
    handleUpload,
  };
--- a/frontend/src/features/home/model/useRefAudios.ts
+++ b/frontend/src/features/home/model/useRefAudios.ts
@@ -13,14 +13,12 @@ interface RefAudio {
 }

 interface UseRefAudiosOptions {
-  fixedRefText: string;
  selectedRefAudio: RefAudio | null;
  setSelectedRefAudio: React.Dispatch<React.SetStateAction<RefAudio | null>>;
  setRefText: React.Dispatch<React.SetStateAction<string>>;
 }

 export const useRefAudios = ({
-  fixedRefText,
  selectedRefAudio,
  setSelectedRefAudio,
  setRefText,
@@ -28,6 +26,7 @@ export const useRefAudios = ({
  const [refAudios, setRefAudios] = useState<RefAudio[]>([]);
  const [isUploadingRef, setIsUploadingRef] = useState(false);
  const [uploadRefError, setUploadRefError] = useState<string | null>(null);
+  const [retranscribingId, setRetranscribingId] = useState<string | null>(null);

  const fetchRefAudios = useCallback(async () => {
    try {
@@ -42,15 +41,12 @@ export const useRefAudios = ({
  }, []);

  const uploadRefAudio = useCallback(async (file: File) => {
-    const refTextInput = fixedRefText;
-
    setIsUploadingRef(true);
    setUploadRefError(null);

    try {
      const formData = new FormData();
      formData.append('file', file);
-      formData.append('ref_text', refTextInput);

      const { data: res } = await api.post<ApiResponse<RefAudio>>('/api/ref-audios', formData, {
        headers: { 'Content-Type': 'multipart/form-data' },
@@ -68,7 +64,7 @@ export const useRefAudios = ({
      const errorMsg = axiosErr.response?.data?.message || axiosErr.message || String(err);
      setUploadRefError(`上传失败: ${errorMsg}`);
    }
-  }, [fetchRefAudios, fixedRefText, setRefText, setSelectedRefAudio]);
+  }, [fetchRefAudios, setRefText, setSelectedRefAudio]);

  const deleteRefAudio = useCallback(async (audioId: string) => {
    if (!confirm("确定要删除这个参考音频吗？")) return;
@@ -84,6 +80,28 @@ export const useRefAudios = ({
    }
  }, [fetchRefAudios, selectedRefAudio, setRefText, setSelectedRefAudio]);

+  const retranscribeRefAudio = useCallback(async (audioId: string) => {
+    setRetranscribingId(audioId);
+    try {
+      const { data: res } = await api.post<ApiResponse<{ ref_text: string }>>(
+        `/api/ref-audios/${encodeURIComponent(audioId)}/retranscribe`
+      );
+      const payload = unwrap(res);
+      toast.success("识别完成");
+      // 更新列表和当前选中
+      await fetchRefAudios();
+      if (selectedRefAudio?.id === audioId) {
+        setRefText(payload.ref_text);
+      }
+    } catch (err: unknown) {
+      const axiosErr = err as { response?: { data?: { message?: string } }; message?: string };
+      const errorMsg = axiosErr.response?.data?.message || axiosErr.message || String(err);
+      toast.error(`识别失败: ${errorMsg}`);
+    } finally {
+      setRetranscribingId(null);
+    }
+  }, [fetchRefAudios, selectedRefAudio, setRefText]);
+
  return {
    refAudios,
    isUploadingRef,
@@ -92,5 +110,7 @@ export const useRefAudios = ({
    fetchRefAudios,
    uploadRefAudio,
    deleteRefAudio,
+    retranscribeRefAudio,
+    retranscribingId,
  };
 };
--- a/frontend/src/features/home/model/useSavedScripts.ts
+++ b/frontend/src/features/home/model/useSavedScripts.ts
@@ -0,0 +1,51 @@
+import { useState, useEffect, useRef } from "react";
+
+export interface SavedScript {
+  id: string;
+  name: string;
+  content: string;
+  savedAt: number;
+}
+
+export function useSavedScripts(storageKey: string) {
+  const lsKey = `vigent_${storageKey}_savedScripts`;
+  const lsKeyRef = useRef(lsKey);
+  lsKeyRef.current = lsKey;
+
+  const [savedScripts, setSavedScripts] = useState<SavedScript[]>([]);
+
+  // Re-read from localStorage whenever lsKey changes (e.g. guest → userId)
+  useEffect(() => {
+    try {
+      const raw = localStorage.getItem(lsKey);
+      setSavedScripts(raw ? JSON.parse(raw) : []);
+    } catch {
+      setSavedScripts([]);
+    }
+  }, [lsKey]);
+
+  const saveScript = (content: string) => {
+    const name = content.slice(0, 15).replace(/\n/g, " ") || "未命名";
+    const entry: SavedScript = {
+      id: Date.now().toString(36) + Math.random().toString(36).slice(2, 6),
+      name,
+      content,
+      savedAt: Date.now(),
+    };
+    setSavedScripts((prev) => {
+      const next = [entry, ...prev];
+      localStorage.setItem(lsKeyRef.current, JSON.stringify(next));
+      return next;
+    });
+  };
+
+  const deleteScript = (id: string) => {
+    setSavedScripts((prev) => {
+      const next = prev.filter((s) => s.id !== id);
+      localStorage.setItem(lsKeyRef.current, JSON.stringify(next));
+      return next;
+    });
+  };
+
+  return { savedScripts, saveScript, deleteScript };
+}
--- a/frontend/src/features/home/model/useTimelineEditor.ts
+++ b/frontend/src/features/home/model/useTimelineEditor.ts
@@ -0,0 +1,256 @@
+import { useCallback, useEffect, useRef, useState } from "react";
+import type { Material } from "@/shared/types/material";
+
+export interface TimelineSegment {
+  id: string;
+  materialId: string;
+  materialName: string;
+  start: number;
+  end: number;
+  sourceStart: number;
+  sourceEnd: number;
+  color: string;
+}
+
+export interface CustomAssignment {
+  material_path: string;
+  start: number;
+  end: number;
+  source_start: number;
+  source_end?: number;
+}
+
+const COLORS = ["#8b5cf6", "#ec4899", "#06b6d4", "#f59e0b", "#10b981", "#f97316"];
+
+/** Serializable subset for localStorage */
+interface SegmentSnapshot {
+  materialId: string;
+  start: number;
+  end: number;
+  sourceStart: number;
+  sourceEnd: number;
+}
+
+/** Get effective duration of a segment (clipped range or full material duration) */
+function getEffectiveDuration(
+  seg: { sourceStart: number; sourceEnd: number; materialId: string },
+  mats: Material[]
+): number {
+  const mat = mats.find((m) => m.id === seg.materialId);
+  const matDur = mat?.duration_sec ?? 0;
+  if (seg.sourceEnd > seg.sourceStart) return seg.sourceEnd - seg.sourceStart;
+  if (seg.sourceStart > 0) return Math.max(matDur - seg.sourceStart, 0);
+  return matDur;
+}
+
+/**
+ * Recalculate segment start/end positions based on effective durations.
+ * - Segments placed sequentially by effective duration
+ * - Segments exceeding audioDuration keep their positions (overflow, start >= duration)
+ * - Last visible segment is capped/extended to exactly audioDuration (loop fill)
+ */
+function recalcPositions(
+  segs: TimelineSegment[],
+  mats: Material[],
+  duration: number
+): TimelineSegment[] {
+  if (segs.length === 0 || duration <= 0) return segs;
+
+  const fallbackDur = duration / segs.length;
+  let cursor = 0;
+  const result = segs.map((seg) => {
+    const effDur = getEffectiveDuration(seg, mats);
+    const dur = effDur > 0 ? effDur : fallbackDur;
+    const newSeg = { ...seg, start: cursor, end: cursor + dur };
+    cursor += dur;
+    return newSeg;
+  });
+
+  // Find last segment that starts before audioDuration
+  let lastVisibleIdx = -1;
+  for (let i = result.length - 1; i >= 0; i--) {
+    if (result[i].start < duration) {
+      lastVisibleIdx = i;
+      break;
+    }
+  }
+
+  // Cap/extend last visible segment to exactly audioDuration
+  if (lastVisibleIdx >= 0) {
+    result[lastVisibleIdx] = { ...result[lastVisibleIdx], end: duration };
+  }
+
+  return result;
+}
+
+interface UseTimelineEditorOptions {
+  audioDuration: number;
+  materials: Material[];
+  selectedMaterials: string[];
+  storageKey?: string;
+}
+
+export const useTimelineEditor = ({
+  audioDuration,
+  materials,
+  selectedMaterials,
+  storageKey,
+}: UseTimelineEditorOptions) => {
+  const [segments, setSegments] = useState<TimelineSegment[]>([]);
+  const prevKey = useRef("");
+  const restoredRef = useRef(false);
+
+  // Refs for stable callbacks (avoid recreating on every materials/duration change)
+  const materialsRef = useRef(materials);
+  const audioDurationRef = useRef(audioDuration);
+
+  useEffect(() => {
+    materialsRef.current = materials;
+  }, [materials]);
+
+  useEffect(() => {
+    audioDurationRef.current = audioDuration;
+  }, [audioDuration]);
+
+  // Build a durationsKey so segments re-init when material durations become available
+  const durationsKey = selectedMaterials
+    .map((id) => materials.find((m) => m.id === id)?.duration_sec ?? 0)
+    .join(",");
+
+  // Build a cache key from materials + duration
+  const cacheKey = `${selectedMaterials.join(",")}_${audioDuration.toFixed(1)}`;
+  const lsKey = storageKey ? `vigent_${storageKey}_timeline` : null;
+
+  const initSegments = useCallback(() => {
+    if (selectedMaterials.length === 0 || audioDuration <= 0) {
+      setSegments([]);
+      return;
+    }
+
+    // Try restore from localStorage
+    if (lsKey) {
+      try {
+        const raw = localStorage.getItem(lsKey);
+        if (raw) {
+          const saved = JSON.parse(raw) as { key: string; segments: SegmentSnapshot[] };
+          if (saved.key === cacheKey && saved.segments.length === selectedMaterials.length) {
+            const allMatch = saved.segments.every(
+              (s, i) => s.materialId === selectedMaterials[i] || saved.segments.some((ss) => ss.materialId === selectedMaterials[i])
+            );
+            if (allMatch) {
+              const restored: TimelineSegment[] = saved.segments.map((s, i) => {
+                const mat = materials.find((m) => m.id === s.materialId);
+                return {
+                  id: `seg-${i}-${Date.now()}`,
+                  materialId: s.materialId,
+                  materialName: mat?.scene || mat?.name || s.materialId,
+                  start: 0,
+                  end: 0,
+                  sourceStart: s.sourceStart,
+                  sourceEnd: s.sourceEnd,
+                  color: COLORS[i % COLORS.length],
+                };
+              });
+              setSegments(recalcPositions(restored, materials, audioDuration));
+              restoredRef.current = true;
+              return;
+            }
+          }
+        }
+      } catch {
+        // ignore parse errors
+      }
+    }
+
+    // Create fresh segments — positions derived by recalcPositions
+    const newSegments: TimelineSegment[] = selectedMaterials.map((matId, i) => {
+      const mat = materials.find((m) => m.id === matId);
+      return {
+        id: `seg-${i}-${Date.now()}`,
+        materialId: matId,
+        materialName: mat?.scene || mat?.name || matId,
+        start: 0,
+        end: 0,
+        sourceStart: 0,
+        sourceEnd: 0,
+        color: COLORS[i % COLORS.length],
+      };
+    });
+
+    setSegments(recalcPositions(newSegments, materials, audioDuration));
+  }, [audioDuration, materials, selectedMaterials, lsKey, cacheKey]);
+
+  // Auto-init when selectedMaterials, audioDuration, or material durations change
+  useEffect(() => {
+    const key = `${selectedMaterials.join(",")}_${audioDuration}_${durationsKey}`;
+    if (key !== prevKey.current) {
+      prevKey.current = key;
+      initSegments();
+    }
+  }, [selectedMaterials, audioDuration, durationsKey, initSegments]);
+
+  // Persist segments to localStorage on change (debounced)
+  useEffect(() => {
+    if (!lsKey || segments.length === 0) return;
+    const timeout = setTimeout(() => {
+      const snapshots: SegmentSnapshot[] = segments.map((s) => ({
+        materialId: s.materialId,
+        start: s.start,
+        end: s.end,
+        sourceStart: s.sourceStart,
+        sourceEnd: s.sourceEnd,
+      }));
+      localStorage.setItem(lsKey, JSON.stringify({ key: cacheKey, segments: snapshots }));
+    }, 300);
+    return () => clearTimeout(timeout);
+  }, [segments, lsKey, cacheKey]);
+
+  const reorderSegments = useCallback(
+    (fromIdx: number, toIdx: number) => {
+      setSegments((prev) => {
+        if (fromIdx < 0 || toIdx < 0 || fromIdx >= prev.length || toIdx >= prev.length) return prev;
+        if (fromIdx === toIdx) return prev;
+        const next = [...prev];
+        // Move the segment: remove from old position, insert at new position
+        const [moved] = next.splice(fromIdx, 1);
+        next.splice(toIdx, 0, moved);
+        return recalcPositions(next, materialsRef.current, audioDurationRef.current);
+      });
+    },
+    []
+  );
+
+  const setSourceRange = useCallback(
+    (id: string, sourceStart: number, sourceEnd: number) => {
+      setSegments((prev) => {
+        const updated = prev.map((s) => (s.id === id ? { ...s, sourceStart, sourceEnd } : s));
+        return recalcPositions(updated, materialsRef.current, audioDurationRef.current);
+      });
+    },
+    []
+  );
+
+  const toCustomAssignments = useCallback((): CustomAssignment[] => {
+    const duration = audioDurationRef.current;
+    return segments
+      .filter((seg) => seg.start < duration)
+      .map((seg) => {
+        const mat = materialsRef.current.find((m) => m.id === seg.materialId);
+        return {
+          material_path: mat?.path || seg.materialId,
+          start: seg.start,
+          end: seg.end,
+          source_start: seg.sourceStart,
+          source_end: seg.sourceEnd > seg.sourceStart ? seg.sourceEnd : undefined,
+        };
+      });
+  }, [segments]);
+
+  return {
+    segments,
+    initSegments,
+    reorderSegments,
+    setSourceRange,
+    toCustomAssignments,
+  };
+};
--- a/frontend/src/features/home/ui/ClipTrimmer.tsx
+++ b/frontend/src/features/home/ui/ClipTrimmer.tsx
@@ -0,0 +1,293 @@
+import { useCallback, useEffect, useRef, useState } from "react";
+import { X, Play, Pause } from "lucide-react";
+import type { TimelineSegment } from "@/features/home/model/useTimelineEditor";
+
+interface ClipTrimmerProps {
+  isOpen: boolean;
+  segment: TimelineSegment | null;
+  materialUrl: string | null;
+  onConfirm: (sourceStart: number, sourceEnd: number) => void;
+  onClose: () => void;
+}
+
+function formatSec(sec: number): string {
+  const m = Math.floor(sec / 60);
+  const s = sec % 60;
+  return `${String(m).padStart(2, "0")}:${s.toFixed(1).padStart(4, "0")}`;
+}
+
+export function ClipTrimmer({
+  isOpen,
+  segment,
+  materialUrl,
+  onConfirm,
+  onClose,
+}: ClipTrimmerProps) {
+  const videoRef = useRef<HTMLVideoElement>(null);
+  const trackRef = useRef<HTMLDivElement>(null);
+  const [duration, setDuration] = useState(0);
+  const [sourceStart, setSourceStart] = useState(0);
+  const [sourceEnd, setSourceEnd] = useState(0);
+  const [currentTime, setCurrentTime] = useState(0);
+  const [isPlaying, setIsPlaying] = useState(false);
+  const [dragging, setDragging] = useState<"start" | "end" | null>(null);
+  const animRef = useRef<number>(0);
+
+  // Reset state when segment changes
+  useEffect(() => {
+    if (segment && isOpen) {
+      setSourceStart(segment.sourceStart);
+      setSourceEnd(segment.sourceEnd);
+      setCurrentTime(segment.sourceStart);
+      setIsPlaying(false);
+    }
+  }, [segment, isOpen]);
+
+  // Track currentTime during playback
+  useEffect(() => {
+    if (!isPlaying || !videoRef.current) return;
+
+    const tick = () => {
+      if (!videoRef.current) return;
+      const t = videoRef.current.currentTime;
+      const end = sourceEnd || duration;
+      if (t >= end) {
+        videoRef.current.pause();
+        videoRef.current.currentTime = sourceStart;
+        setCurrentTime(sourceStart);
+        setIsPlaying(false);
+        return;
+      }
+      setCurrentTime(t);
+      animRef.current = requestAnimationFrame(tick);
+    };
+    animRef.current = requestAnimationFrame(tick);
+    return () => cancelAnimationFrame(animRef.current);
+  }, [isPlaying, sourceStart, sourceEnd, duration]);
+
+  // Seek video when not playing and currentTime changes
+  useEffect(() => {
+    if (videoRef.current && !isPlaying) {
+      videoRef.current.currentTime = currentTime;
+    }
+  }, [currentTime, isPlaying]);
+
+  const handleLoadedMetadata = useCallback(() => {
+    if (videoRef.current) {
+      const dur = videoRef.current.duration;
+      setDuration(dur);
+      if (sourceEnd === 0) {
+        setSourceEnd(dur);
+      }
+    }
+  }, [sourceEnd]);
+
+  const togglePlay = useCallback(() => {
+    if (!videoRef.current || duration === 0) return;
+    if (isPlaying) {
+      videoRef.current.pause();
+      setIsPlaying(false);
+    } else {
+      const end = sourceEnd || duration;
+      if (videoRef.current.currentTime >= end || videoRef.current.currentTime < sourceStart) {
+        videoRef.current.currentTime = sourceStart;
+        setCurrentTime(sourceStart);
+      }
+      videoRef.current.play().catch(() => {});
+      setIsPlaying(true);
+    }
+  }, [isPlaying, sourceStart, sourceEnd, duration]);
+
+  // --- Dual-handle slider logic ---
+  const getPositionFromEvent = useCallback(
+    (clientX: number) => {
+      if (!trackRef.current || duration === 0) return 0;
+      const rect = trackRef.current.getBoundingClientRect();
+      const ratio = Math.max(0, Math.min(1, (clientX - rect.left) / rect.width));
+      return ratio * duration;
+    },
+    [duration]
+  );
+
+  const handleThumbPointerDown = useCallback(
+    (which: "start" | "end", e: React.PointerEvent) => {
+      e.preventDefault();
+      e.stopPropagation();
+      setDragging(which);
+      (e.target as HTMLElement).setPointerCapture(e.pointerId);
+    },
+    []
+  );
+
+  const handleTrackPointerMove = useCallback(
+    (e: React.PointerEvent) => {
+      if (!dragging) return;
+      const pos = getPositionFromEvent(e.clientX);
+      const minGap = 0.5;
+      if (dragging === "start") {
+        const clamped = Math.max(0, Math.min(pos, (sourceEnd || duration) - minGap));
+        setSourceStart(clamped);
+        setCurrentTime(clamped);
+      } else {
+        const clamped = Math.min(duration, Math.max(pos, sourceStart + minGap));
+        setSourceEnd(clamped);
+      }
+    },
+    [dragging, getPositionFromEvent, sourceStart, sourceEnd, duration]
+  );
+
+  const handleTrackPointerUp = useCallback(() => {
+    setDragging(null);
+  }, []);
+
+  const handleConfirm = () => {
+    onConfirm(sourceStart, sourceEnd >= duration ? 0 : sourceEnd);
+  };
+
+  if (!isOpen || !segment) return null;
+
+  const assignedDur = segment.end - segment.start;
+  const effectiveEnd = sourceEnd || duration;
+  const clipDur = effectiveEnd - sourceStart;
+  const startPct = duration > 0 ? (sourceStart / duration) * 100 : 0;
+  const endPct = duration > 0 ? (effectiveEnd / duration) * 100 : 100;
+  const playheadPct = duration > 0 ? (currentTime / duration) * 100 : 0;
+
+  return (
+    <div className="fixed inset-0 z-50 flex items-center justify-center bg-black/60 backdrop-blur-sm" onClick={onClose}>
+      <div
+        className="bg-gray-900 border border-white/10 rounded-2xl w-full max-w-lg mx-4 overflow-hidden"
+        onClick={(e) => e.stopPropagation()}
+      >
+        {/* Header */}
+        <div className="flex items-center justify-between px-5 py-3 border-b border-white/10">
+          <h3 className="text-white font-semibold text-sm">
+            截取设置 - {segment.materialName}
+          </h3>
+          <button onClick={onClose} className="text-gray-400 hover:text-white">
+            <X className="h-4 w-4" />
+          </button>
+        </div>
+
+        {/* Video preview */}
+        <div className="px-5 pt-4">
+          <div className="relative bg-black rounded-lg overflow-hidden aspect-video group">
+            {materialUrl ? (
+              <video
+                ref={videoRef}
+                src={materialUrl}
+                className="w-full h-full object-contain"
+                onLoadedMetadata={handleLoadedMetadata}
+                onEnded={() => setIsPlaying(false)}
+                preload="auto"
+                muted
+              />
+            ) : (
+              <div className="flex items-center justify-center h-full text-gray-500 text-sm">
+                无法加载视频
+              </div>
+            )}
+            {/* Play/Pause overlay */}
+            {materialUrl && (
+              <button
+                onClick={togglePlay}
+                className="absolute inset-0 flex items-center justify-center bg-black/0 hover:bg-black/30 transition-colors"
+              >
+                <div className={`p-3 rounded-full bg-black/60 text-white transition-opacity ${isPlaying ? "opacity-0 group-hover:opacity-100" : "opacity-100"}`}>
+                  {isPlaying ? <Pause className="h-6 w-6" /> : <Play className="h-6 w-6" />}
+                </div>
+              </button>
+            )}
+            <div className="absolute bottom-2 right-2 bg-black/70 text-white text-[10px] px-2 py-0.5 rounded pointer-events-none">
+              {formatSec(currentTime)}
+            </div>
+          </div>
+        </div>
+
+        {/* Dual-handle range slider */}
+        <div className="px-5 py-4 space-y-3">
+          <div className="text-xs text-gray-400 flex justify-between">
+            <span>源视频时长: {duration > 0 ? formatSec(duration) : "加载中..."}</span>
+          </div>
+
+          {/* Custom range track */}
+          <div
+            ref={trackRef}
+            className="relative h-8 cursor-pointer select-none touch-none"
+            onPointerMove={handleTrackPointerMove}
+            onPointerUp={handleTrackPointerUp}
+            onPointerLeave={handleTrackPointerUp}
+          >
+            {/* Background track */}
+            <div className="absolute top-1/2 -translate-y-1/2 left-0 right-0 h-2 bg-white/10 rounded-full" />
+
+            {/* Selected range */}
+            <div
+              className="absolute top-1/2 -translate-y-1/2 h-2 rounded-full"
+              style={{
+                left: `${startPct}%`,
+                width: `${endPct - startPct}%`,
+                backgroundColor: segment.color + "88",
+              }}
+            />
+
+            {/* Playhead indicator */}
+            {duration > 0 && (
+              <div
+                className="absolute top-1/2 -translate-y-1/2 w-0.5 h-4 bg-white/60 rounded-full pointer-events-none"
+                style={{ left: `${playheadPct}%` }}
+              />
+            )}
+
+            {/* Start thumb */}
+            <div
+              onPointerDown={(e) => handleThumbPointerDown("start", e)}
+              className="absolute top-1/2 -translate-y-1/2 -translate-x-1/2 w-4 h-4 rounded-full bg-purple-500 border-2 border-white shadow-lg cursor-grab active:cursor-grabbing hover:scale-110 transition-transform z-10"
+              style={{ left: `${startPct}%` }}
+              title={`起点: ${formatSec(sourceStart)}`}
+            />
+
+            {/* End thumb */}
+            <div
+              onPointerDown={(e) => handleThumbPointerDown("end", e)}
+              className="absolute top-1/2 -translate-y-1/2 -translate-x-1/2 w-4 h-4 rounded-full bg-pink-500 border-2 border-white shadow-lg cursor-grab active:cursor-grabbing hover:scale-110 transition-transform z-10"
+              style={{ left: `${endPct}%` }}
+              title={`终点: ${formatSec(effectiveEnd)}`}
+            />
+          </div>
+
+          {/* Time labels */}
+          <div className="flex justify-between text-xs text-gray-400">
+            <span className="text-purple-400">{formatSec(sourceStart)}</span>
+            <span className="text-pink-400">{formatSec(effectiveEnd)}</span>
+          </div>
+
+          {/* Info */}
+          <div className="text-[11px] text-gray-500 flex items-center gap-2 flex-wrap">
+            <span>截取: {clipDur.toFixed(1)}s</span>
+            <span className="text-gray-600">|</span>
+            <span>分配: {assignedDur.toFixed(1)}s</span>
+            {clipDur < assignedDur && <span className="text-amber-500">(将循环补足)</span>}
+            {clipDur > assignedDur && <span className="text-cyan-500">(将截断)</span>}
+          </div>
+        </div>
+
+        {/* Actions */}
+        <div className="flex justify-end gap-2 px-5 pb-4">
+          <button
+            onClick={onClose}
+            className="px-4 py-1.5 text-xs bg-white/10 hover:bg-white/20 rounded-lg text-gray-300 transition-colors"
+          >
+            取消
+          </button>
+          <button
+            onClick={handleConfirm}
+            className="px-4 py-1.5 text-xs bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-700 hover:to-pink-700 text-white rounded-lg transition-colors"
+          >
+            确定
+          </button>
+        </div>
+      </div>
+    </div>
+  );
+}
--- a/frontend/src/features/home/ui/FloatingStylePreview.tsx
+++ b/frontend/src/features/home/ui/FloatingStylePreview.tsx
@@ -86,6 +86,8 @@ export function FloatingStylePreview({

  const previewScale = windowWidth / previewBaseWidth;
  const previewHeight = previewBaseHeight * previewScale;
+  const widthScale = Math.min(1, previewBaseWidth / 1080);
+  const responsiveScale = Math.max(0.55, widthScale);

  const activeSubtitleStyle = subtitleStyles.find((s) => s.id === selectedSubtitleStyleId)
    || subtitleStyles.find((s) => s.is_default)
@@ -102,8 +104,8 @@ export function FloatingStylePreview({
  const subtitleHighlightColor = activeSubtitleStyle?.highlight_color || "#FFE600";
  const subtitleNormalColor = activeSubtitleStyle?.normal_color || "#FFFFFF";
  const subtitleStrokeColor = activeSubtitleStyle?.stroke_color || "#000000";
-  const subtitleStrokeSize = activeSubtitleStyle?.stroke_size ?? 3;
-  const subtitleLetterSpacing = activeSubtitleStyle?.letter_spacing ?? 2;
+  const subtitleStrokeSize = Math.max(1, Math.round((activeSubtitleStyle?.stroke_size ?? 3) * responsiveScale));
+  const subtitleLetterSpacing = Math.max(0, (activeSubtitleStyle?.letter_spacing ?? 2) * responsiveScale);
  const subtitleFontFamilyName = `SubtitlePreview-${activeSubtitleStyle?.id || "default"}`;
  const subtitleFontUrl = activeSubtitleStyle?.font_file
    ? resolveAssetUrl(`fonts/${activeSubtitleStyle.font_file}`)
@@ -111,14 +113,19 @@ export function FloatingStylePreview({

  const titleColor = activeTitleStyle?.color || "#FFFFFF";
  const titleStrokeColor = activeTitleStyle?.stroke_color || "#000000";
-  const titleStrokeSize = activeTitleStyle?.stroke_size ?? 8;
-  const titleLetterSpacing = activeTitleStyle?.letter_spacing ?? 4;
+  const titleStrokeSize = Math.max(1, Math.round((activeTitleStyle?.stroke_size ?? 8) * responsiveScale));
+  const titleLetterSpacing = Math.max(0, (activeTitleStyle?.letter_spacing ?? 4) * responsiveScale);
  const titleFontWeight = activeTitleStyle?.font_weight ?? 900;
  const titleFontFamilyName = `TitlePreview-${activeTitleStyle?.id || "default"}`;
  const titleFontUrl = activeTitleStyle?.font_file
    ? resolveAssetUrl(`fonts/${activeTitleStyle.font_file}`)
    : null;

+  const scaledTitleFontSize = Math.max(36, Math.round(titleFontSize * responsiveScale));
+  const scaledSubtitleFontSize = Math.max(28, Math.round(subtitleFontSize * responsiveScale));
+  const scaledTitleTopMargin = Math.max(0, Math.round(titleTopMargin * responsiveScale));
+  const scaledSubtitleBottomMargin = Math.max(0, Math.round(subtitleBottomMargin * responsiveScale));
+
  const content = (
    <div
      style={{
@@ -172,11 +179,11 @@ export function FloatingStylePreview({
            className="w-full text-center"
            style={{
              position: 'absolute',
-              top: `${titleTopMargin}px`,
+              top: `${scaledTitleTopMargin}px`,
              left: 0,
              right: 0,
              color: titleColor,
-              fontSize: `${titleFontSize}px`,
+              fontSize: `${scaledTitleFontSize}px`,
              fontWeight: titleFontWeight,
              fontFamily: titleFontUrl
                ? `'${titleFontFamilyName}', "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "Noto Sans SC", sans-serif`
@@ -184,6 +191,10 @@ export function FloatingStylePreview({
              textShadow: buildTextShadow(titleStrokeColor, titleStrokeSize),
              letterSpacing: `${titleLetterSpacing}px`,
              lineHeight: 1.2,
+              whiteSpace: 'normal',
+              wordBreak: 'break-word',
+              overflowWrap: 'anywhere',
+              boxSizing: 'border-box',
              opacity: videoTitle.trim() ? 1 : 0.7,
              padding: '0 5%',
            }}
@@ -195,16 +206,20 @@ export function FloatingStylePreview({
            className="w-full text-center"
            style={{
              position: 'absolute',
-              bottom: `${subtitleBottomMargin}px`,
+              bottom: `${scaledSubtitleBottomMargin}px`,
              left: 0,
              right: 0,
-              fontSize: `${subtitleFontSize}px`,
+              fontSize: `${scaledSubtitleFontSize}px`,
              fontFamily: subtitleFontUrl
                ? `'${subtitleFontFamilyName}', "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "Noto Sans SC", sans-serif`
                : '"PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "Noto Sans SC", sans-serif',
              textShadow: buildTextShadow(subtitleStrokeColor, subtitleStrokeSize),
              letterSpacing: `${subtitleLetterSpacing}px`,
              lineHeight: 1.35,
+              whiteSpace: 'normal',
+              wordBreak: 'break-word',
+              overflowWrap: 'anywhere',
+              boxSizing: 'border-box',
              padding: '0 6%',
            }}
          >
--- a/frontend/src/features/home/ui/GenerateActionBar.tsx
+++ b/frontend/src/features/home/ui/GenerateActionBar.tsx
@@ -4,6 +4,7 @@ interface GenerateActionBarProps {
  isGenerating: boolean;
  progress: number;
  disabled: boolean;
+  materialCount?: number;
  onGenerate: () => void;
 }

@@ -11,43 +12,51 @@ export function GenerateActionBar({
  isGenerating,
  progress,
  disabled,
+  materialCount = 1,
  onGenerate,
 }: GenerateActionBarProps) {
  return (
-    <button
-      onClick={onGenerate}
-      disabled={disabled}
-      className={`w-full py-4 rounded-xl font-bold text-lg transition-all ${disabled
-        ? "bg-gray-600 cursor-not-allowed text-gray-400"
-        : "bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-700 hover:to-pink-700 text-white shadow-lg hover:shadow-purple-500/25"
-        }`}
-    >
-      {isGenerating ? (
-        <span className="flex items-center justify-center gap-3">
-          <svg className="animate-spin h-5 w-5" viewBox="0 0 24 24">
-            <circle
-              className="opacity-25"
-              cx="12"
-              cy="12"
-              r="10"
-              stroke="currentColor"
-              strokeWidth="4"
-              fill="none"
-            />
-            <path
-              className="opacity-75"
-              fill="currentColor"
-              d="M4 12a8 8 0 018-8V0C5.373 0 0 5.373 0 12h4z"
-            />
-          </svg>
-          生成中... {progress}%
-        </span>
-      ) : (
-        <span className="flex items-center justify-center gap-2">
-          <Rocket className="h-5 w-5" />
-          生成视频
-        </span>
+    <div>
+      <button
+        onClick={onGenerate}
+        disabled={disabled}
+        className={`w-full py-4 rounded-xl font-bold text-lg transition-all ${disabled
+          ? "bg-gray-600 cursor-not-allowed text-gray-400"
+          : "bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-700 hover:to-pink-700 text-white shadow-lg hover:shadow-purple-500/25"
+          }`}
+      >
+        {isGenerating ? (
+          <span className="flex items-center justify-center gap-3">
+            <svg className="animate-spin h-5 w-5" viewBox="0 0 24 24">
+              <circle
+                className="opacity-25"
+                cx="12"
+                cy="12"
+                r="10"
+                stroke="currentColor"
+                strokeWidth="4"
+                fill="none"
+              />
+              <path
+                className="opacity-75"
+                fill="currentColor"
+                d="M4 12a8 8 0 018-8V0C5.373 0 0 5.373 0 12h4z"
+              />
+            </svg>
+            生成中... {progress}%
+          </span>
+        ) : (
+          <span className="flex items-center justify-center gap-2">
+            <Rocket className="h-5 w-5" />
+            生成视频
+          </span>
+        )}
+      </button>
+      {!isGenerating && materialCount >= 2 && (
+        <p className="text-xs text-gray-400 text-center mt-1.5">
+          多素材模式 ({materialCount} 个机位)，生成耗时较长
+        </p>
      )}
-    </button>
+    </div>
  );
 }
--- a/frontend/src/features/home/ui/GeneratedAudiosPanel.tsx
+++ b/frontend/src/features/home/ui/GeneratedAudiosPanel.tsx
@@ -0,0 +1,293 @@
+import { useState, useRef, useCallback, useEffect } from "react";
+import { Play, Pause, Pencil, Trash2, Check, X, RefreshCw, Mic, ChevronDown } from "lucide-react";
+import type { GeneratedAudio } from "@/features/home/model/useGeneratedAudios";
+
+interface AudioTask {
+  status: string;
+  progress?: number;
+  message?: string;
+}
+
+interface GeneratedAudiosPanelProps {
+  generatedAudios: GeneratedAudio[];
+  selectedAudioId: string | null;
+  isGeneratingAudio: boolean;
+  audioTask: AudioTask | null;
+  onGenerateAudio: () => void;
+  onRefresh: () => void;
+  onSelectAudio: (audio: GeneratedAudio) => void;
+  onDeleteAudio: (id: string) => void;
+  onRenameAudio: (id: string, newName: string) => void;
+  hasText: boolean;
+  missingRefAudio?: boolean;
+  speed: number;
+  onSpeedChange: (speed: number) => void;
+  ttsMode: string;
+}
+
+export function GeneratedAudiosPanel({
+  generatedAudios,
+  selectedAudioId,
+  isGeneratingAudio,
+  audioTask,
+  onGenerateAudio,
+  onRefresh,
+  onSelectAudio,
+  onDeleteAudio,
+  onRenameAudio,
+  hasText,
+  missingRefAudio = false,
+  speed,
+  onSpeedChange,
+  ttsMode,
+}: GeneratedAudiosPanelProps) {
+  const [editingId, setEditingId] = useState<string | null>(null);
+  const [editName, setEditName] = useState("");
+  const [playingId, setPlayingId] = useState<string | null>(null);
+  const [speedOpen, setSpeedOpen] = useState(false);
+  const audioRef = useRef<HTMLAudioElement | null>(null);
+  const speedRef = useRef<HTMLDivElement>(null);
+
+  const stopPlaying = useCallback(() => {
+    if (audioRef.current) {
+      audioRef.current.pause();
+      audioRef.current.currentTime = 0;
+      audioRef.current = null;
+    }
+    setPlayingId(null);
+  }, []);
+
+  // Cleanup on unmount
+  useEffect(() => {
+    return () => {
+      if (audioRef.current) {
+        audioRef.current.pause();
+        audioRef.current = null;
+      }
+    };
+  }, []);
+
+  // Close speed dropdown on click outside
+  useEffect(() => {
+    const handler = (e: MouseEvent) => {
+      if (speedRef.current && !speedRef.current.contains(e.target as Node)) {
+        setSpeedOpen(false);
+      }
+    };
+    if (speedOpen) document.addEventListener("mousedown", handler);
+    return () => document.removeEventListener("mousedown", handler);
+  }, [speedOpen]);
+
+  const togglePlay = (audio: GeneratedAudio, e: React.MouseEvent) => {
+    e.stopPropagation();
+    if (playingId === audio.id) {
+      stopPlaying();
+      return;
+    }
+    stopPlaying();
+    const player = new Audio(audio.path);
+    player.onended = () => setPlayingId(null);
+    player.play().catch(() => {});
+    audioRef.current = player;
+    setPlayingId(audio.id);
+  };
+
+  const startEditing = (audio: GeneratedAudio, e: React.MouseEvent) => {
+    e.stopPropagation();
+    setEditingId(audio.id);
+    setEditName(audio.name);
+  };
+
+  const saveEditing = (audioId: string, e: React.MouseEvent) => {
+    e.stopPropagation();
+    if (!editName.trim()) return;
+    onRenameAudio(audioId, editName.trim());
+    setEditingId(null);
+    setEditName("");
+  };
+
+  const cancelEditing = (e: React.MouseEvent) => {
+    e.stopPropagation();
+    setEditingId(null);
+    setEditName("");
+  };
+
+  const canGenerate = hasText && !missingRefAudio;
+
+  const speedOptions = [
+    { value: 0.8, label: "较慢" },
+    { value: 0.9, label: "稍慢" },
+    { value: 1.0, label: "正常" },
+    { value: 1.1, label: "稍快" },
+    { value: 1.2, label: "较快" },
+  ] as const;
+  const currentSpeedLabel = speedOptions.find((o) => o.value === speed)?.label ?? "正常";
+
+  return (
+    <div className="bg-white/5 rounded-2xl p-4 sm:p-6 border border-white/10 backdrop-blur-sm relative z-10">
+      <div className="flex justify-between items-center gap-2 mb-4">
+        <h2 className="text-base sm:text-lg font-semibold text-white flex items-center gap-2 whitespace-nowrap">
+          <Mic className="h-4 w-4 text-purple-400" />
+          配音列表
+        </h2>
+        <div className="flex gap-1.5">
+          {/* 语速下拉 (仅声音克隆模式) */}
+          {ttsMode === "voiceclone" && (
+            <div ref={speedRef} className="relative">
+              <button
+                onClick={() => setSpeedOpen((v) => !v)}
+                className="px-2 py-1 text-xs bg-white/10 hover:bg-white/20 rounded text-gray-300 whitespace-nowrap flex items-center gap-1 transition-all"
+              >
+                语速: {currentSpeedLabel}
+                <ChevronDown className={`h-3 w-3 transition-transform ${speedOpen ? "rotate-180" : ""}`} />
+              </button>
+              {speedOpen && (
+                <div className="absolute right-0 top-full mt-1 bg-gray-800 border border-white/20 rounded-lg shadow-xl py-1 z-50 min-w-[80px]">
+                  {speedOptions.map((opt) => (
+                    <button
+                      key={opt.value}
+                      onClick={() => { onSpeedChange(opt.value); setSpeedOpen(false); }}
+                      className={`w-full text-left px-3 py-1.5 text-xs transition-colors ${
+                        speed === opt.value
+                          ? "bg-purple-600/40 text-purple-200"
+                          : "text-gray-300 hover:bg-white/10"
+                      }`}
+                    >
+                      {opt.label}
+                    </button>
+                  ))}
+                </div>
+              )}
+            </div>
+          )}
+          <button
+            onClick={onGenerateAudio}
+            disabled={isGeneratingAudio || !canGenerate}
+            title={missingRefAudio ? "请先选择参考音频" : !hasText ? "请先输入文案" : ""}
+            className={`px-2 py-1 text-xs rounded transition-all whitespace-nowrap flex items-center gap-1 ${
+              isGeneratingAudio || !canGenerate
+                ? "bg-gray-600 cursor-not-allowed text-gray-400"
+                : "bg-gradient-to-r from-purple-600 to-pink-600 hover:from-purple-700 hover:to-pink-700 text-white"
+            }`}
+          >
+            <Mic className="h-3.5 w-3.5" />
+            生成配音
+          </button>
+          <button
+            onClick={onRefresh}
+            className="px-2 py-1 text-xs bg-white/10 hover:bg-white/20 rounded text-gray-300 whitespace-nowrap flex items-center gap-1"
+          >
+            <RefreshCw className="h-3.5 w-3.5" />
+          </button>
+        </div>
+      </div>
+
+      {/* 缺少参考音频提示 */}
+      {missingRefAudio && (
+        <div className="mb-3 px-3 py-2 bg-yellow-500/10 border border-yellow-500/30 rounded-lg text-yellow-300 text-xs">
+          声音克隆模式需要先选择参考音频
+        </div>
+      )}
+
+      {/* 生成进度 */}
+      {isGeneratingAudio && audioTask && (
+        <div className="mb-4 p-3 bg-purple-500/10 rounded-xl border border-purple-500/30">
+          <div className="flex justify-between text-sm text-purple-300 mb-2">
+            <span>{audioTask.message || "生成中..."}</span>
+            <span>{audioTask.progress || 0}%</span>
+          </div>
+          <div className="h-2 bg-black/30 rounded-full overflow-hidden">
+            <div
+              className="h-full bg-gradient-to-r from-purple-500 to-pink-500 transition-all duration-300"
+              style={{ width: `${audioTask.progress || 0}%` }}
+            />
+          </div>
+        </div>
+      )}
+
+      {/* 配音列表 */}
+      {generatedAudios.length === 0 ? (
+        <div className="text-center py-6 text-gray-400">
+          <p className="text-sm">暂无配音</p>
+          <p className="text-xs mt-1 text-gray-500">点击「生成配音」创建</p>
+        </div>
+      ) : (
+        <div className="space-y-2 max-h-48 sm:max-h-56 overflow-y-auto hide-scrollbar">
+          {generatedAudios.map((audio) => {
+            const isSelected = selectedAudioId === audio.id;
+            return (
+              <div
+                key={audio.id}
+                onClick={() => onSelectAudio(audio)}
+                className={`p-3 rounded-lg border transition-all cursor-pointer flex items-center justify-between group ${
+                  isSelected
+                    ? "border-purple-500 bg-purple-500/20"
+                    : "border-white/10 bg-white/5 hover:border-white/30"
+                }`}
+              >
+                {editingId === audio.id ? (
+                  <div className="flex-1 flex items-center gap-2" onClick={(e) => e.stopPropagation()}>
+                    <input
+                      value={editName}
+                      onChange={(e) => setEditName(e.target.value)}
+                      className="flex-1 bg-black/40 border border-white/20 rounded-md px-2 py-1 text-xs text-white"
+                      autoFocus
+                      onKeyDown={(e) => {
+                        if (e.key === "Enter") saveEditing(audio.id, e as unknown as React.MouseEvent);
+                        if (e.key === "Escape") cancelEditing(e as unknown as React.MouseEvent);
+                      }}
+                    />
+                    <button onClick={(e) => saveEditing(audio.id, e)} className="p-1 text-green-400 hover:text-green-300" title="保存">
+                      <Check className="h-4 w-4" />
+                    </button>
+                    <button onClick={cancelEditing} className="p-1 text-gray-400 hover:text-white" title="取消">
+                      <X className="h-4 w-4" />
+                    </button>
+                  </div>
+                ) : (
+                  <>
+                    <div className="min-w-0 flex-1">
+                      <div className="text-white text-sm truncate">{audio.name}</div>
+                      <div className="text-gray-400 text-xs">{audio.duration_sec.toFixed(1)}s</div>
+                    </div>
+                    <div className="flex items-center gap-1 pl-2 opacity-0 group-hover:opacity-100 transition-opacity">
+                      <button
+                        onClick={(e) => togglePlay(audio, e)}
+                        className="p-1 text-gray-500 hover:text-purple-400 transition-colors"
+                        title={playingId === audio.id ? "暂停" : "播放"}
+                      >
+                        {playingId === audio.id ? (
+                          <Pause className="h-3.5 w-3.5" />
+                        ) : (
+                          <Play className="h-3.5 w-3.5" />
+                        )}
+                      </button>
+                      <button
+                        onClick={(e) => startEditing(audio, e)}
+                        className="p-1 text-gray-500 hover:text-white transition-colors"
+                        title="重命名"
+                      >
+                        <Pencil className="h-3.5 w-3.5" />
+                      </button>
+                      <button
+                        onClick={(e) => {
+                          e.stopPropagation();
+                          onDeleteAudio(audio.id);
+                        }}
+                        className="p-1 text-gray-500 hover:text-red-400 transition-colors"
+                        title="删除"
+                      >
+                        <Trash2 className="h-3.5 w-3.5" />
+                      </button>
+                    </div>
+                  </>
+                )}
+              </div>
+            );
+          })}
+        </div>
+      )}
+
+    </div>
+  );
+}
--- a/frontend/src/features/home/ui/HomePage.tsx
+++ b/frontend/src/features/home/ui/HomePage.tsx
@@ -1,20 +1,24 @@
 "use client";

-import { useEffect } from "react";
+import { useEffect, useMemo } from "react";
 import { useRouter } from "next/navigation";
 import VideoPreviewModal from "@/components/VideoPreviewModal";
-import ScriptExtractionModal from "@/components/ScriptExtractionModal";
+import ScriptExtractionModal from "./ScriptExtractionModal";
 import { useHomeController } from "@/features/home/model/useHomeController";
+import { resolveMediaUrl } from "@/shared/lib/media";
 import { BgmPanel } from "@/features/home/ui/BgmPanel";
 import { GenerateActionBar } from "@/features/home/ui/GenerateActionBar";
 import { HistoryList } from "@/features/home/ui/HistoryList";
 import { HomeHeader } from "@/features/home/ui/HomeHeader";
 import { MaterialSelector } from "@/features/home/ui/MaterialSelector";
+import { TimelineEditor } from "@/features/home/ui/TimelineEditor";
+import { ClipTrimmer } from "@/features/home/ui/ClipTrimmer";
 import { PreviewPanel } from "@/features/home/ui/PreviewPanel";
 import { RefAudioPanel } from "@/features/home/ui/RefAudioPanel";
 import { ScriptEditor } from "@/features/home/ui/ScriptEditor";
 import { TitleSubtitlePanel } from "@/features/home/ui/TitleSubtitlePanel";
 import { VoiceSelector } from "@/features/home/ui/VoiceSelector";
+import { GeneratedAudiosPanel } from "@/features/home/ui/GeneratedAudiosPanel";

 export function HomePage() {
  const router = useRouter();
@@ -34,8 +38,8 @@ export function HomePage() {
    fetchMaterials,
    deleteMaterial,
    handleUpload,
-    selectedMaterial,
-    setSelectedMaterial,
+    selectedMaterials,
+    toggleMaterial,
    handlePreviewMaterial,
    editingMaterialId,
    editMaterialName,
@@ -49,6 +53,13 @@ export function HomePage() {
    setExtractModalOpen,
    handleGenerateMeta,
    isGeneratingMeta,
+    handleTranslate,
+    isTranslating,
+    originalText,
+    handleRestoreOriginal,
+    savedScripts,
+    handleSaveScript,
+    deleteSavedScript,
    showStylePreview,
    setShowStylePreview,
    videoTitle,
@@ -69,12 +80,11 @@ export function HomePage() {
    setTitleTopMargin,
    subtitleBottomMargin,
    setSubtitleBottomMargin,
-    enableSubtitles,
-    setEnableSubtitles,
+    outputAspectRatio,
+    setOutputAspectRatio,
    resolveAssetUrl,
    getFontFormat,
    buildTextShadow,
-    materialDimensions,
    ttsMode,
    setTtsMode,
    voices,
@@ -97,6 +107,8 @@ export function HomePage() {
    saveEditing,
    cancelEditing,
    deleteRefAudio,
+    retranscribeRefAudio,
+    retranscribingId,
    recordedBlob,
    isRecording,
    recordingTime,
@@ -104,7 +116,6 @@ export function HomePage() {
    stopRecording,
    useRecording,
    formatRecordingTime,
-    fixedRefText,
    bgmList,
    bgmLoading,
    bgmError,
@@ -130,12 +141,47 @@ export function HomePage() {
    fetchGeneratedVideos,
    registerVideoRef,
    formatDate,
+    generatedAudios,
+    selectedAudio,
+    selectedAudioId,
+    isGeneratingAudio,
+    audioTask,
+    fetchGeneratedAudios,
+    handleGenerateAudio,
+    deleteAudio,
+    renameAudio,
+    selectAudio,
+    speed,
+    setSpeed,
+    timelineSegments,
+    reorderSegments,
+    setSourceRange,
+    clipTrimmerOpen,
+    setClipTrimmerOpen,
+    clipTrimmerSegmentId,
+    setClipTrimmerSegmentId,
  } = useHomeController();

  useEffect(() => {
    router.prefetch("/publish");
  }, [router]);

+  useEffect(() => {
+    if (typeof window === "undefined") return;
+    window.scrollTo({ top: 0, left: 0, behavior: "auto" });
+  }, []);
+
+  const clipTrimmerSegment = useMemo(
+    () => timelineSegments.find((s) => s.id === clipTrimmerSegmentId) ?? null,
+    [timelineSegments, clipTrimmerSegmentId]
+  );
+
+  const clipTrimmerMaterialUrl = useMemo(() => {
+    if (!clipTrimmerSegment) return null;
+    const mat = materials.find((m) => m.id === clipTrimmerSegment.materialId);
+    return mat?.path ? resolveMediaUrl(mat.path) : null;
+  }, [clipTrimmerSegment, materials]);
+
  return (
    <div className="min-h-dvh">
      <HomeHeader />
@@ -144,42 +190,24 @@ export function HomePage() {
        <div className="grid grid-cols-1 lg:grid-cols-2 gap-8">
          {/* 左侧: 输入区域 */}
          <div className="space-y-6">
-            {/* 素材选择 */}
-            <MaterialSelector
-              materials={materials}
-              selectedMaterial={selectedMaterial}
-              isFetching={isFetching}
-              lastMaterialCount={lastMaterialCount}
-              editingMaterialId={editingMaterialId}
-              editMaterialName={editMaterialName}
-              isUploading={isUploading}
-              uploadProgress={uploadProgress}
-              uploadError={uploadError}
-              fetchError={fetchError}
-              apiBase={apiBase}
-              onUploadChange={handleUpload}
-              onRefresh={fetchMaterials}
-              onSelectMaterial={setSelectedMaterial}
-              onPreviewMaterial={handlePreviewMaterial}
-              onStartEditing={startMaterialEditing}
-              onEditNameChange={setEditMaterialName}
-              onSaveEditing={saveMaterialEditing}
-              onCancelEditing={cancelMaterialEditing}
-              onDeleteMaterial={deleteMaterial}
-              onClearUploadError={() => setUploadError(null)}
-              registerMaterialRef={registerMaterialRef}
-            />
-
-            {/* 文案输入 */}
+            {/* 1. 文案输入 */}
            <ScriptEditor
              text={text}
              onChangeText={setText}
              onOpenExtractModal={() => setExtractModalOpen(true)}
              onGenerateMeta={handleGenerateMeta}
              isGeneratingMeta={isGeneratingMeta}
+              onTranslate={handleTranslate}
+              isTranslating={isTranslating}
+              hasOriginalText={originalText !== null}
+              onRestoreOriginal={handleRestoreOriginal}
+              savedScripts={savedScripts}
+              onSaveScript={handleSaveScript}
+              onLoadScript={setText}
+              onDeleteScript={deleteSavedScript}
            />

-            {/* 标题和字幕设置 */}
+            {/* 2. 标题和字幕设置 */}
            <TitleSubtitlePanel
              showStylePreview={showStylePreview}
              onTogglePreview={() => setShowStylePreview((prev) => !prev)}
@@ -207,16 +235,14 @@ export function HomePage() {
              onTitleTopMarginChange={setTitleTopMargin}
              subtitleBottomMargin={subtitleBottomMargin}
              onSubtitleBottomMarginChange={setSubtitleBottomMargin}
-              enableSubtitles={enableSubtitles}
-              onToggleSubtitles={setEnableSubtitles}
              resolveAssetUrl={resolveAssetUrl}
              getFontFormat={getFontFormat}
              buildTextShadow={buildTextShadow}
-              previewBaseWidth={materialDimensions?.width || 1080}
-              previewBaseHeight={materialDimensions?.height || 1920}
+              previewBaseWidth={outputAspectRatio === "16:9" ? 1920 : 1080}
+              previewBaseHeight={outputAspectRatio === "16:9" ? 1080 : 1920}
            />

-            {/* 配音方式选择 */}
+            {/* 3. 配音方式选择 */}
            <VoiceSelector
              ttsMode={ttsMode}
              onSelectTtsMode={setTtsMode}
@@ -242,6 +268,8 @@ export function HomePage() {
                  onSaveEditing={saveEditing}
                  onCancelEditing={cancelEditing}
                  onDeleteRefAudio={deleteRefAudio}
+                  onRetranscribe={retranscribeRefAudio}
+                  retranscribingId={retranscribingId}
                  recordedBlob={recordedBlob}
                  isRecording={isRecording}
                  recordingTime={recordingTime}
@@ -249,12 +277,79 @@ export function HomePage() {
                  onStopRecording={stopRecording}
                  onUseRecording={useRecording}
                  formatRecordingTime={formatRecordingTime}
-                  fixedRefText={fixedRefText}
                />
              )}
            />

-            {/* 背景音乐 */}
+            {/* 4. 配音列表 */}
+            <GeneratedAudiosPanel
+              generatedAudios={generatedAudios}
+              selectedAudioId={selectedAudioId}
+              isGeneratingAudio={isGeneratingAudio}
+              audioTask={audioTask}
+              onGenerateAudio={handleGenerateAudio}
+              onRefresh={() => fetchGeneratedAudios()}
+              onSelectAudio={selectAudio}
+              onDeleteAudio={deleteAudio}
+              onRenameAudio={renameAudio}
+              hasText={!!text.trim()}
+              missingRefAudio={ttsMode === "voiceclone" && !selectedRefAudio}
+              speed={speed}
+              onSpeedChange={setSpeed}
+              ttsMode={ttsMode}
+            />
+
+            {/* 5. 视频素材 */}
+            <MaterialSelector
+                materials={materials}
+                selectedMaterials={selectedMaterials}
+                isFetching={isFetching}
+                lastMaterialCount={lastMaterialCount}
+                editingMaterialId={editingMaterialId}
+                editMaterialName={editMaterialName}
+                isUploading={isUploading}
+                uploadProgress={uploadProgress}
+                uploadError={uploadError}
+                fetchError={fetchError}
+                apiBase={apiBase}
+                onUploadChange={handleUpload}
+                onRefresh={fetchMaterials}
+                onToggleMaterial={toggleMaterial}
+                onPreviewMaterial={handlePreviewMaterial}
+                onStartEditing={startMaterialEditing}
+                onEditNameChange={setEditMaterialName}
+                onSaveEditing={saveMaterialEditing}
+                onCancelEditing={cancelMaterialEditing}
+                onDeleteMaterial={deleteMaterial}
+                onClearUploadError={() => setUploadError(null)}
+                registerMaterialRef={registerMaterialRef}
+              />
+
+            {/* 5.5 时间轴编辑器 — 未选配音/素材时模糊遮挡 */}
+            <div className="relative">
+              {(!selectedAudio || selectedMaterials.length === 0) && (
+                <div className="absolute inset-0 bg-black/50 backdrop-blur-sm rounded-2xl flex items-center justify-center z-10">
+                  <p className="text-gray-400">
+                    {!selectedAudio ? "请先生成并选中配音" : "请先选择素材"}
+                  </p>
+                </div>
+              )}
+              <TimelineEditor
+                audioDuration={selectedAudio?.duration_sec ?? 0}
+                audioUrl={selectedAudio ? (resolveMediaUrl(selectedAudio.path) || "") : ""}
+                segments={timelineSegments}
+                materials={materials}
+                outputAspectRatio={outputAspectRatio}
+                onOutputAspectRatioChange={setOutputAspectRatio}
+                onReorderSegment={reorderSegments}
+                onClickSegment={(seg) => {
+                  setClipTrimmerSegmentId(seg.id);
+                  setClipTrimmerOpen(true);
+                }}
+              />
+            </div>
+
+            {/* 6. 背景音乐 */}
            <BgmPanel
              bgmList={bgmList}
              bgmLoading={bgmLoading}
@@ -272,11 +367,12 @@ export function HomePage() {
              registerBgmItemRef={registerBgmItemRef}
            />

-            {/* 生成按钮 */}
+            {/* 7. 生成按钮 */}
            <GenerateActionBar
              isGenerating={isGenerating}
              progress={currentTask?.progress || 0}
-              disabled={isGenerating || !selectedMaterial || (ttsMode === "voiceclone" && !selectedRefAudio)}
+              materialCount={selectedMaterials.length}
+              disabled={isGenerating || selectedMaterials.length === 0 || !selectedAudio}
              onGenerate={handleGenerate}
            />
          </div>
@@ -312,6 +408,19 @@ export function HomePage() {
        onClose={() => setExtractModalOpen(false)}
        onApply={(nextText) => setText(nextText)}
      />
+
+      <ClipTrimmer
+        isOpen={clipTrimmerOpen}
+        segment={clipTrimmerSegment}
+        materialUrl={clipTrimmerMaterialUrl}
+        onConfirm={(sourceStart, sourceEnd) => {
+          if (clipTrimmerSegmentId) {
+            setSourceRange(clipTrimmerSegmentId, sourceStart, sourceEnd);
+          }
+          setClipTrimmerOpen(false);
+        }}
+        onClose={() => setClipTrimmerOpen(false)}
+      />
    </div>
  );
 }
--- a/frontend/src/features/home/ui/MaterialSelector.tsx
+++ b/frontend/src/features/home/ui/MaterialSelector.tsx
@@ -1,17 +1,10 @@
-import type { ChangeEvent, MouseEvent } from "react";
+import { type ChangeEvent, type MouseEvent } from "react";
 import { Upload, RefreshCw, Eye, Trash2, X, Pencil, Check } from "lucide-react";
-
-interface Material {
-  id: string;
-  name: string;
-  scene: string;
-  size_mb: number;
-  path: string;
-}
+import type { Material } from "@/shared/types/material";

 interface MaterialSelectorProps {
  materials: Material[];
-  selectedMaterial: string;
+  selectedMaterials: string[];
  isFetching: boolean;
  lastMaterialCount: number;
  editingMaterialId: string | null;
@@ -23,7 +16,7 @@ interface MaterialSelectorProps {
  apiBase: string;
  onUploadChange: (event: ChangeEvent<HTMLInputElement>) => void;
  onRefresh: () => void;
-  onSelectMaterial: (id: string) => void;
+  onToggleMaterial: (id: string) => void;
  onPreviewMaterial: (path: string) => void;
  onStartEditing: (material: Material, event: MouseEvent) => void;
  onEditNameChange: (value: string) => void;
@@ -36,7 +29,7 @@ interface MaterialSelectorProps {

 export function MaterialSelector({
  materials,
-  selectedMaterial,
+  selectedMaterials,
  isFetching,
  lastMaterialCount,
  editingMaterialId,
@@ -48,7 +41,7 @@ export function MaterialSelector({
  apiBase,
  onUploadChange,
  onRefresh,
-  onSelectMaterial,
+  onToggleMaterial,
  onPreviewMaterial,
  onStartEditing,
  onEditNameChange,
@@ -58,20 +51,23 @@ export function MaterialSelector({
  onClearUploadError,
  registerMaterialRef,
 }: MaterialSelectorProps) {
+  const selectedSet = new Set(selectedMaterials);
+  const isFull = selectedMaterials.length >= 4;
+
  return (
    <div className="bg-white/5 rounded-2xl p-4 sm:p-6 border border-white/10 backdrop-blur-sm">
      <div className="flex justify-between items-center gap-2 mb-4">
        <h2 className="text-base sm:text-lg font-semibold text-white flex items-center gap-2 whitespace-nowrap">
          📹 视频素材
          <span className="ml-1 text-[11px] sm:text-xs text-gray-400/90 font-normal">
-            (上传自拍视频)
+            (可多选，最多4个)
          </span>
        </h2>
        <div className="flex gap-1.5">
          <input
            type="file"
            id="video-upload"
-            accept=".mp4,.mov,.avi"
+            accept="video/*"
            onChange={onUploadChange}
            className="hidden"
          />
@@ -126,7 +122,7 @@ export function MaterialSelector({
          API: {apiBase}/api/materials/
        </div>
      ) : isFetching && materials.length === 0 ? (
-        <div className="space-y-2 max-h-64 overflow-y-auto hide-scrollbar" style={{ contentVisibility: 'auto' }}>
+        <div className="space-y-2 max-h-48 sm:max-h-64 overflow-y-auto hide-scrollbar" style={{ contentVisibility: 'auto' }}>
          {Array.from({ length: Math.min(4, Math.max(1, lastMaterialCount || 1)) }).map((_, index) => (
            <div
              key={`material-skeleton-${index}`}
@@ -147,82 +143,98 @@ export function MaterialSelector({
        </div>
      ) : (
        <div
-          className="space-y-2 max-h-64 overflow-y-auto hide-scrollbar"
+          className="space-y-2 max-h-48 sm:max-h-64 overflow-y-auto hide-scrollbar"
          style={{ contentVisibility: 'auto' }}
        >
-          {materials.map((m) => (
-            <div
-              key={m.id}
-              ref={(el) => registerMaterialRef(m.id, el)}
-              className={`p-3 rounded-lg border transition-all flex items-center justify-between group ${selectedMaterial === m.id
-                ? "border-purple-500 bg-purple-500/20"
-                : "border-white/10 bg-white/5 hover:border-white/30"
-                }`}
-            >
-              {editingMaterialId === m.id ? (
-                <div className="flex-1 flex items-center gap-2" onClick={(e) => e.stopPropagation()}>
-                  <input
-                    value={editMaterialName}
-                    onChange={(e) => onEditNameChange(e.target.value)}
-                    className="flex-1 bg-black/40 border border-white/20 rounded-md px-2 py-1 text-xs text-white"
-                    autoFocus
-                  />
-                  <button
-                    onClick={(e) => onSaveEditing(m.id, e)}
-                    className="p-1 text-green-400 hover:text-green-300"
-                    title="保存"
-                  >
-                    <Check className="h-4 w-4" />
-                  </button>
-                  <button
-                    onClick={onCancelEditing}
-                    className="p-1 text-gray-400 hover:text-white"
-                    title="取消"
-                  >
-                    <X className="h-4 w-4" />
-                  </button>
-                </div>
-              ) : (
-                <button onClick={() => onSelectMaterial(m.id)} className="flex-1 text-left">
-                  <div className="text-white text-sm truncate">{m.scene || m.name}</div>
-                  <div className="text-gray-400 text-xs">{m.size_mb.toFixed(1)} MB</div>
-                </button>
-              )}
-              <div className="flex items-center gap-2 pl-2">
-                <button
-                  onClick={(e) => {
-                    e.stopPropagation();
-                    if (m.path) {
-                      onPreviewMaterial(m.path);
-                    }
-                  }}
-                  className="p-1 text-gray-500 hover:text-white opacity-0 group-hover:opacity-100 transition-opacity"
-                  title="预览视频"
-                >
-                  <Eye className="h-4 w-4" />
-                </button>
-                {editingMaterialId !== m.id && (
-                  <button
-                    onClick={(e) => onStartEditing(m, e)}
-                    className="p-1 text-gray-500 hover:text-white opacity-0 group-hover:opacity-100 transition-opacity"
-                    title="重命名"
-                  >
-                    <Pencil className="h-4 w-4" />
+          {materials.map((m) => {
+            const isSelected = selectedSet.has(m.id);
+            return (
+              <div
+                key={m.id}
+                ref={(el) => registerMaterialRef(m.id, el)}
+                className={`p-3 rounded-lg border transition-all flex items-center justify-between group ${isSelected
+                  ? "border-purple-500 bg-purple-500/20"
+                  : isFull
+                    ? "border-white/5 bg-white/[0.02] opacity-50 cursor-not-allowed"
+                    : "border-white/10 bg-white/5 hover:border-white/30"
+                  }`}
+              >
+                {editingMaterialId === m.id ? (
+                  <div className="flex-1 flex items-center gap-2" onClick={(e) => e.stopPropagation()}>
+                    <input
+                      value={editMaterialName}
+                      onChange={(e) => onEditNameChange(e.target.value)}
+                      className="flex-1 bg-black/40 border border-white/20 rounded-md px-2 py-1 text-xs text-white"
+                      autoFocus
+                    />
+                    <button
+                      onClick={(e) => onSaveEditing(m.id, e)}
+                      className="p-1 text-green-400 hover:text-green-300"
+                      title="保存"
+                    >
+                      <Check className="h-4 w-4" />
+                    </button>
+                    <button
+                      onClick={onCancelEditing}
+                      className="p-1 text-gray-400 hover:text-white"
+                      title="取消"
+                    >
+                      <X className="h-4 w-4" />
+                    </button>
+                  </div>
+                ) : (
+                  <button onClick={() => onToggleMaterial(m.id)} className="flex-1 text-left flex items-center gap-2">
+                    {/* 复选框 */}
+                    <span
+                      className={`flex-shrink-0 w-4 h-4 rounded border flex items-center justify-center text-[10px] ${isSelected
+                        ? "border-purple-500 bg-purple-500 text-white"
+                        : "border-white/30 text-transparent"
+                        }`}
+                    >
+                      {isSelected ? "✓" : ""}
+                    </span>
+                    <div className="min-w-0">
+                      <div className="text-white text-sm truncate">{m.scene || m.name}</div>
+                      <div className="text-gray-400 text-xs">{m.size_mb.toFixed(1)} MB</div>
+                    </div>
                  </button>
                )}
-                <button
-                  onClick={(e) => {
-                    e.stopPropagation();
-                    onDeleteMaterial(m.id);
-                  }}
-                  className="p-1 text-gray-500 hover:text-red-400 opacity-0 group-hover:opacity-100 transition-opacity"
-                  title="删除素材"
-                >
-                  <Trash2 className="h-4 w-4" />
-                </button>
+                <div className="flex items-center gap-2 pl-2">
+                  <button
+                    onClick={(e) => {
+                      e.stopPropagation();
+                      if (m.path) {
+                        onPreviewMaterial(m.path);
+                      }
+                    }}
+                    className="p-1 text-gray-500 hover:text-white opacity-0 group-hover:opacity-100 transition-opacity"
+                    title="预览视频"
+                  >
+                    <Eye className="h-4 w-4" />
+                  </button>
+                  {editingMaterialId !== m.id && (
+                    <button
+                      onClick={(e) => onStartEditing(m, e)}
+                      className="p-1 text-gray-500 hover:text-white opacity-0 group-hover:opacity-100 transition-opacity"
+                      title="重命名"
+                    >
+                      <Pencil className="h-4 w-4" />
+                    </button>
+                  )}
+                  <button
+                    onClick={(e) => {
+                      e.stopPropagation();
+                      onDeleteMaterial(m.id);
+                    }}
+                    className="p-1 text-gray-500 hover:text-red-400 opacity-0 group-hover:opacity-100 transition-opacity"
+                    title="删除素材"
+                  >
+                    <Trash2 className="h-4 w-4" />
+                  </button>
+                </div>
              </div>
-            </div>
-          ))}
+            );
+          })}
        </div>
      )}
    </div>
--- a/frontend/src/features/home/ui/RefAudioPanel.tsx
+++ b/frontend/src/features/home/ui/RefAudioPanel.tsx
@@ -1,6 +1,6 @@
 import { useEffect, useState } from "react";
 import type { MouseEvent } from "react";
-import { Upload, RefreshCw, Play, Pause, Pencil, Trash2, Check, X, Mic, Square } from "lucide-react";
+import { Upload, RefreshCw, Play, Pause, Pencil, Trash2, Check, X, Mic, Square, RotateCw } from "lucide-react";

 interface RefAudio {
  id: string;
@@ -29,6 +29,8 @@ interface RefAudioPanelProps {
  onSaveEditing: (id: string, event: MouseEvent) => void;
  onCancelEditing: (event: MouseEvent) => void;
  onDeleteRefAudio: (id: string) => void;
+  onRetranscribe: (id: string) => void;
+  retranscribingId: string | null;
  recordedBlob: Blob | null;
  isRecording: boolean;
  recordingTime: number;
@@ -36,9 +38,10 @@ interface RefAudioPanelProps {
  onStopRecording: () => void;
  onUseRecording: () => void;
  formatRecordingTime: (seconds: number) => string;
-  fixedRefText: string;
 }

+const OLD_FIXED_REF_TEXT = "其实生活中有许多美好的瞬间";
+
 export function RefAudioPanel({
  refAudios,
  selectedRefAudio,
@@ -57,6 +60,8 @@ export function RefAudioPanel({
  onSaveEditing,
  onCancelEditing,
  onDeleteRefAudio,
+  onRetranscribe,
+  retranscribingId,
  recordedBlob,
  isRecording,
  recordingTime,
@@ -64,7 +69,6 @@ export function RefAudioPanel({
  onStopRecording,
  onUseRecording,
  formatRecordingTime,
-  fixedRefText,
 }: RefAudioPanelProps) {
  const [recordedUrl, setRecordedUrl] = useState<string | null>(null);

@@ -81,6 +85,9 @@ export function RefAudioPanel({
    };
  }, [recordedBlob]);

+  const needsRetranscribe = (audio: RefAudio) =>
+    audio.ref_text.startsWith(OLD_FIXED_REF_TEXT);
+
  return (
    <div className="space-y-4">
      <div>
@@ -122,7 +129,7 @@ export function RefAudioPanel({

        {isUploadingRef && (
          <div className="mb-2 p-2 bg-purple-500/10 rounded text-sm text-purple-300">
-            ⏳ 上传中...
+            ⏳ 上传并识别中...
          </div>
        )}

@@ -192,6 +199,17 @@ export function RefAudioPanel({
                            <Play className="h-3.5 w-3.5" />
                          )}
                        </button>
+                        <button
+                          onClick={(e) => {
+                            e.stopPropagation();
+                            onRetranscribe(audio.id);
+                          }}
+                          disabled={retranscribingId === audio.id}
+                          className="text-gray-400 hover:text-cyan-400 text-xs disabled:opacity-50"
+                          title="重新识别文字"
+                        >
+                          <RotateCw className={`h-3.5 w-3.5 ${retranscribingId === audio.id ? 'animate-spin' : ''}`} />
+                        </button>
                        <button
                          onClick={(e) => onStartEditing(audio, e)}
                          className="text-gray-400 hover:text-blue-400 text-xs"
@@ -211,7 +229,12 @@ export function RefAudioPanel({
                        </button>
                      </div>
                    </div>
-                    <div className="text-gray-400 text-xs">{audio.duration_sec.toFixed(1)}s</div>
+                    <div className="text-gray-400 text-xs">
+                      {audio.duration_sec.toFixed(1)}s
+                      {needsRetranscribe(audio) && (
+                        <span className="text-yellow-500 ml-1" title="需要重新识别文字">⚠</span>
+                      )}
+                    </div>
                  </>
                )}
              </div>
@@ -221,7 +244,7 @@ export function RefAudioPanel({
      </div>

      <div className="border-t border-white/10 pt-4">
-        <span className="text-sm text-gray-300 mb-2 block">🎤 或在线录音</span>
+        <span className="text-sm text-gray-300 mb-2 block">🎤 或在线录音 <span className="text-xs text-gray-500">（建议 3-10 秒，超出将自动截取）</span></span>
        <div className="flex gap-2 items-center">
          {!isRecording ? (
            <button
@@ -264,15 +287,9 @@ export function RefAudioPanel({
        )}
      </div>

-      <div className="border-t border-white/10 pt-4">
-        <label className="text-sm text-gray-300 mb-2 block">📝 录音/上传时请朗读以下内容：</label>
-        <div className="w-full bg-black/30 border border-white/10 rounded-lg p-3 text-white text-sm">
-          {fixedRefText}
-        </div>
-        <p className="text-xs text-gray-500 mt-1">
-          请清晰朗读上述内容完成录音，系统将以此为参考克隆您的声音
-        </p>
-      </div>
+      <p className="text-xs text-gray-500 mt-2 border-t border-white/10 pt-3">
+        上传任意语音样本（3-10秒），系统将自动识别内容并克隆声音
+      </p>
    </div>
  );
 }
--- a/frontend/src/features/home/ui/ScriptEditor.tsx
+++ b/frontend/src/features/home/ui/ScriptEditor.tsx
@@ -1,4 +1,18 @@
-import { FileText, Loader2, Sparkles } from "lucide-react";
+import { useEffect, useRef, useState } from "react";
+import { FileText, History, Languages, Loader2, RotateCcw, Save, Sparkles, Trash2 } from "lucide-react";
+import type { SavedScript } from "@/features/home/model/useSavedScripts";
+
+const LANGUAGES = [
+  { code: "English", label: "英语 English" },
+  { code: "日本語", label: "日语 日本語" },
+  { code: "한국어", label: "韩语 한국어" },
+  { code: "Français", label: "法语 Français" },
+  { code: "Deutsch", label: "德语 Deutsch" },
+  { code: "Español", label: "西班牙语 Español" },
+  { code: "Русский", label: "俄语 Русский" },
+  { code: "Italiano", label: "意大利语 Italiano" },
+  { code: "Português", label: "葡萄牙语 Português" },
+];

 interface ScriptEditorProps {
  text: string;
@@ -6,6 +20,14 @@ interface ScriptEditorProps {
  onOpenExtractModal: () => void;
  onGenerateMeta: () => void;
  isGeneratingMeta: boolean;
+  onTranslate: (targetLang: string) => void;
+  isTranslating: boolean;
+  hasOriginalText: boolean;
+  onRestoreOriginal: () => void;
+  savedScripts: SavedScript[];
+  onSaveScript: () => void;
+  onLoadScript: (content: string) => void;
+  onDeleteScript: (id: string) => void;
 }

 export function ScriptEditor({
@@ -14,39 +36,176 @@ export function ScriptEditor({
  onOpenExtractModal,
  onGenerateMeta,
  isGeneratingMeta,
+  onTranslate,
+  isTranslating,
+  hasOriginalText,
+  onRestoreOriginal,
+  savedScripts,
+  onSaveScript,
+  onLoadScript,
+  onDeleteScript,
 }: ScriptEditorProps) {
+  const [showLangMenu, setShowLangMenu] = useState(false);
+  const langMenuRef = useRef<HTMLDivElement>(null);
+  const [showHistoryMenu, setShowHistoryMenu] = useState(false);
+  const historyMenuRef = useRef<HTMLDivElement>(null);
+
+  useEffect(() => {
+    if (!showLangMenu) return;
+    const handleClickOutside = (e: MouseEvent) => {
+      if (langMenuRef.current && !langMenuRef.current.contains(e.target as Node)) {
+        setShowLangMenu(false);
+      }
+    };
+    document.addEventListener("mousedown", handleClickOutside);
+    return () => document.removeEventListener("mousedown", handleClickOutside);
+  }, [showLangMenu]);
+
+  useEffect(() => {
+    if (!showHistoryMenu) return;
+    const handleClickOutside = (e: MouseEvent) => {
+      if (historyMenuRef.current && !historyMenuRef.current.contains(e.target as Node)) {
+        setShowHistoryMenu(false);
+      }
+    };
+    document.addEventListener("mousedown", handleClickOutside);
+    return () => document.removeEventListener("mousedown", handleClickOutside);
+  }, [showHistoryMenu]);
+
+  const handleSelectLang = (langCode: string) => {
+    setShowLangMenu(false);
+    onTranslate(langCode);
+  };
+
+  const formatDate = (ts: number) => {
+    const d = new Date(ts);
+    return `${(d.getMonth() + 1).toString().padStart(2, "0")}-${d.getDate().toString().padStart(2, "0")} ${d.getHours().toString().padStart(2, "0")}:${d.getMinutes().toString().padStart(2, "0")}`;
+  };
+
  return (
-    <div className="bg-white/5 rounded-2xl p-4 sm:p-6 border border-white/10 backdrop-blur-sm">
-      <div className="flex flex-wrap justify-between items-center gap-2 mb-4">
-        <h2 className="text-base sm:text-lg font-semibold text-white flex items-center gap-2 whitespace-nowrap">
+    <div className="relative z-10 bg-white/5 rounded-2xl p-4 sm:p-6 border border-white/10 backdrop-blur-sm">
+      <div className="mb-4 space-y-3">
+        <h2 className="text-base sm:text-lg font-semibold text-white flex items-center gap-2">
          ✍️ 文案提取与编辑
        </h2>
-        <div className="flex gap-2 flex-shrink-0">
+        <div className="flex gap-2 flex-wrap justify-end items-center">
+          {/* 历史文案 */}
+          <div className="relative" ref={historyMenuRef}>
+            <button
+              onClick={() => setShowHistoryMenu((prev) => !prev)}
+              className="h-7 px-2.5 text-xs rounded transition-all whitespace-nowrap bg-gray-600 hover:bg-gray-500 text-white inline-flex items-center gap-1"
+            >
+              <History className="h-3.5 w-3.5" />
+              历史文案
+            </button>
+            {showHistoryMenu && (
+              <div className="absolute left-0 top-full mt-1 z-50 bg-gray-800 border border-white/10 rounded-lg shadow-xl py-1 min-w-[220px] max-h-[280px] overflow-y-auto">
+                {savedScripts.length === 0 ? (
+                  <div className="px-3 py-3 text-xs text-gray-500 text-center">暂无保存的文案</div>
+                ) : (
+                  savedScripts.map((script) => (
+                    <div
+                      key={script.id}
+                      className="flex items-center gap-1 px-3 py-1.5 hover:bg-white/10 transition-colors group"
+                    >
+                      <button
+                        onClick={() => {
+                          onLoadScript(script.content);
+                          setShowHistoryMenu(false);
+                        }}
+                        className="flex-1 text-left min-w-0"
+                      >
+                        <div className="text-xs text-gray-200 truncate">{script.name}</div>
+                        <div className="text-[10px] text-gray-500">{formatDate(script.savedAt)}</div>
+                      </button>
+                      <button
+                        onClick={(e) => {
+                          e.stopPropagation();
+                          onDeleteScript(script.id);
+                        }}
+                        className="opacity-0 group-hover:opacity-100 p-1 text-gray-500 hover:text-red-400 transition-all shrink-0"
+                      >
+                        <Trash2 className="h-3 w-3" />
+                      </button>
+                    </div>
+                  ))
+                )}
+              </div>
+            )}
+          </div>
          <button
            onClick={onOpenExtractModal}
-            className="px-2 py-1 text-xs rounded transition-all whitespace-nowrap bg-purple-600 hover:bg-purple-700 text-white flex items-center gap-1"
+            className="h-7 px-2.5 text-xs rounded transition-all whitespace-nowrap bg-purple-600 hover:bg-purple-700 text-white inline-flex items-center gap-1"
          >
            <FileText className="h-3.5 w-3.5" />
            文案提取助手
          </button>
+          <div className="relative" ref={langMenuRef}>
+            <button
+              onClick={() => setShowLangMenu((prev) => !prev)}
+              disabled={isTranslating || !text.trim()}
+              className={`h-7 px-2.5 text-xs rounded transition-all whitespace-nowrap inline-flex items-center gap-1 ${
+                isTranslating || !text.trim()
+                  ? "bg-gray-600 cursor-not-allowed text-gray-400"
+                  : "bg-gradient-to-r from-emerald-600 to-teal-600 hover:from-emerald-700 hover:to-teal-700 text-white"
+              }`}
+            >
+              {isTranslating ? (
+                <>
+                  <Loader2 className="h-3.5 w-3.5 animate-spin" />
+                  翻译中...
+                </>
+              ) : (
+                <>
+                  <Languages className="h-3.5 w-3.5" />
+                  AI多语言
+                </>
+              )}
+            </button>
+            {showLangMenu && (
+              <div className="absolute right-0 top-full mt-1 z-50 bg-gray-800 border border-white/10 rounded-lg shadow-xl py-1 min-w-[160px]">
+                {hasOriginalText && (
+                  <>
+                    <button
+                      onClick={() => { setShowLangMenu(false); onRestoreOriginal(); }}
+                      className="w-full text-left px-3 py-1.5 text-xs text-amber-400 hover:bg-white/10 transition-colors flex items-center gap-1"
+                    >
+                      <RotateCcw className="h-3 w-3" />
+                      还原原文
+                    </button>
+                    <div className="border-t border-white/10 my-1" />
+                  </>
+                )}
+                {LANGUAGES.map((lang) => (
+                  <button
+                    key={lang.code}
+                    onClick={() => handleSelectLang(lang.code)}
+                    className="w-full text-left px-3 py-1.5 text-xs text-gray-200 hover:bg-white/10 transition-colors"
+                  >
+                    {lang.label}
+                  </button>
+                ))}
+              </div>
+            )}
+          </div>
          <button
            onClick={onGenerateMeta}
            disabled={isGeneratingMeta || !text.trim()}
-            className={`px-2 py-1 text-xs rounded transition-all whitespace-nowrap ${isGeneratingMeta || !text.trim()
+            className={`h-7 px-2.5 text-xs rounded transition-all whitespace-nowrap inline-flex items-center gap-1 ${isGeneratingMeta || !text.trim()
              ? "bg-gray-600 cursor-not-allowed text-gray-400"
              : "bg-gradient-to-r from-blue-600 to-cyan-600 hover:from-blue-700 hover:to-cyan-700 text-white"
              }`}
          >
            {isGeneratingMeta ? (
-              <span className="flex items-center gap-1">
+              <>
                <Loader2 className="h-3.5 w-3.5 animate-spin" />
                生成中...
-              </span>
+              </>
            ) : (
-              <span className="flex items-center gap-1">
+              <>
                <Sparkles className="h-3.5 w-3.5" />
                AI生成标题标签
-              </span>
+              </>
            )}
          </button>
        </div>
@@ -57,9 +216,20 @@ export function ScriptEditor({
        placeholder="请输入你想说的话..."
        className="w-full h-40 bg-black/30 border border-white/10 rounded-xl p-4 text-white placeholder-gray-500 resize-none focus:outline-none focus:border-purple-500 transition-colors hide-scrollbar"
      />
-      <div className="flex justify-between mt-2 text-sm text-gray-400">
+      <div className="flex items-center justify-between mt-2 text-sm text-gray-400">
        <span>{text.length} 字</span>
-        <span>预计时长: ~{Math.ceil(text.length / 4)} 秒</span>
+        <button
+          onClick={onSaveScript}
+          disabled={!text.trim()}
+          className={`px-2.5 py-1 text-xs rounded transition-all flex items-center gap-1 ${
+            !text.trim()
+              ? "bg-gray-700 cursor-not-allowed text-gray-500"
+              : "bg-amber-600/80 hover:bg-amber-600 text-white"
+          }`}
+        >
+          <Save className="h-3 w-3" />
+          保存文案
+        </button>
      </div>
    </div>
  );
--- a/frontend/src/features/home/ui/ScriptExtractionModal.tsx
+++ b/frontend/src/features/home/ui/ScriptExtractionModal.tsx
--- a/frontend/src/features/home/ui/TimelineEditor.tsx
+++ b/frontend/src/features/home/ui/TimelineEditor.tsx
@@ -0,0 +1,349 @@
+import { useEffect, useRef, useCallback, useState } from "react";
+import WaveSurfer from "wavesurfer.js";
+import { ChevronDown } from "lucide-react";
+import type { TimelineSegment } from "@/features/home/model/useTimelineEditor";
+import type { Material } from "@/shared/types/material";
+
+interface TimelineEditorProps {
+  audioDuration: number;
+  audioUrl: string;
+  segments: TimelineSegment[];
+  materials: Material[];
+  outputAspectRatio: "9:16" | "16:9";
+  onOutputAspectRatioChange: (ratio: "9:16" | "16:9") => void;
+  onReorderSegment: (fromIdx: number, toIdx: number) => void;
+  onClickSegment: (segment: TimelineSegment) => void;
+}
+
+function formatTime(sec: number): string {
+  const m = Math.floor(sec / 60);
+  const s = sec % 60;
+  return `${String(m).padStart(2, "0")}:${s.toFixed(1).padStart(4, "0")}`;
+}
+
+export function TimelineEditor({
+  audioDuration,
+  audioUrl,
+  segments,
+  materials,
+  outputAspectRatio,
+  onOutputAspectRatioChange,
+  onReorderSegment,
+  onClickSegment,
+}: TimelineEditorProps) {
+  const waveRef = useRef<HTMLDivElement>(null);
+  const wsRef = useRef<WaveSurfer | null>(null);
+  const [waveReady, setWaveReady] = useState(false);
+  const [isPlaying, setIsPlaying] = useState(false);
+
+  // Refs for high-frequency DOM updates (avoid 60fps re-renders)
+  const playheadRef = useRef<HTMLDivElement>(null);
+  const timeRef = useRef<HTMLSpanElement>(null);
+  const audioDurationRef = useRef(audioDuration);
+
+  useEffect(() => {
+    audioDurationRef.current = audioDuration;
+  }, [audioDuration]);
+
+  // Drag-to-reorder state
+  const [dragFromIdx, setDragFromIdx] = useState<number | null>(null);
+  const [dragOverIdx, setDragOverIdx] = useState<number | null>(null);
+
+  // Aspect ratio dropdown
+  const [ratioOpen, setRatioOpen] = useState(false);
+  const ratioRef = useRef<HTMLDivElement>(null);
+  const ratioOptions = [
+    { value: "9:16" as const, label: "竖屏 9:16" },
+    { value: "16:9" as const, label: "横屏 16:9" },
+  ];
+  const currentRatioLabel =
+    ratioOptions.find((opt) => opt.value === outputAspectRatio)?.label ?? "竖屏 9:16";
+
+  useEffect(() => {
+    const handler = (e: MouseEvent) => {
+      if (ratioRef.current && !ratioRef.current.contains(e.target as Node)) {
+        setRatioOpen(false);
+      }
+    };
+    if (ratioOpen) document.addEventListener("mousedown", handler);
+    return () => document.removeEventListener("mousedown", handler);
+  }, [ratioOpen]);
+
+  // Create / recreate wavesurfer when audioUrl changes
+  useEffect(() => {
+    if (!waveRef.current || !audioUrl) return;
+
+    const playheadEl = playheadRef.current;
+    const timeEl = timeRef.current;
+
+    // Destroy previous instance
+    if (wsRef.current) {
+      wsRef.current.destroy();
+      wsRef.current = null;
+    }
+
+    const ws = WaveSurfer.create({
+      container: waveRef.current,
+      height: 56,
+      waveColor: "#6d28d9",
+      progressColor: "#a855f7",
+      barWidth: 2,
+      barGap: 1,
+      barRadius: 2,
+      cursorWidth: 1,
+      cursorColor: "#e879f9",
+      interact: true,
+      normalize: true,
+    });
+
+    // Click waveform → seek + auto-play
+    ws.on("interaction", () => ws.play());
+    ws.on("play", () => setIsPlaying(true));
+    ws.on("pause", () => setIsPlaying(false));
+    ws.on("finish", () => {
+      setIsPlaying(false);
+      if (playheadRef.current) playheadRef.current.style.display = "none";
+    });
+    // High-frequency: update playhead + time via refs (no React re-render)
+    ws.on("timeupdate", (time: number) => {
+      const dur = audioDurationRef.current;
+      if (playheadRef.current && dur > 0) {
+        playheadRef.current.style.left = `${(time / dur) * 100}%`;
+        playheadRef.current.style.display = "block";
+      }
+      if (timeRef.current) {
+        timeRef.current.textContent = formatTime(time);
+      }
+    });
+
+    ws.load(audioUrl);
+    wsRef.current = ws;
+
+    return () => {
+      ws.destroy();
+      wsRef.current = null;
+      setIsPlaying(false);
+      if (playheadEl) playheadEl.style.display = "none";
+      if (timeEl) timeEl.textContent = formatTime(0);
+    };
+  }, [audioUrl, waveReady]);
+
+  // Callback ref to detect when waveRef div mounts
+  const waveCallbackRef = useCallback((node: HTMLDivElement | null) => {
+    (waveRef as React.MutableRefObject<HTMLDivElement | null>).current = node;
+    setWaveReady(!!node);
+  }, []);
+
+  const handlePlayPause = useCallback(() => {
+    wsRef.current?.playPause();
+  }, []);
+
+  // Drag-to-reorder handlers
+  const handleDragStart = useCallback((idx: number, e: React.DragEvent) => {
+    setDragFromIdx(idx);
+    e.dataTransfer.effectAllowed = "move";
+    e.dataTransfer.setData("text/plain", String(idx));
+  }, []);
+
+  const handleDragOver = useCallback((idx: number, e: React.DragEvent) => {
+    e.preventDefault();
+    e.dataTransfer.dropEffect = "move";
+    setDragOverIdx(idx);
+  }, []);
+
+  const handleDragLeave = useCallback(() => {
+    setDragOverIdx(null);
+  }, []);
+
+  const handleDrop = useCallback((toIdx: number, e: React.DragEvent) => {
+    e.preventDefault();
+    const fromIdx = parseInt(e.dataTransfer.getData("text/plain"), 10);
+    if (!isNaN(fromIdx) && fromIdx !== toIdx) {
+      onReorderSegment(fromIdx, toIdx);
+    }
+    setDragFromIdx(null);
+    setDragOverIdx(null);
+  }, [onReorderSegment]);
+
+  const handleDragEnd = useCallback(() => {
+    setDragFromIdx(null);
+    setDragOverIdx(null);
+  }, []);
+
+  // Filter visible vs overflow segments
+  const visibleSegments = segments.filter((s) => s.start < audioDuration);
+  const overflowSegments = segments.filter((s) => s.start >= audioDuration);
+  const hasSegments = visibleSegments.length > 0;
+
+  return (
+    <div className="bg-white/5 rounded-2xl p-4 sm:p-6 border border-white/10 backdrop-blur-sm">
+      <div className="flex items-center justify-between mb-3">
+        <h2 className="text-base sm:text-lg font-semibold text-white flex items-center gap-2">
+          🎞️ 时间轴编辑
+        </h2>
+        <div className="flex items-center gap-2 text-xs text-gray-400">
+          <div ref={ratioRef} className="relative">
+            <button
+              type="button"
+              onClick={() => setRatioOpen((v) => !v)}
+              className="px-2 py-1 text-xs bg-white/10 hover:bg-white/20 rounded text-gray-300 whitespace-nowrap flex items-center gap-1 transition-all"
+              title="设置输出画面比例"
+            >
+              画面: {currentRatioLabel}
+              <ChevronDown className={`h-3 w-3 transition-transform ${ratioOpen ? "rotate-180" : ""}`} />
+            </button>
+            {ratioOpen && (
+              <div className="absolute right-0 top-full mt-1 bg-gray-800 border border-white/20 rounded-lg shadow-xl py-1 z-50 min-w-[106px]">
+                {ratioOptions.map((opt) => (
+                  <button
+                    key={opt.value}
+                    type="button"
+                    onClick={() => {
+                      onOutputAspectRatioChange(opt.value);
+                      setRatioOpen(false);
+                    }}
+                    className={`w-full text-left px-3 py-1.5 text-xs transition-colors ${
+                      outputAspectRatio === opt.value
+                        ? "bg-purple-600/40 text-purple-200"
+                        : "text-gray-300 hover:bg-white/10"
+                    }`}
+                  >
+                    {opt.label}
+                  </button>
+                ))}
+              </div>
+            )}
+          </div>
+
+          {audioUrl && (
+            <>
+              <button
+                onClick={handlePlayPause}
+                className="w-7 h-7 flex items-center justify-center rounded-full bg-white/10 hover:bg-white/20 text-white transition-colors"
+                title={isPlaying ? "暂停" : "播放"}
+              >
+                {isPlaying ? "⏸" : "▶"}
+              </button>
+              <span ref={timeRef} className="tabular-nums">00:00.0</span>
+              <span className="text-gray-600">/</span>
+              <span className="tabular-nums">{formatTime(audioDuration)}</span>
+            </>
+          )}
+        </div>
+      </div>
+
+      {/* Waveform — always rendered so ref stays mounted */}
+      <div className="relative mb-1">
+        <div ref={waveCallbackRef} className="rounded-lg overflow-hidden bg-black/20 cursor-pointer" style={{ minHeight: 56 }} />
+      </div>
+
+      {/* Segment blocks or empty placeholder */}
+      {hasSegments ? (
+        <>
+          <div className="relative h-14 flex select-none">
+            {/* Playhead — syncs with audio playback */}
+            <div
+              ref={playheadRef}
+              className="absolute top-0 h-full w-0.5 bg-fuchsia-400 z-10 pointer-events-none"
+              style={{ display: "none", left: "0%" }}
+            />
+            {visibleSegments.map((seg, i) => {
+              const left = (seg.start / audioDuration) * 100;
+              const width = ((seg.end - seg.start) / audioDuration) * 100;
+              const segDur = seg.end - seg.start;
+              const isDragTarget = dragOverIdx === i && dragFromIdx !== i;
+
+              // Compute loop portion for the last visible segment
+              const isLastVisible = i === visibleSegments.length - 1;
+              let loopPercent = 0;
+              if (isLastVisible && audioDuration > 0) {
+                const mat = materials.find((m) => m.id === seg.materialId);
+                const matDur = mat?.duration_sec ?? 0;
+                const effDur = (seg.sourceEnd > seg.sourceStart)
+                  ? (seg.sourceEnd - seg.sourceStart)
+                  : Math.max(matDur - seg.sourceStart, 0);
+                if (effDur > 0 && segDur > effDur + 0.1) {
+                  loopPercent = ((segDur - effDur) / segDur) * 100;
+                }
+              }
+
+              return (
+                <div key={seg.id} className="absolute top-0 h-full" style={{ left: `${left}%`, width: `${width}%` }}>
+                  <button
+                    draggable
+                    onDragStart={(e) => handleDragStart(i, e)}
+                    onDragOver={(e) => handleDragOver(i, e)}
+                    onDragLeave={handleDragLeave}
+                    onDrop={(e) => handleDrop(i, e)}
+                    onDragEnd={handleDragEnd}
+                    onClick={() => onClickSegment(seg)}
+                    className={`relative w-full h-full rounded-lg flex flex-col items-center justify-center overflow-hidden cursor-grab active:cursor-grabbing transition-all border ${
+                      isDragTarget
+                        ? "ring-2 ring-purple-400 border-purple-400 scale-[1.02]"
+                        : dragFromIdx === i
+                        ? "opacity-50 border-white/10"
+                        : "hover:opacity-90 border-white/10"
+                    }`}
+                    style={{ backgroundColor: seg.color + "33", borderColor: isDragTarget ? undefined : seg.color + "66" }}
+                    title={`拖拽可调换顺序 · 点击设置截取范围\n${seg.materialName}\n${segDur.toFixed(1)}s${loopPercent > 0 ? ` (含循环 ${(segDur * loopPercent / 100).toFixed(1)}s)` : ""}`}
+                  >
+                    <span className="text-[11px] text-white/90 truncate max-w-full px-1 leading-tight z-[1]">
+                      {seg.materialName}
+                    </span>
+                    <span className="text-[10px] text-white/60 leading-tight z-[1]">
+                      {segDur.toFixed(1)}s
+                    </span>
+                    {seg.sourceStart > 0 && (
+                      <span className="text-[9px] text-amber-400/80 leading-tight z-[1]">
+                        ✂ {seg.sourceStart.toFixed(1)}s
+                      </span>
+                    )}
+                    {/* Loop fill stripe overlay */}
+                    {loopPercent > 0 && (
+                      <div
+                        className="absolute top-0 right-0 h-full pointer-events-none flex items-center justify-center"
+                        style={{
+                          width: `${loopPercent}%`,
+                          background: `repeating-linear-gradient(-45deg, transparent, transparent 3px, rgba(255,255,255,0.07) 3px, rgba(255,255,255,0.07) 6px)`,
+                          borderLeft: "1px dashed rgba(255,255,255,0.25)",
+                        }}
+                      >
+                        <span className="text-[9px] text-white/30">循环</span>
+                      </div>
+                    )}
+                  </button>
+                </div>
+              );
+            })}
+          </div>
+
+          {/* Overflow segments — shown as gray chips */}
+          {overflowSegments.length > 0 && (
+            <div className="flex flex-wrap items-center gap-1.5 mt-1.5">
+              <span className="text-[10px] text-gray-500">未使用:</span>
+              {overflowSegments.map((seg) => (
+                <span
+                  key={seg.id}
+                  className="text-[10px] text-gray-500 bg-white/5 border border-white/10 rounded px-1.5 py-0.5"
+                >
+                  {seg.materialName}
+                </span>
+              ))}
+            </div>
+          )}
+
+          <p className="text-[10px] text-gray-500 mt-1.5">
+            点击波形定位播放 · 拖拽色块调换顺序 · 点击色块设置截取范围
+          </p>
+        </>
+      ) : (
+        <>
+          <div className="h-14 bg-white/5 rounded-lg" />
+          <p className="text-[10px] text-gray-500 mt-1.5">
+            选中配音和素材后可编辑时间轴
+          </p>
+        </>
+      )}
+    </div>
+  );
+}
--- a/frontend/src/features/home/ui/TitleSubtitlePanel.tsx
+++ b/frontend/src/features/home/ui/TitleSubtitlePanel.tsx
@@ -52,8 +52,6 @@ interface TitleSubtitlePanelProps {
  onTitleTopMarginChange: (value: number) => void;
  subtitleBottomMargin: number;
  onSubtitleBottomMarginChange: (value: number) => void;
-  enableSubtitles: boolean;
-  onToggleSubtitles: (value: boolean) => void;
  resolveAssetUrl: (path?: string | null) => string | null;
  getFontFormat: (fontFile?: string) => string;
  buildTextShadow: (color: string, size: number) => string;
@@ -82,8 +80,6 @@ export function TitleSubtitlePanel({
  onTitleTopMarginChange,
  subtitleBottomMargin,
  onSubtitleBottomMarginChange,
-  enableSubtitles,
-  onToggleSubtitles,
  resolveAssetUrl,
  getFontFormat,
  buildTextShadow,
@@ -117,7 +113,7 @@ export function TitleSubtitlePanel({
          subtitleFontSize={subtitleFontSize}
          titleTopMargin={titleTopMargin}
          subtitleBottomMargin={subtitleBottomMargin}
-          enableSubtitles={enableSubtitles}
+          enableSubtitles={true}
          resolveAssetUrl={resolveAssetUrl}
          getFontFormat={getFontFormat}
          buildTextShadow={buildTextShadow}
@@ -186,7 +182,7 @@ export function TitleSubtitlePanel({
        </div>
      )}

-      {enableSubtitles && subtitleStyles.length > 0 && (
+      {subtitleStyles.length > 0 && (
        <div className="mt-4">
          <label className="text-sm text-gray-300 mb-2 block">字幕样式</label>
          <div className="grid grid-cols-2 gap-2">
@@ -232,22 +228,6 @@ export function TitleSubtitlePanel({
          </div>
        </div>
      )}
-
-      <div className="mt-4 pt-4 border-t border-white/10 flex items-center justify-between">
-        <div>
-          <span className="text-sm text-gray-300">逐字高亮字幕</span>
-          <p className="text-xs text-gray-500 mt-1">自动生成卡拉OK效果字幕</p>
-        </div>
-        <label className="relative inline-flex items-center cursor-pointer">
-          <input
-            type="checkbox"
-            checked={enableSubtitles}
-            onChange={(e) => onToggleSubtitles(e.target.checked)}
-            className="sr-only peer"
-          />
-          <div className="w-11 h-6 bg-gray-600 peer-focus:outline-none rounded-full peer peer-checked:after:translate-x-full peer-checked:after:border-white after:content-[''] after:absolute after:top-[2px] after:left-[2px] after:bg-white after:border-gray-300 after:border after:rounded-full after:h-5 after:w-5 after:transition-all peer-checked:bg-purple-600"></div>
-        </label>
-      </div>
    </div>
  );
 }
--- a/frontend/src/features/home/ui/script-extraction/useScriptExtraction.ts
+++ b/frontend/src/features/home/ui/script-extraction/useScriptExtraction.ts
--- a/frontend/src/features/publish/model/usePublishController.ts
+++ b/frontend/src/features/publish/model/usePublishController.ts
@@ -5,8 +5,8 @@ import { ApiResponse, unwrap } from "@/shared/api/types";
 import { formatDate, getApiBaseUrl, isAbsoluteUrl, resolveMediaUrl } from "@/shared/lib/media";
 import { clampTitle } from "@/shared/lib/title";
 import { useTitleInput } from "@/shared/hooks/useTitleInput";
-import { useAuth } from "@/contexts/AuthContext";
-import { useTask } from "@/contexts/TaskContext";
+import { useAuth } from "@/shared/contexts/AuthContext";
+import { useTask } from "@/shared/contexts/TaskContext";
 import { toast } from "sonner";
 import { usePublishPrefetch } from "@/shared/hooks/usePublishPrefetch";
 import {
--- a/frontend/src/shared/contexts/AuthContext.tsx
+++ b/frontend/src/shared/contexts/AuthContext.tsx
--- a/frontend/src/shared/contexts/TaskContext.tsx
+++ b/frontend/src/shared/contexts/TaskContext.tsx
--- a/frontend/src/shared/types/material.ts
+++ b/frontend/src/shared/types/material.ts
@@ -0,0 +1,8 @@
+export interface Material {
+  id: string;
+  name: string;
+  path: string;
+  size_mb: number;
+  scene?: string;
+  duration_sec?: number;
+}
--- a/models/CosyVoice/CODE_OF_CONDUCT.md
+++ b/models/CosyVoice/CODE_OF_CONDUCT.md
@@ -0,0 +1,76 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to making participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment
+include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or
+ advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+ address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+ professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies both within project spaces and in public spaces
+when an individual is representing the project or its community. Examples of
+representing a project or community include using an official project e-mail
+address, posting via an official social media account, or acting as an appointed
+representative at an online or offline event. Representation of a project may be
+further defined and clarified by project maintainers.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at mikelei@mobvoi.com. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq
--- a/models/CosyVoice/FAQ.md
+++ b/models/CosyVoice/FAQ.md
@@ -0,0 +1,16 @@
+## ModuleNotFoundError: No module named 'matcha'
+
+Matcha-TTS is a third_party module. Please check `third_party` directory. If there is no `Matcha-TTS`, execute `git submodule update --init --recursive`.
+
+run `export PYTHONPATH=third_party/Matcha-TTS` if you want to use `from cosyvoice.cli.cosyvoice import CosyVoice` in python script.
+
+## cannot find resource.zip or cannot unzip resource.zip
+
+Please make sure you have git-lfs installed. Execute
+
+```sh
+git clone https://www.modelscope.cn/iic/CosyVoice-ttsfrd.git pretrained_models/CosyVoice-ttsfrd
+cd pretrained_models/CosyVoice-ttsfrd/
+unzip resource.zip -d .
+pip install ttsfrd-0.3.6-cp38-cp38-linux_x86_64.whl
+```
--- a/models/CosyVoice/LICENSE
+++ b/models/CosyVoice/LICENSE
@@ -0,0 +1,201 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/models/CosyVoice/README.md
+++ b/models/CosyVoice/README.md
@@ -0,0 +1,264 @@
+![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=CosyVoice🤠&text2=Text-to-Speech%20💖%20Large%20Language%20Model&width=800&height=210)
+
+## 👉🏻 CosyVoice 👈🏻
+
+**Fun-CosyVoice 3.0**: [Demos](https://funaudiollm.github.io/cosyvoice3/); [Paper](https://arxiv.org/pdf/2505.17589); [Modelscope](https://www.modelscope.cn/models/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [Huggingface](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [CV3-Eval](https://github.com/FunAudioLLM/CV3-Eval)
+
+**CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/pdf/2412.10117); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B)
+
+**CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice-300M); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice-300M)
+
+## Highlight🔥
+
+**Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
+### Key Features
+- **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
+- **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
+- **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
+- **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
+- **Bi-Streaming**: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output.
+- **Instruct Support**: Supports various instructions such as languages, dialects, emotions, speed, volume, etc.
+
+
+## Roadmap
+
+- [x] 2025/12
+
+    - [x] release Fun-CosyVoice3-0.5B-2512 base model, rl model and its training/inference script
+    - [x] release Fun-CosyVoice3-0.5B modelscope gradio space
+
+- [x] 2025/08
+
+    - [x] Thanks to the contribution from NVIDIA Yuekai Zhang, add triton trtllm runtime support and cosyvoice2 grpo training support
+
+- [x] 2025/07
+
+    - [x] release Fun-CosyVoice 3.0 eval set
+
+- [x] 2025/05
+
+    - [x] add CosyVoice2-0.5B vllm support
+
+- [x] 2024/12
+
+    - [x] 25hz CosyVoice2-0.5B released
+
+- [x] 2024/09
+
+    - [x] 25hz CosyVoice-300M base model
+    - [x] 25hz CosyVoice-300M voice conversion function
+
+- [x] 2024/08
+
+    - [x] Repetition Aware Sampling(RAS) inference for llm stability
+    - [x] Streaming inference mode support, including kv cache and sdpa for rtf optimization
+
+- [x] 2024/07
+
+    - [x] Flow matching training support
+    - [x] WeTextProcessing support when ttsfrd is not available
+    - [x] Fastapi server and client
+
+## Evaluation
+
+| Model | Open-Source | Model Size | test-zh<br>CER (%) ↓ | test-zh<br>SS (%) ↑ | test-en<br>WER (%) ↓ | test-en<br>SS (%) ↑ | test-hard<br>CER (%) ↓ | test-hard<br>SS (%) ↑ |
+| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
+| Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
+| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |
+| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |
+| Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |
+| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |
+| FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |
+| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |
+| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |
+| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - |
+| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |
+| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |
+| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - |
+| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - |
+| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
+| Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
+
+
+## Install
+
+### Clone and install
+
+- Clone the repo
+    ``` sh
+    git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
+    # If you failed to clone the submodule due to network failures, please run the following command until success
+    cd CosyVoice
+    git submodule update --init --recursive
+    ```
+
+- Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
+- Create Conda env:
+
+    ``` sh
+    conda create -n cosyvoice -y python=3.10
+    conda activate cosyvoice
+    pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
+
+    # If you encounter sox compatibility issues
+    # ubuntu
+    sudo apt-get install sox libsox-dev
+    # centos
+    sudo yum install sox sox-devel
+    ```
+
+### Model download
+
+We strongly recommend that you download our pretrained `Fun-CosyVoice3-0.5B` `CosyVoice2-0.5B` `CosyVoice-300M` `CosyVoice-300M-SFT` `CosyVoice-300M-Instruct` model and `CosyVoice-ttsfrd` resource.
+
+``` python
+# modelscope SDK model download
+from modelscope import snapshot_download
+snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
+snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
+snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
+snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
+snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
+snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
+
+# for oversea users, huggingface SDK model download
+from huggingface_hub import snapshot_download
+snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
+snapshot_download('FunAudioLLM/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
+snapshot_download('FunAudioLLM/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
+snapshot_download('FunAudioLLM/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
+snapshot_download('FunAudioLLM/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
+snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
+```
+
+Optionally, you can unzip `ttsfrd` resource and install `ttsfrd` package for better text normalization performance.
+
+Notice that this step is not necessary. If you do not install `ttsfrd` package, we will use wetext by default.
+
+``` sh
+cd pretrained_models/CosyVoice-ttsfrd/
+unzip resource.zip -d .
+pip install ttsfrd_dependency-0.1-py3-none-any.whl
+pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
+```
+
+### Basic Usage
+
+We strongly recommend using `Fun-CosyVoice3-0.5B` for better performance.
+Follow the code in `example.py` for detailed usage of each model.
+```sh
+python example.py
+```
+
+#### vLLM Usage
+CosyVoice2/3 now supports **vLLM 0.11.x+ (V1 engine)** and **vLLM 0.9.0 (legacy)**.
+Older vllm version(<0.9.0) do not support CosyVoice inference, and versions in between (e.g., 0.10.x) are not tested.
+
+Notice that `vllm` has a lot of specific requirements. You can create a new env to in case your hardward do not support vllm and old env is corrupted.
+
+``` sh
+conda create -n cosyvoice_vllm --clone cosyvoice
+conda activate cosyvoice_vllm
+# for vllm==0.9.0
+pip install vllm==v0.9.0 transformers==4.51.3 numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
+# for vllm>=0.11.0
+pip install vllm==v0.11.0 transformers==4.57.1 numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
+python vllm_example.py
+```
+
+#### Start web demo
+
+You can use our web demo page to get familiar with CosyVoice quickly.
+
+Please see the demo website for details.
+
+``` python
+# change iic/CosyVoice-300M-SFT for sft inference, or iic/CosyVoice-300M-Instruct for instruct inference
+python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M
+```
+
+#### Advanced Usage
+
+For advanced users, we have provided training and inference scripts in `examples/libritts`.
+
+#### Build for deployment
+
+Optionally, if you want service deployment,
+You can run the following steps.
+
+``` sh
+cd runtime/python
+docker build -t cosyvoice:v1.0 .
+# change iic/CosyVoice-300M to iic/CosyVoice-300M-Instruct if you want to use instruct inference
+# for grpc usage
+docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && python3 server.py --port 50000 --max_conc 4 --model_dir iic/CosyVoice-300M && sleep infinity"
+cd grpc && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>
+# for fastapi usage
+docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && python3 server.py --port 50000 --model_dir iic/CosyVoice-300M && sleep infinity"
+cd fastapi && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>
+```
+
+#### Using Nvidia TensorRT-LLM for deployment
+
+Using TensorRT-LLM to accelerate cosyvoice2 llm could give 4x acceleration comparing with huggingface transformers implementation.
+To quick start:
+
+``` sh
+cd runtime/triton_trtllm
+docker compose up -d
+```
+For more details, you could check [here](https://github.com/FunAudioLLM/CosyVoice/tree/main/runtime/triton_trtllm)
+
+## Discussion & Communication
+
+You can directly discuss on [Github Issues](https://github.com/FunAudioLLM/CosyVoice/issues).
+
+You can also scan the QR code to join our official Dingding chat group.
+
+<img src="./asset/dingding.png" width="250px">
+
+## Acknowledge
+
+1. We borrowed a lot of code from [FunASR](https://github.com/modelscope/FunASR).
+2. We borrowed a lot of code from [FunCodec](https://github.com/modelscope/FunCodec).
+3. We borrowed a lot of code from [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS).
+4. We borrowed a lot of code from [AcademiCodec](https://github.com/yangdongchao/AcademiCodec).
+5. We borrowed a lot of code from [WeNet](https://github.com/wenet-e2e/wenet).
+
+## Citations
+
+``` bibtex
+@article{du2024cosyvoice,
+  title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
+  author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
+  journal={arXiv preprint arXiv:2407.05407},
+  year={2024}
+}
+
+@article{du2024cosyvoice,
+  title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},
+  author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},
+  journal={arXiv preprint arXiv:2412.10117},
+  year={2024}
+}
+
+@article{du2025cosyvoice,
+  title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
+  author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},
+  journal={arXiv preprint arXiv:2505.17589},
+  year={2025}
+}
+
+@inproceedings{lyu2025build,
+  title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},
+  author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},
+  booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+  pages={1--2},
+  year={2025},
+  organization={IEEE}
+}
+```
+
+## Disclaimer
+The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.
--- a/models/CosyVoice/cosyvoice/init.py
+++ b/models/CosyVoice/cosyvoice/init.py
--- a/models/CosyVoice/cosyvoice/bin/average_model.py
+++ b/models/CosyVoice/cosyvoice/bin/average_model.py
@@ -0,0 +1,93 @@
+# Copyright (c) 2020 Mobvoi Inc (Di Wu)
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import argparse
+import glob
+
+import yaml
+import torch
+
+
+def get_args():
+    parser = argparse.ArgumentParser(description='average model')
+    parser.add_argument('--dst_model', required=True, help='averaged model')
+    parser.add_argument('--src_path',
+                        required=True,
+                        help='src model path for average')
+    parser.add_argument('--val_best',
+                        action="store_true",
+                        help='averaged model')
+    parser.add_argument('--num',
+                        default=5,
+                        type=int,
+                        help='nums for averaged model')
+
+    args = parser.parse_args()
+    print(args)
+    return args
+
+
+def main():
+    args = get_args()
+    val_scores = []
+    if args.val_best:
+        yamls = glob.glob('{}/*.yaml'.format(args.src_path))
+        yamls = [
+            f for f in yamls
+            if not (os.path.basename(f).startswith('train')
+                    or os.path.basename(f).startswith('init'))
+        ]
+        for y in yamls:
+            with open(y, 'r') as f:
+                dic_yaml = yaml.load(f, Loader=yaml.BaseLoader)
+                loss = float(dic_yaml['loss_dict']['loss'])
+                epoch = int(dic_yaml['epoch'])
+                step = int(dic_yaml['step'])
+                tag = dic_yaml['tag']
+                val_scores += [[epoch, step, loss, tag]]
+        sorted_val_scores = sorted(val_scores,
+                                   key=lambda x: x[2],
+                                   reverse=False)
+        print("best val (epoch, step, loss, tag) = " +
+              str(sorted_val_scores[:args.num]))
+        path_list = [
+            args.src_path + '/epoch_{}_whole.pt'.format(score[0])
+            for score in sorted_val_scores[:args.num]
+        ]
+    print(path_list)
+    avg = {}
+    num = args.num
+    assert num == len(path_list)
+    for path in path_list:
+        print('Processing {}'.format(path))
+        states = torch.load(path, map_location=torch.device('cpu'))
+        for k in states.keys():
+            if k not in ['step', 'epoch']:
+                if k not in avg.keys():
+                    avg[k] = states[k].clone()
+                else:
+                    avg[k] += states[k]
+    # average
+    for k in avg.keys():
+        if avg[k] is not None:
+            # pytorch 1.6 use true_divide instead of /=
+            avg[k] = torch.true_divide(avg[k], num)
+    print('Saving to {}'.format(args.dst_model))
+    torch.save(avg, args.dst_model)
+
+
+if __name__ == '__main__':
+    main()
--- a/models/CosyVoice/cosyvoice/bin/export_jit.py
+++ b/models/CosyVoice/cosyvoice/bin/export_jit.py
@@ -0,0 +1,99 @@
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import argparse
+import logging
+logging.getLogger('matplotlib').setLevel(logging.WARNING)
+import os
+import sys
+import torch
+ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
+sys.path.append('{}/../..'.format(ROOT_DIR))
+sys.path.append('{}/../../third_party/Matcha-TTS'.format(ROOT_DIR))
+from cosyvoice.cli.cosyvoice import AutoModel
+from cosyvoice.utils.file_utils import logging
+
+
+def get_args():
+    parser = argparse.ArgumentParser(description='export your model for deployment')
+    parser.add_argument('--model_dir',
+                        type=str,
+                        default='pretrained_models/CosyVoice-300M',
+                        help='local path')
+    args = parser.parse_args()
+    print(args)
+    return args
+
+
+def get_optimized_script(model, preserved_attrs=[]):
+    script = torch.jit.script(model)
+    if preserved_attrs != []:
+        script = torch.jit.freeze(script, preserved_attrs=preserved_attrs)
+    else:
+        script = torch.jit.freeze(script)
+    script = torch.jit.optimize_for_inference(script)
+    return script
+
+
+def main():
+    args = get_args()
+    logging.basicConfig(level=logging.DEBUG,
+                        format='%(asctime)s %(levelname)s %(message)s')
+
+    torch._C._jit_set_fusion_strategy([('STATIC', 1)])
+    torch._C._jit_set_profiling_mode(False)
+    torch._C._jit_set_profiling_executor(False)
+
+    model = AutoModel(model_dir=args.model_dir)
+
+    if model.__class__.__name__ == 'CosyVoice':
+        # 1. export llm text_encoder
+        llm_text_encoder = model.model.llm.text_encoder
+        script = get_optimized_script(llm_text_encoder)
+        script.save('{}/llm.text_encoder.fp32.zip'.format(args.model_dir))
+        script = get_optimized_script(llm_text_encoder.half())
+        script.save('{}/llm.text_encoder.fp16.zip'.format(args.model_dir))
+        logging.info('successfully export llm_text_encoder')
+
+        # 2. export llm llm
+        llm_llm = model.model.llm.llm
+        script = get_optimized_script(llm_llm, ['forward_chunk'])
+        script.save('{}/llm.llm.fp32.zip'.format(args.model_dir))
+        script = get_optimized_script(llm_llm.half(), ['forward_chunk'])
+        script.save('{}/llm.llm.fp16.zip'.format(args.model_dir))
+        logging.info('successfully export llm_llm')
+
+        # 3. export flow encoder
+        flow_encoder = model.model.flow.encoder
+        script = get_optimized_script(flow_encoder)
+        script.save('{}/flow.encoder.fp32.zip'.format(args.model_dir))
+        script = get_optimized_script(flow_encoder.half())
+        script.save('{}/flow.encoder.fp16.zip'.format(args.model_dir))
+        logging.info('successfully export flow_encoder')
+    elif model.__class__.__name__ == 'CosyVoice2':
+        # 1. export flow encoder
+        flow_encoder = model.model.flow.encoder
+        script = get_optimized_script(flow_encoder)
+        script.save('{}/flow.encoder.fp32.zip'.format(args.model_dir))
+        script = get_optimized_script(flow_encoder.half())
+        script.save('{}/flow.encoder.fp16.zip'.format(args.model_dir))
+        logging.info('successfully export flow_encoder')
+    else:
+        raise ValueError('unsupported model type')
+
+
+if __name__ == '__main__':
+    main()
--- a/models/CosyVoice/cosyvoice/bin/export_onnx.py
+++ b/models/CosyVoice/cosyvoice/bin/export_onnx.py
@@ -0,0 +1,114 @@
+# Copyright (c) 2024 Antgroup Inc (authors: Zhoubofan, hexisyztem@icloud.com)
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import argparse
+import logging
+logging.getLogger('matplotlib').setLevel(logging.WARNING)
+import os
+import sys
+import onnxruntime
+import random
+import torch
+from tqdm import tqdm
+ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
+sys.path.append('{}/../..'.format(ROOT_DIR))
+sys.path.append('{}/../../third_party/Matcha-TTS'.format(ROOT_DIR))
+from cosyvoice.cli.cosyvoice import AutoModel
+from cosyvoice.utils.file_utils import logging
+
+
+def get_dummy_input(batch_size, seq_len, out_channels, device):
+    x = torch.rand((batch_size, out_channels, seq_len), dtype=torch.float32, device=device)
+    mask = torch.ones((batch_size, 1, seq_len), dtype=torch.float32, device=device)
+    mu = torch.rand((batch_size, out_channels, seq_len), dtype=torch.float32, device=device)
+    t = torch.rand((batch_size), dtype=torch.float32, device=device)
+    spks = torch.rand((batch_size, out_channels), dtype=torch.float32, device=device)
+    cond = torch.rand((batch_size, out_channels, seq_len), dtype=torch.float32, device=device)
+    return x, mask, mu, t, spks, cond
+
+
+def get_args():
+    parser = argparse.ArgumentParser(description='export your model for deployment')
+    parser.add_argument('--model_dir',
+                        type=str,
+                        default='pretrained_models/CosyVoice-300M',
+                        help='local path')
+    args = parser.parse_args()
+    print(args)
+    return args
+
+
+@torch.no_grad()
+def main():
+    args = get_args()
+    logging.basicConfig(level=logging.DEBUG,
+                        format='%(asctime)s %(levelname)s %(message)s')
+
+    model = AutoModel(model_dir=args.model_dir)
+
+    # 1. export flow decoder estimator
+    estimator = model.model.flow.decoder.estimator
+    estimator.eval()
+
+    device = model.model.device
+    batch_size, seq_len = 2, 256
+    out_channels = model.model.flow.decoder.estimator.out_channels
+    x, mask, mu, t, spks, cond = get_dummy_input(batch_size, seq_len, out_channels, device)
+    torch.onnx.export(
+        estimator,
+        (x, mask, mu, t, spks, cond),
+        '{}/flow.decoder.estimator.fp32.onnx'.format(args.model_dir),
+        export_params=True,
+        opset_version=18,
+        do_constant_folding=True,
+        input_names=['x', 'mask', 'mu', 't', 'spks', 'cond'],
+        output_names=['estimator_out'],
+        dynamic_axes={
+            'x': {2: 'seq_len'},
+            'mask': {2: 'seq_len'},
+            'mu': {2: 'seq_len'},
+            'cond': {2: 'seq_len'},
+            'estimator_out': {2: 'seq_len'},
+        }
+    )
+
+    # 2. test computation consistency
+    option = onnxruntime.SessionOptions()
+    option.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
+    option.intra_op_num_threads = 1
+    providers = ['CUDAExecutionProvider' if torch.cuda.is_available() else 'CPUExecutionProvider']
+    estimator_onnx = onnxruntime.InferenceSession('{}/flow.decoder.estimator.fp32.onnx'.format(args.model_dir),
+                                                  sess_options=option, providers=providers)
+
+    for _ in tqdm(range(10)):
+        x, mask, mu, t, spks, cond = get_dummy_input(batch_size, random.randint(16, 512), out_channels, device)
+        output_pytorch = estimator(x, mask, mu, t, spks, cond)
+        ort_inputs = {
+            'x': x.cpu().numpy(),
+            'mask': mask.cpu().numpy(),
+            'mu': mu.cpu().numpy(),
+            't': t.cpu().numpy(),
+            'spks': spks.cpu().numpy(),
+            'cond': cond.cpu().numpy()
+        }
+        output_onnx = estimator_onnx.run(None, ort_inputs)[0]
+        torch.testing.assert_allclose(output_pytorch, torch.from_numpy(output_onnx).to(device), rtol=1e-2, atol=1e-4)
+    logging.info('successfully export estimator')
+
+
+if __name__ == "__main__":
+    main()
--- a/models/CosyVoice/cosyvoice/bin/train.py
+++ b/models/CosyVoice/cosyvoice/bin/train.py
@@ -0,0 +1,195 @@
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import argparse
+import datetime
+import logging
+logging.getLogger('matplotlib').setLevel(logging.WARNING)
+from copy import deepcopy
+import os
+import torch
+import torch.distributed as dist
+import deepspeed
+
+from hyperpyyaml import load_hyperpyyaml
+
+from torch.distributed.elastic.multiprocessing.errors import record
+
+from cosyvoice.utils.losses import DPOLoss
+from cosyvoice.utils.executor import Executor
+from cosyvoice.utils.train_utils import (
+    init_distributed,
+    init_dataset_and_dataloader,
+    init_optimizer_and_scheduler,
+    init_summarywriter, save_model,
+    wrap_cuda_model, check_modify_and_save_config)
+
+
+def get_args():
+    parser = argparse.ArgumentParser(description='training your network')
+    parser.add_argument('--train_engine',
+                        default='torch_ddp',
+                        choices=['torch_ddp', 'deepspeed'],
+                        help='Engine for paralleled training')
+    parser.add_argument('--model', required=True, help='model which will be trained')
+    parser.add_argument('--ref_model', required=False, help='ref model used in dpo')
+    parser.add_argument('--config', required=True, help='config file')
+    parser.add_argument('--train_data', required=True, help='train data file')
+    parser.add_argument('--cv_data', required=True, help='cv data file')
+    parser.add_argument('--qwen_pretrain_path', required=False, help='qwen pretrain path')
+    parser.add_argument('--onnx_path', required=False, help='onnx path, which is required for online feature extraction')
+    parser.add_argument('--checkpoint', help='checkpoint model')
+    parser.add_argument('--model_dir', required=True, help='save model dir')
+    parser.add_argument('--tensorboard_dir',
+                        default='tensorboard',
+                        help='tensorboard log dir')
+    parser.add_argument('--ddp.dist_backend',
+                        dest='dist_backend',
+                        default='nccl',
+                        choices=['nccl', 'gloo'],
+                        help='distributed backend')
+    parser.add_argument('--num_workers',
+                        default=0,
+                        type=int,
+                        help='num of subprocess workers for reading')
+    parser.add_argument('--prefetch',
+                        default=100,
+                        type=int,
+                        help='prefetch number')
+    parser.add_argument('--pin_memory',
+                        action='store_true',
+                        default=False,
+                        help='Use pinned memory buffers used for reading')
+    parser.add_argument('--use_amp',
+                        action='store_true',
+                        default=False,
+                        help='Use automatic mixed precision training')
+    parser.add_argument('--dpo',
+                        action='store_true',
+                        default=False,
+                        help='Use Direct Preference Optimization')
+    parser.add_argument('--deepspeed.save_states',
+                        dest='save_states',
+                        default='model_only',
+                        choices=['model_only', 'model+optimizer'],
+                        help='save model/optimizer states')
+    parser.add_argument('--timeout',
+                        default=60,
+                        type=int,
+                        help='timeout (in seconds) of cosyvoice_join.')
+    parser = deepspeed.add_config_arguments(parser)
+    args = parser.parse_args()
+    return args
+
+
+@record
+def main():
+    args = get_args()
+    os.environ['onnx_path'] = args.onnx_path
+    logging.basicConfig(level=logging.DEBUG,
+                        format='%(asctime)s %(levelname)s %(message)s')
+    # gan train has some special initialization logic
+    gan = True if args.model == 'hifigan' else False
+
+    override_dict = {k: None for k in ['llm', 'flow', 'hift', 'hifigan'] if k != args.model}
+    if gan is True:
+        override_dict.pop('hift')
+    if args.qwen_pretrain_path is not None:
+        override_dict['qwen_pretrain_path'] = args.qwen_pretrain_path
+    with open(args.config, 'r') as f:
+        configs = load_hyperpyyaml(f, overrides=override_dict)
+    if gan is True:
+        configs['train_conf'] = configs['train_conf_gan']
+    configs['train_conf'].update(vars(args))
+
+    # Init env for ddp
+    init_distributed(args)
+
+    # Get dataset & dataloader
+    train_dataset, cv_dataset, train_data_loader, cv_data_loader = \
+        init_dataset_and_dataloader(args, configs, gan, args.dpo)
+
+    # Do some sanity checks and save config to arsg.model_dir
+    configs = check_modify_and_save_config(args, configs)
+
+    # Tensorboard summary
+    writer = init_summarywriter(args)
+
+    # load checkpoint
+    if args.dpo is True:
+        configs[args.model].forward = configs[args.model].forward_dpo
+    model = configs[args.model]
+    start_step, start_epoch = 0, -1
+    if args.checkpoint is not None:
+        if os.path.exists(args.checkpoint):
+            state_dict = torch.load(args.checkpoint, map_location='cpu')
+            model.load_state_dict(state_dict, strict=False)
+            if 'step' in state_dict:
+                start_step = state_dict['step']
+            if 'epoch' in state_dict:
+                start_epoch = state_dict['epoch']
+        else:
+            logging.warning('checkpoint {} do not exsist!'.format(args.checkpoint))
+
+    # Dispatch model from cpu to gpu
+    model = wrap_cuda_model(args, model)
+
+    # Get optimizer & scheduler
+    model, optimizer, scheduler, optimizer_d, scheduler_d = init_optimizer_and_scheduler(args, configs, model, gan)
+    scheduler.set_step(start_step)
+    if scheduler_d is not None:
+        scheduler_d.set_step(start_step)
+
+    # Save init checkpoints
+    info_dict = deepcopy(configs['train_conf'])
+    info_dict['step'] = start_step
+    info_dict['epoch'] = start_epoch
+    save_model(model, 'init', info_dict)
+
+    # DPO related
+    if args.dpo is True:
+        ref_model = deepcopy(configs[args.model])
+        state_dict = torch.load(args.ref_model, map_location='cpu')
+        ref_model.load_state_dict(state_dict, strict=False)
+        dpo_loss = DPOLoss(beta=0.01, label_smoothing=0.0, ipo=False)
+        # NOTE maybe it is not needed to wrap ref_model as ddp because its parameter is not updated
+        ref_model = wrap_cuda_model(args, ref_model)
+    else:
+        ref_model, dpo_loss = None, None
+
+    # Get executor
+    executor = Executor(gan=gan, ref_model=ref_model, dpo_loss=dpo_loss)
+    executor.step = start_step
+
+    # Init scaler, used for pytorch amp mixed precision training
+    scaler = torch.cuda.amp.GradScaler() if args.use_amp else None
+    print('start step {} start epoch {}'.format(start_step, start_epoch))
+
+    # Start training loop
+    for epoch in range(start_epoch + 1, info_dict['max_epoch']):
+        executor.epoch = epoch
+        train_dataset.set_epoch(epoch)
+        dist.barrier()
+        group_join = dist.new_group(backend="gloo", timeout=datetime.timedelta(seconds=args.timeout))
+        if gan is True:
+            executor.train_one_epoc_gan(model, optimizer, scheduler, optimizer_d, scheduler_d, train_data_loader, cv_data_loader,
+                                        writer, info_dict, scaler, group_join)
+        else:
+            executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, scaler, group_join, ref_model=ref_model)
+        dist.destroy_process_group(group_join)
+
+
+if __name__ == '__main__':
+    main()
--- a/models/CosyVoice/cosyvoice/cli/init.py
+++ b/models/CosyVoice/cosyvoice/cli/init.py
--- a/models/CosyVoice/cosyvoice/cli/cosyvoice.py
+++ b/models/CosyVoice/cosyvoice/cli/cosyvoice.py
@@ -0,0 +1,240 @@
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import time
+from typing import Generator
+from tqdm import tqdm
+from hyperpyyaml import load_hyperpyyaml
+from modelscope import snapshot_download
+import torch
+from cosyvoice.cli.frontend import CosyVoiceFrontEnd
+from cosyvoice.cli.model import CosyVoiceModel, CosyVoice2Model, CosyVoice3Model
+from cosyvoice.utils.file_utils import logging
+from cosyvoice.utils.class_utils import get_model_type
+
+
+class CosyVoice:
+
+    def __init__(self, model_dir, load_jit=False, load_trt=False, fp16=False, trt_concurrent=1):
+        self.model_dir = model_dir
+        self.fp16 = fp16
+        if not os.path.exists(model_dir):
+            model_dir = snapshot_download(model_dir)
+        hyper_yaml_path = '{}/cosyvoice.yaml'.format(model_dir)
+        if not os.path.exists(hyper_yaml_path):
+            raise ValueError('{} not found!'.format(hyper_yaml_path))
+        with open(hyper_yaml_path, 'r') as f:
+            configs = load_hyperpyyaml(f)
+        assert get_model_type(configs) == CosyVoiceModel, 'do not use {} for CosyVoice initialization!'.format(model_dir)
+        self.frontend = CosyVoiceFrontEnd(configs['get_tokenizer'],
+                                          configs['feat_extractor'],
+                                          '{}/campplus.onnx'.format(model_dir),
+                                          '{}/speech_tokenizer_v1.onnx'.format(model_dir),
+                                          '{}/spk2info.pt'.format(model_dir),
+                                          configs['allowed_special'])
+        self.sample_rate = configs['sample_rate']
+        if torch.cuda.is_available() is False and (load_jit is True or load_trt is True or fp16 is True):
+            load_jit, load_trt, fp16 = False, False, False
+            logging.warning('no cuda device, set load_jit/load_trt/fp16 to False')
+        self.model = CosyVoiceModel(configs['llm'], configs['flow'], configs['hift'], fp16)
+        self.model.load('{}/llm.pt'.format(model_dir),
+                        '{}/flow.pt'.format(model_dir),
+                        '{}/hift.pt'.format(model_dir))
+        if load_jit:
+            self.model.load_jit('{}/llm.text_encoder.{}.zip'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'),
+                                '{}/llm.llm.{}.zip'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'),
+                                '{}/flow.encoder.{}.zip'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'))
+        if load_trt:
+            self.model.load_trt('{}/flow.decoder.estimator.{}.mygpu.plan'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'),
+                                '{}/flow.decoder.estimator.fp32.onnx'.format(model_dir),
+                                trt_concurrent,
+                                self.fp16)
+        del configs
+
+    def list_available_spks(self):
+        spks = list(self.frontend.spk2info.keys())
+        return spks
+
+    def add_zero_shot_spk(self, prompt_text, prompt_wav, zero_shot_spk_id):
+        assert zero_shot_spk_id != '', 'do not use empty zero_shot_spk_id'
+        model_input = self.frontend.frontend_zero_shot('', prompt_text, prompt_wav, self.sample_rate, '')
+        del model_input['text']
+        del model_input['text_len']
+        self.frontend.spk2info[zero_shot_spk_id] = model_input
+        return True
+
+    def save_spkinfo(self):
+        torch.save(self.frontend.spk2info, '{}/spk2info.pt'.format(self.model_dir))
+
+    def inference_sft(self, tts_text, spk_id, stream=False, speed=1.0, text_frontend=True):
+        for i in tqdm(self.frontend.text_normalize(tts_text, split=True, text_frontend=text_frontend)):
+            model_input = self.frontend.frontend_sft(i, spk_id)
+            start_time = time.time()
+            logging.info('synthesis text {}'.format(i))
+            for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
+                speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
+                logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
+                yield model_output
+                start_time = time.time()
+
+    def inference_zero_shot(self, tts_text, prompt_text, prompt_wav, zero_shot_spk_id='', stream=False, speed=1.0, text_frontend=True):
+        if self.__class__.__name__ == 'CosyVoice3' and '<|endofprompt|>' not in prompt_text + tts_text:
+            logging.warning('<|endofprompt|> not found in CosyVoice3 inference, check your input text')
+        prompt_text = self.frontend.text_normalize(prompt_text, split=False, text_frontend=text_frontend)
+        for i in tqdm(self.frontend.text_normalize(tts_text, split=True, text_frontend=text_frontend)):
+            if (not isinstance(i, Generator)) and len(i) < 0.5 * len(prompt_text):
+                logging.warning('synthesis text {} too short than prompt text {}, this may lead to bad performance'.format(i, prompt_text))
+            model_input = self.frontend.frontend_zero_shot(i, prompt_text, prompt_wav, self.sample_rate, zero_shot_spk_id)
+            start_time = time.time()
+            logging.info('synthesis text {}'.format(i))
+            for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
+                speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
+                logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
+                yield model_output
+                start_time = time.time()
+
+    def inference_cross_lingual(self, tts_text, prompt_wav, zero_shot_spk_id='', stream=False, speed=1.0, text_frontend=True):
+        for i in tqdm(self.frontend.text_normalize(tts_text, split=True, text_frontend=text_frontend)):
+            model_input = self.frontend.frontend_cross_lingual(i, prompt_wav, self.sample_rate, zero_shot_spk_id)
+            start_time = time.time()
+            logging.info('synthesis text {}'.format(i))
+            for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
+                speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
+                logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
+                yield model_output
+                start_time = time.time()
+
+    def inference_instruct(self, tts_text, spk_id, instruct_text, stream=False, speed=1.0, text_frontend=True):
+        assert self.__class__.__name__ == 'CosyVoice', 'inference_instruct is only implemented for CosyVoice!'
+        instruct_text = self.frontend.text_normalize(instruct_text, split=False, text_frontend=text_frontend)
+        for i in tqdm(self.frontend.text_normalize(tts_text, split=True, text_frontend=text_frontend)):
+            model_input = self.frontend.frontend_instruct(i, spk_id, instruct_text)
+            start_time = time.time()
+            logging.info('synthesis text {}'.format(i))
+            for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
+                speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
+                logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
+                yield model_output
+                start_time = time.time()
+
+    def inference_vc(self, source_wav, prompt_wav, stream=False, speed=1.0):
+        model_input = self.frontend.frontend_vc(source_wav, prompt_wav, self.sample_rate)
+        start_time = time.time()
+        for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
+            speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
+            logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
+            yield model_output
+            start_time = time.time()
+
+
+class CosyVoice2(CosyVoice):
+
+    def __init__(self, model_dir, load_jit=False, load_trt=False, load_vllm=False, fp16=False, trt_concurrent=1):
+        self.model_dir = model_dir
+        self.fp16 = fp16
+        if not os.path.exists(model_dir):
+            model_dir = snapshot_download(model_dir)
+        hyper_yaml_path = '{}/cosyvoice2.yaml'.format(model_dir)
+        if not os.path.exists(hyper_yaml_path):
+            raise ValueError('{} not found!'.format(hyper_yaml_path))
+        with open(hyper_yaml_path, 'r') as f:
+            configs = load_hyperpyyaml(f, overrides={'qwen_pretrain_path': os.path.join(model_dir, 'CosyVoice-BlankEN')})
+        assert get_model_type(configs) == CosyVoice2Model, 'do not use {} for CosyVoice2 initialization!'.format(model_dir)
+        self.frontend = CosyVoiceFrontEnd(configs['get_tokenizer'],
+                                          configs['feat_extractor'],
+                                          '{}/campplus.onnx'.format(model_dir),
+                                          '{}/speech_tokenizer_v2.onnx'.format(model_dir),
+                                          '{}/spk2info.pt'.format(model_dir),
+                                          configs['allowed_special'])
+        self.sample_rate = configs['sample_rate']
+        if torch.cuda.is_available() is False and (load_jit is True or load_trt is True or load_vllm is True or fp16 is True):
+            load_jit, load_trt, load_vllm, fp16 = False, False, False, False
+            logging.warning('no cuda device, set load_jit/load_trt/load_vllm/fp16 to False')
+        self.model = CosyVoice2Model(configs['llm'], configs['flow'], configs['hift'], fp16)
+        self.model.load('{}/llm.pt'.format(model_dir),
+                        '{}/flow.pt'.format(model_dir),
+                        '{}/hift.pt'.format(model_dir))
+        if load_vllm:
+            self.model.load_vllm('{}/vllm'.format(model_dir))
+        if load_jit:
+            self.model.load_jit('{}/flow.encoder.{}.zip'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'))
+        if load_trt:
+            self.model.load_trt('{}/flow.decoder.estimator.{}.mygpu.plan'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'),
+                                '{}/flow.decoder.estimator.fp32.onnx'.format(model_dir),
+                                trt_concurrent,
+                                self.fp16)
+        del configs
+
+    def inference_instruct2(self, tts_text, instruct_text, prompt_wav, zero_shot_spk_id='', stream=False, speed=1.0, text_frontend=True):
+        for i in tqdm(self.frontend.text_normalize(tts_text, split=True, text_frontend=text_frontend)):
+            model_input = self.frontend.frontend_instruct2(i, instruct_text, prompt_wav, self.sample_rate, zero_shot_spk_id)
+            start_time = time.time()
+            logging.info('synthesis text {}'.format(i))
+            for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
+                speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
+                logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
+                yield model_output
+                start_time = time.time()
+
+
+class CosyVoice3(CosyVoice2):
+
+    def __init__(self, model_dir, load_trt=False, load_vllm=False, fp16=False, trt_concurrent=1):
+        self.model_dir = model_dir
+        self.fp16 = fp16
+        if not os.path.exists(model_dir):
+            model_dir = snapshot_download(model_dir)
+        hyper_yaml_path = '{}/cosyvoice3.yaml'.format(model_dir)
+        if not os.path.exists(hyper_yaml_path):
+            raise ValueError('{} not found!'.format(hyper_yaml_path))
+        with open(hyper_yaml_path, 'r') as f:
+            configs = load_hyperpyyaml(f, overrides={'qwen_pretrain_path': os.path.join(model_dir, 'CosyVoice-BlankEN')})
+        assert get_model_type(configs) == CosyVoice3Model, 'do not use {} for CosyVoice3 initialization!'.format(model_dir)
+        self.frontend = CosyVoiceFrontEnd(configs['get_tokenizer'],
+                                          configs['feat_extractor'],
+                                          '{}/campplus.onnx'.format(model_dir),
+                                          '{}/speech_tokenizer_v3.onnx'.format(model_dir),
+                                          '{}/spk2info.pt'.format(model_dir),
+                                          configs['allowed_special'])
+        self.sample_rate = configs['sample_rate']
+        if torch.cuda.is_available() is False and (load_trt is True or fp16 is True):
+            load_trt, fp16 = False, False
+            logging.warning('no cuda device, set load_trt/fp16 to False')
+        self.model = CosyVoice3Model(configs['llm'], configs['flow'], configs['hift'], fp16)
+        self.model.load('{}/llm.pt'.format(model_dir),
+                        '{}/flow.pt'.format(model_dir),
+                        '{}/hift.pt'.format(model_dir))
+        if load_vllm:
+            self.model.load_vllm('{}/vllm'.format(model_dir))
+        if load_trt:
+            if self.fp16 is True:
+                logging.warning('DiT tensorRT fp16 engine have some performance issue, use at caution!')
+            self.model.load_trt('{}/flow.decoder.estimator.{}.mygpu.plan'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'),
+                                '{}/flow.decoder.estimator.fp32.onnx'.format(model_dir),
+                                trt_concurrent,
+                                self.fp16)
+        del configs
+
+
+def AutoModel(**kwargs):
+    if not os.path.exists(kwargs['model_dir']):
+        kwargs['model_dir'] = snapshot_download(kwargs['model_dir'])
+    if os.path.exists('{}/cosyvoice.yaml'.format(kwargs['model_dir'])):
+        return CosyVoice(**kwargs)
+    elif os.path.exists('{}/cosyvoice2.yaml'.format(kwargs['model_dir'])):
+        return CosyVoice2(**kwargs)
+    elif os.path.exists('{}/cosyvoice3.yaml'.format(kwargs['model_dir'])):
+        return CosyVoice3(**kwargs)
+    else:
+        raise TypeError('No valid model type found!')
--- a/models/CosyVoice/cosyvoice/cli/frontend.py
+++ b/models/CosyVoice/cosyvoice/cli/frontend.py
@@ -0,0 +1,224 @@
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from functools import partial
+from typing import Generator
+import json
+import onnxruntime
+import torch
+import numpy as np
+import whisper
+from typing import Callable
+import torchaudio.compliance.kaldi as kaldi
+import os
+import re
+import inflect
+from cosyvoice.utils.file_utils import logging, load_wav
+from cosyvoice.utils.frontend_utils import contains_chinese, replace_blank, replace_corner_mark, remove_bracket, spell_out_number, split_paragraph, is_only_punctuation
+
+
+class CosyVoiceFrontEnd:
+
+    def __init__(self,
+                 get_tokenizer: Callable,
+                 feat_extractor: Callable,
+                 campplus_model: str,
+                 speech_tokenizer_model: str,
+                 spk2info: str = '',
+                 allowed_special: str = 'all'):
+        self.tokenizer = get_tokenizer()
+        self.feat_extractor = feat_extractor
+        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+        option = onnxruntime.SessionOptions()
+        option.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
+        option.intra_op_num_threads = 1
+        self.campplus_session = onnxruntime.InferenceSession(campplus_model, sess_options=option, providers=["CPUExecutionProvider"])
+        self.speech_tokenizer_session = onnxruntime.InferenceSession(speech_tokenizer_model, sess_options=option,
+                                                                     providers=["CUDAExecutionProvider" if torch.cuda.is_available() else
+                                                                                "CPUExecutionProvider"])
+        if os.path.exists(spk2info):
+            self.spk2info = torch.load(spk2info, map_location=self.device, weights_only=True)
+        else:
+            self.spk2info = {}
+        self.allowed_special = allowed_special
+        self.inflect_parser = inflect.engine()
+        # NOTE compatible when no text frontend tool is avaliable
+        try:
+            import ttsfrd
+            self.frd = ttsfrd.TtsFrontendEngine()
+            ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
+            assert self.frd.initialize('{}/../../pretrained_models/CosyVoice-ttsfrd/resource'.format(ROOT_DIR)) is True, \
+                'failed to initialize ttsfrd resource'
+            self.frd.set_lang_type('pinyinvg')
+            self.text_frontend = 'ttsfrd'
+            logging.info('use ttsfrd frontend')
+        except:
+            try:
+                from wetext import Normalizer as ZhNormalizer
+                from wetext import Normalizer as EnNormalizer
+                self.zh_tn_model = ZhNormalizer(remove_erhua=False)
+                self.en_tn_model = EnNormalizer()
+                self.text_frontend = 'wetext'
+                logging.info('use wetext frontend')
+            except:
+                self.text_frontend = ''
+                logging.info('no frontend is avaliable')
+
+
+    def _extract_text_token(self, text):
+        if isinstance(text, Generator):
+            logging.info('get tts_text generator, will return _extract_text_token_generator!')
+            # NOTE add a dummy text_token_len for compatibility
+            return self._extract_text_token_generator(text), torch.tensor([0], dtype=torch.int32).to(self.device)
+        else:
+            text_token = self.tokenizer.encode(text, allowed_special=self.allowed_special)
+            text_token = torch.tensor([text_token], dtype=torch.int32).to(self.device)
+            text_token_len = torch.tensor([text_token.shape[1]], dtype=torch.int32).to(self.device)
+            return text_token, text_token_len
+
+    def _extract_text_token_generator(self, text_generator):
+        for text in text_generator:
+            text_token, _ = self._extract_text_token(text)
+            for i in range(text_token.shape[1]):
+                yield text_token[:, i: i + 1]
+
+    def _extract_speech_token(self, prompt_wav):
+        speech = load_wav(prompt_wav, 16000)
+        assert speech.shape[1] / 16000 <= 30, 'do not support extract speech token for audio longer than 30s'
+        feat = whisper.log_mel_spectrogram(speech, n_mels=128)
+        speech_token = self.speech_tokenizer_session.run(None,
+                                                         {self.speech_tokenizer_session.get_inputs()[0].name:
+                                                          feat.detach().cpu().numpy(),
+                                                          self.speech_tokenizer_session.get_inputs()[1].name:
+                                                          np.array([feat.shape[2]], dtype=np.int32)})[0].flatten().tolist()
+        speech_token = torch.tensor([speech_token], dtype=torch.int32).to(self.device)
+        speech_token_len = torch.tensor([speech_token.shape[1]], dtype=torch.int32).to(self.device)
+        return speech_token, speech_token_len
+
+    def _extract_spk_embedding(self, prompt_wav):
+        speech = load_wav(prompt_wav, 16000)
+        feat = kaldi.fbank(speech,
+                           num_mel_bins=80,
+                           dither=0,
+                           sample_frequency=16000)
+        feat = feat - feat.mean(dim=0, keepdim=True)
+        embedding = self.campplus_session.run(None,
+                                              {self.campplus_session.get_inputs()[0].name: feat.unsqueeze(dim=0).cpu().numpy()})[0].flatten().tolist()
+        embedding = torch.tensor([embedding]).to(self.device)
+        return embedding
+
+    def _extract_speech_feat(self, prompt_wav):
+        speech = load_wav(prompt_wav, 24000)
+        speech_feat = self.feat_extractor(speech).squeeze(dim=0).transpose(0, 1).to(self.device)
+        speech_feat = speech_feat.unsqueeze(dim=0)
+        speech_feat_len = torch.tensor([speech_feat.shape[1]], dtype=torch.int32).to(self.device)
+        return speech_feat, speech_feat_len
+
+    def text_normalize(self, text, split=True, text_frontend=True):
+        if isinstance(text, Generator):
+            logging.info('get tts_text generator, will skip text_normalize!')
+            return [text]
+        # NOTE skip text_frontend when ssml symbol in text
+        if '<|' in text and '|>' in text:
+            text_frontend = False
+        if text_frontend is False or text == '':
+            return [text] if split is True else text
+        text = text.strip()
+        if self.text_frontend == 'ttsfrd':
+            texts = [i["text"] for i in json.loads(self.frd.do_voicegen_frd(text))["sentences"]]
+            text = ''.join(texts)
+        else:
+            if contains_chinese(text):
+                if self.text_frontend == 'wetext':
+                    text = self.zh_tn_model.normalize(text)
+                text = text.replace("\n", "")
+                text = replace_blank(text)
+                text = replace_corner_mark(text)
+                text = text.replace(".", "。")
+                text = text.replace(" - ", "，")
+                text = remove_bracket(text)
+                text = re.sub(r'[，,、]+$', '。', text)
+                texts = list(split_paragraph(text, partial(self.tokenizer.encode, allowed_special=self.allowed_special), "zh", token_max_n=80,
+                                             token_min_n=60, merge_len=20, comma_split=False))
+            else:
+                if self.text_frontend == 'wetext':
+                    text = self.en_tn_model.normalize(text)
+                text = spell_out_number(text, self.inflect_parser)
+                texts = list(split_paragraph(text, partial(self.tokenizer.encode, allowed_special=self.allowed_special), "en", token_max_n=80,
+                                             token_min_n=60, merge_len=20, comma_split=False))
+        texts = [i for i in texts if not is_only_punctuation(i)]
+        return texts if split is True else text
+
+    def frontend_sft(self, tts_text, spk_id):
+        tts_text_token, tts_text_token_len = self._extract_text_token(tts_text)
+        embedding = self.spk2info[spk_id]['embedding']
+        model_input = {'text': tts_text_token, 'text_len': tts_text_token_len, 'llm_embedding': embedding, 'flow_embedding': embedding}
+        return model_input
+
+    def frontend_zero_shot(self, tts_text, prompt_text, prompt_wav, resample_rate, zero_shot_spk_id):
+        tts_text_token, tts_text_token_len = self._extract_text_token(tts_text)
+        if zero_shot_spk_id == '':
+            prompt_text_token, prompt_text_token_len = self._extract_text_token(prompt_text)
+            speech_feat, speech_feat_len = self._extract_speech_feat(prompt_wav)
+            speech_token, speech_token_len = self._extract_speech_token(prompt_wav)
+            if resample_rate == 24000:
+                # cosyvoice2, force speech_feat % speech_token = 2
+                token_len = min(int(speech_feat.shape[1] / 2), speech_token.shape[1])
+                speech_feat, speech_feat_len[:] = speech_feat[:, :2 * token_len], 2 * token_len
+                speech_token, speech_token_len[:] = speech_token[:, :token_len], token_len
+            embedding = self._extract_spk_embedding(prompt_wav)
+            model_input = {'prompt_text': prompt_text_token, 'prompt_text_len': prompt_text_token_len,
+                           'llm_prompt_speech_token': speech_token, 'llm_prompt_speech_token_len': speech_token_len,
+                           'flow_prompt_speech_token': speech_token, 'flow_prompt_speech_token_len': speech_token_len,
+                           'prompt_speech_feat': speech_feat, 'prompt_speech_feat_len': speech_feat_len,
+                           'llm_embedding': embedding, 'flow_embedding': embedding}
+        else:
+            model_input = {**self.spk2info[zero_shot_spk_id]}
+        model_input['text'] = tts_text_token
+        model_input['text_len'] = tts_text_token_len
+        return model_input
+
+    def frontend_cross_lingual(self, tts_text, prompt_wav, resample_rate, zero_shot_spk_id):
+        model_input = self.frontend_zero_shot(tts_text, '', prompt_wav, resample_rate, zero_shot_spk_id)
+        # in cross lingual mode, we remove prompt in llm
+        del model_input['prompt_text']
+        del model_input['prompt_text_len']
+        del model_input['llm_prompt_speech_token']
+        del model_input['llm_prompt_speech_token_len']
+        return model_input
+
+    def frontend_instruct(self, tts_text, spk_id, instruct_text):
+        model_input = self.frontend_sft(tts_text, spk_id)
+        # in instruct mode, we remove spk_embedding in llm due to information leakage
+        del model_input['llm_embedding']
+        instruct_text_token, instruct_text_token_len = self._extract_text_token(instruct_text)
+        model_input['prompt_text'] = instruct_text_token
+        model_input['prompt_text_len'] = instruct_text_token_len
+        return model_input
+
+    def frontend_instruct2(self, tts_text, instruct_text, prompt_wav, resample_rate, zero_shot_spk_id):
+        model_input = self.frontend_zero_shot(tts_text, instruct_text, prompt_wav, resample_rate, zero_shot_spk_id)
+        del model_input['llm_prompt_speech_token']
+        del model_input['llm_prompt_speech_token_len']
+        return model_input
+
+    def frontend_vc(self, source_speech_16k, prompt_wav, resample_rate):
+        prompt_speech_token, prompt_speech_token_len = self._extract_speech_token(prompt_wav)
+        prompt_speech_feat, prompt_speech_feat_len = self._extract_speech_feat(prompt_wav)
+        embedding = self._extract_spk_embedding(prompt_wav)
+        source_speech_token, source_speech_token_len = self._extract_speech_token(source_speech_16k)
+        model_input = {'source_speech_token': source_speech_token, 'source_speech_token_len': source_speech_token_len,
+                       'flow_prompt_speech_token': prompt_speech_token, 'flow_prompt_speech_token_len': prompt_speech_token_len,
+                       'prompt_speech_feat': prompt_speech_feat, 'prompt_speech_feat_len': prompt_speech_feat_len,
+                       'flow_embedding': embedding}
+        return model_input
--- a/models/CosyVoice/cosyvoice/cli/model.py
+++ b/models/CosyVoice/cosyvoice/cli/model.py
@@ -0,0 +1,450 @@
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#               2025 Alibaba Inc (authors: Xiang Lyu, Bofan Zhou)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+from typing import Generator
+import torch
+import numpy as np
+import threading
+import time
+from torch.nn import functional as F
+from contextlib import nullcontext
+import uuid
+from cosyvoice.utils.common import fade_in_out
+from cosyvoice.utils.file_utils import convert_onnx_to_trt, export_cosyvoice2_vllm
+from cosyvoice.utils.common import TrtContextWrapper
+
+
+class CosyVoiceModel:
+
+    def __init__(self,
+                 llm: torch.nn.Module,
+                 flow: torch.nn.Module,
+                 hift: torch.nn.Module,
+                 fp16: bool = False):
+        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+        self.llm = llm
+        self.flow = flow
+        self.hift = hift
+        self.fp16 = fp16
+        self.token_min_hop_len = 2 * self.flow.input_frame_rate
+        self.token_max_hop_len = 4 * self.flow.input_frame_rate
+        self.token_overlap_len = 20
+        # mel fade in out
+        self.mel_overlap_len = int(self.token_overlap_len / self.flow.input_frame_rate * 22050 / 256)
+        self.mel_window = np.hamming(2 * self.mel_overlap_len)
+        # hift cache
+        self.mel_cache_len = 20
+        self.source_cache_len = int(self.mel_cache_len * 256)
+        # speech fade in out
+        self.speech_window = np.hamming(2 * self.source_cache_len)
+        # rtf and decoding related
+        self.stream_scale_factor = 1
+        assert self.stream_scale_factor >= 1, 'stream_scale_factor should be greater than 1, change it according to your actual rtf'
+        self.llm_context = torch.cuda.stream(torch.cuda.Stream(self.device)) if torch.cuda.is_available() else nullcontext()
+        self.lock = threading.Lock()
+        # dict used to store session related variable
+        self.tts_speech_token_dict = {}
+        self.llm_end_dict = {}
+        self.mel_overlap_dict = {}
+        self.flow_cache_dict = {}
+        self.hift_cache_dict = {}
+        self.silent_tokens = []
+
+    def load(self, llm_model, flow_model, hift_model):
+        self.llm.load_state_dict(torch.load(llm_model, map_location=self.device, weights_only=True), strict=True)
+        self.llm.to(self.device).eval()
+        self.flow.load_state_dict(torch.load(flow_model, map_location=self.device, weights_only=True), strict=True)
+        self.flow.to(self.device).eval()
+        # in case hift_model is a hifigan model
+        hift_state_dict = {k.replace('generator.', ''): v for k, v in torch.load(hift_model, map_location=self.device, weights_only=True).items()}
+        self.hift.load_state_dict(hift_state_dict, strict=True)
+        self.hift.to(self.device).eval()
+
+    def load_jit(self, llm_text_encoder_model, llm_llm_model, flow_encoder_model):
+        llm_text_encoder = torch.jit.load(llm_text_encoder_model, map_location=self.device)
+        self.llm.text_encoder = llm_text_encoder
+        llm_llm = torch.jit.load(llm_llm_model, map_location=self.device)
+        self.llm.llm = llm_llm
+        flow_encoder = torch.jit.load(flow_encoder_model, map_location=self.device)
+        self.flow.encoder = flow_encoder
+
+    def load_trt(self, flow_decoder_estimator_model, flow_decoder_onnx_model, trt_concurrent, fp16):
+        assert torch.cuda.is_available(), 'tensorrt only supports gpu!'
+        if not os.path.exists(flow_decoder_estimator_model) or os.path.getsize(flow_decoder_estimator_model) == 0:
+            convert_onnx_to_trt(flow_decoder_estimator_model, self.get_trt_kwargs(), flow_decoder_onnx_model, fp16)
+        del self.flow.decoder.estimator
+        import tensorrt as trt
+        with open(flow_decoder_estimator_model, 'rb') as f:
+            estimator_engine = trt.Runtime(trt.Logger(trt.Logger.INFO)).deserialize_cuda_engine(f.read())
+        assert estimator_engine is not None, 'failed to load trt {}'.format(flow_decoder_estimator_model)
+        self.flow.decoder.estimator = TrtContextWrapper(estimator_engine, trt_concurrent=trt_concurrent, device=self.device)
+
+    def get_trt_kwargs(self):
+        min_shape = [(2, 80, 4), (2, 1, 4), (2, 80, 4), (2, 80, 4)]
+        opt_shape = [(2, 80, 500), (2, 1, 500), (2, 80, 500), (2, 80, 500)]
+        max_shape = [(2, 80, 3000), (2, 1, 3000), (2, 80, 3000), (2, 80, 3000)]
+        input_names = ["x", "mask", "mu", "cond"]
+        return {'min_shape': min_shape, 'opt_shape': opt_shape, 'max_shape': max_shape, 'input_names': input_names}
+
+    def llm_job(self, text, prompt_text, llm_prompt_speech_token, llm_embedding, uuid):
+        cur_silent_token_num, max_silent_token_num = 0, 5
+        with self.llm_context, torch.cuda.amp.autocast(self.fp16 is True and hasattr(self.llm, 'vllm') is False):
+            if isinstance(text, Generator):
+                assert (self.__class__.__name__ != 'CosyVoiceModel') and not hasattr(self.llm, 'vllm'), 'streaming input text is only implemented for CosyVoice2/3 and do not support vllm!'
+                token_generator = self.llm.inference_bistream(text=text,
+                                                              prompt_text=prompt_text.to(self.device),
+                                                              prompt_text_len=torch.tensor([prompt_text.shape[1]], dtype=torch.int32).to(self.device),
+                                                              prompt_speech_token=llm_prompt_speech_token.to(self.device),
+                                                              prompt_speech_token_len=torch.tensor([llm_prompt_speech_token.shape[1]], dtype=torch.int32).to(self.device),
+                                                              embedding=llm_embedding.to(self.device))
+            else:
+                token_generator = self.llm.inference(text=text.to(self.device),
+                                                     text_len=torch.tensor([text.shape[1]], dtype=torch.int32).to(self.device),
+                                                     prompt_text=prompt_text.to(self.device),
+                                                     prompt_text_len=torch.tensor([prompt_text.shape[1]], dtype=torch.int32).to(self.device),
+                                                     prompt_speech_token=llm_prompt_speech_token.to(self.device),
+                                                     prompt_speech_token_len=torch.tensor([llm_prompt_speech_token.shape[1]], dtype=torch.int32).to(self.device),
+                                                     embedding=llm_embedding.to(self.device),
+                                                     uuid=uuid)  
+            for i in token_generator:
+                if i in self.silent_tokens:
+                    cur_silent_token_num += 1
+                    if cur_silent_token_num > max_silent_token_num:
+                        continue
+                else:
+                    cur_silent_token_num = 0
+                self.tts_speech_token_dict[uuid].append(i)
+        self.llm_end_dict[uuid] = True
+
+    def vc_job(self, source_speech_token, uuid):
+        self.tts_speech_token_dict[uuid] = source_speech_token.flatten().tolist()
+        self.llm_end_dict[uuid] = True
+
+    def token2wav(self, token, prompt_token, prompt_feat, embedding, uuid, finalize=False, speed=1.0):
+        with torch.cuda.amp.autocast(self.fp16):
+            tts_mel, self.flow_cache_dict[uuid] = self.flow.inference(token=token.to(self.device, dtype=torch.int32),
+                                                                      token_len=torch.tensor([token.shape[1]], dtype=torch.int32).to(self.device),
+                                                                      prompt_token=prompt_token.to(self.device),
+                                                                      prompt_token_len=torch.tensor([prompt_token.shape[1]], dtype=torch.int32).to(self.device),
+                                                                      prompt_feat=prompt_feat.to(self.device),
+                                                                      prompt_feat_len=torch.tensor([prompt_feat.shape[1]], dtype=torch.int32).to(self.device),
+                                                                      embedding=embedding.to(self.device),
+                                                                      flow_cache=self.flow_cache_dict[uuid])
+
+        # mel overlap fade in out
+        if self.mel_overlap_dict[uuid].shape[2] != 0:
+            tts_mel = fade_in_out(tts_mel, self.mel_overlap_dict[uuid], self.mel_window)
+        # append hift cache
+        if self.hift_cache_dict[uuid] is not None:
+            hift_cache_mel, hift_cache_source = self.hift_cache_dict[uuid]['mel'], self.hift_cache_dict[uuid]['source']
+            tts_mel = torch.concat([hift_cache_mel, tts_mel], dim=2)
+        else:
+            hift_cache_source = torch.zeros(1, 1, 0)
+        # keep overlap mel and hift cache
+        if finalize is False:
+            self.mel_overlap_dict[uuid] = tts_mel[:, :, -self.mel_overlap_len:]
+            tts_mel = tts_mel[:, :, :-self.mel_overlap_len]
+            tts_speech, tts_source = self.hift.inference(speech_feat=tts_mel, cache_source=hift_cache_source)
+            if self.hift_cache_dict[uuid] is not None:
+                tts_speech = fade_in_out(tts_speech, self.hift_cache_dict[uuid]['speech'], self.speech_window)
+            self.hift_cache_dict[uuid] = {'mel': tts_mel[:, :, -self.mel_cache_len:],
+                                          'source': tts_source[:, :, -self.source_cache_len:],
+                                          'speech': tts_speech[:, -self.source_cache_len:]}
+            tts_speech = tts_speech[:, :-self.source_cache_len]
+        else:
+            if speed != 1.0:
+                assert self.hift_cache_dict[uuid] is None, 'speed change only support non-stream inference mode'
+                tts_mel = F.interpolate(tts_mel, size=int(tts_mel.shape[2] / speed), mode='linear')
+            tts_speech, tts_source = self.hift.inference(speech_feat=tts_mel, cache_source=hift_cache_source)
+            if self.hift_cache_dict[uuid] is not None:
+                tts_speech = fade_in_out(tts_speech, self.hift_cache_dict[uuid]['speech'], self.speech_window)
+        return tts_speech
+
+    def tts(self, text=torch.zeros(1, 0, dtype=torch.int32), flow_embedding=torch.zeros(0, 192), llm_embedding=torch.zeros(0, 192),
+            prompt_text=torch.zeros(1, 0, dtype=torch.int32),
+            llm_prompt_speech_token=torch.zeros(1, 0, dtype=torch.int32),
+            flow_prompt_speech_token=torch.zeros(1, 0, dtype=torch.int32),
+            prompt_speech_feat=torch.zeros(1, 0, 80), source_speech_token=torch.zeros(1, 0, dtype=torch.int32), stream=False, speed=1.0, **kwargs):
+        # this_uuid is used to track variables related to this inference thread
+        this_uuid = str(uuid.uuid1())
+        with self.lock:
+            self.tts_speech_token_dict[this_uuid], self.llm_end_dict[this_uuid] = [], False
+            self.hift_cache_dict[this_uuid] = None
+            self.mel_overlap_dict[this_uuid] = torch.zeros(1, 80, 0)
+            self.flow_cache_dict[this_uuid] = torch.zeros(1, 80, 0, 2)
+        if source_speech_token.shape[1] == 0:
+            p = threading.Thread(target=self.llm_job, args=(text, prompt_text, llm_prompt_speech_token, llm_embedding, this_uuid))
+        else:
+            p = threading.Thread(target=self.vc_job, args=(source_speech_token, this_uuid))
+        p.start()
+        if stream is True:
+            token_hop_len = self.token_min_hop_len
+            while True:
+                time.sleep(0.1)
+                if len(self.tts_speech_token_dict[this_uuid]) >= token_hop_len + self.token_overlap_len:
+                    this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid][:token_hop_len + self.token_overlap_len]) \
+                        .unsqueeze(dim=0)
+                    this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                                     prompt_token=flow_prompt_speech_token,
+                                                     prompt_feat=prompt_speech_feat,
+                                                     embedding=flow_embedding,
+                                                     uuid=this_uuid,
+                                                     finalize=False)
+                    yield {'tts_speech': this_tts_speech.cpu()}
+                    with self.lock:
+                        self.tts_speech_token_dict[this_uuid] = self.tts_speech_token_dict[this_uuid][token_hop_len:]
+                    # increase token_hop_len for better speech quality
+                    token_hop_len = min(self.token_max_hop_len, int(token_hop_len * self.stream_scale_factor))
+                if self.llm_end_dict[this_uuid] is True and len(self.tts_speech_token_dict[this_uuid]) < token_hop_len + self.token_overlap_len:
+                    break
+            p.join()
+            # deal with remain tokens, make sure inference remain token len equals token_hop_len when cache_speech is not None
+            this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
+            this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                             prompt_token=flow_prompt_speech_token,
+                                             prompt_feat=prompt_speech_feat,
+                                             embedding=flow_embedding,
+                                             uuid=this_uuid,
+                                             finalize=True)
+            yield {'tts_speech': this_tts_speech.cpu()}
+        else:
+            # deal with all tokens
+            p.join()
+            this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
+            this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                             prompt_token=flow_prompt_speech_token,
+                                             prompt_feat=prompt_speech_feat,
+                                             embedding=flow_embedding,
+                                             uuid=this_uuid,
+                                             finalize=True,
+                                             speed=speed)
+            yield {'tts_speech': this_tts_speech.cpu()}
+        with self.lock:
+            self.tts_speech_token_dict.pop(this_uuid)
+            self.llm_end_dict.pop(this_uuid)
+            self.mel_overlap_dict.pop(this_uuid)
+            self.hift_cache_dict.pop(this_uuid)
+            self.flow_cache_dict.pop(this_uuid)
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+            torch.cuda.current_stream().synchronize()
+
+
+class CosyVoice2Model(CosyVoiceModel):
+
+    def __init__(self,
+                 llm: torch.nn.Module,
+                 flow: torch.nn.Module,
+                 hift: torch.nn.Module,
+                 fp16: bool = False):
+        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+        self.llm = llm
+        self.flow = flow
+        self.hift = hift
+        self.fp16 = fp16
+        # NOTE must matching training static_chunk_size
+        self.token_hop_len = 25
+        # NOTE increase token_hop_len incrementally to avoid duplicate inference 
+        self.token_max_hop_len = 4 * self.token_hop_len
+        self.stream_scale_factor = 2
+        assert self.stream_scale_factor >= 1, 'stream_scale_factor should be greater than 1, change it according to your actual rtf'
+        # hift cache
+        self.mel_cache_len = 8
+        self.source_cache_len = int(self.mel_cache_len * 480)
+        # speech fade in out
+        self.speech_window = np.hamming(2 * self.source_cache_len)
+        # rtf and decoding related
+        self.llm_context = torch.cuda.stream(torch.cuda.Stream(self.device)) if torch.cuda.is_available() else nullcontext()
+        self.lock = threading.Lock()
+        # dict used to store session related variable
+        self.tts_speech_token_dict = {}
+        self.llm_end_dict = {}
+        self.hift_cache_dict = {}
+        self.silent_tokens = []
+
+    def load_jit(self, flow_encoder_model):
+        flow_encoder = torch.jit.load(flow_encoder_model, map_location=self.device)
+        self.flow.encoder = flow_encoder
+
+    def load_vllm(self, model_dir):
+        export_cosyvoice2_vllm(self.llm, model_dir, self.device)
+        from vllm import EngineArgs, LLMEngine
+        engine_args = EngineArgs(model=model_dir,
+                                 skip_tokenizer_init=True,
+                                 enable_prompt_embeds=True,
+                                 gpu_memory_utilization=0.2)
+        self.llm.vllm = LLMEngine.from_engine_args(engine_args)
+        self.llm.lock = threading.Lock()
+        del self.llm.llm.model.model.layers
+
+    def token2wav(self, token, prompt_token, prompt_feat, embedding, token_offset, uuid, stream=False, finalize=False, speed=1.0):
+        with torch.cuda.amp.autocast(self.fp16):
+            tts_mel, _ = self.flow.inference(token=token.to(self.device, dtype=torch.int32),
+                                             token_len=torch.tensor([token.shape[1]], dtype=torch.int32).to(self.device),
+                                             prompt_token=prompt_token.to(self.device),
+                                             prompt_token_len=torch.tensor([prompt_token.shape[1]], dtype=torch.int32).to(self.device),
+                                             prompt_feat=prompt_feat.to(self.device),
+                                             prompt_feat_len=torch.tensor([prompt_feat.shape[1]], dtype=torch.int32).to(self.device),
+                                             embedding=embedding.to(self.device),
+                                             streaming=stream,
+                                             finalize=finalize)
+        tts_mel = tts_mel[:, :, token_offset * self.flow.token_mel_ratio:]
+        # append hift cache
+        if self.hift_cache_dict[uuid] is not None:
+            hift_cache_mel, hift_cache_source = self.hift_cache_dict[uuid]['mel'], self.hift_cache_dict[uuid]['source']
+            tts_mel = torch.concat([hift_cache_mel, tts_mel], dim=2)
+        else:
+            hift_cache_source = torch.zeros(1, 1, 0)
+        # keep overlap mel and hift cache
+        if finalize is False:
+            tts_speech, tts_source = self.hift.inference(speech_feat=tts_mel, cache_source=hift_cache_source)
+            if self.hift_cache_dict[uuid] is not None:
+                tts_speech = fade_in_out(tts_speech, self.hift_cache_dict[uuid]['speech'], self.speech_window)
+            self.hift_cache_dict[uuid] = {'mel': tts_mel[:, :, -self.mel_cache_len:],
+                                          'source': tts_source[:, :, -self.source_cache_len:],
+                                          'speech': tts_speech[:, -self.source_cache_len:]}
+            tts_speech = tts_speech[:, :-self.source_cache_len]
+        else:
+            if speed != 1.0:
+                assert self.hift_cache_dict[uuid] is None, 'speed change only support non-stream inference mode'
+                tts_mel = F.interpolate(tts_mel, size=int(tts_mel.shape[2] / speed), mode='linear')
+            tts_speech, tts_source = self.hift.inference(speech_feat=tts_mel, cache_source=hift_cache_source)
+            if self.hift_cache_dict[uuid] is not None:
+                tts_speech = fade_in_out(tts_speech, self.hift_cache_dict[uuid]['speech'], self.speech_window)
+        return tts_speech
+
+    def tts(self, text=torch.zeros(1, 0, dtype=torch.int32), flow_embedding=torch.zeros(0, 192), llm_embedding=torch.zeros(0, 192),
+            prompt_text=torch.zeros(1, 0, dtype=torch.int32),
+            llm_prompt_speech_token=torch.zeros(1, 0, dtype=torch.int32),
+            flow_prompt_speech_token=torch.zeros(1, 0, dtype=torch.int32),
+            prompt_speech_feat=torch.zeros(1, 0, 80), source_speech_token=torch.zeros(1, 0, dtype=torch.int32), stream=False, speed=1.0, **kwargs):
+        # this_uuid is used to track variables related to this inference thread
+        this_uuid = str(uuid.uuid1())
+        with self.lock:
+            self.tts_speech_token_dict[this_uuid], self.llm_end_dict[this_uuid] = [], False
+            self.hift_cache_dict[this_uuid] = None
+        if source_speech_token.shape[1] == 0:
+            p = threading.Thread(target=self.llm_job, args=(text, prompt_text, llm_prompt_speech_token, llm_embedding, this_uuid))
+        else:
+            p = threading.Thread(target=self.vc_job, args=(source_speech_token, this_uuid))
+        p.start()
+        if stream is True:
+            token_offset = 0
+            prompt_token_pad = int(np.ceil(flow_prompt_speech_token.shape[1] / self.token_hop_len) * self.token_hop_len - flow_prompt_speech_token.shape[1])
+            while True:
+                time.sleep(0.1)
+                this_token_hop_len = self.token_hop_len + prompt_token_pad if token_offset == 0 else self.token_hop_len
+                if len(self.tts_speech_token_dict[this_uuid]) - token_offset >= this_token_hop_len + self.flow.pre_lookahead_len:
+                    this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid][:token_offset + this_token_hop_len + self.flow.pre_lookahead_len]).unsqueeze(dim=0)
+                    this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                                     prompt_token=flow_prompt_speech_token,
+                                                     prompt_feat=prompt_speech_feat,
+                                                     embedding=flow_embedding,
+                                                     token_offset=token_offset,
+                                                     uuid=this_uuid,
+                                                     stream=stream,
+                                                     finalize=False)
+                    token_offset += this_token_hop_len
+                    self.token_hop_len = min(self.token_max_hop_len, self.token_hop_len * self.stream_scale_factor)
+                    yield {'tts_speech': this_tts_speech.cpu()}
+                if self.llm_end_dict[this_uuid] is True and len(self.tts_speech_token_dict[this_uuid]) - token_offset < this_token_hop_len + self.flow.pre_lookahead_len:
+                    break
+            p.join()
+            # deal with remain tokens, make sure inference remain token len equals token_hop_len when cache_speech is not None
+            this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
+            this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                             prompt_token=flow_prompt_speech_token,
+                                             prompt_feat=prompt_speech_feat,
+                                             embedding=flow_embedding,
+                                             token_offset=token_offset,
+                                             uuid=this_uuid,
+                                             finalize=True)
+            yield {'tts_speech': this_tts_speech.cpu()}
+        else:
+            # deal with all tokens
+            p.join()
+            this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
+            this_tts_speech = self.token2wav(token=this_tts_speech_token,
+                                             prompt_token=flow_prompt_speech_token,
+                                             prompt_feat=prompt_speech_feat,
+                                             embedding=flow_embedding,
+                                             token_offset=0,
+                                             uuid=this_uuid,
+                                             finalize=True,
+                                             speed=speed)
+            yield {'tts_speech': this_tts_speech.cpu()}
+        with self.lock:
+            self.tts_speech_token_dict.pop(this_uuid)
+            self.llm_end_dict.pop(this_uuid)
+            self.hift_cache_dict.pop(this_uuid)
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+            torch.cuda.current_stream().synchronize()
+
+
+class CosyVoice3Model(CosyVoice2Model):
+
+    def __init__(self,
+                 llm: torch.nn.Module,
+                 flow: torch.nn.Module,
+                 hift: torch.nn.Module,
+                 fp16: bool = False):
+        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+        self.llm = llm
+        self.flow = flow
+        self.hift = hift
+        self.fp16 = fp16
+        # NOTE must matching training static_chunk_size
+        self.token_hop_len = 25
+        # NOTE increase token_hop_len incrementally to avoid duplicate inference 
+        self.token_max_hop_len = 4 * self.token_hop_len
+        self.stream_scale_factor = 2
+        assert self.stream_scale_factor >= 1, 'stream_scale_factor should be greater than 1, change it according to your actual rtf'
+        # rtf and decoding related
+        self.llm_context = torch.cuda.stream(torch.cuda.Stream(self.device)) if torch.cuda.is_available() else nullcontext()
+        self.lock = threading.Lock()
+        # dict used to store session related variable
+        self.tts_speech_token_dict = {}
+        self.llm_end_dict = {}
+        self.hift_cache_dict = {}
+        # FSQ silent and breath token
+        self.silent_tokens = [1, 2, 28, 29, 55, 248, 494, 2241, 2242, 2322, 2323]
+
+    def token2wav(self, token, prompt_token, prompt_feat, embedding, token_offset, uuid, stream=False, finalize=False, speed=1.0):
+        with torch.cuda.amp.autocast(self.fp16):
+            tts_mel, _ = self.flow.inference(token=token.to(self.device, dtype=torch.int32),
+                                             token_len=torch.tensor([token.shape[1]], dtype=torch.int32).to(self.device),
+                                             prompt_token=prompt_token.to(self.device),
+                                             prompt_token_len=torch.tensor([prompt_token.shape[1]], dtype=torch.int32).to(self.device),
+                                             prompt_feat=prompt_feat.to(self.device),
+                                             prompt_feat_len=torch.tensor([prompt_feat.shape[1]], dtype=torch.int32).to(self.device),
+                                             embedding=embedding.to(self.device),
+                                             streaming=stream,
+                                             finalize=finalize)
+            tts_mel = tts_mel[:, :, token_offset * self.flow.token_mel_ratio:]
+            # append mel cache
+            if self.hift_cache_dict[uuid] is not None:
+                hift_cache_mel = self.hift_cache_dict[uuid]['mel']
+                tts_mel = torch.concat([hift_cache_mel, tts_mel], dim=2)
+                self.hift_cache_dict[uuid]['mel'] = tts_mel
+            else:
+                self.hift_cache_dict[uuid] = {'mel': tts_mel, 'speech_offset': 0}
+            if speed != 1.0:
+                assert token_offset == 0 and finalize is True, 'speed change only support non-stream inference mode'
+                tts_mel = F.interpolate(tts_mel, size=int(tts_mel.shape[2] / speed), mode='linear')
+            tts_speech, _ = self.hift.inference(speech_feat=tts_mel, finalize=finalize)
+            tts_speech = tts_speech[:, self.hift_cache_dict[uuid]['speech_offset']:]
+            self.hift_cache_dict[uuid]['speech_offset'] += tts_speech.shape[1]
+        return tts_speech
--- a/models/CosyVoice/cosyvoice/dataset/init.py
+++ b/models/CosyVoice/cosyvoice/dataset/init.py
--- a/models/CosyVoice/cosyvoice/dataset/dataset.py
+++ b/models/CosyVoice/cosyvoice/dataset/dataset.py
@@ -0,0 +1,155 @@
+# Copyright (c) 2021 Mobvoi Inc. (authors: Binbin Zhang)
+#               2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import math
+from functools import partial
+
+import torch
+import torch.distributed as dist
+from torch.utils.data import IterableDataset
+from cosyvoice.utils.file_utils import read_lists
+
+
+class Processor(IterableDataset):
+
+    def __init__(self, source, f, *args, **kw):
+        assert callable(f)
+        self.source = source
+        self.f = f
+        self.args = args
+        self.kw = kw
+
+    def set_epoch(self, epoch):
+        self.source.set_epoch(epoch)
+
+    def __iter__(self):
+        """ Return an iterator over the source dataset processed by the
+            given processor.
+        """
+        assert self.source is not None
+        assert callable(self.f)
+        return self.f(iter(self.source), *self.args, **self.kw)
+
+    def apply(self, f):
+        assert callable(f)
+        return Processor(self, f, *self.args, **self.kw)
+
+
+class DistributedSampler:
+
+    def __init__(self, shuffle=True, partition=True):
+        self.epoch = -1
+        self.update()
+        self.shuffle = shuffle
+        self.partition = partition
+
+    def update(self):
+        assert dist.is_available()
+        if dist.is_initialized():
+            self.rank = dist.get_rank()
+            self.world_size = dist.get_world_size()
+        else:
+            self.rank = 0
+            self.world_size = 1
+        worker_info = torch.utils.data.get_worker_info()
+        if worker_info is None:
+            self.worker_id = 0
+            self.num_workers = 1
+        else:
+            self.worker_id = worker_info.id
+            self.num_workers = worker_info.num_workers
+        return dict(rank=self.rank,
+                    world_size=self.world_size,
+                    worker_id=self.worker_id,
+                    num_workers=self.num_workers)
+
+    def set_epoch(self, epoch):
+        self.epoch = epoch
+
+    def sample(self, data):
+        """ Sample data according to rank/world_size/num_workers
+
+            Args:
+                data(List): input data list
+
+            Returns:
+                List: data list after sample
+        """
+        data = list(range(len(data)))
+        # force datalist even
+        if self.partition:
+            if self.shuffle:
+                random.Random(self.epoch).shuffle(data)
+            if len(data) < self.world_size:
+                data = data * math.ceil(self.world_size / len(data))
+                data = data[:self.world_size]
+            data = data[self.rank::self.world_size]
+        if len(data) < self.num_workers:
+            data = data * math.ceil(self.num_workers / len(data))
+            data = data[:self.num_workers]
+        data = data[self.worker_id::self.num_workers]
+        return data
+
+
+class DataList(IterableDataset):
+
+    def __init__(self, lists, shuffle=True, partition=True):
+        self.lists = lists
+        self.sampler = DistributedSampler(shuffle, partition)
+
+    def set_epoch(self, epoch):
+        self.sampler.set_epoch(epoch)
+
+    def __iter__(self):
+        sampler_info = self.sampler.update()
+        indexes = self.sampler.sample(self.lists)
+        for index in indexes:
+            data = dict(src=self.lists[index])
+            data.update(sampler_info)
+            yield data
+
+
+def Dataset(data_list_file,
+            data_pipeline,
+            mode='train',
+            gan=False,
+            dpo=False,
+            shuffle=True,
+            partition=True):
+    """ Construct dataset from arguments
+
+        We have two shuffle stage in the Dataset. The first is global
+        shuffle at shards tar/raw file level. The second is global shuffle
+        at training samples level.
+
+        Args:
+            data_type(str): raw/shard
+            tokenizer (BaseTokenizer): tokenizer to tokenize
+            partition(bool): whether to do data partition in terms of rank
+    """
+    lists = read_lists(data_list_file)
+    dataset = DataList(lists,
+                       shuffle=shuffle,
+                       partition=partition)
+    # map partial arg to padding func
+    for i in range(1, len(data_pipeline)):
+        if data_pipeline[i].func.__name__ == 'compute_fbank' and gan is True:
+            data_pipeline[i] = partial(data_pipeline[i], token_mel_ratio=0)
+        if data_pipeline[i].func.__name__ == 'padding':
+            data_pipeline[i] = partial(data_pipeline[i], gan=gan, dpo=dpo)
+    for func in data_pipeline:
+        dataset = Processor(dataset, func, mode=mode)
+    return dataset
--- a/models/CosyVoice/cosyvoice/dataset/processor.py
+++ b/models/CosyVoice/cosyvoice/dataset/processor.py
@@ -0,0 +1,431 @@
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+import random
+
+import pyarrow.parquet as pq
+from io import BytesIO
+import numpy as np
+import whisper
+import torch
+import torchaudio
+from torch.nn.utils.rnn import pad_sequence
+import torch.nn.functional as F
+import pyworld as pw
+from cosyvoice.utils.onnx import embedding_extractor, online_feature
+
+AUDIO_FORMAT_SETS = {'flac', 'mp3', 'm4a', 'ogg', 'opus', 'wav', 'wma'}
+
+
+def parquet_opener(data, mode='train'):
+    """ Give url or local file, return file descriptor
+        Inplace operation.
+
+        Args:
+            data(Iterable[str]): url or local file list
+
+        Returns:
+            Iterable[{src, stream}]
+    """
+    for sample in data:
+        assert 'src' in sample
+        url = sample['src']
+        try:
+            for df in pq.ParquetFile(url).iter_batches(batch_size=64):
+                df = df.to_pandas()
+                for i in range(len(df)):
+                    sample.update(dict(df.loc[i]))
+                    # NOTE do not return sample directly, must initialize a new dict
+                    yield {**sample}
+        except Exception as ex:
+            logging.warning('Failed to open {}, ex info {}'.format(url, ex))
+
+
+def filter(data,
+           max_length=10240,
+           min_length=10,
+           token_max_length=200,
+           token_min_length=1,
+           min_output_input_ratio=0.0005,
+           max_output_input_ratio=1,
+           mode='train'):
+    """ Filter sample according to feature and label length
+        Inplace operation.
+
+        Args::
+            data: Iterable[{key, wav, label, sample_rate}]
+            max_length: drop utterance which is greater than max_length(10ms)
+            min_length: drop utterance which is less than min_length(10ms)
+            token_max_length: drop utterance which is greater than
+                token_max_length, especially when use char unit for
+                english modeling
+            token_min_length: drop utterance which is
+                less than token_max_length
+            min_output_input_ratio: minimal ration of
+                token_length / feats_length(10ms)
+            max_output_input_ratio: maximum ration of
+                token_length / feats_length(10ms)
+
+        Returns:
+            Iterable[{key, wav, label, sample_rate}]
+    """
+    for sample in data:
+        sample['speech'], sample['sample_rate'] = torchaudio.load(BytesIO(sample['audio_data']))
+        sample['speech'] = sample['speech'].mean(dim=0, keepdim=True)
+        del sample['audio_data']
+        # sample['wav'] is torch.Tensor, we have 100 frames every second
+        num_frames = sample['speech'].size(1) / sample['sample_rate'] * 100
+        if num_frames < min_length:
+            continue
+        if num_frames > max_length:
+            continue
+        if len(sample['text_token']) < token_min_length:
+            continue
+        if len(sample['text_token']) > token_max_length:
+            continue
+        if online_feature is False and len(sample['speech_token']) == 0:
+            continue
+        if online_feature is False and 'reject_speech_token' in sample and len(sample['reject_speech_token']) == 0:
+            continue
+        if num_frames != 0:
+            if len(sample['text_token']) / num_frames < min_output_input_ratio:
+                continue
+            if len(sample['text_token']) / num_frames > max_output_input_ratio:
+                continue
+        yield sample
+
+
+def resample(data, resample_rate=22050, min_sample_rate=16000, mode='train'):
+    """ Resample data.
+        Inplace operation.
+
+        Args:
+            data: Iterable[{key, wav, label, sample_rate}]
+            resample_rate: target resample rate
+
+        Returns:
+            Iterable[{key, wav, label, sample_rate}]
+    """
+    for sample in data:
+        assert 'sample_rate' in sample
+        assert 'speech' in sample
+        sample_rate = sample['sample_rate']
+        waveform = sample['speech']
+        if sample_rate != resample_rate:
+            if sample_rate < min_sample_rate:
+                continue
+            sample['sample_rate'] = resample_rate
+            sample['speech'] = torchaudio.transforms.Resample(
+                orig_freq=sample_rate, new_freq=resample_rate)(waveform)
+        max_val = sample['speech'].abs().max()
+        if max_val > 1:
+            sample['speech'] /= max_val
+        yield sample
+
+
+def truncate(data, truncate_length=24576, mode='train'):
+    """ Truncate data.
+
+        Args:
+            data: Iterable[{key, wav, label, sample_rate}]
+            truncate_length: truncate length
+
+        Returns:
+            Iterable[{key, wav, label, sample_rate}]
+    """
+    for sample in data:
+        waveform = sample['speech']
+        if waveform.shape[1] > truncate_length:
+            start = random.randint(0, waveform.shape[1] - truncate_length)
+            waveform = waveform[:, start: start + truncate_length]
+        else:
+            waveform = torch.concat([waveform, torch.zeros(1, truncate_length - waveform.shape[1])], dim=1)
+        sample['speech'] = waveform
+        yield sample
+
+
+def compute_fbank(data,
+                  feat_extractor,
+                  num_frames=-1,
+                  mode='train'):
+    """ Extract fbank
+
+        Args:
+            data: Iterable[{key, wav, label, sample_rate}]
+
+        Returns:
+            Iterable[{key, feat, label}]
+    """
+    for sample in data:
+        assert 'sample_rate' in sample
+        assert 'speech' in sample
+        assert 'utt' in sample
+        assert 'text_token' in sample
+        # NOTE in cosyvoice2/3, we support online token extraction, so we need to align speech to 25hz first
+        if num_frames != -1:
+            index = int(np.ceil(sample['speech'].shape[1] / num_frames))
+            sample['speech'] = torch.concat([sample['speech'], torch.zeros(1, index * num_frames - sample['speech'].shape[1])], dim=1)
+        sample['speech_feat'] = feat_extractor(sample['speech']).squeeze(dim=0).transpose(0, 1)
+        yield sample
+
+
+def compute_whisper_fbank(data, num_frames=-1, mode='train'):
+    """ Extract whisper fbank 
+
+        Args:
+            data: Iterable[{key, wav, label, sample_rate}]
+
+        Returns:
+            Iterable[{key, feat, label}]
+    """
+    for sample in data:
+        if num_frames != -1:
+            assert sample['speech'].shape[1] % num_frames == 0, 'speech length is not aligned with speech_token'
+        sample['speech_16k'] = torchaudio.transforms.Resample(orig_freq=sample['sample_rate'], new_freq=16000)(sample['speech'])
+        sample['whisper_feat'] = whisper.log_mel_spectrogram(sample['speech_16k'], n_mels=128).squeeze(dim=0).transpose(0, 1)
+        yield sample
+
+
+def compute_f0(data, sample_rate, hop_size, mode='train'):
+    """ Extract f0
+
+        Args:
+            data: Iterable[{key, wav, label, sample_rate}]
+
+        Returns:
+            Iterable[{key, feat, label}]
+    """
+    frame_period = hop_size * 1000 / sample_rate
+    for sample in data:
+        assert 'sample_rate' in sample
+        assert 'speech' in sample
+        assert 'utt' in sample
+        assert 'text_token' in sample
+        waveform = sample['speech']
+        _f0, t = pw.harvest(waveform.squeeze(dim=0).numpy().astype('double'), sample_rate, frame_period=frame_period)
+        if sum(_f0 != 0) < 5:  # this happens when the algorithm fails
+            _f0, t = pw.dio(waveform.squeeze(dim=0).numpy().astype('double'), sample_rate, frame_period=frame_period)  # if harvest fails, try dio
+        f0 = pw.stonemask(waveform.squeeze(dim=0).numpy().astype('double'), _f0, t, sample_rate)
+        f0 = F.interpolate(torch.from_numpy(f0).view(1, 1, -1), size=sample['speech_feat'].shape[0], mode='linear').view(-1)
+        sample['pitch_feat'] = f0
+        yield sample
+
+
+def parse_embedding(data, normalize, mode='train'):
+    """ Parse utt_embedding/spk_embedding
+
+        Args:
+            data: Iterable[{key, wav, label, sample_rate}]
+
+        Returns:
+            Iterable[{key, feat, label}]
+    """
+    for sample in data:
+        if 'utt_embedding' not in sample and 'spk_embedding' not in sample:
+            sample['speech_16k'] = torchaudio.transforms.Resample(orig_freq=sample['sample_rate'], new_freq=16000)(sample['speech'])
+            embedding = embedding_extractor.inference(sample['speech_16k'])
+            sample['spk_embedding'] = sample['utt_embedding'] = embedding
+        else:
+            sample['utt_embedding'] = torch.tensor(sample['utt_embedding'], dtype=torch.float32)
+            sample['spk_embedding'] = torch.tensor(sample['spk_embedding'], dtype=torch.float32)
+        if normalize:
+            sample['utt_embedding'] = F.normalize(sample['utt_embedding'], dim=0)
+            sample['spk_embedding'] = F.normalize(sample['spk_embedding'], dim=0)
+        yield sample
+
+
+def tokenize(data, get_tokenizer, allowed_special, mode='train'):
+    """ Decode text to chars or BPE
+        Inplace operation
+
+        Args:
+            data: Iterable[{key, wav, txt, sample_rate}]
+
+        Returns:
+            Iterable[{key, wav, txt, tokens, label, sample_rate}]
+    """
+    tokenizer = get_tokenizer()
+    for sample in data:
+        assert 'text' in sample
+        sample['text_token'] = tokenizer.encode(sample['text'], allowed_special=allowed_special)
+        if 'instruct' in sample:
+            sample['instruct_token'] = tokenizer.encode(sample['instruct'], allowed_special=allowed_special)
+        yield sample
+
+
+def shuffle(data, shuffle_size=10000, mode='train'):
+    """ Local shuffle the data
+
+        Args:
+            data: Iterable[{key, feat, label}]
+            shuffle_size: buffer size for shuffle
+
+        Returns:
+            Iterable[{key, feat, label}]
+    """
+    buf = []
+    yield_size = int(shuffle_size / 2)
+    for sample in data:
+        buf.append(sample)
+        if len(buf) >= shuffle_size:
+            random.shuffle(buf)
+            for x in buf[:yield_size]:
+                yield x
+            buf = buf[yield_size:]
+    # The sample left over
+    random.shuffle(buf)
+    for x in buf:
+        yield x
+
+
+def sort(data, sort_size=500, mode='train'):
+    """ Sort the data by feature length.
+        Sort is used after shuffle and before batch, so we can group
+        utts with similar lengths into a batch, and `sort_size` should
+        be less than `shuffle_size`
+
+        Args:
+            data: Iterable[{key, feat, label}]
+            sort_size: buffer size for sort
+
+        Returns:
+            Iterable[{key, feat, label}]
+    """
+
+    buf = []
+    for sample in data:
+        buf.append(sample)
+        if len(buf) >= sort_size:
+            buf.sort(key=lambda x: x['speech_feat'].size(0))
+            for x in buf:
+                yield x
+            buf = []
+    # The sample left over
+    buf.sort(key=lambda x: x['speech_feat'].size(0))
+    for x in buf:
+        yield x
+
+
+def static_batch(data, batch_size=16):
+    """ Static batch the data by `batch_size`
+
+        Args:
+            data: Iterable[{key, feat, label}]
+            batch_size: batch size
+
+        Returns:
+            Iterable[List[{key, feat, label}]]
+    """
+    buf = []
+    for sample in data:
+        buf.append(sample)
+        if len(buf) >= batch_size:
+            yield buf
+            buf = []
+    if len(buf) > 0:
+        yield buf
+
+
+def dynamic_batch(data, max_frames_in_batch=12000, mode='train'):
+    """ Dynamic batch the data until the total frames in batch
+        reach `max_frames_in_batch`
+
+        Args:
+            data: Iterable[{key, feat, label}]
+            max_frames_in_batch: max_frames in one batch
+
+        Returns:
+            Iterable[List[{key, feat, label}]]
+    """
+    buf = []
+    longest_frames = 0
+    for sample in data:
+        assert 'speech_feat' in sample
+        assert isinstance(sample['speech_feat'], torch.Tensor)
+        new_sample_frames = sample['speech_feat'].size(0)
+        longest_frames = max(longest_frames, new_sample_frames)
+        frames_after_padding = longest_frames * (len(buf) + 1)
+        if frames_after_padding > max_frames_in_batch:
+            yield buf
+            buf = [sample]
+            longest_frames = new_sample_frames
+        else:
+            buf.append(sample)
+    if len(buf) > 0:
+        yield buf
+
+
+def batch(data, batch_type='static', batch_size=16, max_frames_in_batch=12000, mode='train'):
+    """ Wrapper for static/dynamic batch
+    """
+    if batch_type == 'static':
+        return static_batch(data, batch_size)
+    elif batch_type == 'dynamic':
+        return dynamic_batch(data, max_frames_in_batch)
+    else:
+        logging.fatal('Unsupported batch type {}'.format(batch_type))
+
+
+def padding(data, use_spk_embedding, mode='train', gan=False, dpo=False):
+    """ Padding the data into training data
+
+        Args:
+            data: Iterable[List[{key, feat, label}]]
+
+        Returns:
+            Iterable[Tuple(keys, feats, labels, feats lengths, label lengths)]
+    """
+    for sample in data:
+        assert isinstance(sample, list)
+        order = torch.argsort(torch.tensor([x['speech'].size(1) for x in sample], dtype=torch.int32), descending=True)
+        batch = {}
+        batch['utts'] = [sample[i]['utt'] for i in order]
+        batch['text'] = [sample[i]['text'] for i in order]
+        text_token = [torch.tensor(sample[i]['text_token']) for i in order]
+        batch['text_token_len'] = torch.tensor([i.size(0) for i in text_token], dtype=torch.int32)
+        batch['text_token'] = pad_sequence(text_token, batch_first=True, padding_value=0)
+        speech_feat = [sample[i]['speech_feat'] for i in order]
+        batch['speech_feat_len'] = torch.tensor([i.size(0) for i in speech_feat], dtype=torch.int32)
+        batch['speech_feat'] = pad_sequence(speech_feat, batch_first=True, padding_value=0)
+        batch['utt_embedding'] = torch.stack([sample[i]['utt_embedding'] for i in order], dim=0)
+        batch['spk_embedding'] = torch.stack([sample[i]['spk_embedding'] for i in order], dim=0)
+        if torch.tensor(['instruct_token' in sample[i] for i in order]).all():
+            instruct_token = [torch.tensor(sample[i]['instruct_token']) for i in order]
+            batch['instruct_token_len'] = torch.tensor([i.size(0) for i in instruct_token], dtype=torch.int32)
+            batch['instruct_token'] = pad_sequence(instruct_token, batch_first=True, padding_value=0)
+        if torch.tensor(['whisper_feat' in sample[i] for i in order]).all():
+            whisper_feat = [sample[i]['whisper_feat'] for i in order]
+            batch['whisper_feat_len'] = torch.tensor([i.size(0) for i in whisper_feat], dtype=torch.int32)
+            batch['whisper_feat'] = pad_sequence(whisper_feat, batch_first=True, padding_value=0)
+        if torch.tensor(['speech_token' in sample[i] for i in order]).all():
+            speech_token = [torch.tensor(sample[i]['speech_token']) for i in order]
+            batch['speech_token_len'] = torch.tensor([i.size(0) for i in speech_token], dtype=torch.int32)
+            batch['speech_token'] = pad_sequence(speech_token, batch_first=True, padding_value=0)
+        if gan is True:
+            # in gan train, we need speech/pitch_feat
+            speech = [sample[i]['speech'].squeeze(dim=0) for i in order]
+            batch['speech_len'] = torch.tensor([i.size(0) for i in speech], dtype=torch.int32)
+            batch['speech'] = pad_sequence(speech, batch_first=True, padding_value=0)
+            pitch_feat = [sample[i]['pitch_feat'] for i in order]
+            batch['pitch_feat_len'] = torch.tensor([i.size(0) for i in pitch_feat], dtype=torch.int32)
+            batch['pitch_feat'] = pad_sequence(pitch_feat, batch_first=True, padding_value=0)
+        if dpo is True:
+            reject_speech_token = [torch.tensor(sample[i]['reject_speech_token']) for i in order]
+            batch['reject_speech_token_len'] = torch.tensor([i.size(0) for i in reject_speech_token], dtype=torch.int32)
+            batch['reject_speech_token'] = pad_sequence(reject_speech_token, batch_first=True, padding_value=0)
+        if use_spk_embedding is True:
+            batch["embedding"] = batch["spk_embedding"]
+        else:
+            batch["embedding"] = batch["utt_embedding"]
+        yield batch
--- a/models/CosyVoice/cosyvoice/flow/DiT/dit.py
+++ b/models/CosyVoice/cosyvoice/flow/DiT/dit.py
@@ -0,0 +1,176 @@
+
+"""
+ein notation:
+b - batch
+n - sequence
+nt - text sequence
+nw - raw wave length
+d - dimension
+"""
+
+from __future__ import annotations
+
+import torch
+from torch import nn
+import torch.nn.functional as F
+from einops import repeat
+from x_transformers.x_transformers import RotaryEmbedding
+from cosyvoice.utils.mask import add_optional_chunk_mask
+from cosyvoice.flow.DiT.modules import (
+    TimestepEmbedding,
+    ConvNeXtV2Block,
+    CausalConvPositionEmbedding,
+    DiTBlock,
+    AdaLayerNormZero_Final,
+    precompute_freqs_cis,
+    get_pos_embed_indices,
+)
+
+
+# Text embedding
+
+
+class TextEmbedding(nn.Module):
+    def __init__(self, text_num_embeds, text_dim, conv_layers=0, conv_mult=2):
+        super().__init__()
+        self.text_embed = nn.Embedding(text_num_embeds + 1, text_dim)  # use 0 as filler token
+
+        if conv_layers > 0:
+            self.extra_modeling = True
+            self.precompute_max_pos = 4096  # ~44s of 24khz audio
+            self.register_buffer("freqs_cis", precompute_freqs_cis(text_dim, self.precompute_max_pos), persistent=False)
+            self.text_blocks = nn.Sequential(
+                *[ConvNeXtV2Block(text_dim, text_dim * conv_mult) for _ in range(conv_layers)]
+            )
+        else:
+            self.extra_modeling = False
+
+    def forward(self, text: int["b nt"], seq_len, drop_text=False):  # noqa: F722
+        batch, text_len = text.shape[0], text.shape[1]
+        text = text + 1  # use 0 as filler token. preprocess of batch pad -1, see list_str_to_idx()
+        text = text[:, :seq_len]  # curtail if character tokens are more than the mel spec tokens
+        text = F.pad(text, (0, seq_len - text_len), value=0)
+
+        if drop_text:  # cfg for text
+            text = torch.zeros_like(text)
+
+        text = self.text_embed(text)  # b n -> b n d
+
+        # possible extra modeling
+        if self.extra_modeling:
+            # sinus pos emb
+            batch_start = torch.zeros((batch,), dtype=torch.long)
+            pos_idx = get_pos_embed_indices(batch_start, seq_len, max_pos=self.precompute_max_pos)
+            text_pos_embed = self.freqs_cis[pos_idx]
+            text = text + text_pos_embed
+
+            # convnextv2 blocks
+            text = self.text_blocks(text)
+
+        return text
+
+
+# noised input audio and context mixing embedding
+
+
+class InputEmbedding(nn.Module):
+    def __init__(self, mel_dim, text_dim, out_dim, spk_dim=None):
+        super().__init__()
+        spk_dim = 0 if spk_dim is None else spk_dim
+        self.spk_dim = spk_dim
+        self.proj = nn.Linear(mel_dim * 2 + text_dim + spk_dim, out_dim)
+        self.conv_pos_embed = CausalConvPositionEmbedding(dim=out_dim)
+
+    def forward(
+            self,
+            x: float["b n d"],
+            cond: float["b n d"],
+            text_embed: float["b n d"],
+            spks: float["b d"],
+    ):
+        to_cat = [x, cond, text_embed]
+        if self.spk_dim > 0:
+            spks = repeat(spks, "b c -> b t c", t=x.shape[1])
+            to_cat.append(spks)
+
+        x = self.proj(torch.cat(to_cat, dim=-1))
+        x = self.conv_pos_embed(x) + x
+        return x
+
+
+# Transformer backbone using DiT blocks
+
+
+class DiT(nn.Module):
+    def __init__(
+        self,
+        *,
+        dim,
+        depth=8,
+        heads=8,
+        dim_head=64,
+        dropout=0.1,
+        ff_mult=4,
+        mel_dim=80,
+        mu_dim=None,
+        long_skip_connection=False,
+        spk_dim=None,
+        out_channels=None,
+        static_chunk_size=50,
+        num_decoding_left_chunks=2
+    ):
+        super().__init__()
+
+        self.time_embed = TimestepEmbedding(dim)
+        if mu_dim is None:
+            mu_dim = mel_dim
+        self.input_embed = InputEmbedding(mel_dim, mu_dim, dim, spk_dim)
+
+        self.rotary_embed = RotaryEmbedding(dim_head)
+
+        self.dim = dim
+        self.depth = depth
+
+        self.transformer_blocks = nn.ModuleList(
+            [DiTBlock(dim=dim, heads=heads, dim_head=dim_head, ff_mult=ff_mult, dropout=dropout) for _ in range(depth)]
+        )
+        self.long_skip_connection = nn.Linear(dim * 2, dim, bias=False) if long_skip_connection else None
+
+        self.norm_out = AdaLayerNormZero_Final(dim)  # final modulation
+        self.proj_out = nn.Linear(dim, mel_dim)
+        self.out_channels = out_channels
+        self.static_chunk_size = static_chunk_size
+        self.num_decoding_left_chunks = num_decoding_left_chunks
+
+    def forward(self, x, mask, mu, t, spks=None, cond=None, streaming=False):
+        x = x.transpose(1, 2)
+        mu = mu.transpose(1, 2)
+        cond = cond.transpose(1, 2)
+        spks = spks.unsqueeze(dim=1)
+        batch, seq_len = x.shape[0], x.shape[1]
+        if t.ndim == 0:
+            t = t.repeat(batch)
+
+        # t: conditioning time, c: context (text + masked cond audio), x: noised input audio
+        t = self.time_embed(t)
+        x = self.input_embed(x, cond, mu, spks.squeeze(1))
+
+        rope = self.rotary_embed.forward_from_seq_len(seq_len)
+
+        if self.long_skip_connection is not None:
+            residual = x
+
+        if streaming is True:
+            attn_mask = add_optional_chunk_mask(x, mask.bool(), False, False, 0, self.static_chunk_size, -1).unsqueeze(dim=1)
+        else:
+            attn_mask = add_optional_chunk_mask(x, mask.bool(), False, False, 0, 0, -1).repeat(1, x.size(1), 1).unsqueeze(dim=1)
+
+        for block in self.transformer_blocks:
+            x = block(x, t, mask=attn_mask.bool(), rope=rope)
+
+        if self.long_skip_connection is not None:
+            x = self.long_skip_connection(torch.cat((x, residual), dim=-1))
+
+        x = self.norm_out(x, t)
+        output = self.proj_out(x).transpose(1, 2)
+        return output
--- a/models/CosyVoice/cosyvoice/flow/DiT/modules.py
+++ b/models/CosyVoice/cosyvoice/flow/DiT/modules.py
@@ -0,0 +1,616 @@
+
+"""
+ein notation:
+b - batch
+n - sequence
+nt - text sequence
+nw - raw wave length
+d - dimension
+"""
+
+from __future__ import annotations
+from typing import Optional
+import math
+
+import torch
+from torch import nn
+import torch.nn.functional as F
+import torchaudio
+
+from x_transformers.x_transformers import apply_rotary_pos_emb
+
+
+# raw wav to mel spec
+class MelSpec(nn.Module):
+    def __init__(
+        self,
+        filter_length=1024,
+        hop_length=256,
+        win_length=1024,
+        n_mel_channels=100,
+        target_sample_rate=24_000,
+        normalize=False,
+        power=1,
+        norm=None,
+        center=True,
+    ):
+        super().__init__()
+        self.n_mel_channels = n_mel_channels
+
+        self.mel_stft = torchaudio.transforms.MelSpectrogram(
+            sample_rate=target_sample_rate,
+            n_fft=filter_length,
+            win_length=win_length,
+            hop_length=hop_length,
+            n_mels=n_mel_channels,
+            power=power,
+            center=center,
+            normalized=normalize,
+            norm=norm,
+        )
+
+        self.register_buffer("dummy", torch.tensor(0), persistent=False)
+
+    def forward(self, inp):
+        if len(inp.shape) == 3:
+            inp = inp.squeeze(1)  # 'b 1 nw -> b nw'
+
+        assert len(inp.shape) == 2
+
+        if self.dummy.device != inp.device:
+            self.to(inp.device)
+
+        mel = self.mel_stft(inp)
+        mel = mel.clamp(min=1e-5).log()
+        return mel
+
+
+# sinusoidal position embedding
+
+
+class SinusPositionEmbedding(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.dim = dim
+
+    def forward(self, x, scale=1000):
+        device = x.device
+        half_dim = self.dim // 2
+        emb = math.log(10000) / (half_dim - 1)
+        emb = torch.exp(torch.arange(half_dim, device=device).float() * -emb)
+        emb = scale * x.unsqueeze(1) * emb.unsqueeze(0)
+        emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
+        return emb
+
+
+# convolutional position embedding
+
+
+class ConvPositionEmbedding(nn.Module):
+    def __init__(self, dim, kernel_size=31, groups=16):
+        super().__init__()
+        assert kernel_size % 2 != 0
+        self.conv1d = nn.Sequential(
+            nn.Conv1d(dim, dim, kernel_size, groups=groups, padding=kernel_size // 2),
+            nn.Mish(),
+            nn.Conv1d(dim, dim, kernel_size, groups=groups, padding=kernel_size // 2),
+            nn.Mish(),
+        )
+
+    def forward(self, x: float["b n d"], mask: bool["b n"] | None = None):  # noqa: F722
+        if mask is not None:
+            mask = mask[..., None]
+            x = x.masked_fill(~mask, 0.0)
+
+        x = x.permute(0, 2, 1)
+        x = self.conv1d(x)
+        out = x.permute(0, 2, 1)
+
+        if mask is not None:
+            out = out.masked_fill(~mask, 0.0)
+
+        return out
+
+
+class CausalConvPositionEmbedding(nn.Module):
+    def __init__(self, dim, kernel_size=31, groups=16):
+        super().__init__()
+        assert kernel_size % 2 != 0
+        self.kernel_size = kernel_size
+        self.conv1 = nn.Sequential(
+            nn.Conv1d(dim, dim, kernel_size, groups=groups, padding=0),
+            nn.Mish(),
+        )
+        self.conv2 = nn.Sequential(
+            nn.Conv1d(dim, dim, kernel_size, groups=groups, padding=0),
+            nn.Mish(),
+        )
+
+    def forward(self, x: float["b n d"], mask: bool["b n"] | None = None):  # noqa: F722
+        if mask is not None:
+            mask = mask[..., None]
+            x = x.masked_fill(~mask, 0.0)
+
+        x = x.permute(0, 2, 1)
+        x = F.pad(x, (self.kernel_size - 1, 0, 0, 0))
+        x = self.conv1(x)
+        x = F.pad(x, (self.kernel_size - 1, 0, 0, 0))
+        x = self.conv2(x)
+        out = x.permute(0, 2, 1)
+
+        if mask is not None:
+            out = out.masked_fill(~mask, 0.0)
+
+        return out
+
+
+# rotary positional embedding related
+
+
+def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0, theta_rescale_factor=1.0):
+    # proposed by reddit user bloc97, to rescale rotary embeddings to longer sequence length without fine-tuning
+    # has some connection to NTK literature
+    # https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
+    # https://github.com/lucidrains/rotary-embedding-torch/blob/main/rotary_embedding_torch/rotary_embedding_torch.py
+    theta *= theta_rescale_factor ** (dim / (dim - 2))
+    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
+    t = torch.arange(end, device=freqs.device)  # type: ignore
+    freqs = torch.outer(t, freqs).float()  # type: ignore
+    freqs_cos = torch.cos(freqs)  # real part
+    freqs_sin = torch.sin(freqs)  # imaginary part
+    return torch.cat([freqs_cos, freqs_sin], dim=-1)
+
+
+def get_pos_embed_indices(start, length, max_pos, scale=1.0):
+    # length = length if isinstance(length, int) else length.max()
+    scale = scale * torch.ones_like(start, dtype=torch.float32)  # in case scale is a scalar
+    pos = (
+        start.unsqueeze(1)
+        + (torch.arange(length, device=start.device, dtype=torch.float32).unsqueeze(0) * scale.unsqueeze(1)).long()
+    )
+    # avoid extra long error.
+    pos = torch.where(pos < max_pos, pos, max_pos - 1)
+    return pos
+
+
+# Global Response Normalization layer (Instance Normalization ?)
+
+
+class GRN(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.gamma = nn.Parameter(torch.zeros(1, 1, dim))
+        self.beta = nn.Parameter(torch.zeros(1, 1, dim))
+
+    def forward(self, x):
+        Gx = torch.norm(x, p=2, dim=1, keepdim=True)
+        Nx = Gx / (Gx.mean(dim=-1, keepdim=True) + 1e-6)
+        return self.gamma * (x * Nx) + self.beta + x
+
+
+# ConvNeXt-V2 Block https://github.com/facebookresearch/ConvNeXt-V2/blob/main/models/convnextv2.py
+# ref: https://github.com/bfs18/e2_tts/blob/main/rfwave/modules.py#L108
+
+
+class ConvNeXtV2Block(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        intermediate_dim: int,
+        dilation: int = 1,
+    ):
+        super().__init__()
+        padding = (dilation * (7 - 1)) // 2
+        self.dwconv = nn.Conv1d(
+            dim, dim, kernel_size=7, padding=padding, groups=dim, dilation=dilation
+        )  # depthwise conv
+        self.norm = nn.LayerNorm(dim, eps=1e-6)
+        self.pwconv1 = nn.Linear(dim, intermediate_dim)  # pointwise/1x1 convs, implemented with linear layers
+        self.act = nn.GELU()
+        self.grn = GRN(intermediate_dim)
+        self.pwconv2 = nn.Linear(intermediate_dim, dim)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        residual = x
+        x = x.transpose(1, 2)  # b n d -> b d n
+        x = self.dwconv(x)
+        x = x.transpose(1, 2)  # b d n -> b n d
+        x = self.norm(x)
+        x = self.pwconv1(x)
+        x = self.act(x)
+        x = self.grn(x)
+        x = self.pwconv2(x)
+        return residual + x
+
+
+# AdaLayerNormZero
+# return with modulated x for attn input, and params for later mlp modulation
+
+
+class AdaLayerNormZero(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+
+        self.silu = nn.SiLU()
+        self.linear = nn.Linear(dim, dim * 6)
+
+        self.norm = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+
+    def forward(self, x, emb=None):
+        emb = self.linear(self.silu(emb))
+        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = torch.chunk(emb, 6, dim=1)
+
+        x = self.norm(x) * (1 + scale_msa[:, None]) + shift_msa[:, None]
+        return x, gate_msa, shift_mlp, scale_mlp, gate_mlp
+
+
+# AdaLayerNormZero for final layer
+# return only with modulated x for attn input, cuz no more mlp modulation
+
+
+class AdaLayerNormZero_Final(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+
+        self.silu = nn.SiLU()
+        self.linear = nn.Linear(dim, dim * 2)
+
+        self.norm = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+
+    def forward(self, x, emb):
+        emb = self.linear(self.silu(emb))
+        scale, shift = torch.chunk(emb, 2, dim=1)
+
+        x = self.norm(x) * (1 + scale)[:, None, :] + shift[:, None, :]
+        return x
+
+
+# FeedForward
+
+
+class FeedForward(nn.Module):
+    def __init__(self, dim, dim_out=None, mult=4, dropout=0.0, approximate: str = "none"):
+        super().__init__()
+        inner_dim = int(dim * mult)
+        dim_out = dim_out if dim_out is not None else dim
+
+        activation = nn.GELU(approximate=approximate)
+        project_in = nn.Sequential(nn.Linear(dim, inner_dim), activation)
+        self.ff = nn.Sequential(project_in, nn.Dropout(dropout), nn.Linear(inner_dim, dim_out))
+
+    def forward(self, x):
+        return self.ff(x)
+
+
+# Attention with possible joint part
+# modified from diffusers/src/diffusers/models/attention_processor.py
+
+
+class Attention(nn.Module):
+    def __init__(
+        self,
+        processor: JointAttnProcessor | AttnProcessor,
+        dim: int,
+        heads: int = 8,
+        dim_head: int = 64,
+        dropout: float = 0.0,
+        context_dim: Optional[int] = None,  # if not None -> joint attention
+        context_pre_only=None,
+    ):
+        super().__init__()
+
+        if not hasattr(F, "scaled_dot_product_attention"):
+            raise ImportError("Attention equires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0.")
+
+        self.processor = processor
+
+        self.dim = dim
+        self.heads = heads
+        self.inner_dim = dim_head * heads
+        self.dropout = dropout
+
+        self.context_dim = context_dim
+        self.context_pre_only = context_pre_only
+
+        self.to_q = nn.Linear(dim, self.inner_dim)
+        self.to_k = nn.Linear(dim, self.inner_dim)
+        self.to_v = nn.Linear(dim, self.inner_dim)
+
+        if self.context_dim is not None:
+            self.to_k_c = nn.Linear(context_dim, self.inner_dim)
+            self.to_v_c = nn.Linear(context_dim, self.inner_dim)
+            if self.context_pre_only is not None:
+                self.to_q_c = nn.Linear(context_dim, self.inner_dim)
+
+        self.to_out = nn.ModuleList([])
+        self.to_out.append(nn.Linear(self.inner_dim, dim))
+        self.to_out.append(nn.Dropout(dropout))
+
+        if self.context_pre_only is not None and not self.context_pre_only:
+            self.to_out_c = nn.Linear(self.inner_dim, dim)
+
+    def forward(
+        self,
+        x: float["b n d"],  # noised input x  # noqa: F722
+        c: float["b n d"] = None,  # context c  # noqa: F722
+        mask: bool["b n"] | None = None,  # noqa: F722
+        rope=None,  # rotary position embedding for x
+        c_rope=None,  # rotary position embedding for c
+    ) -> torch.Tensor:
+        if c is not None:
+            return self.processor(self, x, c=c, mask=mask, rope=rope, c_rope=c_rope)
+        else:
+            return self.processor(self, x, mask=mask, rope=rope)
+
+
+# Attention processor
+
+
+class AttnProcessor:
+    def __init__(self):
+        pass
+
+    def __call__(
+        self,
+        attn: Attention,
+        x: float["b n d"],  # noised input x  # noqa: F722
+        mask: bool["b n"] | None = None,  # noqa: F722
+        rope=None,  # rotary position embedding
+    ) -> torch.FloatTensor:
+        batch_size = x.shape[0]
+
+        # `sample` projections.
+        query = attn.to_q(x)
+        key = attn.to_k(x)
+        value = attn.to_v(x)
+
+        # apply rotary position embedding
+        if rope is not None:
+            freqs, xpos_scale = rope
+            q_xpos_scale, k_xpos_scale = (xpos_scale, xpos_scale**-1.0) if xpos_scale is not None else (1.0, 1.0)
+
+            query = apply_rotary_pos_emb(query, freqs, q_xpos_scale)
+            key = apply_rotary_pos_emb(key, freqs, k_xpos_scale)
+
+        # attention
+        inner_dim = key.shape[-1]
+        head_dim = inner_dim // attn.heads
+        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+
+        # mask. e.g. inference got a batch with different target durations, mask out the padding
+        if mask is not None:
+            attn_mask = mask
+            if attn_mask.dim() == 2:
+                attn_mask = attn_mask.unsqueeze(1).unsqueeze(1)  # 'b n -> b 1 1 n'
+                attn_mask = attn_mask.expand(batch_size, attn.heads, query.shape[-2], key.shape[-2])
+        else:
+            attn_mask = None
+
+        x = F.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=0.0, is_causal=False)
+        x = x.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
+        x = x.to(query.dtype)
+
+        # linear proj
+        x = attn.to_out[0](x)
+        # dropout
+        x = attn.to_out[1](x)
+
+        if mask is not None:
+            if mask.dim() == 2:
+                mask = mask.unsqueeze(-1)
+            else:
+                mask = mask[:, 0, -1].unsqueeze(-1)
+            x = x.masked_fill(~mask, 0.0)
+
+        return x
+
+
+# Joint Attention processor for MM-DiT
+# modified from diffusers/src/diffusers/models/attention_processor.py
+
+
+class JointAttnProcessor:
+    def __init__(self):
+        pass
+
+    def __call__(
+        self,
+        attn: Attention,
+        x: float["b n d"],  # noised input x  # noqa: F722
+        c: float["b nt d"] = None,  # context c, here text # noqa: F722
+        mask: bool["b n"] | None = None,  # noqa: F722
+        rope=None,  # rotary position embedding for x
+        c_rope=None,  # rotary position embedding for c
+    ) -> torch.FloatTensor:
+        residual = x
+
+        batch_size = c.shape[0]
+
+        # `sample` projections.
+        query = attn.to_q(x)
+        key = attn.to_k(x)
+        value = attn.to_v(x)
+
+        # `context` projections.
+        c_query = attn.to_q_c(c)
+        c_key = attn.to_k_c(c)
+        c_value = attn.to_v_c(c)
+
+        # apply rope for context and noised input independently
+        if rope is not None:
+            freqs, xpos_scale = rope
+            q_xpos_scale, k_xpos_scale = (xpos_scale, xpos_scale**-1.0) if xpos_scale is not None else (1.0, 1.0)
+            query = apply_rotary_pos_emb(query, freqs, q_xpos_scale)
+            key = apply_rotary_pos_emb(key, freqs, k_xpos_scale)
+        if c_rope is not None:
+            freqs, xpos_scale = c_rope
+            q_xpos_scale, k_xpos_scale = (xpos_scale, xpos_scale**-1.0) if xpos_scale is not None else (1.0, 1.0)
+            c_query = apply_rotary_pos_emb(c_query, freqs, q_xpos_scale)
+            c_key = apply_rotary_pos_emb(c_key, freqs, k_xpos_scale)
+
+        # attention
+        query = torch.cat([query, c_query], dim=1)
+        key = torch.cat([key, c_key], dim=1)
+        value = torch.cat([value, c_value], dim=1)
+
+        inner_dim = key.shape[-1]
+        head_dim = inner_dim // attn.heads
+        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+
+        # mask. e.g. inference got a batch with different target durations, mask out the padding
+        if mask is not None:
+            attn_mask = F.pad(mask, (0, c.shape[1]), value=True)  # no mask for c (text)
+            attn_mask = attn_mask.unsqueeze(1).unsqueeze(1)  # 'b n -> b 1 1 n'
+            attn_mask = attn_mask.expand(batch_size, attn.heads, query.shape[-2], key.shape[-2])
+        else:
+            attn_mask = None
+
+        x = F.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=0.0, is_causal=False)
+        x = x.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
+        x = x.to(query.dtype)
+
+        # Split the attention outputs.
+        x, c = (
+            x[:, : residual.shape[1]],
+            x[:, residual.shape[1]:],
+        )
+
+        # linear proj
+        x = attn.to_out[0](x)
+        # dropout
+        x = attn.to_out[1](x)
+        if not attn.context_pre_only:
+            c = attn.to_out_c(c)
+
+        if mask is not None:
+            mask = mask.unsqueeze(-1)
+            x = x.masked_fill(~mask, 0.0)
+            # c = c.masked_fill(~mask, 0.)  # no mask for c (text)
+
+        return x, c
+
+
+# DiT Block
+
+
+class DiTBlock(nn.Module):
+    def __init__(self, dim, heads, dim_head, ff_mult=4, dropout=0.1):
+        super().__init__()
+
+        self.attn_norm = AdaLayerNormZero(dim)
+        self.attn = Attention(
+            processor=AttnProcessor(),
+            dim=dim,
+            heads=heads,
+            dim_head=dim_head,
+            dropout=dropout,
+        )
+
+        self.ff_norm = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.ff = FeedForward(dim=dim, mult=ff_mult, dropout=dropout, approximate="tanh")
+
+    def forward(self, x, t, mask=None, rope=None):  # x: noised input, t: time embedding
+        # pre-norm & modulation for attention input
+        norm, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.attn_norm(x, emb=t)
+
+        # attention
+        attn_output = self.attn(x=norm, mask=mask, rope=rope)
+
+        # process attention output for input x
+        x = x + gate_msa.unsqueeze(1) * attn_output
+
+        ff_norm = self.ff_norm(x) * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
+        ff_output = self.ff(ff_norm)
+        x = x + gate_mlp.unsqueeze(1) * ff_output
+
+        return x
+
+
+# MMDiT Block https://arxiv.org/abs/2403.03206
+
+
+class MMDiTBlock(nn.Module):
+    r"""
+    modified from diffusers/src/diffusers/models/attention.py
+
+    notes.
+    _c: context related. text, cond, etc. (left part in sd3 fig2.b)
+    _x: noised input related. (right part)
+    context_pre_only: last layer only do prenorm + modulation cuz no more ffn
+    """
+
+    def __init__(self, dim, heads, dim_head, ff_mult=4, dropout=0.1, context_pre_only=False):
+        super().__init__()
+
+        self.context_pre_only = context_pre_only
+
+        self.attn_norm_c = AdaLayerNormZero_Final(dim) if context_pre_only else AdaLayerNormZero(dim)
+        self.attn_norm_x = AdaLayerNormZero(dim)
+        self.attn = Attention(
+            processor=JointAttnProcessor(),
+            dim=dim,
+            heads=heads,
+            dim_head=dim_head,
+            dropout=dropout,
+            context_dim=dim,
+            context_pre_only=context_pre_only,
+        )
+
+        if not context_pre_only:
+            self.ff_norm_c = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+            self.ff_c = FeedForward(dim=dim, mult=ff_mult, dropout=dropout, approximate="tanh")
+        else:
+            self.ff_norm_c = None
+            self.ff_c = None
+        self.ff_norm_x = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.ff_x = FeedForward(dim=dim, mult=ff_mult, dropout=dropout, approximate="tanh")
+
+    def forward(self, x, c, t, mask=None, rope=None, c_rope=None):  # x: noised input, c: context, t: time embedding
+        # pre-norm & modulation for attention input
+        if self.context_pre_only:
+            norm_c = self.attn_norm_c(c, t)
+        else:
+            norm_c, c_gate_msa, c_shift_mlp, c_scale_mlp, c_gate_mlp = self.attn_norm_c(c, emb=t)
+        norm_x, x_gate_msa, x_shift_mlp, x_scale_mlp, x_gate_mlp = self.attn_norm_x(x, emb=t)
+
+        # attention
+        x_attn_output, c_attn_output = self.attn(x=norm_x, c=norm_c, mask=mask, rope=rope, c_rope=c_rope)
+
+        # process attention output for context c
+        if self.context_pre_only:
+            c = None
+        else:  # if not last layer
+            c = c + c_gate_msa.unsqueeze(1) * c_attn_output
+
+            norm_c = self.ff_norm_c(c) * (1 + c_scale_mlp[:, None]) + c_shift_mlp[:, None]
+            c_ff_output = self.ff_c(norm_c)
+            c = c + c_gate_mlp.unsqueeze(1) * c_ff_output
+
+        # process attention output for input x
+        x = x + x_gate_msa.unsqueeze(1) * x_attn_output
+
+        norm_x = self.ff_norm_x(x) * (1 + x_scale_mlp[:, None]) + x_shift_mlp[:, None]
+        x_ff_output = self.ff_x(norm_x)
+        x = x + x_gate_mlp.unsqueeze(1) * x_ff_output
+
+        return c, x
+
+
+# time step conditioning embedding
+
+
+class TimestepEmbedding(nn.Module):
+    def __init__(self, dim, freq_embed_dim=256):
+        super().__init__()
+        self.time_embed = SinusPositionEmbedding(freq_embed_dim)
+        self.time_mlp = nn.Sequential(nn.Linear(freq_embed_dim, dim), nn.SiLU(), nn.Linear(dim, dim))
+
+    def forward(self, timestep: float["b"]):  # noqa: F821
+        time_hidden = self.time_embed(timestep)
+        time_hidden = time_hidden.to(timestep.dtype)
+        time = self.time_mlp(time_hidden)  # b d
+        return time
--- a/models/CosyVoice/cosyvoice/flow/decoder.py
+++ b/models/CosyVoice/cosyvoice/flow/decoder.py
@@ -0,0 +1,494 @@
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu, Zhihao Du)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import pack, rearrange, repeat
+from cosyvoice.utils.common import mask_to_bias
+from cosyvoice.utils.mask import add_optional_chunk_mask
+from matcha.models.components.decoder import SinusoidalPosEmb, Block1D, ResnetBlock1D, Downsample1D, TimestepEmbedding, Upsample1D
+from matcha.models.components.transformer import BasicTransformerBlock
+
+
+class Transpose(torch.nn.Module):
+    def __init__(self, dim0: int, dim1: int):
+        super().__init__()
+        self.dim0 = dim0
+        self.dim1 = dim1
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = torch.transpose(x, self.dim0, self.dim1)
+        return x
+
+
+class CausalConv1d(torch.nn.Conv1d):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: int,
+        stride: int = 1,
+        dilation: int = 1,
+        groups: int = 1,
+        bias: bool = True,
+        padding_mode: str = 'zeros',
+        device=None,
+        dtype=None
+    ) -> None:
+        super(CausalConv1d, self).__init__(in_channels, out_channels,
+                                           kernel_size, stride,
+                                           padding=0, dilation=dilation,
+                                           groups=groups, bias=bias,
+                                           padding_mode=padding_mode,
+                                           device=device, dtype=dtype)
+        assert stride == 1
+        self.causal_padding = kernel_size - 1
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = F.pad(x, (self.causal_padding, 0), value=0.0)
+        x = super(CausalConv1d, self).forward(x)
+        return x
+
+
+class CausalBlock1D(Block1D):
+    def __init__(self, dim: int, dim_out: int):
+        super(CausalBlock1D, self).__init__(dim, dim_out)
+        self.block = torch.nn.Sequential(
+            CausalConv1d(dim, dim_out, 3),
+            Transpose(1, 2),
+            nn.LayerNorm(dim_out),
+            Transpose(1, 2),
+            nn.Mish(),
+        )
+
+    def forward(self, x: torch.Tensor, mask: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        output = self.block(x * mask)
+        return output * mask
+
+
+class CausalResnetBlock1D(ResnetBlock1D):
+    def __init__(self, dim: int, dim_out: int, time_emb_dim: int, groups: int = 8):
+        super(CausalResnetBlock1D, self).__init__(dim, dim_out, time_emb_dim, groups)
+        self.block1 = CausalBlock1D(dim, dim_out)
+        self.block2 = CausalBlock1D(dim_out, dim_out)
+
+
+class ConditionalDecoder(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        channels=(256, 256),
+        dropout=0.05,
+        attention_head_dim=64,
+        n_blocks=1,
+        num_mid_blocks=2,
+        num_heads=4,
+        act_fn="snake",
+    ):
+        """
+        This decoder requires an input with the same shape of the target. So, if your text content
+        is shorter or longer than the outputs, please re-sampling it before feeding to the decoder.
+        """
+        super().__init__()
+        channels = tuple(channels)
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+
+        self.time_embeddings = SinusoidalPosEmb(in_channels)
+        time_embed_dim = channels[0] * 4
+        self.time_mlp = TimestepEmbedding(
+            in_channels=in_channels,
+            time_embed_dim=time_embed_dim,
+            act_fn="silu",
+        )
+        self.down_blocks = nn.ModuleList([])
+        self.mid_blocks = nn.ModuleList([])
+        self.up_blocks = nn.ModuleList([])
+
+        output_channel = in_channels
+        for i in range(len(channels)):  # pylint: disable=consider-using-enumerate
+            input_channel = output_channel
+            output_channel = channels[i]
+            is_last = i == len(channels) - 1
+            resnet = ResnetBlock1D(dim=input_channel, dim_out=output_channel, time_emb_dim=time_embed_dim)
+            transformer_blocks = nn.ModuleList(
+                [
+                    BasicTransformerBlock(
+                        dim=output_channel,
+                        num_attention_heads=num_heads,
+                        attention_head_dim=attention_head_dim,
+                        dropout=dropout,
+                        activation_fn=act_fn,
+                    )
+                    for _ in range(n_blocks)
+                ]
+            )
+            downsample = (
+                Downsample1D(output_channel) if not is_last else nn.Conv1d(output_channel, output_channel, 3, padding=1)
+            )
+            self.down_blocks.append(nn.ModuleList([resnet, transformer_blocks, downsample]))
+
+        for _ in range(num_mid_blocks):
+            input_channel = channels[-1]
+            out_channels = channels[-1]
+            resnet = ResnetBlock1D(dim=input_channel, dim_out=output_channel, time_emb_dim=time_embed_dim)
+
+            transformer_blocks = nn.ModuleList(
+                [
+                    BasicTransformerBlock(
+                        dim=output_channel,
+                        num_attention_heads=num_heads,
+                        attention_head_dim=attention_head_dim,
+                        dropout=dropout,
+                        activation_fn=act_fn,
+                    )
+                    for _ in range(n_blocks)
+                ]
+            )
+
+            self.mid_blocks.append(nn.ModuleList([resnet, transformer_blocks]))
+
+        channels = channels[::-1] + (channels[0],)
+        for i in range(len(channels) - 1):
+            input_channel = channels[i] * 2
+            output_channel = channels[i + 1]
+            is_last = i == len(channels) - 2
+            resnet = ResnetBlock1D(
+                dim=input_channel,
+                dim_out=output_channel,
+                time_emb_dim=time_embed_dim,
+            )
+            transformer_blocks = nn.ModuleList(
+                [
+                    BasicTransformerBlock(
+                        dim=output_channel,
+                        num_attention_heads=num_heads,
+                        attention_head_dim=attention_head_dim,
+                        dropout=dropout,
+                        activation_fn=act_fn,
+                    )
+                    for _ in range(n_blocks)
+                ]
+            )
+            upsample = (
+                Upsample1D(output_channel, use_conv_transpose=True)
+                if not is_last
+                else nn.Conv1d(output_channel, output_channel, 3, padding=1)
+            )
+            self.up_blocks.append(nn.ModuleList([resnet, transformer_blocks, upsample]))
+        self.final_block = Block1D(channels[-1], channels[-1])
+        self.final_proj = nn.Conv1d(channels[-1], self.out_channels, 1)
+        self.initialize_weights()
+
+    def initialize_weights(self):
+        for m in self.modules():
+            if isinstance(m, nn.Conv1d):
+                nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+            elif isinstance(m, nn.GroupNorm):
+                nn.init.constant_(m.weight, 1)
+                nn.init.constant_(m.bias, 0)
+            elif isinstance(m, nn.Linear):
+                nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+
+    def forward(self, x, mask, mu, t, spks=None, cond=None, streaming=False):
+        """Forward pass of the UNet1DConditional model.
+
+        Args:
+            x (torch.Tensor): shape (batch_size, in_channels, time)
+            mask (_type_): shape (batch_size, 1, time)
+            t (_type_): shape (batch_size)
+            spks (_type_, optional): shape: (batch_size, condition_channels). Defaults to None.
+            cond (_type_, optional): placeholder for future use. Defaults to None.
+
+        Raises:
+            ValueError: _description_
+            ValueError: _description_
+
+        Returns:
+            _type_: _description_
+        """
+
+        t = self.time_embeddings(t).to(t.dtype)
+        t = self.time_mlp(t)
+
+        x = pack([x, mu], "b * t")[0]
+
+        if spks is not None:
+            spks = repeat(spks, "b c -> b c t", t=x.shape[-1])
+            x = pack([x, spks], "b * t")[0]
+        if cond is not None:
+            x = pack([x, cond], "b * t")[0]
+
+        hiddens = []
+        masks = [mask]
+        for resnet, transformer_blocks, downsample in self.down_blocks:
+            mask_down = masks[-1]
+            x = resnet(x, mask_down, t)
+            x = rearrange(x, "b c t -> b t c").contiguous()
+            attn_mask = add_optional_chunk_mask(x, mask_down.bool(), False, False, 0, 0, -1).repeat(1, x.size(1), 1)
+            attn_mask = mask_to_bias(attn_mask, x.dtype)
+            for transformer_block in transformer_blocks:
+                x = transformer_block(
+                    hidden_states=x,
+                    attention_mask=attn_mask,
+                    timestep=t,
+                )
+            x = rearrange(x, "b t c -> b c t").contiguous()
+            hiddens.append(x)  # Save hidden states for skip connections
+            x = downsample(x * mask_down)
+            masks.append(mask_down[:, :, ::2])
+        masks = masks[:-1]
+        mask_mid = masks[-1]
+
+        for resnet, transformer_blocks in self.mid_blocks:
+            x = resnet(x, mask_mid, t)
+            x = rearrange(x, "b c t -> b t c").contiguous()
+            attn_mask = add_optional_chunk_mask(x, mask_mid.bool(), False, False, 0, 0, -1).repeat(1, x.size(1), 1)
+            attn_mask = mask_to_bias(attn_mask, x.dtype)
+            for transformer_block in transformer_blocks:
+                x = transformer_block(
+                    hidden_states=x,
+                    attention_mask=attn_mask,
+                    timestep=t,
+                )
+            x = rearrange(x, "b t c -> b c t").contiguous()
+
+        for resnet, transformer_blocks, upsample in self.up_blocks:
+            mask_up = masks.pop()
+            skip = hiddens.pop()
+            x = pack([x[:, :, :skip.shape[-1]], skip], "b * t")[0]
+            x = resnet(x, mask_up, t)
+            x = rearrange(x, "b c t -> b t c").contiguous()
+            attn_mask = add_optional_chunk_mask(x, mask_up.bool(), False, False, 0, 0, -1).repeat(1, x.size(1), 1)
+            attn_mask = mask_to_bias(attn_mask, x.dtype)
+            for transformer_block in transformer_blocks:
+                x = transformer_block(
+                    hidden_states=x,
+                    attention_mask=attn_mask,
+                    timestep=t,
+                )
+            x = rearrange(x, "b t c -> b c t").contiguous()
+            x = upsample(x * mask_up)
+        x = self.final_block(x, mask_up)
+        output = self.final_proj(x * mask_up)
+        return output * mask
+
+
+class CausalConditionalDecoder(ConditionalDecoder):
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        channels=(256, 256),
+        dropout=0.05,
+        attention_head_dim=64,
+        n_blocks=1,
+        num_mid_blocks=2,
+        num_heads=4,
+        act_fn="snake",
+        static_chunk_size=50,
+        num_decoding_left_chunks=2,
+    ):
+        """
+        This decoder requires an input with the same shape of the target. So, if your text content
+        is shorter or longer than the outputs, please re-sampling it before feeding to the decoder.
+        """
+        torch.nn.Module.__init__(self)
+        channels = tuple(channels)
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.time_embeddings = SinusoidalPosEmb(in_channels)
+        time_embed_dim = channels[0] * 4
+        self.time_mlp = TimestepEmbedding(
+            in_channels=in_channels,
+            time_embed_dim=time_embed_dim,
+            act_fn="silu",
+        )
+        self.static_chunk_size = static_chunk_size
+        self.num_decoding_left_chunks = num_decoding_left_chunks
+        self.down_blocks = nn.ModuleList([])
+        self.mid_blocks = nn.ModuleList([])
+        self.up_blocks = nn.ModuleList([])
+
+        output_channel = in_channels
+        for i in range(len(channels)):  # pylint: disable=consider-using-enumerate
+            input_channel = output_channel
+            output_channel = channels[i]
+            is_last = i == len(channels) - 1
+            resnet = CausalResnetBlock1D(dim=input_channel, dim_out=output_channel, time_emb_dim=time_embed_dim)
+            transformer_blocks = nn.ModuleList(
+                [
+                    BasicTransformerBlock(
+                        dim=output_channel,
+                        num_attention_heads=num_heads,
+                        attention_head_dim=attention_head_dim,
+                        dropout=dropout,
+                        activation_fn=act_fn,
+                    )
+                    for _ in range(n_blocks)
+                ]
+            )
+            downsample = (
+                Downsample1D(output_channel) if not is_last else CausalConv1d(output_channel, output_channel, 3)
+            )
+            self.down_blocks.append(nn.ModuleList([resnet, transformer_blocks, downsample]))
+
+        for _ in range(num_mid_blocks):
+            input_channel = channels[-1]
+            out_channels = channels[-1]
+            resnet = CausalResnetBlock1D(dim=input_channel, dim_out=output_channel, time_emb_dim=time_embed_dim)
+
+            transformer_blocks = nn.ModuleList(
+                [
+                    BasicTransformerBlock(
+                        dim=output_channel,
+                        num_attention_heads=num_heads,
+                        attention_head_dim=attention_head_dim,
+                        dropout=dropout,
+                        activation_fn=act_fn,
+                    )
+                    for _ in range(n_blocks)
+                ]
+            )
+
+            self.mid_blocks.append(nn.ModuleList([resnet, transformer_blocks]))
+
+        channels = channels[::-1] + (channels[0],)
+        for i in range(len(channels) - 1):
+            input_channel = channels[i] * 2
+            output_channel = channels[i + 1]
+            is_last = i == len(channels) - 2
+            resnet = CausalResnetBlock1D(
+                dim=input_channel,
+                dim_out=output_channel,
+                time_emb_dim=time_embed_dim,
+            )
+            transformer_blocks = nn.ModuleList(
+                [
+                    BasicTransformerBlock(
+                        dim=output_channel,
+                        num_attention_heads=num_heads,
+                        attention_head_dim=attention_head_dim,
+                        dropout=dropout,
+                        activation_fn=act_fn,
+                    )
+                    for _ in range(n_blocks)
+                ]
+            )
+            upsample = (
+                Upsample1D(output_channel, use_conv_transpose=True)
+                if not is_last
+                else CausalConv1d(output_channel, output_channel, 3)
+            )
+            self.up_blocks.append(nn.ModuleList([resnet, transformer_blocks, upsample]))
+        self.final_block = CausalBlock1D(channels[-1], channels[-1])
+        self.final_proj = nn.Conv1d(channels[-1], self.out_channels, 1)
+        self.initialize_weights()
+
+    def forward(self, x, mask, mu, t, spks=None, cond=None, streaming=False):
+        """Forward pass of the UNet1DConditional model.
+
+        Args:
+            x (torch.Tensor): shape (batch_size, in_channels, time)
+            mask (_type_): shape (batch_size, 1, time)
+            t (_type_): shape (batch_size)
+            spks (_type_, optional): shape: (batch_size, condition_channels). Defaults to None.
+            cond (_type_, optional): placeholder for future use. Defaults to None.
+
+        Raises:
+            ValueError: _description_
+            ValueError: _description_
+
+        Returns:
+            _type_: _description_
+        """
+        t = self.time_embeddings(t).to(t.dtype)
+        t = self.time_mlp(t)
+
+        x = pack([x, mu], "b * t")[0]
+
+        if spks is not None:
+            spks = repeat(spks, "b c -> b c t", t=x.shape[-1])
+            x = pack([x, spks], "b * t")[0]
+        if cond is not None:
+            x = pack([x, cond], "b * t")[0]
+
+        hiddens = []
+        masks = [mask]
+        for resnet, transformer_blocks, downsample in self.down_blocks:
+            mask_down = masks[-1]
+            x = resnet(x, mask_down, t)
+            x = rearrange(x, "b c t -> b t c").contiguous()
+            if streaming is True:
+                attn_mask = add_optional_chunk_mask(x, mask_down.bool(), False, False, 0, self.static_chunk_size, -1)
+            else:
+                attn_mask = add_optional_chunk_mask(x, mask_down.bool(), False, False, 0, 0, -1).repeat(1, x.size(1), 1)
+            attn_mask = mask_to_bias(attn_mask, x.dtype)
+            for transformer_block in transformer_blocks:
+                x = transformer_block(
+                    hidden_states=x,
+                    attention_mask=attn_mask,
+                    timestep=t,
+                )
+            x = rearrange(x, "b t c -> b c t").contiguous()
+            hiddens.append(x)  # Save hidden states for skip connections
+            x = downsample(x * mask_down)
+            masks.append(mask_down[:, :, ::2])
+        masks = masks[:-1]
+        mask_mid = masks[-1]
+
+        for resnet, transformer_blocks in self.mid_blocks:
+            x = resnet(x, mask_mid, t)
+            x = rearrange(x, "b c t -> b t c").contiguous()
+            if streaming is True:
+                attn_mask = add_optional_chunk_mask(x, mask_mid.bool(), False, False, 0, self.static_chunk_size, -1)
+            else:
+                attn_mask = add_optional_chunk_mask(x, mask_mid.bool(), False, False, 0, 0, -1).repeat(1, x.size(1), 1)
+            attn_mask = mask_to_bias(attn_mask, x.dtype)
+            for transformer_block in transformer_blocks:
+                x = transformer_block(
+                    hidden_states=x,
+                    attention_mask=attn_mask,
+                    timestep=t,
+                )
+            x = rearrange(x, "b t c -> b c t").contiguous()
+
+        for resnet, transformer_blocks, upsample in self.up_blocks:
+            mask_up = masks.pop()
+            skip = hiddens.pop()
+            x = pack([x[:, :, :skip.shape[-1]], skip], "b * t")[0]
+            x = resnet(x, mask_up, t)
+            x = rearrange(x, "b c t -> b t c").contiguous()
+            if streaming is True:
+                attn_mask = add_optional_chunk_mask(x, mask_up.bool(), False, False, 0, self.static_chunk_size, -1)
+            else:
+                attn_mask = add_optional_chunk_mask(x, mask_up.bool(), False, False, 0, 0, -1).repeat(1, x.size(1), 1)
+            attn_mask = mask_to_bias(attn_mask, x.dtype)
+            for transformer_block in transformer_blocks:
+                x = transformer_block(
+                    hidden_states=x,
+                    attention_mask=attn_mask,
+                    timestep=t,
+                )
+            x = rearrange(x, "b t c -> b c t").contiguous()
+            x = upsample(x * mask_up)
+        x = self.final_block(x, mask_up)
+        output = self.final_proj(x * mask_up)
+        return output * mask
--- a/models/CosyVoice/cosyvoice/flow/flow.py
+++ b/models/CosyVoice/cosyvoice/flow/flow.py
@@ -0,0 +1,443 @@
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu, Zhihao Du)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os, logging
+import random
+from typing import Dict, Optional
+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+from omegaconf import DictConfig
+from cosyvoice.utils.mask import make_pad_mask
+from cosyvoice.utils.onnx import SpeechTokenExtractor, online_feature, onnx_path
+
+
+class MaskedDiffWithXvec(torch.nn.Module):
+    def __init__(self,
+                 input_size: int = 512,
+                 output_size: int = 80,
+                 spk_embed_dim: int = 192,
+                 output_type: str = "mel",
+                 vocab_size: int = 4096,
+                 input_frame_rate: int = 50,
+                 only_mask_loss: bool = True,
+                 encoder: torch.nn.Module = None,
+                 length_regulator: torch.nn.Module = None,
+                 decoder: torch.nn.Module = None,
+                 decoder_conf: Dict = {'in_channels': 240, 'out_channel': 80, 'spk_emb_dim': 80, 'n_spks': 1,
+                                       'cfm_params': DictConfig({'sigma_min': 1e-06, 'solver': 'euler', 't_scheduler': 'cosine',
+                                                                 'training_cfg_rate': 0.2, 'inference_cfg_rate': 0.7, 'reg_loss_type': 'l1'}),
+                                       'decoder_params': {'channels': [256, 256], 'dropout': 0.0, 'attention_head_dim': 64,
+                                                          'n_blocks': 4, 'num_mid_blocks': 12, 'num_heads': 8, 'act_fn': 'gelu'}}):
+        super().__init__()
+        self.input_size = input_size
+        self.output_size = output_size
+        self.decoder_conf = decoder_conf
+        self.vocab_size = vocab_size
+        self.output_type = output_type
+        self.input_frame_rate = input_frame_rate
+        logging.info(f"input frame rate={self.input_frame_rate}")
+        self.input_embedding = nn.Embedding(vocab_size, input_size)
+        self.spk_embed_affine_layer = torch.nn.Linear(spk_embed_dim, output_size)
+        self.encoder = encoder
+        self.encoder_proj = torch.nn.Linear(self.encoder.output_size(), output_size)
+        self.decoder = decoder
+        self.length_regulator = length_regulator
+        self.only_mask_loss = only_mask_loss
+
+    def forward(
+            self,
+            batch: dict,
+            device: torch.device,
+    ) -> Dict[str, Optional[torch.Tensor]]:
+        token = batch['speech_token'].to(device)
+        token_len = batch['speech_token_len'].to(device)
+        feat = batch['speech_feat'].to(device)
+        feat_len = batch['speech_feat_len'].to(device)
+        embedding = batch['embedding'].to(device)
+
+        # xvec projection
+        embedding = F.normalize(embedding, dim=1)
+        embedding = self.spk_embed_affine_layer(embedding)
+
+        # concat text and prompt_text
+        mask = (~make_pad_mask(token_len)).float().unsqueeze(-1).to(device)
+        token = self.input_embedding(torch.clamp(token, min=0)) * mask
+
+        # text encode
+        h, h_lengths = self.encoder(token, token_len)
+        h = self.encoder_proj(h)
+        h, h_lengths = self.length_regulator(h, feat_len)
+
+        # get conditions
+        conds = torch.zeros(feat.shape, device=token.device)
+        for i, j in enumerate(feat_len):
+            if random.random() < 0.5:
+                continue
+            index = random.randint(0, int(0.3 * j))
+            conds[i, :index] = feat[i, :index]
+        conds = conds.transpose(1, 2)
+
+        mask = (~make_pad_mask(feat_len)).to(h)
+        # NOTE this is unnecessary, feat/h already same shape
+        loss, _ = self.decoder.compute_loss(
+            feat.transpose(1, 2).contiguous(),
+            mask.unsqueeze(1),
+            h.transpose(1, 2).contiguous(),
+            embedding,
+            cond=conds
+        )
+        return {'loss': loss}
+
+    @torch.inference_mode()
+    def inference(self,
+                  token,
+                  token_len,
+                  prompt_token,
+                  prompt_token_len,
+                  prompt_feat,
+                  prompt_feat_len,
+                  embedding,
+                  flow_cache):
+        assert token.shape[0] == 1
+        # xvec projection
+        embedding = F.normalize(embedding, dim=1)
+        embedding = self.spk_embed_affine_layer(embedding)
+
+        # concat speech token and prompt speech token
+        token_len1, token_len2 = prompt_token.shape[1], token.shape[1]
+        token, token_len = torch.concat([prompt_token, token], dim=1), prompt_token_len + token_len
+        mask = (~make_pad_mask(token_len)).unsqueeze(-1).to(embedding)
+        token = self.input_embedding(torch.clamp(token, min=0)) * mask
+
+        # text encode
+        h, h_lengths = self.encoder(token, token_len)
+        h = self.encoder_proj(h)
+        mel_len1, mel_len2 = prompt_feat.shape[1], int(token_len2 / self.input_frame_rate * 22050 / 256)
+        h, h_lengths = self.length_regulator.inference(h[:, :token_len1], h[:, token_len1:], mel_len1, mel_len2, self.input_frame_rate)
+
+        # get conditions
+        conds = torch.zeros([1, mel_len1 + mel_len2, self.output_size], device=token.device).to(h.dtype)
+        conds[:, :mel_len1] = prompt_feat
+        conds = conds.transpose(1, 2)
+
+        mask = (~make_pad_mask(torch.tensor([mel_len1 + mel_len2]))).to(h)
+        feat, flow_cache = self.decoder(
+            mu=h.transpose(1, 2).contiguous(),
+            mask=mask.unsqueeze(1),
+            spks=embedding,
+            cond=conds,
+            n_timesteps=10,
+            prompt_len=mel_len1,
+            cache=flow_cache
+        )
+        feat = feat[:, :, mel_len1:]
+        assert feat.shape[2] == mel_len2
+        return feat.float(), flow_cache
+
+
+class CausalMaskedDiffWithXvec(torch.nn.Module):
+    def __init__(self,
+                 input_size: int = 512,
+                 output_size: int = 80,
+                 spk_embed_dim: int = 192,
+                 output_type: str = "mel",
+                 vocab_size: int = 4096,
+                 input_frame_rate: int = 50,
+                 only_mask_loss: bool = True,
+                 token_mel_ratio: int = 2,
+                 pre_lookahead_len: int = 3,
+                 encoder: torch.nn.Module = None,
+                 decoder: torch.nn.Module = None,
+                 decoder_conf: Dict = {'in_channels': 240, 'out_channel': 80, 'spk_emb_dim': 80, 'n_spks': 1,
+                                       'cfm_params': DictConfig({'sigma_min': 1e-06, 'solver': 'euler', 't_scheduler': 'cosine',
+                                                                 'training_cfg_rate': 0.2, 'inference_cfg_rate': 0.7, 'reg_loss_type': 'l1'}),
+                                       'decoder_params': {'channels': [256, 256], 'dropout': 0.0, 'attention_head_dim': 64,
+                                                          'n_blocks': 4, 'num_mid_blocks': 12, 'num_heads': 8, 'act_fn': 'gelu'}}):
+        super().__init__()
+        self.input_size = input_size
+        self.output_size = output_size
+        self.decoder_conf = decoder_conf
+        self.vocab_size = vocab_size
+        self.output_type = output_type
+        self.input_frame_rate = input_frame_rate
+        logging.info(f"input frame rate={self.input_frame_rate}")
+        self.input_embedding = nn.Embedding(vocab_size, input_size)
+        self.spk_embed_affine_layer = torch.nn.Linear(spk_embed_dim, output_size)
+        self.encoder = encoder
+        self.encoder_proj = torch.nn.Linear(self.encoder.output_size(), output_size)
+        self.decoder = decoder
+        self.only_mask_loss = only_mask_loss
+        self.token_mel_ratio = token_mel_ratio
+        self.pre_lookahead_len = pre_lookahead_len
+        if online_feature is True:
+            self.speech_token_extractor = SpeechTokenExtractor(model_path=os.path.join(onnx_path, 'speech_tokenizer_v2.batch.onnx'))
+
+    def forward(
+            self,
+            batch: dict,
+            device: torch.device,
+    ) -> Dict[str, Optional[torch.Tensor]]:
+        if 'speech_token' not in batch:
+            token, token_len = self.speech_token_extractor.inference(batch['whisper_feat'], batch['whisper_feat_len'], device)
+        else:
+            token = batch['speech_token'].to(device)
+            token_len = batch['speech_token_len'].to(device)
+        feat = batch['speech_feat'].to(device)
+        feat_len = batch['speech_feat_len'].to(device)
+        embedding = batch['embedding'].to(device)
+
+        # NOTE unified training, static_chunk_size > 0 or = 0
+        streaming = True if random.random() < 0.5 else False
+
+        # xvec projection
+        embedding = F.normalize(embedding, dim=1)
+        embedding = self.spk_embed_affine_layer(embedding)
+
+        # concat text and prompt_text
+        mask = (~make_pad_mask(token_len)).float().unsqueeze(-1).to(device)
+        token = self.input_embedding(torch.clamp(token, min=0)) * mask
+
+        # text encode
+        h, h_lengths = self.encoder(token, token_len, streaming=streaming)
+        h = self.encoder_proj(h)
+
+        # get conditions
+        conds = torch.zeros(feat.shape, device=token.device)
+        for i, j in enumerate(feat_len):
+            if random.random() < 0.5:
+                continue
+            index = random.randint(0, int(0.3 * j))
+            conds[i, :index] = feat[i, :index]
+        conds = conds.transpose(1, 2)
+
+        mask = (~make_pad_mask(h_lengths.sum(dim=-1).squeeze(dim=1))).to(h)
+        loss, _ = self.decoder.compute_loss(
+            feat.transpose(1, 2).contiguous(),
+            mask.unsqueeze(1),
+            h.transpose(1, 2).contiguous(),
+            embedding,
+            cond=conds,
+            streaming=streaming,
+        )
+        return {'loss': loss}
+
+    @torch.inference_mode()
+    def inference(self,
+                  token,
+                  token_len,
+                  prompt_token,
+                  prompt_token_len,
+                  prompt_feat,
+                  prompt_feat_len,
+                  embedding,
+                  streaming,
+                  finalize):
+        assert token.shape[0] == 1
+        # xvec projection
+        embedding = F.normalize(embedding, dim=1)
+        embedding = self.spk_embed_affine_layer(embedding)
+
+        # concat text and prompt_text
+        token, token_len = torch.concat([prompt_token, token], dim=1), prompt_token_len + token_len
+        mask = (~make_pad_mask(token_len)).unsqueeze(-1).to(embedding)
+        token = self.input_embedding(torch.clamp(token, min=0)) * mask
+
+        # text encode
+        if finalize is True:
+            h, h_lengths = self.encoder(token, token_len, streaming=streaming)
+        else:
+            token, context = token[:, :-self.pre_lookahead_len], token[:, -self.pre_lookahead_len:]
+            h, h_lengths = self.encoder(token, token_len, context=context, streaming=streaming)
+        mel_len1, mel_len2 = prompt_feat.shape[1], h.shape[1] - prompt_feat.shape[1]
+        h = self.encoder_proj(h)
+
+        # get conditions
+        conds = torch.zeros([1, mel_len1 + mel_len2, self.output_size], device=token.device).to(h.dtype)
+        conds[:, :mel_len1] = prompt_feat
+        conds = conds.transpose(1, 2)
+
+        mask = (~make_pad_mask(torch.tensor([mel_len1 + mel_len2]))).to(h)
+        feat, _ = self.decoder(
+            mu=h.transpose(1, 2).contiguous(),
+            mask=mask.unsqueeze(1),
+            spks=embedding,
+            cond=conds,
+            n_timesteps=10,
+            streaming=streaming
+        )
+        feat = feat[:, :, mel_len1:]
+        assert feat.shape[2] == mel_len2
+        return feat.float(), None
+
+
+class CausalMaskedDiffWithDiT(torch.nn.Module):
+    def __init__(self,
+                 input_size: int = 512,
+                 output_size: int = 80,
+                 spk_embed_dim: int = 192,
+                 output_type: str = "mel",
+                 vocab_size: int = 4096,
+                 input_frame_rate: int = 50,
+                 only_mask_loss: bool = True,
+                 token_mel_ratio: int = 2,
+                 pre_lookahead_len: int = 3,
+                 pre_lookahead_layer: torch.nn.Module = None,
+                 decoder: torch.nn.Module = None,
+                 decoder_conf: Dict = {'in_channels': 240, 'out_channel': 80, 'spk_emb_dim': 80, 'n_spks': 1,
+                                       'cfm_params': DictConfig({'sigma_min': 1e-06, 'solver': 'euler', 't_scheduler': 'cosine',
+                                                                 'training_cfg_rate': 0.2, 'inference_cfg_rate': 0.7, 'reg_loss_type': 'l1'}),
+                                       'decoder_params': {'channels': [256, 256], 'dropout': 0.0, 'attention_head_dim': 64,
+                                                          'n_blocks': 4, 'num_mid_blocks': 12, 'num_heads': 8, 'act_fn': 'gelu'}}):
+        super().__init__()
+        self.input_size = input_size
+        self.output_size = output_size
+        self.decoder_conf = decoder_conf
+        self.vocab_size = vocab_size
+        self.output_type = output_type
+        self.input_frame_rate = input_frame_rate
+        logging.info(f"input frame rate={self.input_frame_rate}")
+        self.input_embedding = nn.Embedding(vocab_size, input_size)
+        self.spk_embed_affine_layer = torch.nn.Linear(spk_embed_dim, output_size)
+        self.pre_lookahead_len = pre_lookahead_len
+        self.pre_lookahead_layer = pre_lookahead_layer
+        self.decoder = decoder
+        self.only_mask_loss = only_mask_loss
+        self.token_mel_ratio = token_mel_ratio
+        if online_feature is True:
+            self.speech_token_extractor = SpeechTokenExtractor(model_path=os.path.join(onnx_path, 'speech_tokenizer_v3.batch.onnx'))
+
+    def forward(
+            self,
+            batch: dict,
+            device: torch.device,
+    ) -> Dict[str, Optional[torch.Tensor]]:
+        if 'speech_token' not in batch:
+            token, token_len = self.speech_token_extractor.inference(batch['whisper_feat'], batch['whisper_feat_len'], device)
+        else:
+            token = batch['speech_token'].to(device)
+            token_len = batch['speech_token_len'].to(device)
+        feat = batch['speech_feat'].to(device)
+        feat_len = batch['speech_feat_len'].to(device)
+        embedding = batch['embedding'].to(device)
+
+        # NOTE unified training, static_chunk_size > 0 or = 0
+        streaming = True if random.random() < 0.5 else False
+
+        # xvec projection
+        embedding = F.normalize(embedding, dim=1)
+        embedding = self.spk_embed_affine_layer(embedding)
+
+        # concat text and prompt_text
+        mask = (~make_pad_mask(token_len)).float().unsqueeze(-1).to(device)
+        token = self.input_embedding(torch.clamp(token, min=0)) * mask
+
+        # text encode
+        h = self.pre_lookahead_layer(token)
+        h = h.repeat_interleave(self.token_mel_ratio, dim=1)
+        mask = mask.repeat_interleave(self.token_mel_ratio, dim=1).squeeze(dim=-1)
+
+        # get conditions
+        conds = torch.zeros(feat.shape, device=token.device)
+        for i, j in enumerate(feat_len):
+            if random.random() < 0.5:
+                continue
+            index = random.randint(0, int(0.3 * j))
+            conds[i, :index] = feat[i, :index]
+        conds = conds.transpose(1, 2)
+
+        loss, _ = self.decoder.compute_loss(
+            feat.transpose(1, 2).contiguous(),
+            mask.unsqueeze(1),
+            h.transpose(1, 2).contiguous(),
+            embedding,
+            cond=conds,
+            streaming=streaming,
+        )
+        return {'loss': loss}
+
+    @torch.inference_mode()
+    def inference(self,
+                  token,
+                  token_len,
+                  prompt_token,
+                  prompt_token_len,
+                  prompt_feat,
+                  prompt_feat_len,
+                  embedding,
+                  streaming,
+                  finalize):
+        assert token.shape[0] == 1
+        # xvec projection
+        embedding = F.normalize(embedding, dim=1)
+        embedding = self.spk_embed_affine_layer(embedding)
+
+        # concat text and prompt_text
+        token, token_len = torch.concat([prompt_token, token], dim=1), prompt_token_len + token_len
+        mask = (~make_pad_mask(token_len)).unsqueeze(-1).to(embedding)
+        token = self.input_embedding(torch.clamp(token, min=0)) * mask
+
+        # text encode
+        if finalize is True:
+            h = self.pre_lookahead_layer(token)
+        else:
+            h = self.pre_lookahead_layer(token[:, :-self.pre_lookahead_len], context=token[:, -self.pre_lookahead_len:])
+        h = h.repeat_interleave(self.token_mel_ratio, dim=1)
+        mel_len1, mel_len2 = prompt_feat.shape[1], h.shape[1] - prompt_feat.shape[1]
+
+        # get conditions
+        conds = torch.zeros([1, mel_len1 + mel_len2, self.output_size], device=token.device).to(h.dtype)
+        conds[:, :mel_len1] = prompt_feat
+        conds = conds.transpose(1, 2)
+
+        mask = (~make_pad_mask(torch.tensor([mel_len1 + mel_len2]))).to(h)
+        feat, _ = self.decoder(
+            mu=h.transpose(1, 2).contiguous(),
+            mask=mask.unsqueeze(1),
+            spks=embedding,
+            cond=conds,
+            n_timesteps=10,
+            streaming=streaming
+        )
+        feat = feat[:, :, mel_len1:]
+        assert feat.shape[2] == mel_len2
+        return feat.float(), None
+
+
+if __name__ == '__main__':
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+    from hyperpyyaml import load_hyperpyyaml
+    with open('./pretrained_models/Fun-CosyVoice3-0.5B/cosyvoice3.yaml', 'r') as f:
+        configs = load_hyperpyyaml(f, overrides={'llm': None, 'hift': None})
+    model = configs['flow']
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    model.to(device)
+    model.eval()
+    max_len = 10 * model.decoder.estimator.static_chunk_size
+    chunk_size = model.decoder.estimator.static_chunk_size
+    context_size = model.pre_lookahead_layer.pre_lookahead_len
+    token = torch.randint(0, 6561, size=(1, max_len)).to(device)
+    token_len = torch.tensor([max_len]).to(device)
+    prompt_token = torch.randint(0, 6561, size=(1, chunk_size)).to(device)
+    prompt_token_len = torch.tensor([chunk_size]).to(device)
+    prompt_feat = torch.rand(1, chunk_size * 2, 80).to(device)
+    prompt_feat_len = torch.tensor([chunk_size * 2]).to(device)
+    prompt_embedding = torch.rand(1, 192).to(device)
+    pred_gt, _ = model.inference(token, token_len, prompt_token, prompt_token_len, prompt_feat, prompt_feat_len, prompt_embedding, streaming=True, finalize=True)
+    for i in range(0, max_len, chunk_size):
+        finalize = True if i + chunk_size + context_size >= max_len else False
+        pred_chunk, _ = model.inference(token[:, :i + chunk_size + context_size], torch.tensor([token[:, :i + chunk_size + context_size].shape[1]]).to(device),
+                                        prompt_token, prompt_token_len, prompt_feat, prompt_feat_len, prompt_embedding, streaming=True, finalize=finalize)
+        pred_chunk = pred_chunk[:, :, i * model.token_mel_ratio:]
+        print((pred_gt[:, :, i * model.token_mel_ratio: i * model.token_mel_ratio + pred_chunk.shape[2]] - pred_chunk).abs().max().item())
--- a/models/CosyVoice/cosyvoice/flow/flow_matching.py
+++ b/models/CosyVoice/cosyvoice/flow/flow_matching.py
@@ -0,0 +1,227 @@
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu, Zhihao Du)
+#               2025 Alibaba Inc (authors: Xiang Lyu, Bofan Zhou)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import torch
+import torch.nn.functional as F
+from matcha.models.components.flow_matching import BASECFM
+from cosyvoice.utils.common import set_all_random_seed
+
+
+class ConditionalCFM(BASECFM):
+    def __init__(self, in_channels, cfm_params, n_spks=1, spk_emb_dim=64, estimator: torch.nn.Module = None):
+        super().__init__(
+            n_feats=in_channels,
+            cfm_params=cfm_params,
+            n_spks=n_spks,
+            spk_emb_dim=spk_emb_dim,
+        )
+        self.t_scheduler = cfm_params.t_scheduler
+        self.training_cfg_rate = cfm_params.training_cfg_rate
+        self.inference_cfg_rate = cfm_params.inference_cfg_rate
+        in_channels = in_channels + (spk_emb_dim if n_spks > 0 else 0)
+        # Just change the architecture of the estimator here
+        self.estimator = estimator
+
+    @torch.inference_mode()
+    def forward(self, mu, mask, n_timesteps, temperature=1.0, spks=None, cond=None, prompt_len=0, cache=torch.zeros(1, 80, 0, 2)):
+        """Forward diffusion
+
+        Args:
+            mu (torch.Tensor): output of encoder
+                shape: (batch_size, n_feats, mel_timesteps)
+            mask (torch.Tensor): output_mask
+                shape: (batch_size, 1, mel_timesteps)
+            n_timesteps (int): number of diffusion steps
+            temperature (float, optional): temperature for scaling noise. Defaults to 1.0.
+            spks (torch.Tensor, optional): speaker ids. Defaults to None.
+                shape: (batch_size, spk_emb_dim)
+            cond: Not used but kept for future purposes
+
+        Returns:
+            sample: generated mel-spectrogram
+                shape: (batch_size, n_feats, mel_timesteps)
+        """
+
+        z = torch.randn_like(mu).to(mu.device).to(mu.dtype) * temperature
+        cache_size = cache.shape[2]
+        # fix prompt and overlap part mu and z
+        if cache_size != 0:
+            z[:, :, :cache_size] = cache[:, :, :, 0]
+            mu[:, :, :cache_size] = cache[:, :, :, 1]
+        z_cache = torch.concat([z[:, :, :prompt_len], z[:, :, -34:]], dim=2)
+        mu_cache = torch.concat([mu[:, :, :prompt_len], mu[:, :, -34:]], dim=2)
+        cache = torch.stack([z_cache, mu_cache], dim=-1)
+
+        t_span = torch.linspace(0, 1, n_timesteps + 1, device=mu.device, dtype=mu.dtype)
+        if self.t_scheduler == 'cosine':
+            t_span = 1 - torch.cos(t_span * 0.5 * torch.pi)
+        return self.solve_euler(z, t_span=t_span, mu=mu, mask=mask, spks=spks, cond=cond), cache
+
+    def solve_euler(self, x, t_span, mu, mask, spks, cond, streaming=False):
+        """
+        Fixed euler solver for ODEs.
+        Args:
+            x (torch.Tensor): random noise
+            t_span (torch.Tensor): n_timesteps interpolated
+                shape: (n_timesteps + 1,)
+            mu (torch.Tensor): output of encoder
+                shape: (batch_size, n_feats, mel_timesteps)
+            mask (torch.Tensor): output_mask
+                shape: (batch_size, 1, mel_timesteps)
+            spks (torch.Tensor, optional): speaker ids. Defaults to None.
+                shape: (batch_size, spk_emb_dim)
+            cond: Not used but kept for future purposes
+        """
+        t, _, dt = t_span[0], t_span[-1], t_span[1] - t_span[0]
+        t = t.unsqueeze(dim=0)
+
+        # I am storing this because I can later plot it by putting a debugger here and saving it to a file
+        # Or in future might add like a return_all_steps flag
+        sol = []
+
+        # Do not use concat, it may cause memory format changed and trt infer with wrong results!
+        # NOTE when flow run in amp mode, x.dtype is float32, which cause nan in trt fp16 inference, so set dtype=spks.dtype
+        x_in = torch.zeros([2, 80, x.size(2)], device=x.device, dtype=spks.dtype)
+        mask_in = torch.zeros([2, 1, x.size(2)], device=x.device, dtype=spks.dtype)
+        mu_in = torch.zeros([2, 80, x.size(2)], device=x.device, dtype=spks.dtype)
+        t_in = torch.zeros([2], device=x.device, dtype=spks.dtype)
+        spks_in = torch.zeros([2, 80], device=x.device, dtype=spks.dtype)
+        cond_in = torch.zeros([2, 80, x.size(2)], device=x.device, dtype=spks.dtype)
+        for step in range(1, len(t_span)):
+            # Classifier-Free Guidance inference introduced in VoiceBox
+            x_in[:] = x
+            mask_in[:] = mask
+            mu_in[0] = mu
+            t_in[:] = t.unsqueeze(0)
+            spks_in[0] = spks
+            cond_in[0] = cond
+            dphi_dt = self.forward_estimator(
+                x_in, mask_in,
+                mu_in, t_in,
+                spks_in,
+                cond_in,
+                streaming
+            )
+            dphi_dt, cfg_dphi_dt = torch.split(dphi_dt, [x.size(0), x.size(0)], dim=0)
+            dphi_dt = ((1.0 + self.inference_cfg_rate) * dphi_dt - self.inference_cfg_rate * cfg_dphi_dt)
+            x = x + dt * dphi_dt
+            t = t + dt
+            sol.append(x)
+            if step < len(t_span) - 1:
+                dt = t_span[step + 1] - t
+
+        return sol[-1].float()
+
+    def forward_estimator(self, x, mask, mu, t, spks, cond, streaming=False):
+        if isinstance(self.estimator, torch.nn.Module):
+            return self.estimator(x, mask, mu, t, spks, cond, streaming=streaming)
+        else:
+            [estimator, stream], trt_engine = self.estimator.acquire_estimator()
+            # NOTE need to synchronize when switching stream
+            torch.cuda.current_stream().synchronize()
+            with stream:
+                estimator.set_input_shape('x', (2, 80, x.size(2)))
+                estimator.set_input_shape('mask', (2, 1, x.size(2)))
+                estimator.set_input_shape('mu', (2, 80, x.size(2)))
+                estimator.set_input_shape('t', (2,))
+                estimator.set_input_shape('spks', (2, 80))
+                estimator.set_input_shape('cond', (2, 80, x.size(2)))
+                data_ptrs = [x.contiguous().data_ptr(),
+                             mask.contiguous().data_ptr(),
+                             mu.contiguous().data_ptr(),
+                             t.contiguous().data_ptr(),
+                             spks.contiguous().data_ptr(),
+                             cond.contiguous().data_ptr(),
+                             x.data_ptr()]
+                for i, j in enumerate(data_ptrs):
+                    estimator.set_tensor_address(trt_engine.get_tensor_name(i), j)
+                # run trt engine
+                assert estimator.execute_async_v3(torch.cuda.current_stream().cuda_stream) is True
+                torch.cuda.current_stream().synchronize()
+            self.estimator.release_estimator(estimator, stream)
+            return x
+
+    def compute_loss(self, x1, mask, mu, spks=None, cond=None, streaming=False):
+        """Computes diffusion loss
+
+        Args:
+            x1 (torch.Tensor): Target
+                shape: (batch_size, n_feats, mel_timesteps)
+            mask (torch.Tensor): target mask
+                shape: (batch_size, 1, mel_timesteps)
+            mu (torch.Tensor): output of encoder
+                shape: (batch_size, n_feats, mel_timesteps)
+            spks (torch.Tensor, optional): speaker embedding. Defaults to None.
+                shape: (batch_size, spk_emb_dim)
+
+        Returns:
+            loss: conditional flow matching loss
+            y: conditional flow
+                shape: (batch_size, n_feats, mel_timesteps)
+        """
+        b, _, t = mu.shape
+
+        # random timestep
+        t = torch.rand([b, 1, 1], device=mu.device, dtype=mu.dtype)
+
+        # sample noise p(x_0)
+        z = torch.randn_like(x1)
+
+        y = (1 - (1 - self.sigma_min) * t) * z + t * x1
+        u = x1 - (1 - self.sigma_min) * z
+
+        # during training, we randomly drop condition to trade off mode coverage and sample fidelity
+        if self.training_cfg_rate > 0:
+            cfg_mask = torch.rand(b, device=x1.device) > self.training_cfg_rate
+            mu = mu * cfg_mask.view(-1, 1, 1)
+            spks = spks * cfg_mask.view(-1, 1)
+            cond = cond * cfg_mask.view(-1, 1, 1)
+
+        pred = self.estimator(y, mask, mu, t.squeeze(), spks, cond, streaming=streaming)
+        loss = F.mse_loss(pred * mask, u * mask, reduction="sum") / (torch.sum(mask) * u.shape[1])
+        return loss, y
+
+
+class CausalConditionalCFM(ConditionalCFM):
+    def __init__(self, in_channels, cfm_params, n_spks=1, spk_emb_dim=64, estimator: torch.nn.Module = None):
+        super().__init__(in_channels, cfm_params, n_spks, spk_emb_dim, estimator)
+        set_all_random_seed(0)
+        self.rand_noise = torch.randn([1, 80, 50 * 300])
+
+    @torch.inference_mode()
+    def forward(self, mu, mask, n_timesteps, temperature=1.0, spks=None, cond=None, streaming=False):
+        """Forward diffusion
+
+        Args:
+            mu (torch.Tensor): output of encoder
+                shape: (batch_size, n_feats, mel_timesteps)
+            mask (torch.Tensor): output_mask
+                shape: (batch_size, 1, mel_timesteps)
+            n_timesteps (int): number of diffusion steps
+            temperature (float, optional): temperature for scaling noise. Defaults to 1.0.
+            spks (torch.Tensor, optional): speaker ids. Defaults to None.
+                shape: (batch_size, spk_emb_dim)
+            cond: Not used but kept for future purposes
+
+        Returns:
+            sample: generated mel-spectrogram
+                shape: (batch_size, n_feats, mel_timesteps)
+        """
+
+        z = self.rand_noise[:, :, :mu.size(2)].to(mu.device).to(mu.dtype) * temperature
+        # fix prompt and overlap part mu and z
+        t_span = torch.linspace(0, 1, n_timesteps + 1, device=mu.device, dtype=mu.dtype)
+        if self.t_scheduler == 'cosine':
+            t_span = 1 - torch.cos(t_span * 0.5 * torch.pi)
+        return self.solve_euler(z, t_span=t_span, mu=mu, mask=mask, spks=spks, cond=cond, streaming=streaming), None
--- a/models/CosyVoice/cosyvoice/flow/length_regulator.py
+++ b/models/CosyVoice/cosyvoice/flow/length_regulator.py
@@ -0,0 +1,70 @@
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu, Zhihao Du)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Tuple
+import torch.nn as nn
+import torch
+from torch.nn import functional as F
+from cosyvoice.utils.mask import make_pad_mask
+
+
+class InterpolateRegulator(nn.Module):
+    def __init__(
+            self,
+            channels: int,
+            sampling_ratios: Tuple,
+            out_channels: int = None,
+            groups: int = 1,
+    ):
+        super().__init__()
+        self.sampling_ratios = sampling_ratios
+        out_channels = out_channels or channels
+        model = nn.ModuleList([])
+        if len(sampling_ratios) > 0:
+            for _ in sampling_ratios:
+                module = nn.Conv1d(channels, channels, 3, 1, 1)
+                norm = nn.GroupNorm(groups, channels)
+                act = nn.Mish()
+                model.extend([module, norm, act])
+        model.append(
+            nn.Conv1d(channels, out_channels, 1, 1)
+        )
+        self.model = nn.Sequential(*model)
+
+    def forward(self, x, ylens=None):
+        # x in (B, T, D)
+        mask = (~make_pad_mask(ylens)).to(x).unsqueeze(-1)
+        x = F.interpolate(x.transpose(1, 2).contiguous(), size=ylens.max(), mode='linear')
+        out = self.model(x).transpose(1, 2).contiguous()
+        olens = ylens
+        return out * mask, olens
+
+    def inference(self, x1, x2, mel_len1, mel_len2, input_frame_rate=50):
+        # in inference mode, interploate prompt token and token(head/mid/tail) seprately, so we can get a clear separation point of mel
+        # NOTE 20 corresponds to token_overlap_len in cosyvoice/cli/model.py
+        # x in (B, T, D)
+        if x2.shape[1] > 40:
+            x2_head = F.interpolate(x2[:, :20].transpose(1, 2).contiguous(), size=int(20 / input_frame_rate * 22050 / 256), mode='linear')
+            x2_mid = F.interpolate(x2[:, 20:-20].transpose(1, 2).contiguous(), size=mel_len2 - int(20 / input_frame_rate * 22050 / 256) * 2,
+                                   mode='linear')
+            x2_tail = F.interpolate(x2[:, -20:].transpose(1, 2).contiguous(), size=int(20 / input_frame_rate * 22050 / 256), mode='linear')
+            x2 = torch.concat([x2_head, x2_mid, x2_tail], dim=2)
+        else:
+            x2 = F.interpolate(x2.transpose(1, 2).contiguous(), size=mel_len2, mode='linear')
+        if x1.shape[1] != 0:
+            x1 = F.interpolate(x1.transpose(1, 2).contiguous(), size=mel_len1, mode='linear')
+            x = torch.concat([x1, x2], dim=2)
+        else:
+            x = x2
+        out = self.model(x).transpose(1, 2).contiguous()
+        return out, mel_len1 + mel_len2
--- a/models/CosyVoice/cosyvoice/hifigan/discriminator.py
+++ b/models/CosyVoice/cosyvoice/hifigan/discriminator.py
@@ -0,0 +1,230 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+try:
+    from torch.nn.utils.parametrizations import weight_norm, spectral_norm
+except ImportError:
+    from torch.nn.utils import weight_norm, spectral_norm
+from typing import List, Optional, Tuple
+from einops import rearrange
+from torchaudio.transforms import Spectrogram
+
+LRELU_SLOPE = 0.1
+
+
+class MultipleDiscriminator(nn.Module):
+    def __init__(
+            self, mpd: nn.Module, mrd: nn.Module
+    ):
+        super().__init__()
+        self.mpd = mpd
+        self.mrd = mrd
+
+    def forward(self, y: torch.Tensor, y_hat: torch.Tensor):
+        y_d_rs, y_d_gs, fmap_rs, fmap_gs = [], [], [], []
+        this_y_d_rs, this_y_d_gs, this_fmap_rs, this_fmap_gs = self.mpd(y.unsqueeze(dim=1), y_hat.unsqueeze(dim=1))
+        y_d_rs += this_y_d_rs
+        y_d_gs += this_y_d_gs
+        fmap_rs += this_fmap_rs
+        fmap_gs += this_fmap_gs
+        this_y_d_rs, this_y_d_gs, this_fmap_rs, this_fmap_gs = self.mrd(y, y_hat)
+        y_d_rs += this_y_d_rs
+        y_d_gs += this_y_d_gs
+        fmap_rs += this_fmap_rs
+        fmap_gs += this_fmap_gs
+        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+
+
+class MultiResolutionDiscriminator(nn.Module):
+    def __init__(
+        self,
+        fft_sizes: Tuple[int, ...] = (2048, 1024, 512),
+        num_embeddings: Optional[int] = None,
+    ):
+        """
+        Multi-Resolution Discriminator module adapted from https://github.com/descriptinc/descript-audio-codec.
+        Additionally, it allows incorporating conditional information with a learned embeddings table.
+
+        Args:
+            fft_sizes (tuple[int]): Tuple of window lengths for FFT. Defaults to (2048, 1024, 512).
+            num_embeddings (int, optional): Number of embeddings. None means non-conditional discriminator.
+                Defaults to None.
+        """
+
+        super().__init__()
+        self.discriminators = nn.ModuleList(
+            [DiscriminatorR(window_length=w, num_embeddings=num_embeddings) for w in fft_sizes]
+        )
+
+    def forward(
+        self, y: torch.Tensor, y_hat: torch.Tensor, bandwidth_id: torch.Tensor = None
+    ) -> Tuple[List[torch.Tensor], List[torch.Tensor], List[List[torch.Tensor]], List[List[torch.Tensor]]]:
+        y_d_rs = []
+        y_d_gs = []
+        fmap_rs = []
+        fmap_gs = []
+
+        for d in self.discriminators:
+            y_d_r, fmap_r = d(x=y, cond_embedding_id=bandwidth_id)
+            y_d_g, fmap_g = d(x=y_hat, cond_embedding_id=bandwidth_id)
+            y_d_rs.append(y_d_r)
+            fmap_rs.append(fmap_r)
+            y_d_gs.append(y_d_g)
+            fmap_gs.append(fmap_g)
+
+        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+
+
+class DiscriminatorR(nn.Module):
+    def __init__(
+        self,
+        window_length: int,
+        num_embeddings: Optional[int] = None,
+        channels: int = 32,
+        hop_factor: float = 0.25,
+        bands: Tuple[Tuple[float, float], ...] = ((0.0, 0.1), (0.1, 0.25), (0.25, 0.5), (0.5, 0.75), (0.75, 1.0)),
+    ):
+        super().__init__()
+        self.window_length = window_length
+        self.hop_factor = hop_factor
+        self.spec_fn = Spectrogram(
+            n_fft=window_length, hop_length=int(window_length * hop_factor), win_length=window_length, power=None
+        )
+        n_fft = window_length // 2 + 1
+        bands = [(int(b[0] * n_fft), int(b[1] * n_fft)) for b in bands]
+        self.bands = bands
+        convs = lambda: nn.ModuleList(
+            [
+                weight_norm(nn.Conv2d(2, channels, (3, 9), (1, 1), padding=(1, 4))),
+                weight_norm(nn.Conv2d(channels, channels, (3, 9), (1, 2), padding=(1, 4))),
+                weight_norm(nn.Conv2d(channels, channels, (3, 9), (1, 2), padding=(1, 4))),
+                weight_norm(nn.Conv2d(channels, channels, (3, 9), (1, 2), padding=(1, 4))),
+                weight_norm(nn.Conv2d(channels, channels, (3, 3), (1, 1), padding=(1, 1))),
+            ]
+        )
+        self.band_convs = nn.ModuleList([convs() for _ in range(len(self.bands))])
+
+        if num_embeddings is not None:
+            self.emb = torch.nn.Embedding(num_embeddings=num_embeddings, embedding_dim=channels)
+            torch.nn.init.zeros_(self.emb.weight)
+
+        self.conv_post = weight_norm(nn.Conv2d(channels, 1, (3, 3), (1, 1), padding=(1, 1)))
+
+    def spectrogram(self, x):
+        # Remove DC offset
+        x = x - x.mean(dim=-1, keepdims=True)
+        # Peak normalize the volume of input audio
+        x = 0.8 * x / (x.abs().max(dim=-1, keepdim=True)[0] + 1e-9)
+        x = self.spec_fn(x)
+        x = torch.view_as_real(x)
+        x = rearrange(x, "b f t c -> b c t f")
+        # Split into bands
+        x_bands = [x[..., b[0]: b[1]] for b in self.bands]
+        return x_bands
+
+    def forward(self, x: torch.Tensor, cond_embedding_id: torch.Tensor = None):
+        x_bands = self.spectrogram(x)
+        fmap = []
+        x = []
+        for band, stack in zip(x_bands, self.band_convs):
+            for i, layer in enumerate(stack):
+                band = layer(band)
+                band = torch.nn.functional.leaky_relu(band, 0.1)
+                if i > 0:
+                    fmap.append(band)
+            x.append(band)
+        x = torch.cat(x, dim=-1)
+        if cond_embedding_id is not None:
+            emb = self.emb(cond_embedding_id)
+            h = (emb.view(1, -1, 1, 1) * x).sum(dim=1, keepdims=True)
+        else:
+            h = 0
+        x = self.conv_post(x)
+        fmap.append(x)
+        x += h
+
+        return x, fmap
+
+
+class MultiResSpecDiscriminator(torch.nn.Module):
+
+    def __init__(self,
+                 fft_sizes=[1024, 2048, 512],
+                 hop_sizes=[120, 240, 50],
+                 win_lengths=[600, 1200, 240],
+                 window="hann_window"):
+
+        super(MultiResSpecDiscriminator, self).__init__()
+        self.discriminators = nn.ModuleList([
+            SpecDiscriminator(fft_sizes[0], hop_sizes[0], win_lengths[0], window),
+            SpecDiscriminator(fft_sizes[1], hop_sizes[1], win_lengths[1], window),
+            SpecDiscriminator(fft_sizes[2], hop_sizes[2], win_lengths[2], window)])
+
+    def forward(self, y, y_hat):
+        y_d_rs = []
+        y_d_gs = []
+        fmap_rs = []
+        fmap_gs = []
+        for _, d in enumerate(self.discriminators):
+            y_d_r, fmap_r = d(y)
+            y_d_g, fmap_g = d(y_hat)
+            y_d_rs.append(y_d_r)
+            fmap_rs.append(fmap_r)
+            y_d_gs.append(y_d_g)
+            fmap_gs.append(fmap_g)
+
+        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+
+
+def stft(x, fft_size, hop_size, win_length, window):
+    """Perform STFT and convert to magnitude spectrogram.
+    Args:
+        x (Tensor): Input signal tensor (B, T).
+        fft_size (int): FFT size.
+        hop_size (int): Hop size.
+        win_length (int): Window length.
+        window (str): Window function type.
+    Returns:
+        Tensor: Magnitude spectrogram (B, #frames, fft_size // 2 + 1).
+    """
+    x_stft = torch.stft(x, fft_size, hop_size, win_length, window, return_complex=True)
+
+    # NOTE(kan-bayashi): clamp is needed to avoid nan or inf
+    return torch.abs(x_stft).transpose(2, 1)
+
+
+class SpecDiscriminator(nn.Module):
+    """docstring for Discriminator."""
+
+    def __init__(self, fft_size=1024, shift_size=120, win_length=600, window="hann_window", use_spectral_norm=False):
+        super(SpecDiscriminator, self).__init__()
+        norm_f = weight_norm if use_spectral_norm is False else spectral_norm
+        self.fft_size = fft_size
+        self.shift_size = shift_size
+        self.win_length = win_length
+        self.window = getattr(torch, window)(win_length)
+        self.discriminators = nn.ModuleList([
+            norm_f(nn.Conv2d(1, 32, kernel_size=(3, 9), padding=(1, 4))),
+            norm_f(nn.Conv2d(32, 32, kernel_size=(3, 9), stride=(1, 2), padding=(1, 4))),
+            norm_f(nn.Conv2d(32, 32, kernel_size=(3, 9), stride=(1, 2), padding=(1, 4))),
+            norm_f(nn.Conv2d(32, 32, kernel_size=(3, 9), stride=(1, 2), padding=(1, 4))),
+            norm_f(nn.Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))),
+        ])
+
+        self.out = norm_f(nn.Conv2d(32, 1, 3, 1, 1))
+
+    def forward(self, y):
+
+        fmap = []
+        y = y.squeeze(1)
+        y = stft(y, self.fft_size, self.shift_size, self.win_length, self.window.to(y.device))
+        y = y.unsqueeze(1)
+        for _, d in enumerate(self.discriminators):
+            y = d(y)
+            y = F.leaky_relu(y, LRELU_SLOPE)
+            fmap.append(y)
+
+        y = self.out(y)
+        fmap.append(y)
+
+        return torch.flatten(y, 1, -1), fmap
--- a/models/CosyVoice/cosyvoice/hifigan/f0_predictor.py
+++ b/models/CosyVoice/cosyvoice/hifigan/f0_predictor.py
@@ -0,0 +1,103 @@
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu, Kai Hu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import torch
+import torch.nn as nn
+try:
+    from torch.nn.utils.parametrizations import weight_norm
+except ImportError:
+    from torch.nn.utils import weight_norm
+from cosyvoice.transformer.convolution import CausalConv1d
+
+
+class ConvRNNF0Predictor(nn.Module):
+    def __init__(self,
+                 num_class: int = 1,
+                 in_channels: int = 80,
+                 cond_channels: int = 512
+                 ):
+        super().__init__()
+
+        self.num_class = num_class
+        self.condnet = nn.Sequential(
+            weight_norm(
+                nn.Conv1d(in_channels, cond_channels, kernel_size=3, padding=1)
+            ),
+            nn.ELU(),
+            weight_norm(
+                nn.Conv1d(cond_channels, cond_channels, kernel_size=3, padding=1)
+            ),
+            nn.ELU(),
+            weight_norm(
+                nn.Conv1d(cond_channels, cond_channels, kernel_size=3, padding=1)
+            ),
+            nn.ELU(),
+            weight_norm(
+                nn.Conv1d(cond_channels, cond_channels, kernel_size=3, padding=1)
+            ),
+            nn.ELU(),
+            weight_norm(
+                nn.Conv1d(cond_channels, cond_channels, kernel_size=3, padding=1)
+            ),
+            nn.ELU(),
+        )
+        self.classifier = nn.Linear(in_features=cond_channels, out_features=self.num_class)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.condnet(x)
+        x = x.transpose(1, 2)
+        return torch.abs(self.classifier(x).squeeze(-1))
+
+
+class CausalConvRNNF0Predictor(nn.Module):
+    def __init__(self,
+                 num_class: int = 1,
+                 in_channels: int = 80,
+                 cond_channels: int = 512
+                 ):
+        super().__init__()
+
+        self.num_class = num_class
+        self.condnet = nn.Sequential(
+            weight_norm(
+                CausalConv1d(in_channels, cond_channels, kernel_size=4, causal_type='right')
+            ),
+            nn.ELU(),
+            weight_norm(
+                CausalConv1d(cond_channels, cond_channels, kernel_size=3, causal_type='left')
+            ),
+            nn.ELU(),
+            weight_norm(
+                CausalConv1d(cond_channels, cond_channels, kernel_size=3, causal_type='left')
+            ),
+            nn.ELU(),
+            weight_norm(
+                CausalConv1d(cond_channels, cond_channels, kernel_size=3, causal_type='left')
+            ),
+            nn.ELU(),
+            weight_norm(
+                CausalConv1d(cond_channels, cond_channels, kernel_size=3, causal_type='left')
+            ),
+            nn.ELU(),
+        )
+        self.classifier = nn.Linear(in_features=cond_channels, out_features=self.num_class)
+
+    def forward(self, x: torch.Tensor, finalize: bool = True) -> torch.Tensor:
+        if finalize is True:
+            x = self.condnet[0](x)
+        else:
+            x = self.condnet[0](x[:, :, :-self.condnet[0].causal_padding], x[:, :, -self.condnet[0].causal_padding:])
+        for i in range(1, len(self.condnet)):
+            x = self.condnet[i](x)
+        x = x.transpose(1, 2)
+        return torch.abs(self.classifier(x).squeeze(-1))
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Kevin Wong	a6cc919e5c	更新	2026-02-11 13:57:41 +08:00
Kevin Wong	96a298e51c	更新	2026-02-11 13:48:45 +08:00
Kevin Wong	e33dfc3031	更新	2026-02-10 13:31:29 +08:00
Kevin Wong	3129d45b25	更新	2026-02-09 14:47:19 +08:00
Kevin Wong	e226224119	更新	2026-02-08 19:54:11 +08:00