From 2543a270c154bbdd27db4030385beec21177f307 Mon Sep 17 00:00:00 2001 From: Kevin Wong Date: Wed, 21 Jan 2026 10:40:07 +0800 Subject: [PATCH] =?UTF-8?q?=E6=9B=B4=E6=96=B0=E6=96=87=E6=A1=A3?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- Docs/DEPLOY_MANUAL.md | 14 + Docs/DevLogs/Day6.md | 33 +++ Docs/Logs.md | 46 --- Docs/MuseTalk.md | 544 ------------------------------------ Docs/implementation_plan.md | 46 +++ Docs/task_complete.md | 4 +- README.md | 7 +- 7 files changed, 101 insertions(+), 593 deletions(-) delete mode 100644 Docs/Logs.md delete mode 100644 Docs/MuseTalk.md diff --git a/Docs/DEPLOY_MANUAL.md b/Docs/DEPLOY_MANUAL.md index 1281bab..d9ebbad 100644 --- a/Docs/DEPLOY_MANUAL.md +++ b/Docs/DEPLOY_MANUAL.md @@ -87,6 +87,19 @@ playwright install chromium --- +## 步骤 5: 启动 LatentSync 常驻加速服务 (可选) + +为了消除每次生成视频时的 30-40秒 模型加载时间,建议启动常驻服务: + +```bash +cd /home/rongye/ProgramFiles/ViGent2/models/LatentSync + +# 后台启动服务 (自动读取 backend/.env 中的 GPU 配置) +nohup python -m scripts.server > server.log 2>&1 & +``` + +--- + ## 步骤 7: 配置环境变量 ```bash @@ -102,6 +115,7 @@ cp .env.example .env | 配置项 | 默认值 | 说明 | |--------|--------|------| | `LATENTSYNC_GPU_ID` | 1 | GPU 选择 (0 或 1) | +| `LATENTSYNC_USE_SERVER` | false | 设为 true 以启用常驻服务加速 | | `LATENTSYNC_INFERENCE_STEPS` | 20 | 推理步数 (20-50) | | `LATENTSYNC_GUIDANCE_SCALE` | 1.5 | 引导系数 (1.0-3.0) | | `DEBUG` | true | 生产环境改为 false | diff --git a/Docs/DevLogs/Day6.md b/Docs/DevLogs/Day6.md index 591d0f4..e2f9378 100644 --- a/Docs/DevLogs/Day6.md +++ b/Docs/DevLogs/Day6.md @@ -208,3 +208,36 @@ CUDA_VISIBLE_DEVICES=1 python -m scripts.inference \ - [LatentSync GitHub](https://github.com/bytedance/LatentSync) - [HuggingFace 模型](https://huggingface.co/ByteDance/LatentSync-1.6) - [论文](https://arxiv.org/abs/2412.09262) + +--- + +## 🐛 修复:视频分辨率降低问题 (17:30) + +**问题**:generated video is not resolution of original video (原视频预压缩导致输出为 720p) +**原因**:之前的性能优化中强制将视频压缩至 720p 以提高推理速度,导致 1080p 视频输出被降采样。 +**修复**:在 `lipsync_service.py` 中禁用了 `_preprocess_video` 调用,直接使用原始视频进行推理。此时 `LatentSync` 将输出与输入视频一致的分辨率。 +**结果**: +- ✅ 输出视频将保持原始分辨率 (1080p)。 +- ⚠️ 推理时间将相应增加 (约需多花费 20-30% 时间)。 + +--- + +## ⚡ 性能优化补全 (18:00) + +### 1. 常驻模型服务 (Persistent Server) +**目标**: 消除每次生成视频时 30-40秒 的模型加载时间。 +**实现**: +- 新增 `models/LatentSync/scripts/server.py` (FastAPI 服务) +- 自动加载后端 `.env` 配置 +- 服务常驻显存,支持热调用 +**效果**: +- 首次请求:正常加载 (~40s) +- 后续请求:**0s 加载**,直接推理 + +### 2. GPU 并发控制 (队列) +**目标**: 防止多用户同时请求导致 OOM (显存溢出)。 +**实现**: +- 在 `lipsync_service.py` 引入 `asyncio.Lock` +- 建立全局串行队列,无论远程还是本地调用,强制排队 +**效果**: +- 即使前端触发多次生成,后端也会逐个处理,保证系统稳定性。 diff --git a/Docs/Logs.md b/Docs/Logs.md deleted file mode 100644 index 86e79b0..0000000 --- a/Docs/Logs.md +++ /dev/null @@ -1,46 +0,0 @@ -(venv) rongye@r730-ubuntu:~/ProgramFiles/ViGent2/backend$ uvicorn app.main:app --host 0.0.0.0 --port 8006 -INFO: Started server process [2398255] -INFO: Waiting for application startup. -INFO: Application startup complete. -INFO: Uvicorn running on http://0.0.0.0:8006 (Press CTRL+C to quit) -INFO: 192.168.110.188:5826 - "GET /api/materials/?t=1768899244071 HTTP/1.1" 200 OK -INFO: 192.168.110.188:5826 - "GET /api/materials/?t=1768899248452 HTTP/1.1" 200 OK -INFO: 192.168.110.188:5826 - "GET /api/materials/?t=1768899250145 HTTP/1.1" 200 OK -INFO: 192.168.110.188:5826 - "GET /api/materials/?t=1768899250420 HTTP/1.1" 200 OK -INFO: 192.168.110.188:5826 - "GET /api/materials/?t=1768899250774 HTTP/1.1" 200 OK -INFO: 192.168.110.188:5826 - "GET /api/materials/?t=1768899251257 HTTP/1.1" 200 OK -INFO: 192.168.110.188:5826 - "OPTIONS /api/videos/generate HTTP/1.1" 200 OK -INFO: 192.168.110.188:5826 - "POST /api/videos/generate HTTP/1.1" 200 OK -2026-01-20 16:54:13.143 | INFO | app.services.tts_service:generate_audio:20 - TTS Generating: 大家好,欢迎来到我的频道,今天给大家分享... (zh-CN-YunxiNeural) -INFO: 192.168.110.188:5826 - "GET /api/videos/tasks/33c43a79-6e25-471f-873d-54d651d13474 HTTP/1.1" 200 OK -INFO: 192.168.110.188:5826 - "GET /api/videos/tasks/33c43a79-6e25-471f-873d-54d651d13474 HTTP/1.1" 200 OK -[Pipeline] TTS completed in 1.4s -2026-01-20 16:54:14.547 | INFO | app.services.lipsync_service:_check_weights:56 - ✅ LatentSync 权重文件已就绪 -[LipSync] Health check: ready=True -[LipSync] Starting LatentSync inference... -2026-01-20 16:54:16.799 | INFO | app.services.lipsync_service:generate:172 - 🎬 唇形同步任务: 0bc1aa95-c567-4022-8d8b-cd3e439c78c0.mov + 33c43a79-6e25-471f-873d-54d651d13474_audio.mp3 -2026-01-20 16:54:16.799 | INFO | app.services.lipsync_service:_local_generate:200 - 🔄 调用 LatentSync 推理 (subprocess)... -2026-01-20 16:54:17.004 | INFO | app.services.lipsync_service:_preprocess_video:111 - 📹 原始视频分辨率: 1920×1080 -2026-01-20 16:54:17.005 | INFO | app.services.lipsync_service:_preprocess_video:128 - 📹 预处理视频: 1080p → 720p -2026-01-20 16:54:18.285 | INFO | app.services.lipsync_service:_preprocess_video:152 - ✅ 视频压缩完成: 14.9MB → 1.1MB -2026-01-20 16:54:18.285 | INFO | app.services.lipsync_service:_local_generate:237 - 🖥️ 执行命令: /home/rongye/ProgramFiles/miniconda3/envs/latentsync/bin/python -m scripts.inference --unet_config_path configs/unet/stage2_512.yaml --inference_ckpt_path checkpoints/latentsync_unet.pt --inference_steps... -2026-01-20 16:54:18.285 | INFO | app.services.lipsync_service:_local_generate:238 - 🖥️ GPU: CUDA_VISIBLE_DEVICES=1 -2026-01-20 16:57:52.285 | INFO | app.services.lipsync_service:_local_generate:257 - LatentSync 输出: -: '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'use_ep_level_unified_stream': '0', 'device_id': '0', 'gpu_external_alloc': '0', 'sdpa_kernel': '0', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'gpu_external_free': '0', 'use_tf32': '1', 'cudnn_conv1d_pad_to_nc1d': '0', 'do_copy_in_default_stream': '1'}} -model ignore: checkpoints/auxiliary/models/buffalo_l/w600k_r50.onnx recognition -set det-size: (512, 512) -video in 25 FPS, audio idx in 50FPS -Affine transforming 135 faces... -Restoring 135 faces... - -2026-01-20 16:57:52.287 | INFO | app.services.lipsync_service:_local_generate:262 - ✅ 唇形同步完成: /home/rongye/ProgramFiles/ViGent2/backend/outputs/33c43a79-6e25-471f-873d-54d651d13474_lipsync.mp4 -[Pipeline] LipSync completed in 217.7s -2026-01-20 16:57:52.616 | DEBUG | app.services.video_service:_run_ffmpeg:17 - FFmpeg CMD: ffmpeg -y -i /home/rongye/ProgramFiles/ViGent2/backend/outputs/33c43a79-6e25-471f-873d-54d651d13474_lipsync.mp4 -i /home/rongye/ProgramFiles/ViGent2/backend/outputs/33c43a79-6e25-471f-873d-54d651d13474_audio.mp3 -c:v libx264 -c:a aac -shortest -map 0:v -map 1:a /home/rongye/ProgramFiles/ViGent2/backend/outputs/33c43a79-6e25-471f-873d-54d651d13474_output.mp4 -[Pipeline] Total generation time: 220.4s -INFO: 192.168.110.188:5826 - "GET /api/videos/tasks/33c43a79-6e25-471f-873d-54d651d13474 HTTP/1.1" 200 OK -INFO: 192.168.110.188:10104 - "GET /outputs/33c43a79-6e25-471f-873d-54d651d13474_output.mp4 HTTP/1.1" 206 Partial Content -INFO: 192.168.110.188:6759 - "GET /outputs/33c43a79-6e25-471f-873d-54d651d13474_output.mp4 HTTP/1.1" 206 Partial Content -INFO: 192.168.110.188:6759 - "GET /outputs/33c43a79-6e25-471f-873d-54d651d13474_output.mp4 HTTP/1.1" 304 Not Modified -INFO: 192.168.110.188:6759 - "GET /outputs/33c43a79-6e25-471f-873d-54d651d13474_output.mp4 HTTP/1.1" 206 Partial Content -INFO: 192.168.110.188:6759 - "GET /outputs/33c43a79-6e25-471f-873d-54d651d13474_output.mp4 HTTP/1.1" 206 Partial Content -INFO: 192.168.110.188:10233 - "GET /outputs/33c43a79-6e25-471f-873d-54d651d13474_output.mp4 HTTP/1.1" 304 Not Modified \ No newline at end of file diff --git a/Docs/MuseTalk.md b/Docs/MuseTalk.md deleted file mode 100644 index 49e76c4..0000000 --- a/Docs/MuseTalk.md +++ /dev/null @@ -1,544 +0,0 @@ -# MuseTalk - -MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling - -Yue Zhang\*, -Zhizhou Zhong\*, -Minhao Liu\*, -Zhaokang Chen, -Bin Wu, -Yubin Zeng, -Chao Zhan, -Junxin Huang, -Yingjie He, -Wenjiang Zhou -(*Equal Contribution, Corresponding Author, benbinwu@tencent.com) - -Lyra Lab, Tencent Music Entertainment - -**[github](https://github.com/TMElyralab/MuseTalk)** **[huggingface](https://huggingface.co/TMElyralab/MuseTalk)** **[space](https://huggingface.co/spaces/TMElyralab/MuseTalk)** **[Technical report](https://arxiv.org/abs/2410.10122)** - -We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by [MuseV](https://github.com/TMElyralab/MuseV), as a complete virtual human solution. - -## 🔥 Updates -We're excited to unveil MuseTalk 1.5. -This version **(1)** integrates training with perceptual loss, GAN loss, and sync loss, significantly boosting its overall performance. **(2)** We've implemented a two-stage training strategy and a spatio-temporal data sampling approach to strike a balance between visual quality and lip-sync accuracy. -Learn more details [here](https://arxiv.org/abs/2410.10122). -**The inference codes, training codes and model weights of MuseTalk 1.5 are all available now!** 🚀 - -# Overview -`MuseTalk` is a real-time high quality audio-driven lip-syncing model trained in the latent space of `ft-mse-vae`, which - -1. modifies an unseen face according to the input audio, with a size of face region of `256 x 256`. -1. supports audio in various languages, such as Chinese, English, and Japanese. -1. supports real-time inference with 30fps+ on an NVIDIA Tesla V100. -1. supports modification of the center point of the face region proposes, which **SIGNIFICANTLY** affects generation results. -1. checkpoint available trained on the HDTF and private dataset. - -# News -- [04/05/2025] :mega: We are excited to announce that the training code is now open-sourced! You can now train your own MuseTalk model using our provided training scripts and configurations. -- [03/28/2025] We are thrilled to announce the release of our 1.5 version. This version is a significant improvement over the 1.0 version, with enhanced clarity, identity consistency, and precise lip-speech synchronization. We update the [technical report](https://arxiv.org/abs/2410.10122) with more details. -- [10/18/2024] We release the [technical report](https://arxiv.org/abs/2410.10122v2). Our report details a superior model to the open-source L1 loss version. It includes GAN and perceptual losses for improved clarity, and sync loss for enhanced performance. -- [04/17/2024] We release a pipeline that utilizes MuseTalk for real-time inference. -- [04/16/2024] Release Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk) on HuggingFace Spaces (thanks to HF team for their community grant) -- [04/02/2024] Release MuseTalk project and pretrained models. - - -## Model -![Model Structure](https://github.com/user-attachments/assets/02f4a214-1bdd-4326-983c-e70b478accba) -MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention. - -Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is **NOT** a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with a single step. - -## Cases - - - - - - - -
- -### Input Video ---- -https://github.com/TMElyralab/MuseTalk/assets/163980830/37a3a666-7b90-4244-8d3a-058cb0e44107 - ---- -https://github.com/user-attachments/assets/1ce3e850-90ac-4a31-a45f-8dfa4f2960ac - ---- -https://github.com/user-attachments/assets/fa3b13a1-ae26-4d1d-899e-87435f8d22b3 - ---- -https://github.com/user-attachments/assets/15800692-39d1-4f4c-99f2-aef044dc3251 - ---- -https://github.com/user-attachments/assets/a843f9c9-136d-4ed4-9303-4a7269787a60 - ---- -https://github.com/user-attachments/assets/6eb4e70e-9e19-48e9-85a9-bbfa589c5fcb - - - -### MuseTalk 1.0 ---- -https://github.com/user-attachments/assets/c04f3cd5-9f77-40e9-aafd-61978380d0ef - ---- -https://github.com/user-attachments/assets/2051a388-1cef-4c1d-b2a2-3c1ceee5dc99 - ---- -https://github.com/user-attachments/assets/b5f56f71-5cdc-4e2e-a519-454242000d32 - ---- -https://github.com/user-attachments/assets/a5843835-04ab-4c31-989f-0995cfc22f34 - ---- -https://github.com/user-attachments/assets/3dc7f1d7-8747-4733-bbdd-97874af0c028 - ---- -https://github.com/user-attachments/assets/3c78064e-faad-4637-83ae-28452a22b09a - - - -### MuseTalk 1.5 ---- -https://github.com/user-attachments/assets/999a6f5b-61dd-48e1-b902-bb3f9cbc7247 - ---- -https://github.com/user-attachments/assets/d26a5c9a-003c-489d-a043-c9a331456e75 - ---- -https://github.com/user-attachments/assets/471290d7-b157-4cf6-8a6d-7e899afa302c - ---- -https://github.com/user-attachments/assets/1ee77c4c-8c70-4add-b6db-583a12faa7dc - ---- -https://github.com/user-attachments/assets/370510ea-624c-43b7-bbb0-ab5333e0fcc4 - ---- -https://github.com/user-attachments/assets/b011ece9-a332-4bc1-b8b7-ef6e383d7bde - -
- - -# TODO: -- [x] trained models and inference codes. -- [x] Huggingface Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk). -- [x] codes for real-time inference. -- [x] [technical report](https://arxiv.org/abs/2410.10122v2). -- [x] a better model with updated [technical report](https://arxiv.org/abs/2410.10122). -- [x] realtime inference code for 1.5 version. -- [x] training and data preprocessing codes. -- [ ] **always** welcome to submit issues and PRs to improve this repository! 😊 - - -# Getting Started -We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users: - -## Third party integration -Thanks for the third-party integration, which makes installation and use more convenient for everyone. -We also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results. - -### [ComfyUI](https://github.com/chaojie/ComfyUI-MuseTalk) - -## Installation -To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below: - -### Build environment -We recommend Python 3.10 and CUDA 11.7. Set up your environment as follows: - -```shell -conda create -n MuseTalk python==3.10 -conda activate MuseTalk -``` - -### Install PyTorch 2.0.1 -Choose one of the following installation methods: - -```shell -# Option 1: Using pip -pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118 - -# Option 2: Using conda -conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia -``` - -### Install Dependencies -Install the remaining required packages: - -```shell -pip install -r requirements.txt -``` - -### Install MMLab Packages -Install the MMLab ecosystem packages: - -```bash -pip install --no-cache-dir -U openmim -mim install mmengine -mim install "mmcv==2.0.1" -mim install "mmdet==3.1.0" -mim install "mmpose==1.1.0" -``` - -### Setup FFmpeg -1. [Download](https://github.com/BtbN/FFmpeg-Builds/releases) the ffmpeg-static package - -2. Configure FFmpeg based on your operating system: - -For Linux: -```bash -export FFMPEG_PATH=/path/to/ffmpeg -# Example: -export FFMPEG_PATH=/musetalk/ffmpeg-4.4-amd64-static -``` - -For Windows: -Add the `ffmpeg-xxx\bin` directory to your system's PATH environment variable. Verify the installation by running `ffmpeg -version` in the command prompt - it should display the ffmpeg version information. - -### Download weights -You can download weights in two ways: - -#### Option 1: Using Download Scripts -We provide two scripts for automatic downloading: - -For Linux: -```bash -sh ./download_weights.sh -``` - -For Windows: -```batch -# Run the script -download_weights.bat -``` - -#### Option 2: Manual Download -You can also download the weights manually from the following links: - -1. Download our trained [weights](https://huggingface.co/TMElyralab/MuseTalk/tree/main) -2. Download the weights of other components: - - [sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse/tree/main) - - [whisper](https://huggingface.co/openai/whisper-tiny/tree/main) - - [dwpose](https://huggingface.co/yzd-v/DWPose/tree/main) - - [syncnet](https://huggingface.co/ByteDance/LatentSync/tree/main) - - [face-parse-bisent](https://drive.google.com/file/d/154JgKpzCPW82qINcVieuPH3fZ2e0P812/view?pli=1) - - [resnet18](https://download.pytorch.org/models/resnet18-5c106cde.pth) - -Finally, these weights should be organized in `models` as follows: -``` -./models/ -├── musetalk -│ └── musetalk.json -│ └── pytorch_model.bin -├── musetalkV15 -│ └── musetalk.json -│ └── unet.pth -├── syncnet -│ └── latentsync_syncnet.pt -├── dwpose -│ └── dw-ll_ucoco_384.pth -├── face-parse-bisent -│ ├── 79999_iter.pth -│ └── resnet18-5c106cde.pth -├── sd-vae -│ ├── config.json -│ └── diffusion_pytorch_model.bin -└── whisper - ├── config.json - ├── pytorch_model.bin - └── preprocessor_config.json - -``` -## Quickstart - -### Inference -We provide inference scripts for both versions of MuseTalk: - -#### Prerequisites -Before running inference, please ensure ffmpeg is installed and accessible: -```bash -# Check ffmpeg installation -ffmpeg -version -``` -If ffmpeg is not found, please install it first: -- Windows: Download from [ffmpeg-static](https://github.com/BtbN/FFmpeg-Builds/releases) and add to PATH -- Linux: `sudo apt-get install ffmpeg` - -#### Normal Inference -##### Linux Environment -```bash -# MuseTalk 1.5 (Recommended) -sh inference.sh v1.5 normal - -# MuseTalk 1.0 -sh inference.sh v1.0 normal -``` - -##### Windows Environment - -Please ensure that you set the `ffmpeg_path` to match the actual location of your FFmpeg installation. - -```bash -# MuseTalk 1.5 (Recommended) -python -m scripts.inference --inference_config configs\inference\test.yaml --result_dir results\test --unet_model_path models\musetalkV15\unet.pth --unet_config models\musetalkV15\musetalk.json --version v15 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin - -# For MuseTalk 1.0, change: -# - models\musetalkV15 -> models\musetalk -# - unet.pth -> pytorch_model.bin -# - --version v15 -> --version v1 -``` - -#### Real-time Inference -##### Linux Environment -```bash -# MuseTalk 1.5 (Recommended) -sh inference.sh v1.5 realtime - -# MuseTalk 1.0 -sh inference.sh v1.0 realtime -``` - -##### Windows Environment -```bash -# MuseTalk 1.5 (Recommended) -python -m scripts.realtime_inference --inference_config configs\inference\realtime.yaml --result_dir results\realtime --unet_model_path models\musetalkV15\unet.pth --unet_config models\musetalkV15\musetalk.json --version v15 --fps 25 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin - -# For MuseTalk 1.0, change: -# - models\musetalkV15 -> models\musetalk -# - unet.pth -> pytorch_model.bin -# - --version v15 -> --version v1 -``` - -The configuration file `configs/inference/test.yaml` contains the inference settings, including: -- `video_path`: Path to the input video, image file, or directory of images -- `audio_path`: Path to the input audio file - -Note: For optimal results, we recommend using input videos with 25fps, which is the same fps used during model training. If your video has a lower frame rate, you can use frame interpolation or convert it to 25fps using ffmpeg. - -Important notes for real-time inference: -1. Set `preparation` to `True` when processing a new avatar -2. After preparation, the avatar will generate videos using audio clips from `audio_clips` -3. The generation process can achieve 30fps+ on an NVIDIA Tesla V100 -4. Set `preparation` to `False` for generating more videos with the same avatar - -For faster generation without saving images, you can use: -```bash -python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml --skip_save_images -``` - -## Gradio Demo -We provide an intuitive web interface through Gradio for users to easily adjust input parameters. To optimize inference time, users can generate only the **first frame** to fine-tune the best lip-sync parameters, which helps reduce facial artifacts in the final output. -![para](assets/figs/gradio_2.png) -For minimum hardware requirements, we tested the system on a Windows environment using an NVIDIA GeForce RTX 3050 Ti Laptop GPU with 4GB VRAM. In fp16 mode, generating an 8-second video takes approximately 5 minutes. ![speed](assets/figs/gradio.png) - -Both Linux and Windows users can launch the demo using the following command. Please ensure that the `ffmpeg_path` parameter matches your actual FFmpeg installation path: - -```bash -# You can remove --use_float16 for better quality, but it will increase VRAM usage and inference time -python app.py --use_float16 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin -``` - -## Training - -### Data Preparation -To train MuseTalk, you need to prepare your dataset following these steps: - -1. **Place your source videos** - - For example, if you're using the HDTF dataset, place all your video files in `./dataset/HDTF/source`. - -2. **Run the preprocessing script** - ```bash - python -m scripts.preprocess --config ./configs/training/preprocess.yaml - ``` - This script will: - - Extract frames from videos - - Detect and align faces - - Generate audio features - - Create the necessary data structure for training - -### Training Process -After data preprocessing, you can start the training process: - -1. **First Stage** - ```bash - sh train.sh stage1 - ``` - -2. **Second Stage** - ```bash - sh train.sh stage2 - ``` - -### Configuration Adjustment -Before starting the training, you should adjust the configuration files according to your hardware and requirements: - -1. **GPU Configuration** (`configs/training/gpu.yaml`): - - `gpu_ids`: Specify the GPU IDs you want to use (e.g., "0,1,2,3") - - `num_processes`: Set this to match the number of GPUs you're using - -2. **Stage 1 Configuration** (`configs/training/stage1.yaml`): - - `data.train_bs`: Adjust batch size based on your GPU memory (default: 32) - - `data.n_sample_frames`: Number of sampled frames per video (default: 1) - -3. **Stage 2 Configuration** (`configs/training/stage2.yaml`): - - `random_init_unet`: Must be set to `False` to use the model from stage 1 - - `data.train_bs`: Smaller batch size due to high GPU memory cost (default: 2) - - `data.n_sample_frames`: Higher value for temporal consistency (default: 16) - - `solver.gradient_accumulation_steps`: Increase to simulate larger batch sizes (default: 8) - - -### GPU Memory Requirements -Based on our testing on a machine with 8 NVIDIA H20 GPUs: - -#### Stage 1 Memory Usage -| Batch Size | Gradient Accumulation | Memory per GPU | Recommendation | -|:----------:|:----------------------:|:--------------:|:--------------:| -| 8 | 1 | ~32GB | | -| 16 | 1 | ~45GB | | -| 32 | 1 | ~74GB | ✓ | - -#### Stage 2 Memory Usage -| Batch Size | Gradient Accumulation | Memory per GPU | Recommendation | -|:----------:|:----------------------:|:--------------:|:--------------:| -| 1 | 8 | ~54GB | | -| 2 | 2 | ~80GB | | -| 2 | 8 | ~85GB | ✓ | - -
-## TestCases For 1.0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ImageMuseV+MuseTalk
- - - - - -
- - - - - -
- - - - - -
- - - - - -
- - - - - -
- - - - - -
- - - - - -
- -#### Use of bbox_shift to have adjustable results(For 1.0) -:mag_right: We have found that upper-bound of the mask has an important impact on mouth openness. Thus, to control the mask region, we suggest using the `bbox_shift` parameter. Positive values (moving towards the lower half) increase mouth openness, while negative values (moving towards the upper half) decrease mouth openness. - -You can start by running with the default configuration to obtain the adjustable value range, and then re-run the script within this range. - -For example, in the case of `Xinying Sun`, after running the default configuration, it shows that the adjustable value rage is [-9, 9]. Then, to decrease the mouth openness, we set the value to be `-7`. -``` -python -m scripts.inference --inference_config configs/inference/test.yaml --bbox_shift -7 -``` -:pushpin: More technical details can be found in [bbox_shift](assets/BBOX_SHIFT.md). - - -#### Combining MuseV and MuseTalk - -As a complete solution to virtual human generation, you are suggested to first apply [MuseV](https://github.com/TMElyralab/MuseV) to generate a video (text-to-video, image-to-video or pose-to-video) by referring [this](https://github.com/TMElyralab/MuseV?tab=readme-ov-file#text2video). Frame interpolation is suggested to increase frame rate. Then, you can use `MuseTalk` to generate a lip-sync video by referring [this](https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#inference). - -# Acknowledgement -1. We thank open-source components like [whisper](https://github.com/openai/whisper), [dwpose](https://github.com/IDEA-Research/DWPose), [face-alignment](https://github.com/1adrianb/face-alignment), [face-parsing](https://github.com/zllrunning/face-parsing.PyTorch), [S3FD](https://github.com/yxlijun/S3FD.pytorch) and [LatentSync](https://huggingface.co/ByteDance/LatentSync/tree/main). -1. MuseTalk has referred much to [diffusers](https://github.com/huggingface/diffusers) and [isaacOnline/whisper](https://github.com/isaacOnline/whisper/tree/extract-embeddings). -1. MuseTalk has been built on [HDTF](https://github.com/MRzzm/HDTF) datasets. - -Thanks for open-sourcing! - -# Limitations -- Resolution: Though MuseTalk uses a face region size of 256 x 256, which make it better than other open-source methods, it has not yet reached the theoretical resolution bound. We will continue to deal with this problem. -If you need higher resolution, you could apply super resolution models such as [GFPGAN](https://github.com/TencentARC/GFPGAN) in combination with MuseTalk. - -- Identity preservation: Some details of the original face are not well preserved, such as mustache, lip shape and color. - -- Jitter: There exists some jitter as the current pipeline adopts single-frame generation. - -# Citation -```bib -@article{musetalk, - title={MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling}, - author={Zhang, Yue and Zhong, Zhizhou and Liu, Minhao and Chen, Zhaokang and Wu, Bin and Zeng, Yubin and Zhan, Chao and He, Yingjie and Huang, Junxin and Zhou, Wenjiang}, - journal={arxiv}, - year={2025} -} -``` -# Disclaimer/License -1. `code`: The code of MuseTalk is released under the MIT License. There is no limitation for both academic and commercial usage. -1. `model`: The trained model are available for any purpose, even commercially. -1. `other opensource model`: Other open-source models used must comply with their license, such as `whisper`, `ft-mse-vae`, `dwpose`, `S3FD`, etc.. -1. The testdata are collected from internet, which are available for non-commercial research purposes only. -1. `AIGC`: This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users. diff --git a/Docs/implementation_plan.md b/Docs/implementation_plan.md index 3ead6dc..b0e2354 100644 --- a/Docs/implementation_plan.md +++ b/Docs/implementation_plan.md @@ -225,6 +225,52 @@ cp -r SuperIPAgent/social-auto-upload backend/social_upload --- +### 阶段六:MuseTalk 服务器部署 (Day 2-3) ✅ + +> **目标**:在双显卡服务器上部署 MuseTalk 环境 + +- [x] Conda 环境配置 (musetalk) +- [x] 模型权重下载 (~7GB) +- [x] Subprocess 调用方式实现 +- [x] 健康检查功能 + +### 阶段七:MuseTalk 完整修复 (Day 4) ✅ + +> **目标**:解决推理脚本的各种兼容性问题 + +- [x] 权重检测路径修复 (软链接) +- [x] 音视频长度不匹配修复 +- [x] 推理脚本错误日志增强 +- [x] 视频合成 MP4 生成验证 + +### 阶段八:前端功能增强 (Day 5) ✅ + +> **目标**:提升用户体验 + +- [x] Web 视频上传功能 +- [x] 上传进度显示 +- [x] 自动刷新素材列表 + +### 阶段九:唇形同步模型升级 (Day 6) ✅ + +> **目标**:从 MuseTalk 迁移到 LatentSync 1.6 + +- [x] MuseTalk → LatentSync 1.6 迁移 +- [x] 后端代码适配 (config.py, lipsync_service.py) +- [x] Latent Diffusion 架构 (512x512 高清) +- [x] 服务器端到端验证 + +### 阶段十:性能优化 (Day 6) ✅ + +> **目标**:提升系统响应速度和稳定性 + +- [x] 视频预压缩优化 (1080p → 720p 自动适配) +- [x] 进度更新细化 (实时反馈) +- [x] **常驻模型服务** (Persistent Server, 0s 加载) +- [x] **GPU 并发控制** (串行队列防崩溃) + +--- + ## 项目目录结构 (最终) ``` diff --git a/Docs/task_complete.md b/Docs/task_complete.md index b01fcd9..752a26a 100644 --- a/Docs/task_complete.md +++ b/Docs/task_complete.md @@ -86,8 +86,8 @@ - [x] LipSync 服务单例缓存 - [x] 健康检查缓存 (5分钟) - [x] 异步子进程修复 (subprocess.run → asyncio) -- [ ] 预加载模型服务 (可选) -- [ ] 批量队列处理 (可选) +- [x] 预加载模型服务 (常驻 Server + FastAPI) +- [x] 批量队列处理 (GPU 并发控制) --- diff --git a/README.md b/README.md index aa3ba15..21744e7 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ - 🎙️ **TTS 配音** - EdgeTTS 多音色支持(云溪、晓晓等) - 📱 **一键发布** - Playwright 自动发布到抖音、小红书、B站等 - 🖥️ **Web UI** - Next.js 现代化界面 -- 🚀 **性能优化** - 视频预压缩、健康检查缓存 +- 🚀 **性能优化** - 视频预压缩、常驻模型服务 (0s加载) ## 🛠️ 技术栈 @@ -102,6 +102,10 @@ uvicorn app.main:app --host 0.0.0.0 --port 8006 # 终端 2: 前端 (端口 3002) cd frontend npm run dev -- -p 3002 + +# 终端 3: LatentSync 服务 (端口 8007, 推荐启动) +cd models/LatentSync +nohup python -m scripts.server > server.log 2>&1 & ``` --- @@ -130,6 +134,7 @@ npm run dev -- -p 3002 | 视频生成 | http://服务器IP:3002 | | 发布管理 | http://服务器IP:3002/publish | | API 文档 | http://服务器IP:8006/docs | +| 模型API | http://服务器IP:8007/docs | ---