Getting Started with MTVCraft: A Comprehensive Guide
A step-by-step tutorial for setting up MTVCraft, running your first generation, and understanding the key parameters that control output quality.
Prerequisites
Before diving into MTVCraft, make sure your system meets the following requirements:
- GPU: NVIDIA GPU with at least 16GB VRAM (RTX 3090, A5000, or better)
- CUDA: CUDA 12.1 or compatible version
- Python: Python 3.10 or newer
- Storage: At least 50GB free space for models and outputs
- RAM: 32GB system RAM recommended
You'll also need API keys for:
- Qwen/DashScope: For prompt decomposition (get it from dashscope.aliyuncs.com)
- ElevenLabs: For audio synthesis (get it from elevenlabs.io)
Installation
Step 1: Clone the Repository
git clone https://github.com/baaivision/MTVCraft.git cd MTVCraft
Step 2: Create Conda Environment
We recommend using Conda to manage dependencies:
conda create -n mtv python=3.10 conda activate mtv
Step 3: Install Dependencies
Install PyTorch and other required packages:
pip install -r requirements.txt
This will install all necessary dependencies including PyTorch, transformers, diffusers, and more. The installation may take several minutes depending on your internet connection.
Step 4: Install FFmpeg
FFmpeg is required for video processing:
# Ubuntu/Debian sudo apt-get install ffmpeg # macOS brew install ffmpeg # Or use conda conda install -c conda-forge ffmpeg
Downloading Pretrained Models
MTVCraft requires several pretrained models. The easiest way to download them is using the Hugging Face CLI:
pip install "huggingface_hub[cli]" huggingface-cli download BAAI/MTVCraft --local-dir ./pretrained_models
This will download approximately 30GB of model files including:
- MTV framework checkpoints (single-stream, multi-stream, and accumulative models)
- T5-XXL text encoder
- CogVideoX 3D VAE
- Wav2Vec2 audio encoder
The download may take 30-60 minutes depending on your connection speed. Coffee break recommended!
Configuration
Setting Up API Keys
Open mtv/utils.py and add your API keys:
# Qwen API configuration
qwen_model_name = "qwen-plus"
qwen_api_key = "YOUR_QWEN_API_KEY_HERE"
# ElevenLabs configuration
elevenlabs = ElevenLabs(
api_key="YOUR_ELEVENLABS_API_KEY_HERE"
)Important: Never commit API keys to version control. Consider using environment variables in production:
export QWEN_API_KEY="your_key_here" export ELEVENLABS_API_KEY="your_key_here"
Running Your First Generation
Using the Command Line
The simplest way to generate videos is using the batch inference script:
bash scripts/inference_long.sh ./examples/samples.txt ./output
This command reads prompts from samples.txt and saves videos to the ./output directory.
Using the Gradio Interface
For a more interactive experience, launch the Gradio web interface:
bash scripts/app.sh ./output
This will start a local web server (usually at http://localhost:7860) where you can enter prompts and see results in real-time.
Example Prompts
Here are some effective prompts to try:
- "A pianist performing in a concert hall with audience applause"
- "A chef cooking in a busy restaurant kitchen with sizzling sounds"
- "A person walking through a forest with birds chirping"
- "A barista making coffee in a cozy cafe with background chatter"
- "A gamer celebrating victory with excited shouts and game sounds"
Understanding Key Parameters
Video Duration
MTVCraft generates 4-6 second clips by default. This is the sweet spot for quality and coherence. Longer videos can be created by generating multiple clips and stitching them together.
Inference Steps
The number of diffusion steps controls the quality-speed tradeoff. More steps generally produce better quality but take longer:
- 20 steps: Fast preview (2-3 minutes)
- 50 steps: Good quality (5-7 minutes) - recommended default
- 100 steps: High quality (10-15 minutes)
Guidance Scale
Controls how closely the generation follows the audio conditioning:
- Low (1.0-3.0): More creative, less constrained
- Medium (4.0-7.0): Balanced (recommended)
- High (8.0+): Very strict audio following, may reduce diversity
Troubleshooting Common Issues
Out of Memory Errors
If you encounter CUDA out of memory errors, try:
- Reducing batch size to 1
- Using lower resolution settings
- Closing other GPU-intensive applications
- Enabling mixed precision (FP16) training
API Rate Limiting
Both Qwen and ElevenLabs APIs have rate limits. If you hit these:
- Add delays between requests
- Upgrade your API plan for higher limits
- Process prompts in smaller batches
Poor Audio-Visual Sync
If synchronization is off:
- Increase the number of inference steps
- Try adjusting the guidance scale
- Ensure your prompt clearly describes temporal relationships
Next Steps
Now that you have MTVCraft up and running, here are some ways to deepen your understanding:
- Experiment with different prompt styles and structures
- Read the architecture deep dive to understand how it works internally
- Try modifying the code to add custom features
- Join the community discussions on GitHub
- Contribute improvements back to the project