TutorialNovember 5, 2025•10 min read

Getting Started with MTVCraft: A Comprehensive Guide

A step-by-step tutorial for setting up MTVCraft, running your first generation, and understanding the key parameters that control output quality.

Prerequisites

Before diving into MTVCraft, make sure your system meets the following requirements:

GPU: NVIDIA GPU with at least 16GB VRAM (RTX 3090, A5000, or better)
CUDA: CUDA 12.1 or compatible version
Python: Python 3.10 or newer
Storage: At least 50GB free space for models and outputs
RAM: 32GB system RAM recommended

You'll also need API keys for:

Qwen/DashScope: For prompt decomposition (get it from dashscope.aliyuncs.com)
ElevenLabs: For audio synthesis (get it from elevenlabs.io)

Installation

Step 1: Clone the Repository

git clone https://github.com/baaivision/MTVCraft.git
cd MTVCraft

Step 2: Create Conda Environment

We recommend using Conda to manage dependencies:

conda create -n mtv python=3.10
conda activate mtv

Step 3: Install Dependencies

Install PyTorch and other required packages:

pip install -r requirements.txt

This will install all necessary dependencies including PyTorch, transformers, diffusers, and more. The installation may take several minutes depending on your internet connection.

Step 4: Install FFmpeg

FFmpeg is required for video processing:

# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

# Or use conda
conda install -c conda-forge ffmpeg

Downloading Pretrained Models

MTVCraft requires several pretrained models. The easiest way to download them is using the Hugging Face CLI:

pip install "huggingface_hub[cli]"
huggingface-cli download BAAI/MTVCraft --local-dir ./pretrained_models

This will download approximately 30GB of model files including:

MTV framework checkpoints (single-stream, multi-stream, and accumulative models)
T5-XXL text encoder
CogVideoX 3D VAE
Wav2Vec2 audio encoder

The download may take 30-60 minutes depending on your connection speed. Coffee break recommended!

Configuration

Setting Up API Keys

Open mtv/utils.py and add your API keys:

# Qwen API configuration
qwen_model_name = "qwen-plus"
qwen_api_key = "YOUR_QWEN_API_KEY_HERE"

# ElevenLabs configuration
elevenlabs = ElevenLabs(
    api_key="YOUR_ELEVENLABS_API_KEY_HERE"
)

Important: Never commit API keys to version control. Consider using environment variables in production:

export QWEN_API_KEY="your_key_here"
export ELEVENLABS_API_KEY="your_key_here"

Running Your First Generation

Using the Command Line

The simplest way to generate videos is using the batch inference script:

bash scripts/inference_long.sh ./examples/samples.txt ./output

This command reads prompts from samples.txt and saves videos to the ./output directory.

Using the Gradio Interface

For a more interactive experience, launch the Gradio web interface:

bash scripts/app.sh ./output

This will start a local web server (usually at http://localhost:7860) where you can enter prompts and see results in real-time.

Example Prompts

Here are some effective prompts to try:

"A pianist performing in a concert hall with audience applause"
"A chef cooking in a busy restaurant kitchen with sizzling sounds"
"A person walking through a forest with birds chirping"
"A barista making coffee in a cozy cafe with background chatter"
"A gamer celebrating victory with excited shouts and game sounds"

Understanding Key Parameters

Video Duration

MTVCraft generates 4-6 second clips by default. This is the sweet spot for quality and coherence. Longer videos can be created by generating multiple clips and stitching them together.

Inference Steps

The number of diffusion steps controls the quality-speed tradeoff. More steps generally produce better quality but take longer:

20 steps: Fast preview (2-3 minutes)
50 steps: Good quality (5-7 minutes) - recommended default
100 steps: High quality (10-15 minutes)

Guidance Scale

Controls how closely the generation follows the audio conditioning:

Low (1.0-3.0): More creative, less constrained
Medium (4.0-7.0): Balanced (recommended)
High (8.0+): Very strict audio following, may reduce diversity

Troubleshooting Common Issues

Out of Memory Errors

If you encounter CUDA out of memory errors, try:

Reducing batch size to 1
Using lower resolution settings
Closing other GPU-intensive applications
Enabling mixed precision (FP16) training

API Rate Limiting

Both Qwen and ElevenLabs APIs have rate limits. If you hit these:

Add delays between requests
Upgrade your API plan for higher limits
Process prompts in smaller batches

Poor Audio-Visual Sync

If synchronization is off:

Increase the number of inference steps
Try adjusting the guidance scale
Ensure your prompt clearly describes temporal relationships

Next Steps

Now that you have MTVCraft up and running, here are some ways to deepen your understanding:

Experiment with different prompt styles and structures
Read the architecture deep dive to understand how it works internally
Try modifying the code to add custom features
Join the community discussions on GitHub
Contribute improvements back to the project