MM Audio

mmaudio

MMAudio generates synchronized audio given video and/or text inputs.

L40S 45GB

Fast Inference

REST API

Try in Console API Docs Examples

Model Information

Response Time~5 sec

StatusActive

Version

0.0.1

Updatedabout 2 months ago

Live Demo

Average runtime: ~5 seconds

Input

Configure model parameters

Negative Prompt

Negative prompt to avoid certain sounds

music

Video

Optional video file for video-to-audio generation

File upload is currently disabled

MP4

Output

View generated results

Result

Preview, share or download your results with a single click.

Cost is calculated based on execution time.The model is charged at $0.0011 per second. With a $1 budget, you can run this model approximately 181 times, assuming an average execution time of 5 seconds per run.

API Reference

View Full Documentation

Prerequisites

Create an API Key from the Eachlabs Console
Install the required dependencies for your chosen language (e.g., requests for Python)

API Integration Steps

1. Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

import requests
import time

API_KEY = "YOUR_API_KEY"  # Replace with your API key
HEADERS = {
    "X-API-Key": API_KEY,
    "Content-Type": "application/json"
}

def create_prediction():
    response = requests.post(
        "https://api.eachlabs.ai/v1/prediction/",
        headers=HEADERS,
        json={
            "model": "mmaudio",
            "version": "0.0.1",
            "input": {
  "seed": -1,
  "video": "your_file.video/mp4",
  "prompt": "your prompt here",
  "duration": 8,
  "num_steps": 25,
  "cfg_strength": 4.5,
  "negative_prompt": "music"
}
        }
    )
    prediction = response.json()
    
    if prediction["status"] != "success":
        raise Exception(f"Prediction failed: {prediction}")
    
    return prediction["predictionID"]

2. Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

def get_prediction(prediction_id):
    while True:
        result = requests.get(
            f"https://api.eachlabs.ai/v1/prediction/{prediction_id}",
            headers=HEADERS
        ).json()
        
        if result["status"] == "success":
            return result
        elif result["status"] == "error":
            raise Exception(f"Prediction failed: {result}")
        
        time.sleep(1)  # Wait before polling again

3. Complete Example

Here's a complete example that puts it all together, including error handling and result processing. This shows how to create a prediction and wait for the result in a production environment.

try:
    # Create prediction
    prediction_id = create_prediction()
    print(f"Prediction created: {prediction_id}")
    
    # Get result
    result = get_prediction(prediction_id)
    print(f"Output URL: {result['output']}")
    print(f"Processing time: {result['metrics']['predict_time']}s")
except Exception as e:
    print(f"Error: {e}")

Additional Information

The API uses a two-step process: create prediction and poll for results
Response time: ~5 seconds
Rate limit: 60 requests/minute
Concurrent requests: 10 maximum
Use long-polling to check prediction status until completion

Overview

MMAudio is an innovative multi-modal AI model designed to analyze, process, and generate audio data with advanced capabilities. By integrating state-of-the-art techniques in audio analysis and synthesis, MMAudio supports tasks such as transcription, audio classification, and text-to-audio generation. Its versatility makes it ideal for applications in media, research, and interactive systems.

Technical Specifications

Architecture: Combines convolutional neural networks (CNNs) with transformer-based architectures for robust audio analysis and synthesis.
Supported Tasks:
- Audio transcription and classification
- Text-to-audio generation
- Audio enhancement and denoising
Dataset Training: Trained on diverse audio datasets including speech, music, and environmental sounds.

Key Considerations

Video Quality: Use high-resolution videos for better audio alignment.
Prompt Clarity: Ambiguous prompts may lead to less desirable outcomes. Be descriptive and precise.
Processing Time: Higher num_steps improves quality but increases processing time.
Negative Prompt Usage: Avoid distractions by specifying what not to include in the audio.

Tips & Tricks

Optimize CFG Strength:
- High values (e.g., 10): Strict adherence to the prompt.
- Low values (e.g., 2-5): More creative and flexible outputs.
Leverage Negative Prompts: To refine results, use phrases like "no human voices" or "no loud background music."
Experiment with Seeds: Fixed seeds ensure repeatability, while varying seeds can inspire new outcomes.
Balance Steps and Speed: Start with moderate num_steps (e.g., 50) for efficiency and adjust based on quality needs.

Capabilities

Audio for Silent Films: Enhance silent footage with contextual soundscapes.
Nature Ambiance: Generate immersive environmental audio for landscapes and wildlife videos.
Content Creation: Add professional-quality sound to video projects.
Virtual Reality: Create synchronized audio for VR environments, boosting immersion.

What can I use for?

Media Production: Automate the addition of soundtracks to silent videos, enriching content without manual audio editing.
Gaming and VR: Create immersive environments by generating context-specific audio that responds dynamically to visual cues.

Educational Content: Enhance instructional videos with appropriate sound effects, aiding in better comprehension and engagement.

Things to be aware of

Silent Film Enhancement: Apply MMAudio to silent films to generate authentic soundtracks, revitalizing classic cinema.
Nature Documentary Soundscapes: Use the model to add realistic environmental sounds to nature footage, creating an immersive experience.
Action Sequence Audio: Generate dynamic sound effects for action scenes in videos, enhancing excitement and realism.

Custom Narration: Input textual descriptions to produce corresponding audio narrations, useful for documentaries and presentations.

Limitations

Complex Scenes: May encounter challenges when processing videos with rapid scene changes or intricate visual details.
Unique Sound Effects: Certain distinctive sound effects might require additional customization beyond the model's standard capabilities.

Resource Intensive: Processing high-resolution videos can be computationally demanding.
Output Format: MP4

Related AI Models

Audio Based Lip Synchronization

Video to Video