Open Voice

openvoice

Updated to OpenVoice v2: Versatile Instant Voice Cloning

A100 40GB

Fast Inference

REST API

Try in Console API Docs Examples

Model Information

Response Time~14 sec

StatusActive

Version

0.0.1

Updatedabout 1 month ago

Live Demo

Average runtime: ~14 seconds

Input

Configure model parameters

Text

Input text

Did you ever hear a folk tale about a giant turtle?

language

An enumeration.

Speed

Set speed scale of the output audio

Audio

Input reference audio

File upload is currently disabled

MP3WAV

Output

View generated results

Result

Preview, share or download your results with a single click.

Cost is calculated based on execution time.The model is charged at $0.0015 per second. With a $1 budget, you can run this model approximately 47 times, assuming an average execution time of 14 seconds per run.

API Reference

View Full Documentation

Prerequisites

Create an API Key from the Eachlabs Console
Install the required dependencies for your chosen language (e.g., requests for Python)

API Integration Steps

1. Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

import requests
import time

API_KEY = "YOUR_API_KEY"  # Replace with your API key
HEADERS = {
    "X-API-Key": API_KEY,
    "Content-Type": "application/json"
}

def create_prediction():
    response = requests.post(
        "https://api.eachlabs.ai/v1/prediction/",
        headers=HEADERS,
        json={
            "model": "openvoice",
            "version": "0.0.1",
            "input": {
  "text": "Did you ever hear a folk tale about a giant turtle?",
  "audio": "your_file.audio/mp3",
  "speed": 1,
  "language": "EN_NEWEST"
}
        }
    )
    prediction = response.json()
    
    if prediction["status"] != "success":
        raise Exception(f"Prediction failed: {prediction}")
    
    return prediction["predictionID"]

2. Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

def get_prediction(prediction_id):
    while True:
        result = requests.get(
            f"https://api.eachlabs.ai/v1/prediction/{prediction_id}",
            headers=HEADERS
        ).json()
        
        if result["status"] == "success":
            return result
        elif result["status"] == "error":
            raise Exception(f"Prediction failed: {result}")
        
        time.sleep(1)  # Wait before polling again

3. Complete Example

Here's a complete example that puts it all together, including error handling and result processing. This shows how to create a prediction and wait for the result in a production environment.

try:
    # Create prediction
    prediction_id = create_prediction()
    print(f"Prediction created: {prediction_id}")
    
    # Get result
    result = get_prediction(prediction_id)
    print(f"Output URL: {result['output']}")
    print(f"Processing time: {result['metrics']['predict_time']}s")
except Exception as e:
    print(f"Error: {e}")

Additional Information

The API uses a two-step process: create prediction and poll for results
Response time: ~14 seconds
Rate limit: 60 requests/minute
Concurrent requests: 10 maximum
Use long-polling to check prediction status until completion

Overview

OpenVoice is an advanced text-to-speech (TTS) model designed to deliver natural, expressive, and high-quality voice synthesis. Leveraging cutting-edge neural network architectures, it precisely converts written text into realistic speech. OpenVoice supports a variety of languages, tones, and emotions, making it suitable for media, accessibility, and virtual assistants.

Technical Specifications

Architecture: Built on Transformer-based neural networks optimized for high-fidelity speech synthesis.
Custom Voices: Offers the ability to fine-tune and create custom voices using domain-specific datasets.

Key Considerations

Audio Input Duration:
For efficient processing and accurate cloning, the audio input should ideally be approximately 60 seconds long. Aim to provide a clean and uninterrupted audio sample for better results.
Processing Efficiency:
Longer inputs, whether text or audio, may significantly increase processing time. Optimizing input size ensures faster and more reliable results.
Clarity and Quality:
Clear, high-quality inputs—both text and audio—are critical for achieving accurate and natural-sounding output. Avoid noisy or overly complex data.

Tips & Tricks

Punctuation Matters: Use punctuation effectively to control pauses and intonation for more natural speech.
Custom Lexicons: Define custom pronunciations for domain-specific terms or uncommon words.
Experiment with Speed and Pitch: Adjust the speed and pitch parameters to match your desired output style.
Voice Blending: Combine multiple voices for dialogue or multi-character narration
Input Quality: Ensure your input text is grammatically correct and properly punctuated for the most natural-sounding speech.
Voice Selection: Experiment with different voices and accents to find the best fit for your project.

Capabilities

Real-Time Synthesis: Stream text-to-speech output for live applications.
High-Fidelity Audio: Produces clear, natural-sounding speech suitable for professional use.

What can I use for?

Content Creation: Generate voiceovers for videos, podcasts, or e-learning materials.
Virtual Assistants: Power conversational agents and virtual assistants with realistic speech.
Customer Support: Create automated responses for customer service applications.

Things to be aware of

Dynamic Narration: Generate audiobooks with expressive narration using custom voices.
Language Experiments: Test the model’s capabilities across different languages and accents.
Interactive Applications: Use real-time synthesis for interactive voice applications like games or chatbots.

Limitations

Highly Complex Text: May struggle with synthesizing speech for highly technical or ambiguous text.
Emotion Range: While capable of expressive speech, it may not fully capture nuanced emotions.
Background Noise: Generated speech may sound less natural when combined with inconsistent background audio.
Output Format: WAV

Related AI Models