XTTS

xtts-v2

XTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip.

A100 80GB

Fast Inference

REST API

Try in Console API Docs Examples

Model Information

Response Time~20 sec

StatusActive

Version

0.0.1

Updatedabout 1 month ago

Live Demo

Average runtime: ~20 seconds

Input

Configure model parameters

Speaker

This determines the specific voice or persona that will speak the provided text.

File upload is currently disabled

MP3WAV

language

This refers to the choice of language for the text-to-speech synthesis.

Text

This is the written input that you want to be converted into spoken words.

Hola, ahora estás en Eachlabs AI. Si necesita ayuda, simplemente contáctenos.

Output

View generated results

Result

Preview, share or download your results with a single click.

Cost is calculated based on execution time.The model is charged at $0.002 per second. With a $1 budget, you can run this model approximately 25 times, assuming an average execution time of 20 seconds per run.

API Reference

View Full Documentation

Prerequisites

Create an API Key from the Eachlabs Console
Install the required dependencies for your chosen language (e.g., requests for Python)

API Integration Steps

1. Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

import requests
import time

API_KEY = "YOUR_API_KEY"  # Replace with your API key
HEADERS = {
    "X-API-Key": API_KEY,
    "Content-Type": "application/json"
}

def create_prediction():
    response = requests.post(
        "https://api.eachlabs.ai/v1/prediction/",
        headers=HEADERS,
        json={
            "model": "xtts-v2",
            "version": "0.0.1",
            "input": {
  "text": "Hello, you are now at Eachlabs AI. If you need any support, just contact us.",
  "speaker": "your_file.audio/mp3",
  "language": "en",
  "cleanup_voice": false
}
        }
    )
    prediction = response.json()
    
    if prediction["status"] != "success":
        raise Exception(f"Prediction failed: {prediction}")
    
    return prediction["predictionID"]

2. Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

def get_prediction(prediction_id):
    while True:
        result = requests.get(
            f"https://api.eachlabs.ai/v1/prediction/{prediction_id}",
            headers=HEADERS
        ).json()
        
        if result["status"] == "success":
            return result
        elif result["status"] == "error":
            raise Exception(f"Prediction failed: {result}")
        
        time.sleep(1)  # Wait before polling again

3. Complete Example

Here's a complete example that puts it all together, including error handling and result processing. This shows how to create a prediction and wait for the result in a production environment.

try:
    # Create prediction
    prediction_id = create_prediction()
    print(f"Prediction created: {prediction_id}")
    
    # Get result
    result = get_prediction(prediction_id)
    print(f"Output URL: {result['output']}")
    print(f"Processing time: {result['metrics']['predict_time']}s")
except Exception as e:
    print(f"Error: {e}")

Additional Information

The API uses a two-step process: create prediction and poll for results
Response time: ~20 seconds
Rate limit: 60 requests/minute
Concurrent requests: 10 maximum
Use long-polling to check prediction status until completion

Overview

XTTS is a state-of-the-art text-to-speech (TTS) model that enables high-quality, natural-sounding voice generation in multiple languages. The model is designed for generating lifelike speech while maintaining clarity, emotion, and linguistic precision. It supports a wide range of languages and offers fine-tuned controls to customize voice output to suit various use cases.

Technical Specifications

Multilingual Support: The model supports the following languages:

English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh), Hungarian (hu), Korean (ko), Hindi (hi).

Speaker Personalization: Allows the use of external speaker files to mimic specific voice profiles or styles.

Voice Cleanup: A refinement process to enhance the smoothness and quality of generated speech.

Key Considerations

Language-Specific Nuances: Ensure the text input aligns with the selected language to avoid unnatural pronunciation.

Speaker File Quality: Poor-quality or noisy speaker files can negatively impact the generated output. Use clean recordings for better results.

Output Clarity: Long or overly complex text inputs may produce less natural results.

Tips & Tricks

Text:

Keep sentences concise and grammatically correct.
Avoid abbreviations or symbols that may confuse the model.
Example: Use "Please proceed to the next step." instead of "Pls proc nxt step."

Speaker:

Use high-resolution audio files for better mimicry.
Ensure the recording has a neutral tone without excessive background noise or distortion.

Language:

Select the correct code for the desired language (e.g., en for English, fr for French).
Match the text language with the selected language code for natural intonation.

Cleanup Voice:

Enable this option for smoother and artifact-free outputs, especially when working with synthesized or noisy speaker profiles.

Capabilities

Narration for audiobooks or educational content.

Voiceovers for videos and presentations.

Real-time communication in multilingual scenarios.

What can I use for?

Creating customized voice profiles for specific use cases.

Generating speech in multiple languages with high clarity and natural tone.

Refining synthesized speech using advanced cleanup features.

Things to be aware of

Multilingual Speech:

Input: "Bonjour, comment allez-vous?"
Language: fr
Output: High-quality French speech.

Voice Personalization:

Provide a custom speaker file to replicate a specific voice style.

Enhanced Cleanup:

Enable the cleanup_voice feature to polish the generated audio.

Limitations

Accent and Dialect Variations: The model may not fully replicate regional accents or dialects within a language.

Speaker Diversity: The quality of voice mimicry depends heavily on the provided speaker file's clarity and characteristics.

Complex Text Handling: Highly technical or domain-specific jargon may result in inconsistent pronunciation.

Output Format: WAV

Related AI Models