Kokoro 82M
kokoro-82m
Kokoro 82M is an advanced text-to-speech AI model designed to convert written text into natural-sounding voice output.
Model Information
Input
Configure model parameters
Output
View generated results
Result
Preview, share or download your results with a single click.
Prerequisites
- Create an API Key from the Eachlabs Console
- Install the required dependencies for your chosen language (e.g., requests for Python)
API Integration Steps
1. Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
import requestsimport timeAPI_KEY = "YOUR_API_KEY" # Replace with your API keyHEADERS = {"X-API-Key": API_KEY,"Content-Type": "application/json"}def create_prediction():response = requests.post("https://api.eachlabs.ai/v1/prediction/",headers=HEADERS,json={"model": "kokoro-82m","version": "0.0.1","input": {"text": "your text here","speed": "1","voice": "af"}})prediction = response.json()if prediction["status"] != "success":raise Exception(f"Prediction failed: {prediction}")return prediction["predictionID"]
2. Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
def get_prediction(prediction_id):while True:result = requests.get(f"https://api.eachlabs.ai/v1/prediction/{prediction_id}",headers=HEADERS).json()if result["status"] == "success":return resultelif result["status"] == "error":raise Exception(f"Prediction failed: {result}")time.sleep(1) # Wait before polling again
3. Complete Example
Here's a complete example that puts it all together, including error handling and result processing. This shows how to create a prediction and wait for the result in a production environment.
try:# Create predictionprediction_id = create_prediction()print(f"Prediction created: {prediction_id}")# Get resultresult = get_prediction(prediction_id)print(f"Output URL: {result['output']}")print(f"Processing time: {result['metrics']['predict_time']}s")except Exception as e:print(f"Error: {e}")
Additional Information
- The API uses a two-step process: create prediction and poll for results
- Response time: ~21 seconds
- Rate limit: 60 requests/minute
- Concurrent requests: 10 maximum
- Use long-polling to check prediction status until completion
Overview
Kokoro 82M is a state-of-the-art text-to-speech model designed to produce high-quality and natural-sounding audio from text inputs. Kokoro 82M gives flexibility in voice selection, speed adjustment, and seamless control over the output. Kokoro 82M model is ideal for creating lifelike voiceovers, audio content, or any scenario requiring synthesized speech with precision and clarity.
Technical Specifications
- Advanced Neural Architecture: Kokoro 82M leverages cutting-edge technology to analyze and synthesize text into natural speech.
- Flexible Input Handling: Kokoro 82M supports text of varying lengths and complexities, ensuring consistent performance across use cases.
- Voice Variety: Includes multiple pre-trained voices with distinct tonal qualities, offering diversity for different needs.
- Speed Control: Kokoro 82M allows for dynamic pacing adjustments, enabling applications ranging from audiobooks to quick announcements.
- High Fidelity Output: Kokoro 82M is designed to deliver clean, noise-free audio with clear enunciation and natural intonation.
Key Considerations
- Text Structure Matters: Ensure that the input text is grammatically correct and well-structured to produce the best audio output.
- Speed Extremes: Setting the speed parameter too high or low may affect intelligibility. Moderate adjustments are recommended.
- Output Consistency: Shorter sentences and clear punctuation improve clarity and reduce the risk of unnatural pauses.
Tips & Tricks
- Optimize Text: Avoid overly complex or ambiguous text. Break long sentences into smaller, clear segments for better results.
- Speed Parameter:
- For formal content, keep speed values moderate (e.g., 0.8 to 1.2) to ensure clarity and professionalism.
- For dynamic or energetic outputs, experiment with slightly higher values (e.g., 1.3 to 1.5).
- Voice Selection:
- Use deeper tones for authoritative or serious contexts.
- Lighter or more vibrant voices work well for engaging or casual content.
Capabilities
- High-Quality Synthesis: Produces lifelike, natural-sounding speech that closely mimics human intonation and rhythm.
- Flexible Parameter Control: Enables users to tailor outputs with adjustable speed and diverse voice options.
What can I use for?
- Voiceovers: Generate professional-grade voiceovers for videos, presentations, or tutorials.
- Audiobooks: Create engaging and clear narrations for storytelling or educational content.
- Announcements: Produce dynamic audio for announcements or alerts in public or private settings.
Things to be aware of
- Create a fast-paced announcement by setting the speed to 1.3 and using concise text.
- Generate an audiobook snippet by selecting a steady speed (e.g., 1.0) and a calm voice.
- Test how punctuation affects output by trying variations like pauses (commas) or emphasis (exclamation points).
Limitations
- Text Complexity: While highly capable, overly intricate or poorly formatted text may result in suboptimal audio.
- Speed and Comprehension: Extreme speed settings can hinder clarity and make the output difficult to understand.
- Voice Availability: The pre-trained voices, while diverse, might not cover every niche use case or accent preference.
Output Format: WAV