OmniHuman
omnihuman
OmniHuman is an image-to-video generation model that creates realistic videos or animations from an image and performs lip sync with audio.
Model Information
Input
Configure model parameters
Output
View generated results
Result
Preview, share or download your results with a single click.
Prerequisites
- Create an API Key from the Eachlabs Console
- Install the required dependencies for your chosen language (e.g., requests for Python)
API Integration Steps
1. Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
import requestsimport timeAPI_KEY = "YOUR_API_KEY" # Replace with your API keyHEADERS = {"X-API-Key": API_KEY,"Content-Type": "application/json"}def create_prediction():response = requests.post("https://api.eachlabs.ai/v1/prediction/",headers=HEADERS,json={"model": "omnihuman","version": "0.0.1","input": {"mode": "normal","audio_url": "https://storage.googleapis.com/magicpoint/inputs/omnihuman_audio.mp3","image_url": "https://storage.googleapis.com/magicpoint/models/women.png"}})prediction = response.json()if prediction["status"] != "success":raise Exception(f"Prediction failed: {prediction}")return prediction["predictionID"]
2. Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
def get_prediction(prediction_id):while True:result = requests.get(f"https://api.eachlabs.ai/v1/prediction/{prediction_id}",headers=HEADERS).json()if result["status"] == "success":return resultelif result["status"] == "error":raise Exception(f"Prediction failed: {result}")time.sleep(1) # Wait before polling again
3. Complete Example
Here's a complete example that puts it all together, including error handling and result processing. This shows how to create a prediction and wait for the result in a production environment.
try:# Create predictionprediction_id = create_prediction()print(f"Prediction created: {prediction_id}")# Get resultresult = get_prediction(prediction_id)print(f"Output URL: {result['output']}")print(f"Processing time: {result['metrics']['predict_time']}s")except Exception as e:print(f"Error: {e}")
Additional Information
- The API uses a two-step process: create prediction and poll for results
- Response time: ~200 seconds
- Rate limit: 60 requests/minute
- Concurrent requests: 10 maximum
- Use long-polling to check prediction status until completion
Overview
OmniHuman is an advanced technology developed by ByteDance researchers that creates highly realistic human videos from a single image and a motion signal, such as audio or video. It can animate portraits, half-body, or full-body images with natural movements and lifelike gestures. By combining different inputs, like images and sound, OmniHuman brings still images to life with remarkable detail and realism.
Technical Specifications
- Modes:
- Normal: Standard output generation with balanced processing speed and accuracy.
- Dynamic: More flexible and adaptive response with a focus on contextual awareness.
- Input Handling: Supports multiple formats and performs pre-processing for enhanced output quality.
- Output Generation: Generates coherent and high-fidelity human-like responses based on the provided inputs.
Key Considerations
- High-resolution images yield better performance compared to low-quality images.
- Background noise in audio files can impact accuracy.
- Dynamic mode may require more processing time but offers better adaptability.
- The model is optimized for faces; images may lead to unexpected results.
- Ensure URLs are accessible and not restricted by security settings.
Tips & Tricks
- Mode Selection:
- Use normal mode for standard, structured responses.
- Use dynamic mode for more adaptive and nuanced outputs.
- Audio Input (audio_url):
- Prefer lossless formats (e.g., WAV) over compressed formats (e.g., MP3) for better clarity.
- Keep audio length within a reasonable range to avoid processing delays.
- Ensure the speech is clear, with minimal background noise.
- Audio Normal Mode Length Limit: In normal mode, the maximum supported audio length is 180 seconds.
- Audio Dynamic Mode Length Limit: In dynamic mode, the maximum audio length supported for pets is 90 seconds, and for real-person images, it is 180 seconds.
- Image Input (image_url):
- Use high-resolution, well-lit, front-facing images.
- Avoid extreme facial angles or obstructions (e.g., sunglasses, masks) for best results.
- Images with neutral expressions tend to produce more reliable outputs.
- Supported Normal Mode Input Types: It supports the driving of all types of pictures, including those of real people, anime, and pets.
- Supported Dynamic Mode Input Types: It supports the driving of all types of pictures, including those of real people, anime, and pets.
- Output:
- Normal Mode Output Feature: It supports the output of the original image in its proportional form.
- Dynamic Mode Output Feature: The original image will be cropped to a fixed aspect ratio of 1:1 for output, with a resolution of 512 * 512.
Capabilities
- Processes both audio and image inputs to generate human-like responses.
- Adapts to different scenarios using configurable modes.
- Supports real-time and batch processing.
- Handles a variety of input formats for flexible usage.
- Ensures coherence between audio and image-based outputs.
What can I use for?
- Voice and facial recognition-based response systems.
- Interactive AI-driven conversational agents.
- Enhanced multimedia content creation.
- Automated dubbing and voice sync applications.
- Contextually aware AI-based character simulation.
Things to be aware of
- Experiment with different image angles to observe variations in output.
- Use high-quality audio inputs to test response accuracy.
- Compare normal and dynamic modes for different response behaviors.
- Process multiple inputs to evaluate consistency in generated outputs.
- Try combining varied voice tones and facial expressions to analyze adaptability.
Limitations
- Performance may vary based on the quality of input data.
- Complex or noisy backgrounds in images can lead to inaccurate outputs with OmniHuman by ByteDance.
- Poor audio quality may result in misinterpretations.
- Processing time for OmniHuman by ByteDance may increase for larger files or complex scenarios.
- The model is primarily trained on human faces; other objects may yield unexpected results.
- Audio Normal Mode Length Limit: In normal mode, the maximum supported audio length is 180 seconds.
- Audio Dynamic Mode Length Limit: In dynamic mode, the maximum audio length supported for pets is 90 seconds, and for real-person images, it is 180 seconds.
Output Format: MP4