CogVLM2

cogvlm2-video

CogVLM2 is a model that combines image and video understanding, enabling tasks like captioning, visual question answering, and multimodal analysis.

L40S 45GB

Fast Inference

REST API

Try in Console API Docs Examples

Model Information

Response Time~12 sec

StatusActive

Version

0.0.1

Updatedabout 2 months ago

Live Demo

Average runtime: ~12 seconds

Input

Configure model parameters

Prompt

Input prompt

Describe this video.

Top P

When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens

0.1

Temperature

Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic

0.1

Max New Tokens

Maximum number of tokens to generate. A word is generally 2-3 tokens

2048

Input Video

Input video

File upload is currently disabled

Output

View generated results

Result

Preview, share or download your results with a single click.

"In the video, we see a large elephant walking across a dry grassland. The elephant's skin is covered in a vibrant, rainbow-colored pattern. The elephant's ears are large and floppy, and it has a long, curved trunk. The elephant's eyes are visible, and it appears to be moving purposefully. The background is a clear blue sky, and there are no other objects or creatures in sight. The elephant's colorful skin stands out against the natural surroundings, creating a striking visual contrast."

Cost is calculated based on execution time.The model is charged at $0.0011 per second. With a $1 budget, you can run this model approximately 75 times, assuming an average execution time of 12 seconds per run.

API Reference

View Full Documentation

Prerequisites

Create an API Key from the Eachlabs Console
Install the required dependencies for your chosen language (e.g., requests for Python)

API Integration Steps

1. Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

import requests
import time

API_KEY = "YOUR_API_KEY"  # Replace with your API key
HEADERS = {
    "X-API-Key": API_KEY,
    "Content-Type": "application/json"
}

def create_prediction():
    response = requests.post(
        "https://api.eachlabs.ai/v1/prediction/",
        headers=HEADERS,
        json={
            "model": "cogvlm2-video",
            "version": "0.0.1",
            "input": {
  "top_p": 0.1,
  "prompt": "Describe this video.",
  "input_video": "your input video here",
  "temperature": 0.1,
  "max_new_tokens": 2048
}
        }
    )
    prediction = response.json()
    
    if prediction["status"] != "success":
        raise Exception(f"Prediction failed: {prediction}")
    
    return prediction["predictionID"]

2. Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

def get_prediction(prediction_id):
    while True:
        result = requests.get(
            f"https://api.eachlabs.ai/v1/prediction/{prediction_id}",
            headers=HEADERS
        ).json()
        
        if result["status"] == "success":
            return result
        elif result["status"] == "error":
            raise Exception(f"Prediction failed: {result}")
        
        time.sleep(1)  # Wait before polling again

3. Complete Example

Here's a complete example that puts it all together, including error handling and result processing. This shows how to create a prediction and wait for the result in a production environment.

try:
    # Create prediction
    prediction_id = create_prediction()
    print(f"Prediction created: {prediction_id}")
    
    # Get result
    result = get_prediction(prediction_id)
    print(f"Output URL: {result['output']}")
    print(f"Processing time: {result['metrics']['predict_time']}s")
except Exception as e:
    print(f"Error: {e}")

Additional Information

The API uses a two-step process: create prediction and poll for results
Response time: ~12 seconds
Rate limit: 60 requests/minute
Concurrent requests: 10 maximum
Use long-polling to check prediction status until completion

Overview

CogVLM2-Video is an advanced visual language model designed for comprehensive image and video understanding tasks. Building upon the previous generation, it offers significant improvements in various benchmarks, supports longer content lengths, higher image resolutions.

Technical Specifications

Base Model: Built upon Meta's Llama 3 with 8 billion parameters.

Multimodal Input: Capable of processing both textual and visual data, including images and videos.

Key Considerations

Performance: While the model achieves state-of-the-art results in many benchmarks, real-world performance may vary based on input quality and complexity.

Tips & Tricks

Effective Input Preparation:

Text Prompts: Clearly articulate your prompts to guide the model effectively.
Visual Inputs: Use high-quality images or videos within the supported resolution and length to enhance output accuracy.

Combining Modalities: Leverage both text and visual inputs simultaneously to enrich the context and improve the model's understanding.

Input Length: The model supports content lengths up to 8,000 tokens. Ensure your inputs stay within this limit to maintain optimal performance.

Image Resolution: For image inputs, resolutions up to 1344×1344 pixels are supported. Providing images within this resolution range will yield the best results.

Language Support: CogVLM2-Video is proficient in both Chinese and English. You can input prompts in either language based on your requirements.

Capabilities

Image Understanding: Analyzes and interprets high-resolution images, providing detailed insights and descriptions.

Video Understanding: Processes videos by analyzing keyframes, enabling comprehension of dynamic visual content.

What can I use for?

Visual Question Answering: Obtain answers to questions based on video content.

Content Analysis: Analyze visual media to extract meaningful information and summaries.

Things to be aware of

Interactive Applications: Create chatbots or virtual assistants that can interpret and respond to video input.

Educational Tools: Develop frameworks that use the model’s capabilities to provide explanations or summaries of video content for learning purposes.

Content Creation: Use the model to create descriptive content or narratives based on videos to assist with creative projects.

Limitations

Video Length: The model can process videos up to 1 minute in duration. Longer videos need to be truncated or segmented appropriately.

Resolution Constraints: Images exceeding 1344×1344 pixels may require downscaling to fit within the supported resolution.

Language Limitations: While proficient in Chinese and English, performance in other languages may be limited or unsupported.

Output Format: Text

Related AI Models

Generator Autocaption

Video to Text

Youtube Transcriptor

Video to Text