Skip to content

Image Detection API Documentation

Advanced AI-powered object detection for images using local Ollama models, Google Gemini, AWS Rekognition, and OpenAI Vision APIs.

Table of Contents

Overview

The detection package provides a unified interface for multiple AI vision providers, allowing you to: - Detect objects and labels in images - Extract text (OCR) - Detect faces and facial attributes - Analyze image properties (colors, quality, sharpness) - Check for inappropriate content - Get natural language descriptions

All providers return results in a consistent format, making it easy to switch between providers or compare results.

Supported Providers

Provider API Key Required Features
Ollama (default) None (local server) Labels, Text, Faces, Description, Moderation
Google Gemini GEMINI_API_KEY Labels, Text, Faces, Description, Web detection, Landmarks
AWS Rekognition AWS credentials Labels, Text, Faces, Image Properties, Moderation
OpenAI Vision OPENAI_API_KEY Labels, Description, Text, Faces (via GPT-4o)

Setup & Authentication

Ollama (Local)

  1. Install Ollama and start the daemon:
    ollama serve
    
  2. Pull the default multimodal model:
    ollama pull gemma3
    
  3. (Optional) Override the host or model:
    export OLLAMA_HOST="http://192.168.1.50:11434"
    export IMGX_OLLAMA_MODEL="llava"
    

Google Gemini

  1. Get API key from Google AI Studio
  2. Set environment variable:
    export GEMINI_API_KEY="your-api-key"
    

AWS Rekognition

AWS uses the standard credential chain:

Option 1: Environment Variables

export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_REGION="us-east-1"

Option 2: AWS CLI Configuration

aws configure

Option 3: AWS Profile

export AWS_PROFILE="myprofile"

Option 4: IAM Roles (automatic on EC2/ECS/Lambda)

OpenAI Vision

  1. Get API key from OpenAI Platform
  2. Set environment variable:
    export OPENAI_API_KEY="sk-..."
    

Quick Start

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/razzkumar/imgx"
)

func main() {
    // Load an image
    img, err := imgx.Load("photo.jpg")
    if err != nil {
        log.Fatal(err)
    }

    // Detect objects using the default local Ollama model
    ctx := context.Background()
    result, err := img.Detect(ctx, "ollama")
    if err != nil {
        log.Fatal(err)
    }

    // Print detected labels
    fmt.Println("Detected objects:")
    for _, label := range result.Labels {
        fmt.Printf("- %s (%.1f%% confidence)\n",
            label.Name, label.Confidence*100)
    }
}

Detection Features

Available Features

const (
    FeatureLabels      Feature = "labels"       // Object/label detection
    FeatureObjects     Feature = "objects"      // Alias for labels
    FeatureText        Feature = "text"         // OCR text extraction
    FeatureFaces       Feature = "faces"        // Face detection
    FeatureDescription Feature = "description"  // Natural language description
    FeatureWeb         Feature = "web"          // Web entities (Gemini only)
    FeatureLandmarks   Feature = "landmarks"    // Landmark detection (Gemini only)
    FeatureProperties  Feature = "properties"   // Image properties (AWS only)
    FeatureSafeSearch  Feature = "safesearch"   // Content moderation
)

Feature Support Matrix

Feature Ollama Gemini AWS OpenAI
Labels
Text (OCR)
Faces
Description
Web Detection
Landmarks
Properties
SafeSearch/Moderation

API Reference

Core Types

DetectionResult

type DetectionResult struct {
    Provider      string                 // Provider used (ollama, gemini, aws, openai)
    Labels        []Label                // Detected objects/labels
    Description   string                 // Natural language description
    Text          []TextBlock            // Extracted text
    Faces         []Face                 // Detected faces
    Web           *WebDetection          // Web entities (Gemini)
    BoundingBoxes []BoundingBox          // Object locations
    Properties    map[string]string      // Image properties (AWS)
    Confidence    float32                // Overall confidence (0.0-1.0)
    ProcessedAt   time.Time              // Processing timestamp
    RawResponse   string                 // Raw API response (if requested)
}

Label

type Label struct {
    Name       string   // Label name (e.g., "Dog", "Car")
    Confidence float32  // Confidence score (0.0-1.0)
    Categories []string // Parent categories
}

TextBlock

type TextBlock struct {
    Text       string       // Extracted text
    Confidence float32      // Confidence score (0.0-1.0)
    BoundingBox *BoundingBox // Text location
    Language   string       // Detected language (if available)
}

Face

type Face struct {
    Confidence  float32            // Confidence score (0.0-1.0)
    BoundingBox *BoundingBox       // Face location
    Landmarks   []FaceLandmark     // Facial landmarks (eyes, nose, etc.)
    Emotions    map[string]float32 // Emotion scores
    AgeRange    string             // Estimated age range
    Gender      string             // Detected gender
}

Detection Options

type DetectOptions struct {
    Features           []Feature // Features to detect
    MaxResults         int       // Maximum labels to return (default: 10)
    MinConfidence      float32   // Minimum confidence threshold (0.0-1.0, default: 0.5)
    CustomPrompt       string    // Custom prompt (Gemini/OpenAI)
    IncludeRawResponse bool      // Include raw API response
}

// Create default options
opts := detection.DefaultDetectOptions()

// Customize options
opts := &detection.DetectOptions{
    Features:      []detection.Feature{detection.FeatureLabels, detection.FeatureText},
    MaxResults:    20,
    MinConfidence: 0.7,
}

Methods

Image.Detect()

func (img *Image) Detect(ctx context.Context, provider string,
    opts ...*detection.DetectOptions) (*detection.DetectionResult, error)

High-level method on *imgx.Image instances.

Parameters: - ctx: Context for cancellation and timeouts - provider: Provider name ("ollama", "gemma3", "qwen3-vl", "gemini", "google", "aws", "rekognition", "openai") - opts: Optional detection options

Returns: - *DetectionResult: Detection results - error: Error if detection fails

Provider Interface

type Provider interface {
    Detect(ctx context.Context, img *image.NRGBA,
        opts *DetectOptions) (*DetectionResult, error)
    Name() string
    IsConfigured() bool
}

Direct provider access for advanced use cases:

provider, err := detection.GetProvider("ollama")
if err != nil {
    log.Fatal(err)
}

result, err := provider.Detect(ctx, img.ToNRGBA(), opts)

Examples

Basic Detection

img, _ := imgx.Load("photo.jpg")
ctx := context.Background()

result, err := img.Detect(ctx, "ollama")
if err != nil {
    log.Fatal(err)
}

for _, label := range result.Labels {
    fmt.Printf("%s: %.1f%%\n", label.Name, label.Confidence*100)
}

Multiple Features

opts := &detection.DetectOptions{
    Features: []detection.Feature{
        detection.FeatureLabels,
        detection.FeatureText,
        detection.FeatureFaces,
    },
    MaxResults:    15,
    MinConfidence: 0.7,
}

result, err := img.Detect(ctx, "aws", opts)
if err != nil {
    log.Fatal(err)
}

// Labels
fmt.Println("Objects:", len(result.Labels))

// Text extraction
fmt.Println("Text found:")
for _, text := range result.Text {
    fmt.Printf("- %s\n", text.Text)
}

// Faces
fmt.Printf("Found %d faces\n", len(result.Faces))

AWS Image Properties

// Get image quality metrics and dominant colors
opts := &detection.DetectOptions{
    Features: []detection.Feature{detection.FeatureProperties},
}

result, err := img.Detect(ctx, "aws", opts)
if err != nil {
    log.Fatal(err)
}

// Access properties
fmt.Println("Brightness:", result.Properties["brightness"])
fmt.Println("Sharpness:", result.Properties["sharpness"])
fmt.Println("Contrast:", result.Properties["contrast"])
fmt.Println("Dominant colors:", result.Properties["dominant_colors"])
fmt.Println("Color 1 (hex):", result.Properties["color_1_hex"])
fmt.Println("Color 1 (rgb):", result.Properties["color_1_rgb"])

Custom Prompt (Gemini/OpenAI)

opts := &detection.DetectOptions{
    CustomPrompt: "Is there a dog in this image? What breed might it be?",
}

result, err := img.Detect(ctx, "gemini", opts)
if err != nil {
    log.Fatal(err)
}

fmt.Println("Description:", result.Description)

Compare Multiple Providers

providers := []string{"gemini", "aws", "openai"}

for _, provider := range providers {
    result, err := img.Detect(ctx, provider)
    if err != nil {
        fmt.Printf("%s error: %v\n", provider, err)
        continue
    }

    fmt.Printf("\n%s results:\n", strings.ToUpper(provider))
    for _, label := range result.Labels[:5] {
        fmt.Printf("  - %s (%.1f%%)\n", label.Name, label.Confidence*100)
    }
}

Error Handling

result, err := img.Detect(ctx, "aws")
if err != nil {
    // Check for specific error types
    if errors.Is(err, detection.ErrProviderNotConfigured) {
        log.Fatal("AWS credentials not configured. Run: aws configure")
    }

    // Check error message for details
    if strings.Contains(err.Error(), "invalid credentials") {
        log.Fatal("AWS credentials are invalid or expired")
    }

    log.Fatal(err)
}

With Context Timeout

// Set 10 second timeout
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()

result, err := img.Detect(ctx, "gemini")
if err != nil {
    if errors.Is(err, context.DeadlineExceeded) {
        log.Fatal("Detection timed out")
    }
    log.Fatal(err)
}

Batch Processing

images := []string{"photo1.jpg", "photo2.jpg", "photo3.jpg"}
ctx := context.Background()

for _, imagePath := range images {
    img, err := imgx.Load(imagePath)
    if err != nil {
        log.Printf("Failed to load %s: %v", imagePath, err)
        continue
    }

    result, err := img.Detect(ctx, "gemini")
    if err != nil {
        log.Printf("Failed to detect %s: %v", imagePath, err)
        continue
    }

    fmt.Printf("\n%s:\n", imagePath)
    for i, label := range result.Labels {
        if i >= 3 { break } // Top 3 labels
        fmt.Printf("  %s (%.1f%%)\n", label.Name, label.Confidence*100)
    }
}

Best Practices

1. Choose the Right Provider

  • Gemini: Best for general-purpose detection, custom prompts, and web entity detection
  • AWS Rekognition: Best for production workloads, image properties, and when you need reliable face detection
  • OpenAI Vision: Best for natural language descriptions and when you need GPT-4o's reasoning

2. Set Appropriate Confidence Thresholds

// For critical applications, use higher confidence
opts := &detection.DetectOptions{
    MinConfidence: 0.8, // Only high-confidence results
}

// For exploratory analysis, use lower confidence
opts := &detection.DetectOptions{
    MinConfidence: 0.3, // Catch more possibilities
}

3. Handle Rate Limits

import "time"

for _, img := range images {
    result, err := img.Detect(ctx, "gemini")
    if err != nil {
        if strings.Contains(err.Error(), "rate limit") {
            time.Sleep(2 * time.Second)
            result, err = img.Detect(ctx, "gemini") // Retry
        }
        if err != nil {
            log.Printf("Error: %v", err)
            continue
        }
    }
    // Process result...
}

4. Use Context for Cancellation

ctx, cancel := context.WithCancel(context.Background())

// Cancel on Ctrl+C
go func() {
    sigChan := make(chan os.Signal, 1)
    signal.Notify(sigChan, os.Interrupt)
    <-sigChan
    cancel()
}()

result, err := img.Detect(ctx, "gemini")

5. Cache Results

type DetectionCache struct {
    mu    sync.RWMutex
    cache map[string]*detection.DetectionResult
}

func (c *DetectionCache) Get(key string) (*detection.DetectionResult, bool) {
    c.mu.RLock()
    defer c.mu.RUnlock()
    result, ok := c.cache[key]
    return result, ok
}

func (c *DetectionCache) Set(key string, result *detection.DetectionResult) {
    c.mu.Lock()
    defer c.mu.Unlock()
    c.cache[key] = result
}

Pricing Considerations

AWS Rekognition

Important: The properties feature has separate pricing.

  • Labels only: --features labels → Standard DetectLabels pricing
  • Properties only: --features properties → Image Properties pricing only
  • Both: --features labels,properties → Charged for BOTH APIs

Example costs (as of 2024): - DetectLabels: ~$0.001 per image (first 1M images/month) - Image Properties: Additional charge when combined with labels - DetectText, DetectFaces: Separate pricing

Recommendation: Only request properties when needed to minimize costs.

Gemini

  • Free tier available with rate limits
  • Pay-as-you-go pricing after free tier
  • Custom prompts may use more tokens

OpenAI Vision

  • Charged per API call
  • GPT-4o has different pricing than GPT-4
  • Image size affects cost

Troubleshooting

Common Issues

1. "Provider not configured"

Gemini:

# Check if key is set
echo $GEMINI_API_KEY

# Set the key
export GEMINI_API_KEY="your-key"

AWS:

# Test credentials
aws sts get-caller-identity

# If that fails, run
aws configure

OpenAI:

# Check if key is set
echo $OPENAI_API_KEY

# Set the key
export OPENAI_API_KEY="sk-..."

2. "Invalid AWS credentials"

This error means your AWS credentials are incorrect or expired.

# Verify credentials are correct
aws sts get-caller-identity

# If using temporary credentials, they may have expired
# Get new credentials from your IAM administrator

# Check which credential source is being used
AWS_PROFILE=default aws sts get-caller-identity

3. "Access denied" (AWS)

Your IAM user/role lacks necessary permissions.

Required IAM permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "rekognition:DetectLabels",
        "rekognition:DetectText",
        "rekognition:DetectFaces",
        "rekognition:DetectModerationLabels"
      ],
      "Resource": "*"
    }
  ]
}

4. Rate Limiting

If you hit rate limits, implement exponential backoff:

func detectWithRetry(img *imgx.Image, ctx context.Context, provider string) (*detection.DetectionResult, error) {
    maxRetries := 3
    baseDelay := time.Second

    for i := 0; i < maxRetries; i++ {
        result, err := img.Detect(ctx, provider)
        if err == nil {
            return result, nil
        }

        if !strings.Contains(err.Error(), "rate limit") {
            return nil, err
        }

        if i < maxRetries-1 {
            delay := baseDelay * time.Duration(1<<uint(i))
            time.Sleep(delay)
        }
    }

    return nil, fmt.Errorf("max retries exceeded")
}

5. Large Images

If you get errors about image size:

// Resize large images before detection
if img.Bounds().Dx() > 2048 || img.Bounds().Dy() > 2048 {
    img = img.Fit(2048, 2048, imgx.Lanczos)
}

result, err := img.Detect(ctx, "gemini")

Additional Resources

See Also