Skip to content

Prometheus Metrics API

Champa Intelligence exposes comprehensive metrics in Prometheus format for integration with Grafana, Prometheus, and other monitoring tools.


Endpoints

Light Metrics Endpoint

GET /health/light/metrics
Authorization: Bearer <api_token>

Returns essential health metrics with minimal overhead (~100ms response time).

Use for: High-frequency scraping (every 15-30 seconds)

Full Metrics Endpoint

GET /health/full/metrics
Authorization: Bearer <api_token>

Returns complete metrics including analytics data (~500ms response time).

Use for: Detailed monitoring (every 1-5 minutes)

Portfolio Metrics Endpoint

GET /portfolio/overview/metrics
Authorization: Bearer <api_token>

Returns portfolio-level KPIs for all processes.

Use for: Business dashboards (every 5-15 minutes)


Authentication

All metrics endpoints require API token authentication:

# Using Bearer token
curl -H "Authorization: Bearer YOUR_API_TOKEN" \
  https://your-domain.com/health/light/metrics

# Using basic auth (for compatibility)
curl -u api_user:api_token \
  https://your-domain.com/health/light/metrics

API Token Creation

  1. Create an API user in User Management
  2. Set is_api_user = true
  3. Configure TTL (7-365 days or never)
  4. Copy generated JWT token
  5. Use in Authorization header

Available Metrics

Cluster Health Metrics

camunda_cluster_info

Type: gauge
Description: Cluster metadata
Labels: version

camunda_cluster_info{version="7.20+"} 3

camunda_cluster_running_nodes

Type: gauge
Description: Number of active Camunda nodes

camunda_cluster_running_nodes 3

camunda_active_instances

Type: gauge
Description: Total active process instances across cluster

camunda_active_instances 1247

camunda_incidents

Type: gauge
Description: Total open incidents

camunda_incidents 23

camunda_user_tasks

Type: gauge
Description: Active user tasks

camunda_user_tasks 456

camunda_external_tasks

Type: gauge
Description: Active external tasks

camunda_external_tasks 89

camunda_total_jobs

Type: gauge
Description: Total jobs in queue

camunda_total_jobs 342

camunda_failed_jobs

Type: gauge
Description: Jobs with no retries left

camunda_failed_jobs 5

Per-Node Metrics

camunda_node_status

Type: gauge
Description: Node operational status (1=RUNNING, 0=DOWN)
Labels: node, url

camunda_node_status{node="production_node_1",url="http://10.0.1.10:8080"} 1
camunda_node_status{node="production_node_2",url="http://10.0.1.11:8080"} 1

camunda_node_response_time_ms

Type: gauge
Description: Node response time in milliseconds
Labels: node

camunda_node_response_time_ms{node="production_node_1"} 45.3

camunda_node_workload_score

Type: gauge
Description: Workload score (job acquisitions + decisions + activities)
Labels: node

camunda_node_workload_score{node="production_node_1"} 1250

camunda_node_job_acquisition_success_rate

Type: gauge
Description: Job acquisition success rate percentage
Labels: node

camunda_node_job_acquisition_success_rate{node="production_node_1"} 98.5

camunda_node_job_success_rate

Type: gauge
Description: Job execution success rate percentage
Labels: node

camunda_node_job_success_rate{node="production_node_1"} 99.2

JVM Metrics (Per Node)

camunda_jvm_heap_used_mb

Type: gauge
Description: JVM heap memory used in MB
Labels: node

camunda_jvm_heap_used_mb{node="production_node_1"} 1024.5

camunda_jvm_heap_max_mb

Type: gauge
Description: JVM maximum heap size in MB
Labels: node

camunda_jvm_heap_max_mb{node="production_node_1"} 2048.0

camunda_jvm_heap_utilization_percent

Type: gauge
Description: Heap utilization percentage
Labels: node

camunda_jvm_heap_utilization_percent{node="production_node_1"} 50.2

camunda_jvm_gc_minor_collections

Type: counter
Description: Minor GC collection count
Labels: node

rate(camunda_jvm_gc_minor_collections{node="production_node_1"}[5m])

camunda_jvm_gc_major_collections

Type: counter
Description: Major GC collection count
Labels: node

rate(camunda_jvm_gc_major_collections{node="production_node_1"}[5m])

camunda_jvm_threads_current

Type: gauge
Description: Current thread count
Labels: node

camunda_jvm_threads_current{node="production_node_1"} 125

camunda_jvm_cpu_load_percent

Type: gauge
Description: CPU load percentage
Labels: node

camunda_jvm_cpu_load_percent{node="production_node_1"} 35.7

Database Metrics

camunda_db_latency_ms

Type: gauge
Description: Database query latency in milliseconds

camunda_db_latency_ms 12.5

camunda_db_active_connections

Type: gauge
Description: Active database connections

camunda_db_active_connections 15

camunda_db_max_connections

Type: gauge
Description: Maximum database connections configured

camunda_db_max_connections 100

camunda_db_connection_utilization_percent

Type: gauge
Description: Connection pool utilization percentage

camunda_db_connection_utilization_percent 15.0

camunda_db_table_size_bytes

Type: gauge
Description: Database table size in bytes
Labels: table

camunda_db_table_size_bytes{table="act_hi_procinst"} 1073741824

camunda_db_archivable_instances

Type: gauge
Description: Completed instances eligible for archival (90+ days old)

camunda_db_archivable_instances 45230

Process-Level Metrics

camunda_process_active_instances

Type: gauge
Description: Active instances per process
Labels: process, version

camunda_process_active_instances{process="order_to_cash",version="3"} 145

camunda_process_open_incidents

Type: gauge
Description: Open incidents per process
Labels: process, version

camunda_process_open_incidents{process="order_to_cash",version="3"} 5

camunda_process_health_score

Type: gauge
Description: Health score (0-100)
Labels: process, version

camunda_process_health_score{process="order_to_cash",version="3"} 87.5

camunda_process_started_last_30_days

Type: counter
Description: Instances started in last 30 days
Labels: process, version

camunda_process_started_last_30_days{process="order_to_cash",version="3"} 2341

camunda_process_incident_rate

Type: gauge
Description: Incident rate percentage
Labels: process, version

camunda_process_incident_rate{process="order_to_cash",version="3"} 3.4

Grafana Integration

Setup Data Source

  1. Navigate to Grafana → Configuration → Data Sources
  2. Click "Add data source"
  3. Select "Prometheus"
  4. Configure:
    Name: Champa Intelligence
    URL: http://champa-intelligence:8088/health/light/metrics
    
    Auth:
    - Custom HTTP Headers
    Header: Authorization
    Value: Bearer YOUR_API_TOKEN
    
  5. Click "Save & Test"

Example Dashboard Panels

Cluster Health Overview

# Active Instances
camunda_active_instances

# Incidents
camunda_incidents

# Node Count
camunda_cluster_running_nodes

# User Tasks
camunda_user_tasks

Per-Process Health

# Health Scores
camunda_process_health_score

# Incident Rate (processes with >5%)
camunda_process_incident_rate > 5

# Throughput
rate(camunda_process_started_last_30_days[1h])

JVM Memory Pressure

# Heap Usage %
(camunda_jvm_heap_used_mb / camunda_jvm_heap_max_mb) * 100

# Alert on >80%
(camunda_jvm_heap_used_mb / camunda_jvm_heap_max_mb) * 100 > 80

GC Pressure

# Minor GC Rate
rate(camunda_jvm_gc_minor_collections[5m])

# Major GC Rate (should be rare)
rate(camunda_jvm_gc_major_collections[5m])

Database Health

# Connection Pool Usage
camunda_db_connection_utilization_percent

# Latency
camunda_db_latency_ms

# Archivable Data
camunda_db_archivable_instances / 1000

Alerting Rules

Prometheus Alert Rules

groups:
  - name: camunda_alerts
    interval: 30s
    rules:
      # High Incident Count
      - alert: HighIncidentCount
        expr: camunda_incidents > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High number of incidents"
          description: "{{ $value }} incidents are currently open"

      # Node Down
      - alert: CamundaNodeDown
        expr: camunda_node_status == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Camunda node is down"
          description: "Node {{ $labels.node }} is not responding"

      # High Heap Usage
      - alert: HighHeapUsage
        expr: (camunda_jvm_heap_used_mb / camunda_jvm_heap_max_mb) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "JVM heap usage is high"
          description: "Node {{ $labels.node }} heap usage is {{ $value }}%"

      # Process Health Degradation
      - alert: ProcessHealthDegraded
        expr: camunda_process_health_score < 70
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Process health score is low"
          description: "Process {{ $labels.process }} v{{ $labels.version }} health: {{ $value }}"

      # High DB Connection Usage
      - alert: HighDBConnectionUsage
        expr: camunda_db_connection_utilization_percent > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Database connection pool nearly exhausted"
          description: "Connection usage: {{ $value }}%"

      # Slow Database
      - alert: SlowDatabaseQueries
        expr: camunda_db_latency_ms > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Database queries are slow"
          description: "Average latency: {{ $value }}ms"

Best Practices

Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'champa-intelligence'
    scrape_interval: 30s
    scrape_timeout: 10s
    metrics_path: /health/light/metrics

    bearer_token: YOUR_API_TOKEN

    static_configs:
      - targets: ['champa-intelligence:8088']
        labels:
          environment: 'production'
          service: 'champa-intelligence'

Performance Tips

  1. Use light endpoint for frequent scraping
  2. Full endpoint max every 5 minutes
  3. Enable caching in Champa config
  4. Monitor scrape duration in Prometheus
  5. Use recording rules for expensive queries

Security

  1. Use dedicated API user with api_access permission
  2. Set appropriate TTL (30-90 days)
  3. Rotate tokens regularly
  4. Monitor token usage in audit logs
  5. Use HTTPS in production

Example Queries

Top 5 Processes by Incident Rate

topk(5, camunda_process_incident_rate)

Node with Highest Workload

topk(1, camunda_node_workload_score)

Processes with Low Health Scores

camunda_process_health_score < 75

Total Throughput (instances/hour)

sum(rate(camunda_process_started_last_30_days[1h])) * 3600

Average Heap Usage Across Cluster

avg(camunda_jvm_heap_utilization_percent)

GC Frequency Per Node

rate(camunda_jvm_gc_minor_collections[5m]) + 
rate(camunda_jvm_gc_major_collections[5m])

Troubleshooting

Metrics Not Available

Check authentication:

curl -v -H "Authorization: Bearer TOKEN" \
  http://localhost:8088/health/light/metrics

Empty Metrics

Verify data collection:

# Check logs
docker logs champa-intelligence | grep "Engine health"

# Test database connection
curl http://localhost:8088/health/db

Slow Scrapes

  1. Use /health/light/metrics instead of /health/full/metrics
  2. Increase scrape timeout in Prometheus
  3. Check database performance
  4. Enable Redis caching

Next Steps