Prometheus Metrics API¶
Champa Intelligence exposes comprehensive metrics in Prometheus format for integration with Grafana, Prometheus, and other monitoring tools.
Endpoints¶
Light Metrics Endpoint¶
Returns essential health metrics with minimal overhead (~100ms response time).
Use for: High-frequency scraping (every 15-30 seconds)
Full Metrics Endpoint¶
Returns complete metrics including analytics data (~500ms response time).
Use for: Detailed monitoring (every 1-5 minutes)
Portfolio Metrics Endpoint¶
Returns portfolio-level KPIs for all processes.
Use for: Business dashboards (every 5-15 minutes)
Authentication¶
All metrics endpoints require API token authentication:
# Using Bearer token
curl -H "Authorization: Bearer YOUR_API_TOKEN" \
https://your-domain.com/health/light/metrics
# Using basic auth (for compatibility)
curl -u api_user:api_token \
https://your-domain.com/health/light/metrics
API Token Creation
- Create an API user in User Management
- Set
is_api_user = true - Configure TTL (7-365 days or never)
- Copy generated JWT token
- Use in Authorization header
Available Metrics¶
Cluster Health Metrics¶
camunda_cluster_info¶
Type: gauge
Description: Cluster metadata
Labels: version
camunda_cluster_running_nodes¶
Type: gauge
Description: Number of active Camunda nodes
camunda_active_instances¶
Type: gauge
Description: Total active process instances across cluster
camunda_incidents¶
Type: gauge
Description: Total open incidents
camunda_user_tasks¶
Type: gauge
Description: Active user tasks
camunda_external_tasks¶
Type: gauge
Description: Active external tasks
camunda_total_jobs¶
Type: gauge
Description: Total jobs in queue
camunda_failed_jobs¶
Type: gauge
Description: Jobs with no retries left
Per-Node Metrics¶
camunda_node_status¶
Type: gauge
Description: Node operational status (1=RUNNING, 0=DOWN)
Labels: node, url
camunda_node_status{node="production_node_1",url="http://10.0.1.10:8080"} 1
camunda_node_status{node="production_node_2",url="http://10.0.1.11:8080"} 1
camunda_node_response_time_ms¶
Type: gauge
Description: Node response time in milliseconds
Labels: node
camunda_node_workload_score¶
Type: gauge
Description: Workload score (job acquisitions + decisions + activities)
Labels: node
camunda_node_job_acquisition_success_rate¶
Type: gauge
Description: Job acquisition success rate percentage
Labels: node
camunda_node_job_success_rate¶
Type: gauge
Description: Job execution success rate percentage
Labels: node
JVM Metrics (Per Node)¶
camunda_jvm_heap_used_mb¶
Type: gauge
Description: JVM heap memory used in MB
Labels: node
camunda_jvm_heap_max_mb¶
Type: gauge
Description: JVM maximum heap size in MB
Labels: node
camunda_jvm_heap_utilization_percent¶
Type: gauge
Description: Heap utilization percentage
Labels: node
camunda_jvm_gc_minor_collections¶
Type: counter
Description: Minor GC collection count
Labels: node
camunda_jvm_gc_major_collections¶
Type: counter
Description: Major GC collection count
Labels: node
camunda_jvm_threads_current¶
Type: gauge
Description: Current thread count
Labels: node
camunda_jvm_cpu_load_percent¶
Type: gauge
Description: CPU load percentage
Labels: node
Database Metrics¶
camunda_db_latency_ms¶
Type: gauge
Description: Database query latency in milliseconds
camunda_db_active_connections¶
Type: gauge
Description: Active database connections
camunda_db_max_connections¶
Type: gauge
Description: Maximum database connections configured
camunda_db_connection_utilization_percent¶
Type: gauge
Description: Connection pool utilization percentage
camunda_db_table_size_bytes¶
Type: gauge
Description: Database table size in bytes
Labels: table
camunda_db_archivable_instances¶
Type: gauge
Description: Completed instances eligible for archival (90+ days old)
Process-Level Metrics¶
camunda_process_active_instances¶
Type: gauge
Description: Active instances per process
Labels: process, version
camunda_process_open_incidents¶
Type: gauge
Description: Open incidents per process
Labels: process, version
camunda_process_health_score¶
Type: gauge
Description: Health score (0-100)
Labels: process, version
camunda_process_started_last_30_days¶
Type: counter
Description: Instances started in last 30 days
Labels: process, version
camunda_process_incident_rate¶
Type: gauge
Description: Incident rate percentage
Labels: process, version
Grafana Integration¶
Setup Data Source¶
- Navigate to Grafana → Configuration → Data Sources
- Click "Add data source"
- Select "Prometheus"
- Configure:
- Click "Save & Test"
Example Dashboard Panels¶
Cluster Health Overview¶
# Active Instances
camunda_active_instances
# Incidents
camunda_incidents
# Node Count
camunda_cluster_running_nodes
# User Tasks
camunda_user_tasks
Per-Process Health¶
# Health Scores
camunda_process_health_score
# Incident Rate (processes with >5%)
camunda_process_incident_rate > 5
# Throughput
rate(camunda_process_started_last_30_days[1h])
JVM Memory Pressure¶
# Heap Usage %
(camunda_jvm_heap_used_mb / camunda_jvm_heap_max_mb) * 100
# Alert on >80%
(camunda_jvm_heap_used_mb / camunda_jvm_heap_max_mb) * 100 > 80
GC Pressure¶
# Minor GC Rate
rate(camunda_jvm_gc_minor_collections[5m])
# Major GC Rate (should be rare)
rate(camunda_jvm_gc_major_collections[5m])
Database Health¶
# Connection Pool Usage
camunda_db_connection_utilization_percent
# Latency
camunda_db_latency_ms
# Archivable Data
camunda_db_archivable_instances / 1000
Alerting Rules¶
Prometheus Alert Rules¶
groups:
- name: camunda_alerts
interval: 30s
rules:
# High Incident Count
- alert: HighIncidentCount
expr: camunda_incidents > 50
for: 5m
labels:
severity: warning
annotations:
summary: "High number of incidents"
description: "{{ $value }} incidents are currently open"
# Node Down
- alert: CamundaNodeDown
expr: camunda_node_status == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Camunda node is down"
description: "Node {{ $labels.node }} is not responding"
# High Heap Usage
- alert: HighHeapUsage
expr: (camunda_jvm_heap_used_mb / camunda_jvm_heap_max_mb) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "JVM heap usage is high"
description: "Node {{ $labels.node }} heap usage is {{ $value }}%"
# Process Health Degradation
- alert: ProcessHealthDegraded
expr: camunda_process_health_score < 70
for: 15m
labels:
severity: warning
annotations:
summary: "Process health score is low"
description: "Process {{ $labels.process }} v{{ $labels.version }} health: {{ $value }}"
# High DB Connection Usage
- alert: HighDBConnectionUsage
expr: camunda_db_connection_utilization_percent > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Database connection pool nearly exhausted"
description: "Connection usage: {{ $value }}%"
# Slow Database
- alert: SlowDatabaseQueries
expr: camunda_db_latency_ms > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Database queries are slow"
description: "Average latency: {{ $value }}ms"
Best Practices¶
Scrape Configuration¶
# prometheus.yml
scrape_configs:
- job_name: 'champa-intelligence'
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /health/light/metrics
bearer_token: YOUR_API_TOKEN
static_configs:
- targets: ['champa-intelligence:8088']
labels:
environment: 'production'
service: 'champa-intelligence'
Performance Tips¶
- Use light endpoint for frequent scraping
- Full endpoint max every 5 minutes
- Enable caching in Champa config
- Monitor scrape duration in Prometheus
- Use recording rules for expensive queries
Security¶
- Use dedicated API user with
api_accesspermission - Set appropriate TTL (30-90 days)
- Rotate tokens regularly
- Monitor token usage in audit logs
- Use HTTPS in production
Example Queries¶
Top 5 Processes by Incident Rate¶
Node with Highest Workload¶
Processes with Low Health Scores¶
Total Throughput (instances/hour)¶
Average Heap Usage Across Cluster¶
GC Frequency Per Node¶
Troubleshooting¶
Metrics Not Available¶
Check authentication:
Empty Metrics¶
Verify data collection:
# Check logs
docker logs champa-intelligence | grep "Engine health"
# Test database connection
curl http://localhost:8088/health/db
Slow Scrapes¶
- Use
/health/light/metricsinstead of/health/full/metrics - Increase scrape timeout in Prometheus
- Check database performance
- Enable Redis caching
Next Steps¶
- Health Monitoring API - Detailed health endpoints
- Deployment Guide - Production monitoring setup