Monitoring & Alerting¶
Champa Intelligence is designed for enterprise-grade observability and integrates seamlessly with monitoring stacks like Prometheus and Grafana.
Prometheus Integration¶
The platform exposes two primary Prometheus metrics endpoints, both requiring an API token for authentication.
Metrics Endpoints¶
-
Light Metrics (
/health/light/metrics):- Purpose: Essential, low-overhead metrics for high-frequency scraping.
- Scrape Interval: Recommended: 15-30 seconds.
- Use Case: Real-time alerting on critical health indicators like node status, JVM heap, and database connectivity.
-
Full Metrics (
/health/full/metrics):- Purpose: A comprehensive set of all application and analytics metrics.
- Scrape Interval: Recommended: 5 minutes.
- Use Case: Populating detailed Grafana dashboards for historical analysis and performance trending.
Prometheus Configuration¶
Here is an example prometheus.yml configuration to scrape the Champa Intelligence endpoints.
scrape_configs:
- job_name: 'champa_light'
scrape_interval: 30s
metrics_path: /health/light/metrics
scheme: http # or https if using TLS
static_configs:
- targets: ['champa-intelligence-host:8088']
# Use bearer token authentication
bearer_token: 'YOUR_LONG_LIVED_API_TOKEN'
- job_name: 'champa_full'
scrape_interval: 5m
metrics_path: /health/full/metrics
scheme: http # or https
static_configs:
- targets: ['champa-intelligence-host:8088']
bearer_token: 'YOUR_LONG_LIVED_API_TOKEN'
Creating an API Token
- In Champa Intelligence, go to Admin → User Management.
- Create a new user (e.g.,
prometheus_user). - Check the "Is API User?" box and set a long TTL (e.g., 365 days).
- Assign a role with
api_accessandhealth_monitor_datapermissions. - Copy the generated JWT and use it as the
bearer_token.
Grafana Dashboards¶
You can use the exposed metrics to build powerful Grafana dashboards. Here are some example panels and PromQL queries.
Example Panel: Cluster Health¶
Visualization: Stat Panel
Queries:
- Query for Running Nodes:
camunda_cluster_running_nodes - Query for Active Instances:
camunda_active_instances - Query for Open Incidents:
camunda_incidents
Example Panel: JVM Heap Usage per Node¶
Visualization: Time series Graph
Query:
Legend: {{node}}
Example Panel: Database Connection Pool¶
Visualization: Gauge
Query:
Thresholds: Set thresholds to visualize when usage is high (e.g., 80%).
Example Panel: Top 5 Processes by Incidents¶
Visualization: Bar chart
Query:
Legend: {{process}}
Alerting with Alertmanager¶
You can configure Prometheus alerts based on the scraped metrics. Here are some recommended alert rules.
groups:
- name: champa_alerts
rules:
- alert: CamundaNodeDown
expr: camunda_node_status == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Camunda node {{ $labels.node }} is down."
description: "The Camunda node at URL {{ $labels.url }} is not responding."
- alert: HighJvmHeapUsage
expr: camunda_jvm_heap_utilization_percent > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High JVM Heap Usage on node {{ $labels.node }}."
description: "JVM heap utilization is {{ $value | printf \"%.2f\" }}% on node {{ $labels.node }}."
- alert: HighIncidentCount
expr: camunda_incidents > 50
for: 5m
labels:
severity: warning
annotations:
summary: "High number of open incidents in Camunda."
description: "There are currently {{ $value }} open incidents across the cluster."
- alert: HighDatabaseConnectionUsage
expr: camunda_db_connection_utilization_percent > 80
for: 5m
labels:
severity: critical
annotations:
summary: "Database connection pool is nearly exhausted."
description: "The connection pool utilization is at {{ $value | printf \"%.2f\" }}%."
Best Practices¶
Metric Retention¶
Consider different retention policies for light vs. full metrics:
- Light metrics: Retain for 30-90 days (high frequency data)
- Full metrics: Retain for 6-12 months (detailed analytics)
Alert Fatigue Prevention¶
- Use appropriate
fordurations to avoid transient alerts - Set meaningful thresholds based on your workload patterns
- Use severity levels to prioritize alerts
- Group related alerts to reduce notification volume
Dashboard Organization¶
Organize your Grafana dashboards by audience:
- Operations Dashboard: Real-time health, incidents, node status
- Performance Dashboard: Response times, throughput, resource usage
- Business Dashboard: Process metrics, SLA compliance, journey patterns
- Capacity Planning: Trend analysis, growth projections
Sample Alertmanager Configuration¶
Configure Alertmanager to route alerts to appropriate channels:
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
email_configs:
- to: 'ops-team@example.com'
- name: 'slack'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK'
channel: '#camunda-alerts'
title: 'Champa Alert: {{ .GroupLabels.alertname }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'