Skip to content

Monitoring & Alerting

Champa Intelligence is designed for enterprise-grade observability and integrates seamlessly with monitoring stacks like Prometheus and Grafana.


Prometheus Integration

The platform exposes two primary Prometheus metrics endpoints, both requiring an API token for authentication.

Metrics Endpoints

  • Light Metrics (/health/light/metrics):

    • Purpose: Essential, low-overhead metrics for high-frequency scraping.
    • Scrape Interval: Recommended: 15-30 seconds.
    • Use Case: Real-time alerting on critical health indicators like node status, JVM heap, and database connectivity.
  • Full Metrics (/health/full/metrics):

    • Purpose: A comprehensive set of all application and analytics metrics.
    • Scrape Interval: Recommended: 5 minutes.
    • Use Case: Populating detailed Grafana dashboards for historical analysis and performance trending.

Prometheus Configuration

Here is an example prometheus.yml configuration to scrape the Champa Intelligence endpoints.

scrape_configs:
  - job_name: 'champa_light'
    scrape_interval: 30s
    metrics_path: /health/light/metrics
    scheme: http # or https if using TLS
    static_configs:
      - targets: ['champa-intelligence-host:8088']
    # Use bearer token authentication
    bearer_token: 'YOUR_LONG_LIVED_API_TOKEN'

  - job_name: 'champa_full'
    scrape_interval: 5m
    metrics_path: /health/full/metrics
    scheme: http # or https
    static_configs:
      - targets: ['champa-intelligence-host:8088']
    bearer_token: 'YOUR_LONG_LIVED_API_TOKEN'

Creating an API Token

  1. In Champa Intelligence, go to Admin → User Management.
  2. Create a new user (e.g., prometheus_user).
  3. Check the "Is API User?" box and set a long TTL (e.g., 365 days).
  4. Assign a role with api_access and health_monitor_data permissions.
  5. Copy the generated JWT and use it as the bearer_token.

Grafana Dashboards

You can use the exposed metrics to build powerful Grafana dashboards. Here are some example panels and PromQL queries.

Example Panel: Cluster Health

Visualization: Stat Panel

Queries:

  • Query for Running Nodes: camunda_cluster_running_nodes
  • Query for Active Instances: camunda_active_instances
  • Query for Open Incidents: camunda_incidents

Example Panel: JVM Heap Usage per Node

Visualization: Time series Graph

Query:

camunda_jvm_heap_utilization_percent

Legend: {{node}}

Example Panel: Database Connection Pool

Visualization: Gauge

Query:

camunda_db_connection_utilization_percent

Thresholds: Set thresholds to visualize when usage is high (e.g., 80%).

Example Panel: Top 5 Processes by Incidents

Visualization: Bar chart

Query:

topk(5, camunda_process_open_incidents)

Legend: {{process}}


Alerting with Alertmanager

You can configure Prometheus alerts based on the scraped metrics. Here are some recommended alert rules.

groups:
  - name: champa_alerts
    rules:
      - alert: CamundaNodeDown
        expr: camunda_node_status == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Camunda node {{ $labels.node }} is down."
          description: "The Camunda node at URL {{ $labels.url }} is not responding."

      - alert: HighJvmHeapUsage
        expr: camunda_jvm_heap_utilization_percent > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High JVM Heap Usage on node {{ $labels.node }}."
          description: "JVM heap utilization is {{ $value | printf \"%.2f\" }}% on node {{ $labels.node }}."

      - alert: HighIncidentCount
        expr: camunda_incidents > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High number of open incidents in Camunda."
          description: "There are currently {{ $value }} open incidents across the cluster."

      - alert: HighDatabaseConnectionUsage
        expr: camunda_db_connection_utilization_percent > 80
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool is nearly exhausted."
          description: "The connection pool utilization is at {{ $value | printf \"%.2f\" }}%."

Best Practices

Metric Retention

Consider different retention policies for light vs. full metrics:

  • Light metrics: Retain for 30-90 days (high frequency data)
  • Full metrics: Retain for 6-12 months (detailed analytics)

Alert Fatigue Prevention

  1. Use appropriate for durations to avoid transient alerts
  2. Set meaningful thresholds based on your workload patterns
  3. Use severity levels to prioritize alerts
  4. Group related alerts to reduce notification volume

Dashboard Organization

Organize your Grafana dashboards by audience:

  • Operations Dashboard: Real-time health, incidents, node status
  • Performance Dashboard: Response times, throughput, resource usage
  • Business Dashboard: Process metrics, SLA compliance, journey patterns
  • Capacity Planning: Trend analysis, growth projections

Sample Alertmanager Configuration

Configure Alertmanager to route alerts to appropriate channels:

global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true

    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops-team@example.com'

  - name: 'slack'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK'
        channel: '#camunda-alerts'
        title: 'Champa Alert: {{ .GroupLabels.alertname }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'