Monitoring & Alerting¶

Champa Intelligence is designed for enterprise-grade observability and integrates seamlessly with monitoring stacks like Prometheus and Grafana.

Prometheus Integration¶

The platform exposes two primary Prometheus metrics endpoints, both requiring an API token for authentication.

Metrics Endpoints¶

Light Metrics (/health/light/metrics):
- Purpose: Essential, low-overhead metrics for high-frequency scraping.
- Scrape Interval: Recommended: 15-30 seconds.
- Use Case: Real-time alerting on critical health indicators like node status, JVM heap, and database connectivity.
Full Metrics (/health/full/metrics):
- Purpose: A comprehensive set of all application and analytics metrics.
- Scrape Interval: Recommended: 5 minutes.
- Use Case: Populating detailed Grafana dashboards for historical analysis and performance trending.

Prometheus Configuration¶

Here is an example prometheus.yml configuration to scrape the Champa Intelligence endpoints.

scrape_configs:
  - job_name: 'champa_light'
    scrape_interval: 30s
    metrics_path: /health/light/metrics
    scheme: http # or https if using TLS
    static_configs:
      - targets: ['champa-intelligence-host:8088']
    # Use bearer token authentication
    bearer_token: 'YOUR_LONG_LIVED_API_TOKEN'

  - job_name: 'champa_full'
    scrape_interval: 5m
    metrics_path: /health/full/metrics
    scheme: http # or https
    static_configs:
      - targets: ['champa-intelligence-host:8088']
    bearer_token: 'YOUR_LONG_LIVED_API_TOKEN'

Creating an API Token

In Champa Intelligence, go to Admin → User Management.
Create a new user (e.g., prometheus_user).
Check the "Is API User?" box and set a long TTL (e.g., 365 days).
Assign a role with api_access and health_monitor_data permissions.
Copy the generated JWT and use it as the bearer_token.

Grafana Dashboards¶

You can use the exposed metrics to build powerful Grafana dashboards. Here are some example panels and PromQL queries.

Example Panel: Cluster Health¶

Visualization: Stat Panel

Queries:

Query for Running Nodes: camunda_cluster_running_nodes
Query for Active Instances: camunda_active_instances
Query for Open Incidents: camunda_incidents

Example Panel: JVM Heap Usage per Node¶

Visualization: Time series Graph

Query:

camunda_jvm_heap_utilization_percent

Legend: {{node}}

Example Panel: Database Connection Pool¶

Visualization: Gauge

Query:

camunda_db_connection_utilization_percent

Thresholds: Set thresholds to visualize when usage is high (e.g., 80%).

Example Panel: Top 5 Processes by Incidents¶

Visualization: Bar chart

Query:

topk(5, camunda_process_open_incidents)

Legend: {{process}}

Alerting with Alertmanager¶

You can configure Prometheus alerts based on the scraped metrics. Here are some recommended alert rules.

groups:
  - name: champa_alerts
    rules:
      - alert: CamundaNodeDown
        expr: camunda_node_status == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Camunda node {{ $labels.node }} is down."
          description: "The Camunda node at URL {{ $labels.url }} is not responding."

      - alert: HighJvmHeapUsage
        expr: camunda_jvm_heap_utilization_percent > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High JVM Heap Usage on node {{ $labels.node }}."
          description: "JVM heap utilization is {{ $value | printf \"%.2f\" }}% on node {{ $labels.node }}."

      - alert: HighIncidentCount
        expr: camunda_incidents > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High number of open incidents in Camunda."
          description: "There are currently {{ $value }} open incidents across the cluster."

      - alert: HighDatabaseConnectionUsage
        expr: camunda_db_connection_utilization_percent > 80
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool is nearly exhausted."
          description: "The connection pool utilization is at {{ $value | printf \"%.2f\" }}%."

Best Practices¶

Metric Retention¶

Consider different retention policies for light vs. full metrics:

Light metrics: Retain for 30-90 days (high frequency data)
Full metrics: Retain for 6-12 months (detailed analytics)

Alert Fatigue Prevention¶

Use appropriate for durations to avoid transient alerts
Set meaningful thresholds based on your workload patterns
Use severity levels to prioritize alerts
Group related alerts to reduce notification volume

Dashboard Organization¶

Organize your Grafana dashboards by audience:

Operations Dashboard: Real-time health, incidents, node status
Performance Dashboard: Response times, throughput, resource usage
Business Dashboard: Process metrics, SLA compliance, journey patterns
Capacity Planning: Trend analysis, growth projections

Sample Alertmanager Configuration¶

Configure Alertmanager to route alerts to appropriate channels:

global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true

    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops-team@example.com'

  - name: 'slack'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK'
        channel: '#camunda-alerts'
        title: 'Champa Alert: {{ .GroupLabels.alertname }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'