Skip to content

Health Monitoring

Health Monitoring Dashboard Real-time cluster health overview with JVM metrics and performance insights


Overview

The Health Monitoring dashboard provides comprehensive, real-time visibility into your Camunda 7 cluster's health and performance. It consolidates metrics from all cluster nodes, JVM telemetry, database health, and operational insights into a single, unified interface designed for DevOps, SRE, and operations teams.


Key Capabilities

🎯 Cluster-Wide Visibility

Monitor the status and performance of all Camunda nodes in your cluster simultaneously:

  • Node Status Tracking - Real-time status indicators (RUNNING, DOWN, ERROR) for each configured node
  • Response Time Monitoring - Track API response latency for each node to identify performance bottlenecks
  • Workload Distribution - Visualize workload scores across nodes to detect load imbalances
  • Job Executor Health - Monitor job acquisition rates, execution success rates, and rejection patterns

🔍 Deep JVM Insights

Get detailed visibility into the Java Virtual Machine performance for each node:

  • Memory Management - Real-time heap usage, utilization percentages, and memory pressure indicators
  • Garbage Collection - Track minor and major GC events, total GC time, and potential memory issues
  • Thread Monitoring - Current, peak, and daemon thread counts to identify thread exhaustion
  • CPU & System Load - Process CPU load, system load averages, and overall resource utilization
  • File Descriptors - Monitor open file descriptors to prevent resource exhaustion

The dashboard automatically detects whether your nodes expose JVM metrics via JMX exporters or Micrometer (Quarkus) and adapts accordingly.

💾 Database Health

Monitor the PostgreSQL database that powers your Camunda cluster:

  • Connection Health - Database latency measurements via simple query probes
  • Connection Pool Monitoring - Active vs. max connections with utilization percentages
  • Storage Analysis - Identify the largest Camunda tables for capacity planning
  • Archival Opportunities - Count completed process instances eligible for archival (90+ days old)
  • Slow Query Detection - Identify long-running queries impacting performance (requires pg_stat_statements extension)

📊 Process Analytics

Gain operational insights into your process execution patterns:

  • Process Performance - View total instances, completion rates, failure rates, and average durations for each process definition
  • Activity Hotspots - Identify slow or frequently-executed activities that may need optimization
  • Error Patterns - Analyze incident types, frequencies, and affected processes to prioritize fixes
  • Long-Running Instances - Track processes running longer than expected with runtime breakdowns

⚡ Real-Time Metrics

Stay informed about current system activity and throughput:

  • Business Hours Activity - Compare instance starts during business hours vs. off-hours
  • Completion Rates - Track instances completed in the last hour
  • Average Lifetime - Monitor typical process execution times
  • Throughput Trends - Compare current load against peak activity over the last 7 days

🚨 SLA & Resource Alerts

Proactively identify issues before they impact operations:

  • Overdue Tasks - Identify user tasks overdue by 24h, 72h, or more
  • Unassigned Tasks - Track tasks without assignees that are aging
  • Dead Letter Jobs - Monitor jobs that have exhausted all retries
  • Resource Usage - Track total variables, blob variables, and execution counts

📈 System Health Indicators

Understand your system's operational status:

  • Deployment Activity - Recent deployments (24h, 7 days) and total deployment count
  • Core Metrics - Active instances, user tasks, external tasks, and incidents at a glance
  • Job Executor Stats - Total jobs, failed jobs, active timer jobs, and external task states
  • Definitions - Count of deployed process definitions and DMN decision tables

Features in Detail

Auto-Refresh Mode

Enable continuous monitoring with configurable auto-refresh:

  • Automatically refreshes light metrics every 30 seconds
  • Keeps dashboard current without manual intervention
  • Toggle on/off with a single click

Dark Mode Support

Switch between light and dark themes for comfortable viewing in any environment. Your preference is automatically saved.

Lazy Loading Architecture

The dashboard uses intelligent lazy loading to minimize initial load time:

  • Initial Load - Displays essential metrics immediately (cluster status, database health, core counts)
  • On-Demand Loading - Heavy analytics sections load only when viewed
  • Progressive Enhancement - Each section loads independently without blocking others
  • Smart Caching - Already-loaded sections refresh efficiently

Responsive Design

The dashboard adapts seamlessly to different screen sizes, from desktop monitors to tablets.


Prometheus Integration

For teams using enterprise monitoring stacks, Champa Intelligence exposes all metrics via native Prometheus endpoints.

Available Endpoints

Light Metrics Endpoint

GET /health/light/metrics
- Purpose: Essential, low-overhead metrics for real-time alerting - Recommended Scrape Interval: 15-30 seconds - Includes: Node status, JVM heap, DB latency, active instances, incidents

Full Metrics Endpoint

GET /health/full/metrics
- Purpose: Comprehensive metrics for detailed dashboarding and historical analysis - Recommended Scrape Interval: 5 minutes - Includes: All light metrics plus detailed job executor stats, per-process KPIs, database table sizes, slow queries, and more

Example Grafana Queries

Monitor your cluster with powerful PromQL queries:

# Average heap utilization across all nodes
avg(camunda_jvm_heap_utilization_percent)

# Alert when a node goes down
camunda_node_status == 0

# Top 5 processes by incident count
topk(5, camunda_process_open_incidents)

# Database connection pool usage alert
camunda_db_connection_utilization_percent > 80

# Job success rate per node
camunda_node_job_success_rate < 95

Integration Benefits

  • Unified Monitoring - Correlate Camunda health with other system metrics
  • Custom Dashboards - Build tailored views in Grafana or your preferred tool
  • Historical Trending - Analyze performance patterns over time
  • Alert Management - Set up sophisticated alerting rules based on any metric

Using the Dashboard

Initial View

When you first access the dashboard, you'll see:

  1. Cluster Status Panel - Overview of running vs. total nodes with version info
  2. Database Status Panel - Connection health, latency, and pool utilization
  3. Job Executor Panel - Current job counts and execution states
  4. External Tasks Panel - Active external tasks and retry status
  5. Definitions Panel - Deployed process and decision definitions
  6. Core Metrics Panel - Active instances, user tasks, and incidents

Node Details

Expand any node card to view:

  • JVM health metrics (heap, CPU, threads, GC stats)
  • Memory usage with visual indicators
  • File descriptor utilization
  • Activity rates and workload scores
  • Job acquisition and execution statistics
  • Process activity breakdown

Color-coded indicators help you quickly identify:

  • 🟢 Green: Healthy, normal operation
  • 🟡 Yellow: Warning thresholds exceeded
  • 🔴 Red: Critical issues requiring attention

Analytics Sections

Click on any collapsible section to load detailed analytics:

  • Process Analytics - Performance breakdown per process definition
  • Activity Hotspots - Slow or high-volume activities
  • Error Patterns - Incident analysis and troubleshooting insights
  • Long Running Instances - Processes exceeding expected durations
  • Dead Letter Jobs - Failed jobs requiring intervention
  • Database Storage - Table sizes and archival candidates
  • Slow Queries - Database performance bottlenecks

Each section loads independently and caches results for fast subsequent access.


Technical Architecture

The Health Monitoring system is built on a parallel data collection architecture for maximum performance:

  • Concurrent Node Polling - All cluster nodes are queried simultaneously using thread pools
  • JMX/Micrometer Integration - Automatic detection and parsing of JVM metrics from either source
  • Intelligent Caching - Reduces database load through smart query optimization
  • RESTful API Design - Each metric group has a dedicated endpoint for granular loading
  • Prometheus-Native - First-class support for Prometheus scraping and PromQL

Configuration

The system is configured via environment variables and config files:

  • CAMUNDA_NODES - Dictionary of node names and REST API URLs
  • JMX_EXPORTER_ENDPOINTS - JMX exporter URLs per node
  • JVM_METRICS_SOURCE - Set to 'jmx' or 'micrometer' based on your setup
  • STUCK_INSTANCE_DAYS - Threshold for considering instances "stuck" (default: 7)

Best Practices

Monitoring Strategy

  1. Set Up Baseline Metrics - Understand your normal operating parameters
  2. Enable Auto-Refresh - Keep the dashboard open during critical operations
  3. Configure Alerting - Use Prometheus endpoints to trigger alerts for critical conditions
  4. Review Regularly - Check activity hotspots and error patterns weekly
  5. Plan Capacity - Use database storage metrics to plan archival and scaling

Performance Tips

  • Use the light metrics endpoint for high-frequency Prometheus scraping
  • Enable lazy loading (default) to minimize initial dashboard load
  • Consider database archival when completed instances exceed 90 days old
  • Monitor heap utilization and plan JVM tuning before reaching 80%
  • Track slow queries and add indexes where beneficial

Troubleshooting

Node shows ERROR status

  • Verify the node is reachable via the configured REST API URL
  • Check authentication credentials (CAMUNDA_API_USER, CAMUNDA_API_PASSWORD)
  • Review node logs for startup issues

JVM metrics show NO_JMX_DATA

  • Confirm JMX exporter or Micrometer is properly configured on the node
  • Verify JMX_EXPORTER_ENDPOINTS contains the correct URL
  • Check JVM_METRICS_SOURCE matches your setup (jmx vs. micrometer)

Database latency is high

  • Review slow queries if pg_stat_statements is enabled
  • Check database connection pool size vs. active connections
  • Consider optimizing or indexing identified slow queries

API Reference

REST Endpoints

Endpoint Method Description
/health GET Main dashboard page
/health/api/full GET Complete health data (JSON)
/health/api/metrics/<group> GET Specific metric group (lazy load)
/health/api/individual/<metric> GET Single metric value
/health/api/block/<block> GET Individual analytics block
/health/light/metrics GET Prometheus metrics (light)
/health/full/metrics GET Prometheus metrics (full)

Metric Groups

  • process-analytics - Process definitions, completions, failures
  • system-health - Deployments, dead letter jobs, long-running instances
  • quick-metrics - Business hours activity, lifecycle metrics
  • sla-metrics - Overdue tasks, resource alerts
  • throughput-metrics - Load trends, peak comparisons
  • jmx-metrics - JVM health for all nodes
  • database-metrics - Table sizes, slow queries, archival data

Individual Metrics

  • stuck-instances - Count of processes stuck beyond threshold
  • job-throughput - Jobs executed per minute
  • pending-messages - Message event subscriptions waiting
  • pending-signals - Signal event subscriptions waiting

FAQ

Q: How often should I refresh the dashboard?
A: The light metrics refresh every 30 seconds in auto-refresh mode, which is suitable for most monitoring scenarios. For Prometheus scraping, use 15-30 seconds for light metrics and 5 minutes for full metrics.

Q: Can I monitor multiple clusters?
A: Currently, each Champa Intelligence instance monitors one Camunda cluster. Configure CAMUNDA_NODES with all nodes in your target cluster.

Q: What if my nodes don't expose JMX metrics?
A: The dashboard will function without JVM metrics, showing node status and Camunda-specific metrics. JVM insights simply won't be available for those nodes.

Q: How do I enable pg_stat_statements for slow query detection?
A: Add shared_preload_libraries = 'pg_stat_statements' to your PostgreSQL configuration and restart. Then run CREATE EXTENSION pg_stat_statements; in your database.

Q: Can I export the dashboard data?
A: Yes, use the /health/api/full endpoint to retrieve all metrics as JSON, or scrape the Prometheus endpoints for time-series data.