Skip to content

Health Monitoring API

The Health Monitoring API provides a rich set of endpoints for observing the real-time status of your Camunda cluster, JVMs, database, and overall application health.


Base Path: /health

Required Permission: health_monitor_data


Main Endpoints

Get Full Engine Health Data

This endpoint collects a comprehensive snapshot of the entire cluster's health. It is used to populate the main Health Monitoring dashboard but can also be called directly.

GET /api/full

Response: 200 OK

A complex JSON object containing several nested structures:

{
  "cluster_nodes": [
    {
      "name": "instance1",
      "status": "RUNNING",
      "response_time_ms": 120.5,
      "jvm_metrics": {
        "status": "HEALTHY",
        "memory": {
          "heap_used_mb": 512,
          "heap_max_mb": 2048,
          "heap_utilization_pct": 25.0
        },
        "gc": {
          "minor_collections": 150,
          "major_collections": 5
        },
        "threads": {
          "current": 85
        }
      },
      "job_acquisition_success_rate": 99.5,
      "job_success_rate": 98.9,
      "workload_score": 1500
    }
  ],
  "cluster_status": {
    "total_nodes": 1,
    "running_nodes": 1,
    "engine_version": "7.20+",
    "issues": []
  },
  "totals": {
    "active_instances": 125,
    "user_tasks": 30,
    "incidents": 5
  },
  "db_metrics": {
    "connectivity": "OK",
    "latency_ms": 15,
    "active_connections": 10,
    "max_connections": 100,
    "connection_utilization": 10.0
  },
  "process_analytics": {
    "activity_hotspots": [
      {
        "activity_name": "Process Payment",
        "avg_duration_ms": 5500
      }
    ]
  },
  "timestamp": "2025-10-28T10:30:00Z"
}

Lazy-Loading API Endpoints

These endpoints are used by the frontend to load dashboard sections on-demand.

Get Metric Group

Fetches a specific group of related metrics.

GET /api/metrics/<metric_group>

Path Parameters:

Parameter Description
metric_group The name of the metric group to fetch. Available groups: process-analytics, system-health, quick-metrics, sla-metrics, throughput-metrics, jmx-metrics, database-metrics

Response: 200 OK

A JSON object containing the data for the requested group. For example, /api/metrics/system-health:

{
  "deployment_health": {
    "total_deployments": 50,
    "recent_deployments": 2
  },
  "dead_letter_jobs": [
    {
      "type_": "message",
      "failed_job_count": 3
    }
  ],
  "long_running_instances": [
    {
      "process_key": "order-process",
      "long_running_count": 5
    }
  ],
  "timestamp": "2025-10-28T10:30:00Z"
}

Get Individual Metric

Fetches a single, specific metric that may be slow to compute.

GET /api/individual/<metric_name>

Path Parameters:

Parameter Description
metric_name The name of the individual metric. Available metrics: stuck-instances, job-throughput, pending-messages, pending-signals

Response: 200 OK

{
  "value": 15,
  "timestamp": "2025-10-28T10:30:00Z"
}

Get Dashboard Block

Fetches data for a specific visual block on the dashboard.

GET /api/block/<block_name>

Path Parameters:

Parameter Description
block_name The name of the dashboard block. Available blocks: process-definitions, long-running, activity-hotspots, error-patterns, dead-letter-jobs, database-storage, slow-queries

Response: 200 OK

A JSON object keyed by the block name, containing the relevant data. For example, /api/block/dead-letter-jobs:

{
  "dead_letter_jobs": [
    {
      "type_": "async-continuation",
      "failed_job_count": 5,
      "error_messages": "java.lang.NullPointerException; ..."
    }
  ],
  "timestamp": "2025-10-28T10:30:00Z"
}

Response Fields Reference

Cluster Node Object

Field Type Description
name string Node identifier
status string Node status: RUNNING, DOWN, UNHEALTHY
response_time_ms float API response time in milliseconds
jvm_metrics.status string JVM health status
jvm_metrics.memory object Memory usage statistics
job_acquisition_success_rate float Percentage of successful job acquisitions
job_success_rate float Percentage of successfully executed jobs
workload_score integer Relative workload indicator

Database Metrics Object

Field Type Description
connectivity string Connection status: OK, ERROR
latency_ms float Database query latency
active_connections integer Current active connections
max_connections integer Maximum allowed connections
connection_utilization float Percentage of connection pool in use

Error Responses

All endpoints may return the following error responses:

401 Unauthorized

{
  "error": "Authentication required",
  "message": "Valid API token or session required"
}

403 Forbidden

{
  "error": "Permission denied",
  "message": "health_monitor_data permission required"
}

500 Internal Server Error

{
  "error": "Health check failed",
  "message": "Unable to connect to Camunda node",
  "details": "Connection timeout after 30s"
}

Usage Examples

Using cURL

# Get full health data
curl -H "Authorization: Bearer YOUR_API_TOKEN" \
  https://champa.example.com/health/api/full

# Get specific metric group
curl -H "Authorization: Bearer YOUR_API_TOKEN" \
  https://champa.example.com/health/api/metrics/system-health

# Get individual metric
curl -H "Authorization: Bearer YOUR_API_TOKEN" \
  https://champa.example.com/health/api/individual/stuck-instances

Using Python

import requests

API_TOKEN = "your_api_token"
BASE_URL = "https://champa.example.com"

headers = {
    "Authorization": f"Bearer {API_TOKEN}"
}

# Get full health data
response = requests.get(f"{BASE_URL}/health/api/full", headers=headers)
health_data = response.json()

print(f"Running nodes: {health_data['cluster_status']['running_nodes']}")
print(f"Open incidents: {health_data['totals']['incidents']}")

Using JavaScript

const API_TOKEN = 'your_api_token';
const BASE_URL = 'https://champa.example.com';

async function getHealthData() {
  const response = await fetch(`${BASE_URL}/health/api/full`, {
    headers: {
      'Authorization': `Bearer ${API_TOKEN}`
    }
  });

  const data = await response.json();
  console.log('Cluster status:', data.cluster_status);
  return data;
}

Best Practices

Performance Optimization

  1. Use Lazy-Loading Endpoints: For dashboard implementations, use the specific metric group endpoints (/api/metrics/*) instead of always fetching the full data set.

  2. Cache Responses: Health data is expensive to compute. Cache responses client-side for at least 10-30 seconds.

  3. Parallel Requests: When fetching multiple metric groups, make parallel requests instead of sequential ones.

Monitoring Integration

  1. Set Appropriate Timeouts: Health checks can take 5-15 seconds depending on cluster size. Set HTTP timeouts accordingly.

  2. Handle Partial Failures: Some nodes may be down while others are healthy. Handle partial data gracefully.

  3. Alert Thresholds: Establish baseline metrics for your environment before setting alert thresholds.