Health Monitoring API¶

The Health Monitoring API provides a rich set of endpoints for observing the real-time status of your Camunda cluster, JVMs, database, and overall application health.

Base Path: /health

Required Permission: health_monitor_data

Main Endpoints¶

Get Full Engine Health Data¶

This endpoint collects a comprehensive snapshot of the entire cluster's health. It is used to populate the main Health Monitoring dashboard but can also be called directly.

GET /api/full

Response: 200 OK

A complex JSON object containing several nested structures:

{
  "cluster_nodes": [
    {
      "name": "instance1",
      "status": "RUNNING",
      "response_time_ms": 120.5,
      "jvm_metrics": {
        "status": "HEALTHY",
        "memory": {
          "heap_used_mb": 512,
          "heap_max_mb": 2048,
          "heap_utilization_pct": 25.0
        },
        "gc": {
          "minor_collections": 150,
          "major_collections": 5
        },
        "threads": {
          "current": 85
        }
      },
      "job_acquisition_success_rate": 99.5,
      "job_success_rate": 98.9,
      "workload_score": 1500
    }
  ],
  "cluster_status": {
    "total_nodes": 1,
    "running_nodes": 1,
    "engine_version": "7.20+",
    "issues": []
  },
  "totals": {
    "active_instances": 125,
    "user_tasks": 30,
    "incidents": 5
  },
  "db_metrics": {
    "connectivity": "OK",
    "latency_ms": 15,
    "active_connections": 10,
    "max_connections": 100,
    "connection_utilization": 10.0
  },
  "process_analytics": {
    "activity_hotspots": [
      {
        "activity_name": "Process Payment",
        "avg_duration_ms": 5500
      }
    ]
  },
  "timestamp": "2025-10-28T10:30:00Z"
}

Lazy-Loading API Endpoints¶

These endpoints are used by the frontend to load dashboard sections on-demand.

Get Metric Group¶

Fetches a specific group of related metrics.

GET /api/metrics/<metric_group>

Path Parameters:

Parameter	Description
`metric_group`	The name of the metric group to fetch. Available groups: `process-analytics`, `system-health`, `quick-metrics`, `sla-metrics`, `throughput-metrics`, `jmx-metrics`, `database-metrics`

Response: 200 OK

A JSON object containing the data for the requested group. For example, /api/metrics/system-health:

{
  "deployment_health": {
    "total_deployments": 50,
    "recent_deployments": 2
  },
  "dead_letter_jobs": [
    {
      "type_": "message",
      "failed_job_count": 3
    }
  ],
  "long_running_instances": [
    {
      "process_key": "order-process",
      "long_running_count": 5
    }
  ],
  "timestamp": "2025-10-28T10:30:00Z"
}

Get Individual Metric¶

Fetches a single, specific metric that may be slow to compute.

GET /api/individual/<metric_name>

Path Parameters:

Parameter	Description
`metric_name`	The name of the individual metric. Available metrics: `stuck-instances`, `job-throughput`, `pending-messages`, `pending-signals`

Response: 200 OK

{
  "value": 15,
  "timestamp": "2025-10-28T10:30:00Z"
}

Get Dashboard Block¶

Fetches data for a specific visual block on the dashboard.

GET /api/block/<block_name>

Path Parameters:

Parameter	Description
`block_name`	The name of the dashboard block. Available blocks: `process-definitions`, `long-running`, `activity-hotspots`, `error-patterns`, `dead-letter-jobs`, `database-storage`, `slow-queries`

Response: 200 OK

A JSON object keyed by the block name, containing the relevant data. For example, /api/block/dead-letter-jobs:

{
  "dead_letter_jobs": [
    {
      "type_": "async-continuation",
      "failed_job_count": 5,
      "error_messages": "java.lang.NullPointerException; ..."
    }
  ],
  "timestamp": "2025-10-28T10:30:00Z"
}

Response Fields Reference¶

Cluster Node Object¶

Field	Type	Description
`name`	string	Node identifier
`status`	string	Node status: `RUNNING`, `DOWN`, `UNHEALTHY`
`response_time_ms`	float	API response time in milliseconds
`jvm_metrics.status`	string	JVM health status
`jvm_metrics.memory`	object	Memory usage statistics
`job_acquisition_success_rate`	float	Percentage of successful job acquisitions
`job_success_rate`	float	Percentage of successfully executed jobs
`workload_score`	integer	Relative workload indicator

Database Metrics Object¶

Field	Type	Description
`connectivity`	string	Connection status: `OK`, `ERROR`
`latency_ms`	float	Database query latency
`active_connections`	integer	Current active connections
`max_connections`	integer	Maximum allowed connections
`connection_utilization`	float	Percentage of connection pool in use

Error Responses¶

All endpoints may return the following error responses:

401 Unauthorized¶

{
  "error": "Authentication required",
  "message": "Valid API token or session required"
}

403 Forbidden¶

{
  "error": "Permission denied",
  "message": "health_monitor_data permission required"
}

500 Internal Server Error¶

{
  "error": "Health check failed",
  "message": "Unable to connect to Camunda node",
  "details": "Connection timeout after 30s"
}

Usage Examples¶

Using cURL¶

# Get full health data
curl -H "Authorization: Bearer YOUR_API_TOKEN" \
  https://champa.example.com/health/api/full

# Get specific metric group
curl -H "Authorization: Bearer YOUR_API_TOKEN" \
  https://champa.example.com/health/api/metrics/system-health

# Get individual metric
curl -H "Authorization: Bearer YOUR_API_TOKEN" \
  https://champa.example.com/health/api/individual/stuck-instances

Using Python¶

import requests

API_TOKEN = "your_api_token"
BASE_URL = "https://champa.example.com"

headers = {
    "Authorization": f"Bearer {API_TOKEN}"
}

# Get full health data
response = requests.get(f"{BASE_URL}/health/api/full", headers=headers)
health_data = response.json()

print(f"Running nodes: {health_data['cluster_status']['running_nodes']}")
print(f"Open incidents: {health_data['totals']['incidents']}")

Using JavaScript¶

const API_TOKEN = 'your_api_token';
const BASE_URL = 'https://champa.example.com';

async function getHealthData() {
  const response = await fetch(`${BASE_URL}/health/api/full`, {
    headers: {
      'Authorization': `Bearer ${API_TOKEN}`
    }
  });

  const data = await response.json();
  console.log('Cluster status:', data.cluster_status);
  return data;
}

Best Practices¶

Performance Optimization¶

Use Lazy-Loading Endpoints: For dashboard implementations, use the specific metric group endpoints (/api/metrics/*) instead of always fetching the full data set.
Cache Responses: Health data is expensive to compute. Cache responses client-side for at least 10-30 seconds.
Parallel Requests: When fetching multiple metric groups, make parallel requests instead of sequential ones.

Monitoring Integration¶

Set Appropriate Timeouts: Health checks can take 5-15 seconds depending on cluster size. Set HTTP timeouts accordingly.
Handle Partial Failures: Some nodes may be down while others are healthy. Handle partial data gracefully.
Alert Thresholds: Establish baseline metrics for your environment before setting alert thresholds.