Health Monitoring API¶
The Health Monitoring API provides a rich set of endpoints for observing the real-time status of your Camunda cluster, JVMs, database, and overall application health.
Base Path: /health
Required Permission: health_monitor_data
Main Endpoints¶
Get Full Engine Health Data¶
This endpoint collects a comprehensive snapshot of the entire cluster's health. It is used to populate the main Health Monitoring dashboard but can also be called directly.
Response: 200 OK
A complex JSON object containing several nested structures:
{
"cluster_nodes": [
{
"name": "instance1",
"status": "RUNNING",
"response_time_ms": 120.5,
"jvm_metrics": {
"status": "HEALTHY",
"memory": {
"heap_used_mb": 512,
"heap_max_mb": 2048,
"heap_utilization_pct": 25.0
},
"gc": {
"minor_collections": 150,
"major_collections": 5
},
"threads": {
"current": 85
}
},
"job_acquisition_success_rate": 99.5,
"job_success_rate": 98.9,
"workload_score": 1500
}
],
"cluster_status": {
"total_nodes": 1,
"running_nodes": 1,
"engine_version": "7.20+",
"issues": []
},
"totals": {
"active_instances": 125,
"user_tasks": 30,
"incidents": 5
},
"db_metrics": {
"connectivity": "OK",
"latency_ms": 15,
"active_connections": 10,
"max_connections": 100,
"connection_utilization": 10.0
},
"process_analytics": {
"activity_hotspots": [
{
"activity_name": "Process Payment",
"avg_duration_ms": 5500
}
]
},
"timestamp": "2025-10-28T10:30:00Z"
}
Lazy-Loading API Endpoints¶
These endpoints are used by the frontend to load dashboard sections on-demand.
Get Metric Group¶
Fetches a specific group of related metrics.
Path Parameters:
| Parameter | Description |
|---|---|
metric_group | The name of the metric group to fetch. Available groups: process-analytics, system-health, quick-metrics, sla-metrics, throughput-metrics, jmx-metrics, database-metrics |
Response: 200 OK
A JSON object containing the data for the requested group. For example, /api/metrics/system-health:
{
"deployment_health": {
"total_deployments": 50,
"recent_deployments": 2
},
"dead_letter_jobs": [
{
"type_": "message",
"failed_job_count": 3
}
],
"long_running_instances": [
{
"process_key": "order-process",
"long_running_count": 5
}
],
"timestamp": "2025-10-28T10:30:00Z"
}
Get Individual Metric¶
Fetches a single, specific metric that may be slow to compute.
Path Parameters:
| Parameter | Description |
|---|---|
metric_name | The name of the individual metric. Available metrics: stuck-instances, job-throughput, pending-messages, pending-signals |
Response: 200 OK
Get Dashboard Block¶
Fetches data for a specific visual block on the dashboard.
Path Parameters:
| Parameter | Description |
|---|---|
block_name | The name of the dashboard block. Available blocks: process-definitions, long-running, activity-hotspots, error-patterns, dead-letter-jobs, database-storage, slow-queries |
Response: 200 OK
A JSON object keyed by the block name, containing the relevant data. For example, /api/block/dead-letter-jobs:
{
"dead_letter_jobs": [
{
"type_": "async-continuation",
"failed_job_count": 5,
"error_messages": "java.lang.NullPointerException; ..."
}
],
"timestamp": "2025-10-28T10:30:00Z"
}
Response Fields Reference¶
Cluster Node Object¶
| Field | Type | Description |
|---|---|---|
name | string | Node identifier |
status | string | Node status: RUNNING, DOWN, UNHEALTHY |
response_time_ms | float | API response time in milliseconds |
jvm_metrics.status | string | JVM health status |
jvm_metrics.memory | object | Memory usage statistics |
job_acquisition_success_rate | float | Percentage of successful job acquisitions |
job_success_rate | float | Percentage of successfully executed jobs |
workload_score | integer | Relative workload indicator |
Database Metrics Object¶
| Field | Type | Description |
|---|---|---|
connectivity | string | Connection status: OK, ERROR |
latency_ms | float | Database query latency |
active_connections | integer | Current active connections |
max_connections | integer | Maximum allowed connections |
connection_utilization | float | Percentage of connection pool in use |
Error Responses¶
All endpoints may return the following error responses:
401 Unauthorized¶
403 Forbidden¶
500 Internal Server Error¶
{
"error": "Health check failed",
"message": "Unable to connect to Camunda node",
"details": "Connection timeout after 30s"
}
Usage Examples¶
Using cURL¶
# Get full health data
curl -H "Authorization: Bearer YOUR_API_TOKEN" \
https://champa.example.com/health/api/full
# Get specific metric group
curl -H "Authorization: Bearer YOUR_API_TOKEN" \
https://champa.example.com/health/api/metrics/system-health
# Get individual metric
curl -H "Authorization: Bearer YOUR_API_TOKEN" \
https://champa.example.com/health/api/individual/stuck-instances
Using Python¶
import requests
API_TOKEN = "your_api_token"
BASE_URL = "https://champa.example.com"
headers = {
"Authorization": f"Bearer {API_TOKEN}"
}
# Get full health data
response = requests.get(f"{BASE_URL}/health/api/full", headers=headers)
health_data = response.json()
print(f"Running nodes: {health_data['cluster_status']['running_nodes']}")
print(f"Open incidents: {health_data['totals']['incidents']}")
Using JavaScript¶
const API_TOKEN = 'your_api_token';
const BASE_URL = 'https://champa.example.com';
async function getHealthData() {
const response = await fetch(`${BASE_URL}/health/api/full`, {
headers: {
'Authorization': `Bearer ${API_TOKEN}`
}
});
const data = await response.json();
console.log('Cluster status:', data.cluster_status);
return data;
}
Best Practices¶
Performance Optimization¶
-
Use Lazy-Loading Endpoints: For dashboard implementations, use the specific metric group endpoints (
/api/metrics/*) instead of always fetching the full data set. -
Cache Responses: Health data is expensive to compute. Cache responses client-side for at least 10-30 seconds.
-
Parallel Requests: When fetching multiple metric groups, make parallel requests instead of sequential ones.
Monitoring Integration¶
-
Set Appropriate Timeouts: Health checks can take 5-15 seconds depending on cluster size. Set HTTP timeouts accordingly.
-
Handle Partial Failures: Some nodes may be down while others are healthy. Handle partial data gracefully.
-
Alert Thresholds: Establish baseline metrics for your environment before setting alert thresholds.