Skip to main content
Version: Next

Dual-region operational procedure

info

This procedure has been updated in the Camunda 8.6 release. The procedure used in Camunda 8.5 has been deprecated, and compatibility will be removed in the 8.7 release.

Introduction

This operational blueprint procedure is a step-by-step guide on how to restore operations in the case of a total region failure. It explains how to temporarily restore functionality in the surviving region and how to ultimately do a full recovery to restore the dual-region setup. The operational procedure builds on top of the dual-region AWS setup guidance, but is generally applicable for any dual-region setup.

Before proceeding with the operational procedure, thoroughly review and understand the contents of the dual-region concept page. This page outlines various limitations and requirements pertinent to the procedure, which are crucial for successful execution.

Disclaimer

caution

Running a dual-region configuration requires users to detect and manage any regional failures, and implement the operational procedure for failover and failback that matches their environment.

Prerequisites

  • A dual-region Camunda 8 setup installed in two different regions, preferably derived from our AWS dual-region concept.
  • (deprecated) zbctl to interact with the Zeebe cluster.
  • cURL or similar to interact with the REST API.

Terminology

  • Surviving region
    • A surviving region refers to a region within a dual-region setup that remains operational and unaffected by a failure or disaster that affects other regions.
  • Lost region
    • A lost region is a region within a dual-region setup that becomes unavailable or unusable due to a failure or disaster.
  • Recreated region
    • A recreated region is a region within a dual-region setup that was previously lost but has been restored or recreated to resume its operational state.
    • We assume this region does not contain Camunda 8 deployments or related persistent volumes. Ensure this is the case before executing the failover procedure.

Procedure

We use the same procedure to handle the loss of both active and passive regions. For clarity, this section focuses on the scenario where the passive region is lost while the active region remains operational. The same procedure will be valid in case of active region loss.

Temporary Loss Scenario: If a region loss is temporary — such as from transient network issues — Zeebe can handle this situation without initiating recovery procedures, provided there is sufficient free space on the persistent disk. However, processing may halt due to a loss of quorum during this time.

Key steps to handle passive region loss

  1. Traffic rerouting: Use DNS to reroute traffic to the surviving active region. (Details on managing DNS rerouting depend on your specific DNS setup and are not covered in this guide.)
  2. Failover phase: Temporarily restores Camunda 8 functionality by removing the lost brokers and handling the export to the unreachable Elasticsearch instance.
  3. Failback phase: Fully restores the failed region to its original functionality. This phase requires the region to be ready for the redeployment of Camunda 8.
caution

For the failback procedure, the recreated region must not include any active Camunda 8 deployments or residual persistent volumes associated with Camunda 8 or its Elasticsearch instance. It is essential to initiate a clean deployment to prevent data replication and state conflicts.

Prerequisites

The following procedures assume that the dual-region deployment has been created using AWS setup guide. We assume you have your own copy of the c8-multi-region repository and previously completed changes in the camunda-values.yml to adjust them in your setup.

Follow the dual-region cluster deployment guide to install Camunda 8, configure a dual-region setup, and have the general environment variables (see environment prerequisites already set up.

We will avoid referencing both scenarios of losing either Region 0 or Region 1. Instead, we have generalized the commands and require a one-time setup to configure environment variables, enabling you to execute the procedure based on the surviving region and the one that needs to be recreated. Depending on which region you lost, select the correct tab below and export those environment variables to your terminal for a smoother procedure execution:

export CLUSTER_SURVIVING=$CLUSTER_1
export CLUSTER_RECREATED=$CLUSTER_0
export CAMUNDA_NAMESPACE_SURVIVING=$CAMUNDA_NAMESPACE_1
export CAMUNDA_NAMESPACE_RECREATED=$CAMUNDA_NAMESPACE_0
export REGION_SURVIVING=region1
export REGION_RECREATED=region0

Failover phase

The Failover phase outlines steps for removing lost brokers, redistributing load, disabling Elasticsearch export to a failed region, and restoring user interaction with Camunda 8 to ensure smooth recovery and continued functionality.

Remove lost brokers from Zeebe cluster in the surviving region

Current state
Desired state

Description / Code


Current stateDesired state
You have ensured that you fully lost a region and want to start the temporary recovery.

One of the regions is lost, meaning Zeebe:
- No data has been lost thanks to Zeebe data replication.
- Is unable to process new requests due to losing the quorum
- Stops exporting new data to Elasticsearch in the lost region
- Stops exporting new data to Elasticsearch in the survived region
The lost brokers have been removed from the Zeebe cluster.

Continued processing is enabled, and new brokers in the failback procedure will only join the cluster with our intervention.

Procedure

Start with creating a port-forward to the Zeebe Gateway in the surviving region to the local host to interact with the Gateway.

The following alternatives to port-forwarding are possible:

  • If the Zeebe Gateway is exposed to the outside of the Kubernetes cluster, you can skip port-forwarding and use the URL directly
  • exec into an existing pod (such as Elasticsearch), and execute curl commands from inside of the pod
  • run an Ubuntu pod in the cluster to execute curl commands from inside the Kubernetes cluster

In our example, we went with port-forwarding to a localhost, but other alternatives can also be used.

  1. Use the REST API to retrieve the list of the remaining brokers
kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 8080:8080 -n $CAMUNDA_NAMESPACE_SURVIVING

curl -L -X GET 'http://localhost:8080/v2/topology' \
-H 'Accept: application/json'
Example output
{
"brokers": [
{
"nodeId": 0,
"host": "camunda-zeebe-0.camunda-zeebe.camunda-london",
"port": 26501,
"partitions": [
{
"partitionId": 1,
"role": "leader",
"health": "healthy"
},
{
"partitionId": 6,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 7,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 8,
"role": "follower",
"health": "healthy"
}
],
"version": "8.6.0"
},
{
"nodeId": 2,
"host": "camunda-zeebe-1.camunda-zeebe.camunda-london",
"port": 26501,
"partitions": [
{
"partitionId": 1,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 2,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 3,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 8,
"role": "leader",
"health": "healthy"
}
],
"version": "8.6.0"
},
{
"nodeId": 4,
"host": "camunda-zeebe-2.camunda-zeebe.camunda-london",
"port": 26501,
"partitions": [
{
"partitionId": 2,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 3,
"role": "leader",
"health": "healthy"
},
{
"partitionId": 4,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 5,
"role": "follower",
"health": "healthy"
}
],
"version": "8.6.0"
},
{
"nodeId": 6,
"host": "camunda-zeebe-3.camunda-zeebe.camunda-london",
"port": 26501,
"partitions": [
{
"partitionId": 4,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 5,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 6,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 7,
"role": "leader",
"health": "healthy"
}
],
"version": "8.6.0"
}
],
"clusterSize": 8,
"partitionsCount": 8,
"replicationFactor": 4,
"gatewayVersion": "8.6.0"
}
  1. Port-forward the service of the Zeebe Gateway to access the management REST API
kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING
  1. Based on the Cluster Scaling APIs, send a request to the Zeebe Gateway to redistribute the load to the remaining brokers, thereby removing the lost brokers. In our example, we have lost region 1 and with that our uneven brokers. This means we will have to redistribute to our existing even brokers.
curl -XPOST 'http://localhost:9600/actuator/cluster/brokers?force=true' -H 'Content-Type: application/json' -d '["0", "2", "4", "6"]'

Verification

Port-forwarding the Zeebe Gateway via kubectl and printing the topology should reveal that the cluster size has decreased to 4, partitions have been redistributed over the remaining brokers, and new leaders have been elected.

kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 8080:8080 -n $CAMUNDA_NAMESPACE_SURVIVING

curl -L -X GET 'http://localhost:8080/v2/topology' \
-H 'Accept: application/json'
Example output
{
"brokers": [
{
"nodeId": 0,
"host": "camunda-zeebe-0.camunda-zeebe.camunda-london",
"port": 26501,
"partitions": [
{
"partitionId": 1,
"role": "leader",
"health": "healthy"
},
{
"partitionId": 6,
"role": "leader",
"health": "healthy"
},
{
"partitionId": 7,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 8,
"role": "follower",
"health": "healthy"
}
],
"version": "8.6.0"
},
{
"nodeId": 2,
"host": "camunda-zeebe-1.camunda-zeebe.camunda-london",
"port": 26501,
"partitions": [
{
"partitionId": 1,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 2,
"role": "leader",
"health": "healthy"
},
{
"partitionId": 3,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 8,
"role": "leader",
"health": "healthy"
}
],
"version": "8.6.0"
},
{
"nodeId": 4,
"host": "camunda-zeebe-2.camunda-zeebe.camunda-london",
"port": 26501,
"partitions": [
{
"partitionId": 2,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 3,
"role": "leader",
"health": "healthy"
},
{
"partitionId": 4,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 5,
"role": "follower",
"health": "healthy"
}
],
"version": "8.6.0"
},
{
"nodeId": 6,
"host": "camunda-zeebe-3.camunda-zeebe.camunda-london",
"port": 26501,
"partitions": [
{
"partitionId": 4,
"role": "leader",
"health": "healthy"
},
{
"partitionId": 5,
"role": "leader",
"health": "healthy"
},
{
"partitionId": 6,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 7,
"role": "leader",
"health": "healthy"
}
],
"version": "8.6.0"
}
],
"clusterSize": 4,
"partitionsCount": 8,
"replicationFactor": 2,
"gatewayVersion": "8.6.0"
}

You can also use the Zeebe Gateway's REST API to ensure the scaling progress has been completed. For better output readability, we use jq.

kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING
curl -XGET 'http://localhost:9600/actuator/cluster' | jq .lastChange
Example output
{
"id": 2,
"status": "COMPLETED",
"startedAt": "2024-08-23T11:33:08.355681311Z",
"completedAt": "2024-08-23T11:33:09.170531963Z"
}

Failback phase

Deploy Camunda 8 in the newly created region

Current state
Desired state

Description / Code


DetailsCurrent stateDesired state
Camunda 8A standalone region with a fully functional Camunda 8 setup, including Zeebe, Operate, Tasklist, and Elasticsearch.Restore dual-region functionality by deploying Camunda 8 (Zeebe and Elasticsearch) to the newly restored region.
Operate and TasklistOperate and Tasklist are operational in the standalone region.Operate and Tasklist need to stay disabled to avoid interference during the database backup and restore process.

Procedure

This procedure requires your Helm values file, camunda-values.yml, in aws/dual-region/kubernetes, used to deploy Dual-region Camunda clusters.

Ensure that the values for ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL and ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION1_ARGS_URL correctly point to their respective regions. The placeholder in ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS should contain the Zeebe endpoints for both regions, the result of the aws/dual-region/scripts/generate_zeebe_helm_values.sh.

Additionally, execute the following Helm command to disable Operate and Tasklist. These components will only be enabled at the end of the region recovery. Keeping them disabled in the newly created region is necessary to avoid data duplication by their Elasticsearch importers.

From the terminal context of aws/dual-region/kubernetes execute:

helm install $HELM_RELEASE_NAME camunda/camunda-platform \
--version $HELM_CHART_VERSION \
--kube-context $CLUSTER_RECREATED \
--namespace $CAMUNDA_NAMESPACE_RECREATED \
-f camunda-values.yml \
-f $REGION_RECREATED/camunda-values.yml \
--set operate.enabled=false \
--set tasklist.enabled=false

Verification

The following command will show the pods deployed in the newly created region.

kubectl --context $CLUSTER_RECREATED get pods -n $CAMUNDA_NAMESPACE_RECREATED

Half of the amount of your set clusterSize is used to spawn Zeebe brokers.

For example, in the case of clusterSize: 8, four Zeebe brokers are provisioned in the newly created region.

danger

It is expected that the Zeebe broker pods will not reach the "Ready" state since they are not yet part of a Zeebe cluster and, therefore, not considered healthy by the readiness probe.

Port-forwarding the Zeebe Gateway via kubectl and printing the topology should reveal that the new Zeebe brokers are recognized but yet a full member of the Zeebe cluster.

kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 8080:8080 -n $CAMUNDA_NAMESPACE_SURVIVING

curl -L -X GET 'http://localhost:8080/v2/topology' \
-H 'Accept: application/json'
Example output
{
"brokers": [
{
"nodeId": 0,
"host": "camunda-zeebe-0.camunda-zeebe.camunda-london",
"port": 26501,
"partitions": [
{
"partitionId": 1,
"role": "leader",
"health": "healthy"
},
{
"partitionId": 6,
"role": "leader",
"health": "healthy"
},
{
"partitionId": 7,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 8,
"role": "follower",
"health": "healthy"
}
],
"version": "8.6.0"
},
{
"nodeId": 1,
"host": "camunda-zeebe-0.camunda-zeebe.camunda-paris",
"port": 26501,
"partitions": [],
"version": "8.6.0"
},
{
"nodeId": 2,
"host": "camunda-zeebe-1.camunda-zeebe.camunda-london",
"port": 26501,
"partitions": [
{
"partitionId": 1,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 2,
"role": "leader",
"health": "healthy"
},
{
"partitionId": 3,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 8,
"role": "leader",
"health": "healthy"
}
],
"version": "8.6.0"
},
{
"nodeId": 3,
"host": "camunda-zeebe-1.camunda-zeebe.camunda-paris",
"port": 26501,
"partitions": [],
"version": "8.6.0"
},
{
"nodeId": 4,
"host": "camunda-zeebe-2.camunda-zeebe.camunda-london",
"port": 26501,
"partitions": [
{
"partitionId": 2,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 3,
"role": "leader",
"health": "healthy"
},
{
"partitionId": 4,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 5,
"role": "follower",
"health": "healthy"
}
],
"version": "8.6.0"
},
{
"nodeId": 5,
"host": "camunda-zeebe-2.camunda-zeebe.camunda-paris",
"port": 26501,
"partitions": [],
"version": "8.6.0"
},
{
"nodeId": 6,
"host": "camunda-zeebe-3.camunda-zeebe.camunda-london",
"port": 26501,
"partitions": [
{
"partitionId": 4,
"role": "leader",
"health": "healthy"
},
{
"partitionId": 5,
"role": "leader",
"health": "healthy"
},
{
"partitionId": 6,
"role": "follower",
"health": "healthy"
},
{
"partitionId": 7,
"role": "leader",
"health": "healthy"
}
],
"version": "8.6.0"
},
{
"nodeId": 7,
"host": "camunda-zeebe-3.camunda-zeebe.camunda-paris",
"port": 26501,
"partitions": [],
"version": "8.6.0"
},
],
"clusterSize": 4,
"partitionsCount": 8,
"replicationFactor": 2,
"gatewayVersion": "8.6.0"
}

In conclusion, adhering to this updated operational procedure ensures a structured and efficient recovery process for maintaining operational continuity in dual-region deployments. Please remain cautious in managing dual-region environments and be prepared to implement the outlined steps for successful failover and failback.