Skip to main content
Version: 8.5

Dual-region operational procedure

Introduction

This operational procedure is a step-by-step guide on how to restore operations in the case of a total region failure. It explains how to temporarily restore functionality in the surviving region, and how to ultimately do a full recovery to restore the dual-region setup. The operational procedure builds on top of the dual-region AWS setup guide, but is generally applicable for any dual-region setup.

Before proceeding with the operational procedure, thoroughly review and understand the contents of the dual-region concept page. This page outlines various limitations and requirements pertinent to the procedure, which are crucial for successful execution.

Disclaimer

danger
  • Customers must develop and test the operational procedure described below in non-production environments based on the framework steps outlined by Camunda before applying them in production setups.
  • Before advancing to production go-live, validating these procedures with Camunda is strongly recommended.
  • Customers are solely responsible for detecting any regional failures and implementing the necessary operational procedure described.

Prerequisites

Terminology

  • Surviving region
    • A surviving region refers to a region within a dual-region setup that remains operational and unaffected by a failure or disaster that affects other regions.
  • Lost region
    • A lost region refers to a region within a dual-region setup that becomes unavailable or unusable due to a failure or disaster.
  • Recreated region
    • A recreated region refers to a region within a dual-region setup that was previously lost but has been restored or recreated to resume its operational state.
    • We assume this region contains no Camunda 8 deployments or related persistent volumes. Ensure this is the case before executing the failover procedure.

Procedure

We don't differ between active and passive regions as the procedure is the same for either loss. We will focus on losing the passive region while still having the active region.

You'll need to reroute the traffic to the surviving region with the help of DNS (details on how to do that depend on your DNS setup and are not covered in this guide.)

After you've identified a region loss and before beginning the region restoration procedure, ensure the lost region cannot reconnect as this will hinder a successful recovery during failover and failback execution.

In case the region is only lost temporarily (for example, due to network hiccups), Zeebe can survive a region loss but will stop processing due to the loss in quorum and ultimately fill up the persistent disk before running out of volume, resulting in the loss of data.

The failover phase of the procedure results in the temporary restoration of Camunda 8 functionality by redeploying it within the surviving region to resume Zeebe engine functionality. Before the completion of this phase, Zeebe is unable to export or process new data until it achieves quorum and the configured Elasticsearch endpoints for the exporters become accessible, which is the outcome of the failover procedure.

The failback phase of the procedure results in completely restoring the failed region to its full functionality. It requires you to have the lost region ready again for the redeployment of Camunda 8.

danger

For the failback procedure, your recreated region cannot contain any active Camunda 8 deployments or leftover persistent volumes related to Camunda 8 or its Elasticsearch instance. You must start from a clean slate and not bring old data from the lost region, as states may have diverged.

The following procedures are building on top of the work done in the AWS setup guide about deploying Camunda 8 to two Kubernetes clusters in different regions. We assume you have your own copy of the c8-multi-region repository and previously completed changes in the camunda-values.yml to adjust them to your setup.

Ensure you have followed deploy Camunda 8 to the clusters to have Camunda 8 installed and configured for a dual-region setup.

Environment prerequisites

Ensure you have followed environment prerequisites to have the general environment variables set up already.

We will try to refrain from always mentioning both possible scenarios (losing either region 0 or region 1). Instead, we generalized the commands and require you to do a one-time setup to configure environment variables to help execute the procedure based on the surviving and to be recreated region.

Depending on which region you lost, select the correct tab below and export those environment variables to your terminal for a smoother procedure execution:

export CLUSTER_SURVIVING=$CLUSTER_1
export CLUSTER_RECREATED=$CLUSTER_0
export CAMUNDA_NAMESPACE_SURVIVING=$CAMUNDA_NAMESPACE_1
export CAMUNDA_NAMESPACE_FAILOVER=$CAMUNDA_NAMESPACE_1_FAILOVER
export CAMUNDA_NAMESPACE_RECREATED=$CAMUNDA_NAMESPACE_0
export REGION_SURVIVING=region1
export REGION_RECREATED=region0

Failover

Ensure network isolation between Kubernetes clusters

Current state
Desired state

Description / Code


Current state

One of the regions is lost, meaning Zeebe:

  • Is unable to process new requests due to losing the quorum
  • Stops exporting new data to Elasticsearch in the lost region
  • Stops exporting new data to Elasticsearch in the survived region

Desired state

For the failover procedure, ensure the lost region does not accidentally reconnect. You should be sure it is lost, and if so, look into measures to prevent it from reconnecting. For example, by utilizing the suggested solution below to isolate your active environment.

How to get there

Depending on your architecture, possible approaches are:

  • Configuring Kubernetes Network Policies to disable traffic flow between the clusters.
  • Configure firewall rules to disable traffic flow between the clusters.

Failback

Deploy Camunda 8 in the failback mode in the newly created region

Current state
Desired state

Description / Code


Current state

You have temporary Zeebe brokers deployed in failover mode together with a temporary Elasticsearch within the same surviving region.

Desired state

You want to restore the dual-region functionality again and deploy Zeebe in failback mode to the newly restored region.

Failback mode means new clusterSize/2 brokers will be installed in the restored region:

  • clusterSize/4 brokers are running in the normal mode, participating processing and restoring the data.
  • clusterSize/4 brokers are temporarily running in the sleeping mode. They will run in the normal mode later once the failover setup is removed.

An Elasticsearch will also be deployed in the restored region, but not used yet, before the data is restored into it from the backup from the surviving Elasticsearch cluster.

How to get there

The changes previously done in the base Helm values file camunda-values.yml in aws/dual-region/kubernetes should still be present from Failover - Step 2.

In particular, the values ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL and ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION1_ARGS_URL should solely point at the surviving region.

In addition, the following Helm command will disable Operate and Tasklist since those will only be enabled at the end of the full region restore. It's required to keep them disabled in the newly created region due to their Elasticsearch importers.

Lastly, the installationType is set to failBack to switch the behavior of Zeebe and prepare for this procedure.

  1. From the terminal context of aws/dual-region/kubernetes execute:
helm install $HELM_RELEASE_NAME camunda/camunda-platform \
--version $HELM_CHART_VERSION \
--kube-context $CLUSTER_RECREATED \
--namespace $CAMUNDA_NAMESPACE_RECREATED \
-f camunda-values.yml \
-f $REGION_RECREATED/camunda-values.yml \
--set global.multiregion.installationType=failBack \
--set operate.enabled=false \
--set tasklist.enabled=false

Verification

The following command will show the deployed pods of the newly created region.

Depending on your chosen clusterSize, you should see that the failback deployment contains some Zeebe instances being ready and others unready. Those unready instances are sleeping indefinitely and is the expected behavior. This behavior stems from the failback mode since we still have the temporary failover, which acts as a replacement for the lost region.

For example, in the case of clusterSize: 8, you find two active Zeebe brokers and two unready brokers in the newly created region.

kubectl --context $CLUSTER_RECREATED get pods -n $CAMUNDA_NAMESPACE_RECREATED

Port-forwarding the Zeebe Gateway via kubectl and printing the topology should reveal that the failback brokers have joined the cluster.

kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING
zbctl status --insecure --address localhost:26500