Troubleshooting IAM Roles for Service Accounts (IRSA)
IRSA configuration validation of a Camunda 8 helm deployment
The c8-sm-checks utility is designed to validate IAM Roles for Service Accounts (IRSA) configuration in EKS Kubernetes clusters on AWS. It ensures that key components in a Camunda 8 deployment, such as PostgreSQL and OpenSearch, are properly configured to securely interact with AWS resources via the appropriate IAM roles.
IRSA check script
The /checks/kube/aws-irsa.sh
script verifies IRSA setup in your AWS Kubernetes environment by performing two types of checks:
- Configuration Verification: Ensures key IRSA configurations are correctly set, using specific checks on IAM roles, policies, and mappings to service accounts.
- Namespace Commands and Job Execution: Runs commands within the specified namespace using Kubernetes jobs (if necessary) to verify network and access configurations.
This utility is non-intrusive and will not alter any deployment settings.
If the -s
flag is provided, the script skips spawning debugging pods for network flow verification, which can be helpful if pod creation is restricted or not required for troubleshooting.
The script relies on Helm chart values and is compatible only with deployments installed or updated through standard Helm commands. It will not work with other deployment methods, such as those using helm template
(e.g., ArgoCD).
Compatibility is confirmed for Camunda Helm chart releases version 11 and above.
Key features
- Helm values retrieval: Extracts deployment values using Helm to ensure all required configurations are set.
- EKS and OIDC configuration check: Confirms that EKS is configured with IAM and OIDC, matching the minimum required version for IRSA compatibility.
- Service account role validation: For each specified component, verifies that the service account exists and has the correct IAM role annotations.
- Network access verification: Ensures that PostgreSQL (Aurora) or OpenSearch instances are accessible from within the cluster. This step involves an
nmap
scan through a Kubernetes job. Use the-s
option to skip this step if network flow verification is unnecessary. - IRSA value check: Validates that the Helm deployment values are correctly configured to use IRSA for secure service interactions with AWS.
- Aurora PostgreSQL and OpenSearch IAM configuration: Confirms that these services support IAM login, ensuring secure access configurations.
- Access and Trust Policy verification: Checks that access and trust policies are correctly set. Note that the script performs basic checks; if issues arise with these policies, further manual verification may be needed.
- Service Account Role association test: Tests that the IAM role association with the service account is functioning as expected by spawning a job with the specified service account and validating the resulting ARN. This step can also be skipped using the
-s
option. - OpenSearch Access Policy check: Validates that the OpenSearch access policy is configured correctly to support secure connections from the cluster.
Example usage
You can find the complete usage details in the c8-sm-checks repository. Below is a quick reference for common usage options:
Usage: ./checks/kube/aws-irsa.sh [-h] [-n NAMESPACE] [-e EXCLUDE_COMPONENTS] [-p COMPONENTS_PG] [-l COMPONENTS_OS] [-s]
Options:
-h Display this help message
-n NAMESPACE Specify the namespace to use
-e EXCLUDE_COMPONENTS Comma-separated list of components to exclude from the check (reference of the component is the root key used in the chart)
-p COMPONENTS_PG Comma-separated list of components to check IRSA for PostgreSQL (overrides default list)
-l COMPONENTS_OS Comma-separated list of components to check IRSA for OpenSearch (overrides default list)
-s Disable pod spawn for IRSA and network flow verification
Example Command:
./checks/kube/aws-irsa.sh -n camunda-primary -p "identity,webModeler" -l "zeebe,operate"
In this example, the script will check identity
and webModeler
components (references of the component name in the helm chart) for Aurora PostgreSQL access and zeebe
and operate
components for OpenSearch access in the camunda-primary
namespace.
Script output overview
The script offers detailed output to confirm that each component is properly configured for IRSA. Below is an outline of the checks it performs and the expected output format:
Example Output:
[OK] AWS CLI version 2.15.20 is compatible and user is logged in.
[OK] AWS environment detected. Proceeding with the script.
[INFO] Chart camunda-platform is deployed in namespace camunda-primary.
[INFO] Retrieved values for Helm deployment: camunda-platform-11.0.1.
[FAIL] The service account keycloak-sa does not have a valid eks.amazonaws.com/role-arn annotation. You must add it in the chart, see https://docs.camunda.io/docs/self-managed/setup/deploy/amazon/amazon-eks/eks-helm/
[FAIL] RoleArn name for component 'identityKeycloak' is empty. Skipping verification.
The script highlights errors with the [FAIL]
prefix, and these are directed to stderr
for easier filtering. We recommend capturing stderr
output to quickly identify failed configurations.
If the script returns a false positive—indicating success when issues are actually present—manually review each output line to ensure reported configuration details (like Role ARNs or annotations) are accurate. For example, ensure that each service account has the correct Role ARN and associated permissions to avoid undetected issues.
Advanced troubleshooting for IRSA configuration
The troubleshooting script provides essential checks but may not capture all potential issues, particularly those related to IAM policies and configurations. If IRSA is not functioning as expected and no errors are flagged by the script, follow the steps below for deeper troubleshooting.
Spawn a debug pod to simulate the pod environment
To troubleshoot in an environment identical to your pod, deploy a debug pod with the necessary service account. Here are examples of debug manifests you can customize for your needs:
- Adapt the manifests to use the specific
serviceAccountName
(e.g.,aurora-access-sa
) you want to test. - Insert a sleep timer in the command to allow time to exec into the pod for live debugging.
- Create the pod with the
kubectl apply
command:kubectl apply -f debug-client.yaml
- Once the pod is running, connect to it with a bash shell (make sure to adjust the app label with your value):
kubectl exec -it $(kubectl get pods -l app=REPLACE-WITH-LABEL -o jsonpath='{.items[0].metadata.name}') -- /bin/bash
- Inside the pod, display all environment variables to check for IAM and AWS configurations:This command will print out all environment variables, including those related to IRSA. Inside the pod, validate that key environment variables are correctly injected:
env
AWS_WEB_IDENTITY_TOKEN_FILE
: Path to the token (JWT) file for WebIdentity.AWS_ROLE_ARN
: ARN of the associated IAM role.AWS_REGION
,AWS_STS_REGIONAL_ENDPOINTS
, and other AWS configuration variables.
To ensure that IRSA and role associations are functioning:
- Check that the expected
AWS_ROLE_ARN
and token are present. - Decode the JWT token to validate the correct trust relationship with the service account and namespace.
Verify OpenSearch fine-grained access control (fgac) configuration
For OpenSearch clusters, ensure fine-grained access control is set up to allow the role’s access to the cluster. If you deployed OpenSearch with the terraform reference architecture implementation for EKS, fgac should already be configured. For manual deployments, follow the process outlined in the OpenSearch configuration guide to apply similar controls.
Confirm PostgreSQL IAM role access
Verify that PostgreSQL roles are correctly configured to support IAM-based authentication. The database user should have the rds_iam
role to allow IAM authentication. If the setup was automated with the terraform reference architecture implementation for EKS, the necessary access configuration should already be in place. For manual configurations, refer to PostgreSQL configuration instructions.
To test connectivity:
- Run a manual connection test using the PostgreSQL client manifest.
- Use
psql
within the pod to verify the correct roles are assigned. Run:Confirm thatSELECT * FROM pg_roles WHERE rolname='<your-username>';
rds_iam
is listed among the assigned roles.
Validate IAM Policies for each role
Both trust and permission policies are crucial in configuring IAM Roles for Service Accounts (IRSA) in AWS. Each IAM role should have policies that precisely permit necessary actions and correctly trust the relevant Kubernetes service accounts associated with your components.
AssumeRole policies
In AWS, AssumeRole allows a user or service to assume a role and temporarily gain permissions to execute specific actions. Each role needs an AssumeRole policy that precisely matches AWS requirements for the specific services and actions your components perform.
For each IAM role, ensure the trust policy includes:
- The correct
Service
field, allowing the pod’s service account to assume the role. - An
Action
forsts:AssumeRoleWithWebIdentity
, as IRSA uses WebIdentity to enable IAM role assumption.
Verify that the policy is configured according to AWS’s role trust policy guidelines for Kubernetes IRSA.
Trust policies
For each role, verify that the trust policy syntax is correct, allowing the appropriate service accounts to assume the role. Refer to AWS’s trust policy validation tool for accurate syntax and configuration.
Permission policies
Each IAM role should also have appropriate permission policies attached. These policies define what actions the role can perform on AWS resources. Verify that permission policies:
- Are configured correctly to allow the necessary operations for your resources (e.g., read and write access to S3 buckets or access to RDS).
- Align with your security model by only granting the minimum required permissions.
The AWS’s policy simulator is a valuable tool for testing how permissions are applied and for spotting misconfigurations.
If issues persist
If issues remain unresolved, compare your configuration with Camunda’s reference architecture deployed with Terraform. This setup has been validated to work with IRSA and contains the correct permissions. By comparing it to your setup, you may identify discrepancies that are causing your issues.
Instance Metadata Service (IMDS)
Instance Metadata Service is a default fallback for the AWS SDK due to the default credentials provider chain. Within the context of Amazon EKS, it means a pod will automatically assume the role of a node. This can hide many problems, including whether IRSA was set up correctly or not, since it will fall back to IMDS in case of failure and hide the actual error.
Thus, if nothing within your cluster relies on the implicit node role, we recommend disabling it by defining in Terraform the http_put_response_hop_limit
, for example.
Using a Terraform module like the Amazon EKS module, one can define the following to decrease the default value of two to one, which results in pods not being allowed to assume the role of the node anymore.
eks_managed_node_group_defaults {
metadata_options = {
http_put_response_hop_limit = 1
}
}
Overall, this will disable the role assumption of the node for the Kubernetes pod. Depending on the resulting error within Operate, Zeebe, and Web-Modeler, you'll get a clearer error, which is helpful to debug the error more easily.
In the reference architecture with terraform, this setting is configured like that by default.