Chaos Mesh
Chaos Mesh is a popular chaos testing tool specially built with Kubernetes in mind. The different types of chaos testing it allows are massive, and you can get a full list in the official docs. We will focus primarily on pod chaos in this lab. To start, let’s define our objectives.
First, we will have one or more pods forcefully killed (in a non-graceful manner). We then want to see if new pods come up immediately to replace the pods that were killed & how long it takes for the new pods to come up. Once we have achieved this goal, we will look at automating the whole process like so:
- Before running the test, a new replica is created to minimize business disruption
- At a specified time during the week, a pod is killed as part of the test
- A script watches and waits to see if the replacement pod starts up
- If everything is fine, send an email or message to Slack to notify that the test succeeded, then get rid of the additional replica
- If it didn’t work as expected, keep the additional replica and send out an alert that scaling isn’t working
All of the above steps will be completely automated so that you can have several applications running chaos tests (preferably outside of peak business hours).
First, install Chaos mesh into your cluster with Helm:
helm repo add chaos-mesh https://charts.chaos-mesh.org
kubectl create ns chaos-mesh
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-mesh --version 2.6.3 --set chaosDaemon.runtime=containerd
Now, let’s startup a basic nginx server
Start the pod running nginx
kubectl run --image=nginx nginx-app --port=80 --replicas=2
Accessing the app on browser
kubectl port-forward nginx-app 80:80
Now that we have a target to test chaos on, let’s define a basic pod kill chaos in a file called “pod-kill.yaml”:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill
spec:
action: pod-kill
mode: one
selector:
labelSelectors:
app: nginx
duration: 30s
Now deploy this to Kubernetes:
kubectl apply -f podkill.yaml
Immediately upon deployment, you should see one of the two replicas get killed. You can use kubectl get po --watch
to see this happen in real-time. You can then continue to observe as the pod recovers from this incident and determine whether it recovered within the appropriate time. The next step is to automate all this so that you can handle the deployment and observability part on your behalf. For this, we will use a script stored in a ConfigMap and a CronJob that periodically triggers this script.
First, we will need an image that has both curl & kubectl. In an enterprise environment, you should create this image yourself by building a Docker image with the necessary tools, and then pushing it into your organization’s private repo. This is because publicly available images could get vulnerabilities, get deleted without your knowledge, or exceed your repo pull count which will lead to new images not being pulled. In a testing situation, however, feel free to use an image on Docker Hub with both tools involved. We will be using tranceh2/bash-curl-kubectl
.
Next, will be creating the script that performs the Chaos test with re-usability in mind. This means using arguments to pass information such as deployment name, namespace, and chaos type. Since we will be alerting the status of the report to a Slack channel, we should also pass the Slack webhook URL in this manner. It is best to use a secret to store the webhook URL, and then reference the secret as an env variable. The script itself will be created inside a ConfigMap that will then be mounted to the pod created by the CronJob as a volume. As a final touch, we set the restart policy to Never
since we don’t want a job the introduces chaos restarting indefinitely and accidently firing during peak times. Below is the finalized script.
apiVersion: batch/v1
kind: CronJob
metadata:
name: pod-chaos-test
namespace: default
spec:
schedule: "5 5 * * 2" # At 5:05 AM on Tuesday
jobTemplate:
spec:
template:
spec:
containers:
- name: chaos-test
image: tranceh2/bash-curl-kubectl
command:
- /bin/sh
- -c
- |
cp /scripts/chaos.sh /tmp/chaos.sh
chmod +x /tmp/chaos.sh
/tmp/chaos.sh deployment namespace namespace pod-kill $SLACK_WEBHOOK_URL
volumeMounts:
- name: chaos-script
mountPath: /scripts
env:
- name: SLACK_WEBHOOK_URL
valueFrom:
secretKeyRef:
name: slack-webhook-secret
key: SLACK_WEBHOOK_URL
restartPolicy: Never
volumes:
- name: chaos-script
configMap:
name: chaos-script
The above job should call the template responsible for running the chaos deployment. Now, let’s look at the script itself. For the script, we will use kubectl patch
to temporarily increase the replica count, followed by kubectl apply
to apply the chaos. Next, we will use kubectl wait
to see if the pod returns and the required replica count is maintained. The result will then be sent to Slack with a curl command. Finally, we will use a kubectl patch
command to restore the number of replicas to their initial count and delete the chaos object that gets created. Below is the scrcipt will all the mentioned items:
apiVersion: v1
kind: ConfigMap
metadata:
name: chaos-script
namespace: default
data:
chaos.sh: |
#!/bin/bash
# Define variables from arguments
DEPLOYMENT_NAME=$2
NAMESPACE=$3
CHAOS_NAMESPACE=$4
CHAOS_NAME=$5
SLACK_WEBHOOK_URL=$6
# Get current replica count
current_replicas=$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.spec.replicas}')
echo "Current replicas = $current_replicas"
# Increase replica count by 1
new_replicas=$((current_replicas + 1))
kubectl patch deployment $DEPLOYMENT_NAME --type='json' -p='[{"op": "replace", "path": "/spec/replicas", "value": $new_replicas}]'
# Wait for the new pod to be created and the container to be ready
start_time=$(date +%s)
kubectl wait --for=condition=available --timeout=300s deployment/$DEPLOYMENT_NAME -n $NAMESPACE
echo "Delete chaos"
kubectl delete PodChaos $CHAOS_NAME -n $CHAOS_NAMESPACE | true
echo "Applying chaos"
# Apply chaos mesh job
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: $CHAOS_NAME
namespace: $CHAOS_NAMESPACE
spec:
action: pod-kill
mode: one
selector:
labelSelectors:
app: $DEPLOYMENT_NAME
duration: 30s
EOF
# Wait for chaos to complete and check if the deployment recovers
echo "Waiting until pod recovers"
if kubectl wait --for=condition=available --timeout=300s deployment/$DEPLOYMENT_NAME -n $NAMESPACE; then
curl -X POST -H 'Content-type: application/json' --data '{"text":"$DEPLOYMENT_NAME Pod recovery successful within 2.5 mins."}' $SLACK_WEBHOOK_URL
else
curl -X POST -H 'Content-type: application/json' --data '{"text":"$DEPLOYMENT_NAME Pod recovery failed"}' $SLACK_WEBHOOK_URL
fi
kubectl patch deployment $DEPLOYMENT_NAME --type='json' -p='[{"op": "replace", "path": "/spec/replicas", "value": $current_replicas}]'
echo "Delete chaos"
kubectl delete PodChaos $CHAOS_NAME -n $CHAOS_NAMESPACE
Let’s go step by step. This script increases the number of replicas by 1 and waits for it to fully start. Once it is ready, it deletes any hanging chaos objects and applies the chaos yaml to kill 1 pod, then waits again for the pod to recover. If the pod hasn’t recovered in 300 seconds, it informs that to Slack and exits. Else it sends a success message and reduces the deployment count back to the original number. Finally, it deletes the pod chaos.
Apply the above files to your Kubernetes cluster using kubectl apply
, then manually trigger the Cron job. You can see the job that runs with:
kubectl get jobs
Find the name of the job that is running the chaos test and use:
kubectl logs <job-name>
To see the chaos test as it happens. Also use:
kubectl get pods -A --watch
Keep an eye on the pods as they get killed and auto-recover. Once the test is done, it will send a message with the status to Slack.
This covers pod kill chaos and how we can automate it end to end. Now, let’s take a look at two other types of Chaos: memory chaos and CPU chaos.
Resource chaos
Killing a pod was the most basic type of chaos out there. When it comes to resource-based chaos, Chaos Mesh injects processes into a running container to give it Chaos, requiring several new permissions and configurations. First, when we installed Chaos Mesh, we used chaosDaemon.runtime=containerd
. This is needed since by default, this value is set to docker
. Next, we need to watch the chaos deployments as they happen since it is likely that all the requirements required by the chaos are not available in your cluster.
Now that we have a general idea of what to look out for, let’s start with CPU chaos. Unlike pod chaos where there is a recovery after the chaos ends, there is no such recovery event for CPU chaos. Since CPU is very flexible, Kubernetes will allow the CPU requirements to go until the node has no more CPU left. At this point, your application will get CPU throttled. So instead in this section, we will create a system with a CPU-based HPA. To get an in-depth idea of HPAs, go to the scaling section.
First, add a CPU based HPA to the nginx deployment. The HPA would look something like this:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: nginx-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx-deployment
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
This will start scaling your resources once your pods’ CPU hits 80%. Now let’s take a look at the CPU chaos yaml.
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: $CHAOS_NAME
namespace: $CHAOS_NAMESPACE
spec:
mode: one
stressors:
cpu:
workers: 10
selector:
labelSelectors:
app: $DEPLOYMENT_NAME
This will add 10 workers that will rapidly increase CPU usage of your pod. The ConfigMap that holds the yaml looks similar to the pod kill chaos:
apiVersion: v1
kind: ConfigMap
metadata:
name: cpu-stress-chaos-script
namespace: default
data:
cpu-chaos.sh: |
#!/bin/bash
# Define variables from arguments
DEPLOYMENT_NAME=$1
NAMESPACE=$2
CHAOS_NAMESPACE=$3
CHAOS_NAME=$4
SLACK_WEBHOOK_URL=$5
# Get current replica count
current_replicas=$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.spec.replicas}')
echo "Current replicas = $current_replicas"
echo "Delete chaos"
kubectl delete StressChaos $CHAOS_NAME -n $CHAOS_NAMESPACE | true
echo "Applying CPU stress chaos"
# Apply CPU stress
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: $CHAOS_NAME
namespace: $CHAOS_NAMESPACE
spec:
mode: all
stressors:
cpu:
workers: 10
selector:
labelSelectors:
app: $DEPLOYMENT_NAME
EOF
sleep 60
# Get current replica count
new_replicas=$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.spec.replicas}')
echo "Delete chaos"
kubectl delete StressChaos $CHAOS_NAME -n $CHAOS_NAMESPACE
Some notable differences: we no longer increase the replica count before running the chaos since it is the job of the CPU scaler to increase the count when a threshold is reached. We also used a different method to check if the chaos test was successful. What we do is:
- Take note of the number of replicas before chaos starts
- Run chaos
- Wait a while
- Check to see if the scaler has kicked in and started scaling up the resources
- If yes, then successful
- If no, then send a failure message to Slack
- Delete the chaos
We also don’t perform any manual recovery steps here. Once the chaos is deleted the CPU requirements of the pods should go down, which should result in the pod count scaling back down to previous levels. Now that we have gone through all the steps, apply the files into your cluster with kubectl apply
and run the cronjob manually to see if it works as intended. You might have to watch the chaos object to ensure it runs as expected.
The final type of chaos we will be looking at is memory stress. Unlike CPU, memory is a bit tricky, especially in a Kubernetes context. Unless your pod is equipped with Kubernetes-specific garbage collection algorithms, memory that is introduced into the pods will never be released. As a result, you are using memory-based scaling not a good idea since the application will continue to scale continuously until the max replicas are reached. However, if you have a critical application and want to prevent it from running out of memory due to a sudden spike in requests, you might want to use it anyway.
We will use roughly the same HPA:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: nginx-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx-deployment
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
The memory stressor will also be largely similar to the CPU stress test.
apiVersion: v1
kind: ConfigMap
metadata:
name: memory-stress-chaos-script
namespace: default
data:
cpu-chaos.sh: |
#!/bin/bash
# Define variables from arguments
DEPLOYMENT_NAME=$1
NAMESPACE=$2
CHAOS_NAMESPACE=$3
CHAOS_NAME=$4
SLACK_WEBHOOK_URL=$5
# Get current replica count
current_replicas=$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.spec.replicas}')
echo "Current replicas = $current_replicas"
echo "Delete chaos"
kubectl delete StressChaos $CHAOS_NAME -n $CHAOS_NAMESPACE | true
echo "Applying CPU stress chaos"
# Apply memory stress
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: $CHAOS_NAME
namespace: $CHAOS_NAMESPACE
spec:
mode: all
stressors:
memory:
workers: 4
size: 50MiB
options: ['']
selector:
labelSelectors:
app: $DEPLOYMENT_NAME
EOF
sleep 60
# Get current replica count
new_replicas=$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o jsonpath='{.spec.replicas}')
echo "Delete chaos"
kubectl delete StressChaos $CHAOS_NAME -n $CHAOS_NAMESPACE
Apply all the above files with kubectl apply
. This is the same as the CPU stressor, but when running this stress test, keep note of whether the memory that is consumed gets eventually released. If it doesn’t you might have to manually handle the extra pods created if your application doesn’t handle garbage collection properly.
And that covers 3 different types of chaos.
Conclusion
In this section we covered Chaos mesh along with 3 different types of chaos that can applied to your Kubernetes cluster. There are many other types of Chaos listed in the official docs and it’s recommended that you read through them to find the ones that best fit your requirements. Also, the best-recommended practice is to always try out chaos testing in a test environment before moving it into a production environment. Some chaos tests should not be in a production environment, and there are a few tests that Chaos Mesh specifically warns against using. So make sure you don’t overdo it.