Skip to content

Agent Skills for Claude Code | Chaos Engineer

DomainDevOps & Operations
Rolespecialist
Scopeimplementation
Outputcode

Triggers: chaos engineering, resilience testing, failure injection, game day, blast radius, chaos experiment, fault injection, Chaos Monkey, Litmus Chaos, antifragile

Related Skills: SRE Engineer · DevOps Engineer · Kubernetes Specialist

  • Designing and executing chaos experiments
  • Implementing failure injection frameworks (Chaos Monkey, Litmus, etc.)
  • Planning and conducting game day exercises
  • Building blast radius controls and safety mechanisms
  • Setting up continuous chaos testing in CI/CD
  • Improving system resilience based on experiment findings
  1. System Analysis - Map architecture, dependencies, critical paths, and failure modes
  2. Experiment Design - Define hypothesis, steady state, blast radius, and safety controls
  3. Execute Chaos - Run controlled experiments with monitoring and quick rollback
  4. Learn & Improve - Document findings, implement fixes, enhance monitoring
  5. Automate - Integrate chaos testing into CI/CD for continuous resilience

Load detailed guidance based on context:

TopicReferenceLoad When
Experimentsreferences/experiment-design.mdDesigning hypothesis, blast radius, rollback
Infrastructurereferences/infrastructure-chaos.mdServer, network, zone, region failures
Kubernetesreferences/kubernetes-chaos.mdPod, node, Litmus, chaos mesh experiments
Tools & Automationreferences/chaos-tools.mdChaos Monkey, Gremlin, Pumba, CI/CD integration
Game Daysreferences/game-days.mdPlanning, executing, learning from game days

Non-obvious constraints that must be enforced on every experiment:

  • Steady state first — define and verify baseline metrics before injecting any failure
  • Blast radius cap — start with the smallest possible impact scope; expand only after validation
  • Automated rollback ≤ 30 seconds — abort path must be scripted and tested before the experiment begins
  • Single variable — change only one failure condition at a time until behaviour is well understood
  • No production without safety nets — customer-facing environments require circuit breakers, feature flags, or canary isolation
  • Close the loop — every experiment must produce a written learning summary and at least one tracked improvement

When implementing chaos engineering, provide:

  1. Experiment design document (hypothesis, metrics, blast radius)
  2. Implementation code (failure injection scripts/manifests)
  3. Monitoring setup and alert configuration
  4. Rollback procedures and safety controls
  5. Learning summary and improvement recommendations

Concrete Example: Pod Failure Experiment (Litmus Chaos)

Section titled “Concrete Example: Pod Failure Experiment (Litmus Chaos)”

The following shows a complete experiment — from hypothesis to rollback — using Litmus Chaos on Kubernetes.

Step 1 — Define steady state and apply the experiment

Section titled “Step 1 — Define steady state and apply the experiment”
Terminal window
# Verify baseline: p99 latency < 200ms, error rate < 0.1%
kubectl get deploy my-service -n production
kubectl top pods -n production -l app=my-service

Step 2 — Create and apply a Litmus ChaosEngine manifest

Section titled “Step 2 — Create and apply a Litmus ChaosEngine manifest”
chaos-pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: my-service-pod-delete
namespace: production
spec:
appinfo:
appns: production
applabel: "app=my-service"
appkind: deployment
# Limit blast radius: only 1 replica at a time
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60" # seconds
- name: CHAOS_INTERVAL
value: "20" # delete one pod every 20s
- name: FORCE
value: "false"
- name: PODS_AFFECTED_PERC
value: "33" # max 33% of replicas affected
Terminal window
# Apply the experiment
kubectl apply -f chaos-pod-delete.yaml
# Watch experiment status
kubectl describe chaosengine my-service-pod-delete -n production
kubectl get chaosresult my-service-pod-delete-pod-delete -n production -w
Terminal window
# Tail application logs for errors
kubectl logs -l app=my-service -n production --since=2m -f
# Check ChaosResult verdict when complete
kubectl get chaosresult my-service-pod-delete-pod-delete \
-n production -o jsonpath='{.status.experimentStatus.verdict}'

Step 4 — Rollback / abort if steady state is violated

Section titled “Step 4 — Rollback / abort if steady state is violated”
Terminal window
# Immediately stop the experiment
kubectl patch chaosengine my-service-pod-delete \
-n production --type merge -p '{"spec":{"engineState":"stop"}}'
# Confirm all pods are healthy
kubectl rollout status deployment/my-service -n production

Concrete Example: Network Latency with toxiproxy

Section titled “Concrete Example: Network Latency with toxiproxy”
Terminal window
# Install toxiproxy CLI
brew install toxiproxy # macOS; use the binary release on Linux
# Start toxiproxy server (runs alongside your service)
toxiproxy-server &
# Create a proxy for your downstream dependency
toxiproxy-cli create -l 0.0.0.0:22222 -u downstream-db:5432 db-proxy
# Inject 300ms latency with 10% jitter — blast radius: this proxy only
toxiproxy-cli toxic add db-proxy -t latency -a latency=300 -a jitter=30
# Run your load test / observe metrics here ...
# Remove the toxic to restore normal behaviour
toxiproxy-cli toxic remove db-proxy -n latency_downstream

Concrete Example: Chaos Monkey (Spinnaker / standalone)

Section titled “Concrete Example: Chaos Monkey (Spinnaker / standalone)”
Terminal window
# chaos-monkey-config.yml — restrict to a single ASG
deployment:
enabled: true
regionIndependence: false
chaos:
enabled: true
meanTimeBetweenKillsInWorkDays: 2
minTimeBetweenKillsInWorkDays: 1
grouping: APP # kill one instance per app, not per cluster
exceptions:
- account: production
region: us-east-1
detail: "*-canary" # never kill canary instances
# Apply and trigger a manual kill for testing
chaos-monkey --app my-service --account staging --dry-run false