Agent Skills for Claude Code | Chaos Engineer
| Domain | DevOps & Operations |
| Role | specialist |
| Scope | implementation |
| Output | code |
Triggers: chaos engineering, resilience testing, failure injection, game day, blast radius, chaos experiment, fault injection, Chaos Monkey, Litmus Chaos, antifragile
Related Skills: SRE Engineer · DevOps Engineer · Kubernetes Specialist
When to Use This Skill
Section titled “When to Use This Skill”- Designing and executing chaos experiments
- Implementing failure injection frameworks (Chaos Monkey, Litmus, etc.)
- Planning and conducting game day exercises
- Building blast radius controls and safety mechanisms
- Setting up continuous chaos testing in CI/CD
- Improving system resilience based on experiment findings
Core Workflow
Section titled “Core Workflow”- System Analysis - Map architecture, dependencies, critical paths, and failure modes
- Experiment Design - Define hypothesis, steady state, blast radius, and safety controls
- Execute Chaos - Run controlled experiments with monitoring and quick rollback
- Learn & Improve - Document findings, implement fixes, enhance monitoring
- Automate - Integrate chaos testing into CI/CD for continuous resilience
Reference Guide
Section titled “Reference Guide”Load detailed guidance based on context:
| Topic | Reference | Load When |
|---|---|---|
| Experiments | references/experiment-design.md | Designing hypothesis, blast radius, rollback |
| Infrastructure | references/infrastructure-chaos.md | Server, network, zone, region failures |
| Kubernetes | references/kubernetes-chaos.md | Pod, node, Litmus, chaos mesh experiments |
| Tools & Automation | references/chaos-tools.md | Chaos Monkey, Gremlin, Pumba, CI/CD integration |
| Game Days | references/game-days.md | Planning, executing, learning from game days |
Safety Checklist
Section titled “Safety Checklist”Non-obvious constraints that must be enforced on every experiment:
- Steady state first — define and verify baseline metrics before injecting any failure
- Blast radius cap — start with the smallest possible impact scope; expand only after validation
- Automated rollback ≤ 30 seconds — abort path must be scripted and tested before the experiment begins
- Single variable — change only one failure condition at a time until behaviour is well understood
- No production without safety nets — customer-facing environments require circuit breakers, feature flags, or canary isolation
- Close the loop — every experiment must produce a written learning summary and at least one tracked improvement
Output Templates
Section titled “Output Templates”When implementing chaos engineering, provide:
- Experiment design document (hypothesis, metrics, blast radius)
- Implementation code (failure injection scripts/manifests)
- Monitoring setup and alert configuration
- Rollback procedures and safety controls
- Learning summary and improvement recommendations
Concrete Example: Pod Failure Experiment (Litmus Chaos)
Section titled “Concrete Example: Pod Failure Experiment (Litmus Chaos)”The following shows a complete experiment — from hypothesis to rollback — using Litmus Chaos on Kubernetes.
Step 1 — Define steady state and apply the experiment
Section titled “Step 1 — Define steady state and apply the experiment”# Verify baseline: p99 latency < 200ms, error rate < 0.1%kubectl get deploy my-service -n productionkubectl top pods -n production -l app=my-serviceStep 2 — Create and apply a Litmus ChaosEngine manifest
Section titled “Step 2 — Create and apply a Litmus ChaosEngine manifest”apiVersion: litmuschaos.io/v1alpha1kind: ChaosEnginemetadata: name: my-service-pod-delete namespace: productionspec: appinfo: appns: production applabel: "app=my-service" appkind: deployment # Limit blast radius: only 1 replica at a time engineState: active chaosServiceAccount: litmus-admin experiments: - name: pod-delete spec: components: env: - name: TOTAL_CHAOS_DURATION value: "60" # seconds - name: CHAOS_INTERVAL value: "20" # delete one pod every 20s - name: FORCE value: "false" - name: PODS_AFFECTED_PERC value: "33" # max 33% of replicas affected# Apply the experimentkubectl apply -f chaos-pod-delete.yaml
# Watch experiment statuskubectl describe chaosengine my-service-pod-delete -n productionkubectl get chaosresult my-service-pod-delete-pod-delete -n production -wStep 3 — Monitor during the experiment
Section titled “Step 3 — Monitor during the experiment”# Tail application logs for errorskubectl logs -l app=my-service -n production --since=2m -f
# Check ChaosResult verdict when completekubectl get chaosresult my-service-pod-delete-pod-delete \ -n production -o jsonpath='{.status.experimentStatus.verdict}'Step 4 — Rollback / abort if steady state is violated
Section titled “Step 4 — Rollback / abort if steady state is violated”# Immediately stop the experimentkubectl patch chaosengine my-service-pod-delete \ -n production --type merge -p '{"spec":{"engineState":"stop"}}'
# Confirm all pods are healthykubectl rollout status deployment/my-service -n productionConcrete Example: Network Latency with toxiproxy
Section titled “Concrete Example: Network Latency with toxiproxy”# Install toxiproxy CLIbrew install toxiproxy # macOS; use the binary release on Linux
# Start toxiproxy server (runs alongside your service)toxiproxy-server &
# Create a proxy for your downstream dependencytoxiproxy-cli create -l 0.0.0.0:22222 -u downstream-db:5432 db-proxy
# Inject 300ms latency with 10% jitter — blast radius: this proxy onlytoxiproxy-cli toxic add db-proxy -t latency -a latency=300 -a jitter=30
# Run your load test / observe metrics here ...
# Remove the toxic to restore normal behaviourtoxiproxy-cli toxic remove db-proxy -n latency_downstreamConcrete Example: Chaos Monkey (Spinnaker / standalone)
Section titled “Concrete Example: Chaos Monkey (Spinnaker / standalone)”# chaos-monkey-config.yml — restrict to a single ASGdeployment: enabled: true regionIndependence: falsechaos: enabled: true meanTimeBetweenKillsInWorkDays: 2 minTimeBetweenKillsInWorkDays: 1 grouping: APP # kill one instance per app, not per cluster exceptions: - account: production region: us-east-1 detail: "*-canary" # never kill canary instances
# Apply and trigger a manual kill for testingchaos-monkey --app my-service --account staging --dry-run false