Agent Skills for Claude Code | SRE Engineer
| Domain | DevOps & Operations |
| Role | specialist |
| Scope | implementation |
| Output | code |
Triggers: SRE, site reliability, SLO, SLI, error budget, incident management, chaos engineering, toil reduction, on-call, MTTR
Related Skills: DevOps Engineer · Cloud Architect · Kubernetes Specialist
Core Workflow
Section titled “Core Workflow”- Assess reliability - Review architecture, SLOs, incidents, toil levels
- Define SLOs - Identify meaningful SLIs and set appropriate targets
- Verify alignment - Confirm SLO targets reflect user expectations before proceeding
- Implement monitoring - Build golden signal dashboards and alerting
- Automate toil - Identify repetitive tasks and build automation
- Test resilience - Design and execute chaos experiments; verify recovery meets RTO/RPO targets before marking the experiment complete; validate recovery behavior end-to-end
Reference Guide
Section titled “Reference Guide”Load detailed guidance based on context:
| Topic | Reference | Load When |
|---|---|---|
| SLO/SLI | references/slo-sli-management.md | Defining SLOs, calculating error budgets |
| Error Budgets | references/error-budget-policy.md | Managing budgets, burn rates, policies |
| Monitoring | references/monitoring-alerting.md | Golden signals, alert design, dashboards |
| Automation | references/automation-toil.md | Toil reduction, automation patterns |
| Incidents | references/incident-chaos.md | Incident response, chaos engineering |
Constraints
Section titled “Constraints”MUST DO
Section titled “MUST DO”- Define quantitative SLOs (e.g., 99.9% availability)
- Calculate error budgets from SLO targets
- Monitor golden signals (latency, traffic, errors, saturation)
- Write blameless postmortems for all incidents
- Measure toil and track reduction progress
- Automate repetitive operational tasks
- Test failure scenarios with chaos engineering
- Balance reliability with feature velocity
MUST NOT DO
Section titled “MUST NOT DO”- Set SLOs without user impact justification
- Alert on symptoms without actionable runbooks
- Tolerate >50% toil without automation plan
- Skip postmortems or assign blame
- Implement manual processes for recurring tasks
- Deploy without capacity planning
- Ignore error budget exhaustion
- Build systems that can’t degrade gracefully
Output Templates
Section titled “Output Templates”When implementing SRE practices, provide:
- SLO definitions with SLI measurements and targets
- Monitoring/alerting configuration (Prometheus, etc.)
- Automation scripts (Python, Go, Terraform)
- Runbooks with clear remediation steps
- Brief explanation of reliability impact
Concrete Examples
Section titled “Concrete Examples”SLO Definition & Error Budget Calculation
Section titled “SLO Definition & Error Budget Calculation”# 99.9% availability SLO over a 30-day window# Allowed downtime: (1 - 0.999) * 30 * 24 * 60 = 43.2 minutes/month# Error budget (request-based): 0.001 * total_requests
# Example: 10M requests/month → 10,000 error budget requests# If 5,000 errors consumed in week 1 → 50% budget burned in 25% of window# → Trigger error budget policy: freeze non-critical releasesPrometheus SLO Alerting Rule (Multiwindow Burn Rate)
Section titled “Prometheus SLO Alerting Rule (Multiwindow Burn Rate)”groups: - name: slo_availability rules: # Fast burn: 2% budget in 1h (14.4x burn rate) - alert: HighErrorBudgetBurn expr: | ( sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > 0.014400 and ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > 0.014400 for: 2m labels: severity: critical annotations: summary: "High error budget burn rate detected" runbook: "https://wiki.internal/runbooks/high-error-burn"
# Slow burn: 5% budget in 6h (1x burn rate sustained) - alert: SlowErrorBudgetBurn expr: | ( sum(rate(http_requests_total{status=~"5.."}[6h])) / sum(rate(http_requests_total[6h])) ) > 0.001 for: 15m labels: severity: warning annotations: summary: "Sustained error budget consumption" runbook: "https://wiki.internal/runbooks/slow-error-burn"PromQL Golden Signal Queries
Section titled “PromQL Golden Signal Queries”# Latency — 99th percentile request durationhistogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# Traffic — requests per second by servicesum(rate(http_requests_total[5m])) by (service)
# Errors — error rate ratiosum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /sum(rate(http_requests_total[5m])) by (service)
# Saturation — CPU throttling ratiosum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod) /sum(rate(container_cpu_cfs_periods_total[5m])) by (pod)Toil Automation Script (Python)
Section titled “Toil Automation Script (Python)”#!/usr/bin/env python3"""Auto-remediation: restart pods exceeding error threshold."""import subprocess, sys, json
ERROR_THRESHOLD = 0.05 # 5% error rate triggers restart
def get_error_rate(service: str) -> float: """Query Prometheus for current error rate.""" import urllib.request query = f'sum(rate(http_requests_total{{status=~"5..",service="{service}"}}[5m])) / sum(rate(http_requests_total{{service="{service}"}}[5m]))' url = f"http://prometheus:9090/api/v1/query?query={urllib.request.quote(query)}" with urllib.request.urlopen(url) as resp: data = json.load(resp) results = data["data"]["result"] return float(results[0]["value"][1]) if results else 0.0
def restart_deployment(namespace: str, deployment: str) -> None: subprocess.run( ["kubectl", "rollout", "restart", f"deployment/{deployment}", "-n", namespace], check=True ) print(f"Restarted {namespace}/{deployment}")
if __name__ == "__main__": service, namespace, deployment = sys.argv[1], sys.argv[2], sys.argv[3] rate = get_error_rate(service) print(f"Error rate for {service}: {rate:.2%}") if rate > ERROR_THRESHOLD: restart_deployment(namespace, deployment) else: print("Within SLO threshold — no action required")