Agent Skills for Claude Code | Monitoring Expert
| Domain | DevOps & Operations |
| Role | specialist |
| Scope | implementation |
| Output | code |
Triggers: monitoring, observability, logging, metrics, tracing, alerting, Prometheus, Grafana, DataDog, APM, performance testing, load testing, profiling, capacity planning, bottleneck
Related Skills: DevOps Engineer · Debugging Wizard · Architecture Designer
Observability and performance specialist implementing comprehensive monitoring, alerting, tracing, and performance testing systems.
Core Workflow
Section titled “Core Workflow”- Assess — Identify what needs monitoring (SLIs, critical paths, business metrics)
- Instrument — Add logging, metrics, and traces to the application (see examples below)
- Collect — Configure aggregation and storage (Prometheus scrape, log shipper, OTLP endpoint); verify data arrives before proceeding
- Visualize — Build dashboards using RED (Rate/Errors/Duration) or USE (Utilization/Saturation/Errors) methods
- Alert — Define threshold and anomaly alerts on critical paths; validate no false-positive flood before shipping
Quick-Start Examples
Section titled “Quick-Start Examples”Structured Logging (Node.js / Pino)
Section titled “Structured Logging (Node.js / Pino)”import pino from 'pino';
const logger = pino({ level: 'info' });
// Good — structured fields, includes correlation IDlogger.info({ requestId: req.id, userId: req.user.id, durationMs: elapsed }, 'order.created');
// Bad — string interpolation, no correlationconsole.log(`Order created for user ${userId}`);Prometheus Metrics (Node.js)
Section titled “Prometheus Metrics (Node.js)”import { Counter, Histogram, register } from 'prom-client';
const httpRequests = new Counter({ name: 'http_requests_total', help: 'Total HTTP requests', labelNames: ['method', 'route', 'status'],});
const httpDuration = new Histogram({ name: 'http_request_duration_seconds', help: 'HTTP request latency', labelNames: ['method', 'route'], buckets: [0.05, 0.1, 0.3, 0.5, 1, 2, 5],});
// Instrument a routeapp.use((req, res, next) => { const end = httpDuration.startTimer({ method: req.method, route: req.path }); res.on('finish', () => { httpRequests.inc({ method: req.method, route: req.path, status: res.statusCode }); end(); }); next();});
// Expose scrape endpointapp.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics());});OpenTelemetry Tracing (Node.js)
Section titled “OpenTelemetry Tracing (Node.js)”import { NodeSDK } from '@opentelemetry/sdk-node';import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';import { trace } from '@opentelemetry/api';
const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: 'http://jaeger:4318/v1/traces' }),});sdk.start();
// Manual span around a critical operationconst tracer = trace.getTracer('order-service');async function processOrder(orderId) { const span = tracer.startSpan('order.process'); span.setAttribute('order.id', orderId); try { const result = await db.saveOrder(orderId); span.setStatus({ code: SpanStatusCode.OK }); return result; } catch (err) { span.recordException(err); span.setStatus({ code: SpanStatusCode.ERROR }); throw err; } finally { span.end(); }}Prometheus Alerting Rule
Section titled “Prometheus Alerting Rule”groups: - name: api.rules rules: - alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 2m labels: severity: critical annotations: summary: "Error rate above 5% on {{ $labels.route }}"k6 Load Test
Section titled “k6 Load Test”import http from 'k6/http';import { check, sleep } from 'k6';
export const options = { stages: [ { duration: '1m', target: 50 }, // ramp up { duration: '5m', target: 50 }, // sustained load { duration: '1m', target: 0 }, // ramp down ], thresholds: { http_req_duration: ['p(95)<500'], // 95th percentile < 500 ms http_req_failed: ['rate<0.01'], // error rate < 1% },};
export default function () { const res = http.get('https://api.example.com/orders'); check(res, { 'status is 200': (r) => r.status === 200 }); sleep(1);}Reference Guide
Section titled “Reference Guide”Load detailed guidance based on context:
| Topic | Reference | Load When |
|---|---|---|
| Logging | references/structured-logging.md | Pino, JSON logging |
| Metrics | references/prometheus-metrics.md | Counter, Histogram, Gauge |
| Tracing | references/opentelemetry.md | OpenTelemetry, spans |
| Alerting | references/alerting-rules.md | Prometheus alerts |
| Dashboards | references/dashboards.md | RED/USE method, Grafana |
| Performance Testing | references/performance-testing.md | Load testing, k6, Artillery, benchmarks |
| Profiling | references/application-profiling.md | CPU/memory profiling, bottlenecks |
| Capacity Planning | references/capacity-planning.md | Scaling, forecasting, budgets |
Constraints
Section titled “Constraints”MUST DO
Section titled “MUST DO”- Use structured logging (JSON)
- Include request IDs for correlation
- Set up alerts for critical paths
- Monitor business metrics, not just technical
- Use appropriate metric types (counter/gauge/histogram)
- Implement health check endpoints
MUST NOT DO
Section titled “MUST NOT DO”- Log sensitive data (passwords, tokens, PII)
- Alert on every error (alert fatigue)
- Use string interpolation in logs (use structured fields)
- Skip correlation IDs in distributed systems