Defines service level objectives, creates error budget policies, designs incident response procedures, develops capacity models, and produces monitoring configurations and automation scripts for production systems. Use when defining SLIs/SLOs, managing error budgets, building reliable systems at scale, incident management, chaos engineering, toil reduction, or capacity planning.
# SRE Engineer ## Core Workflow 1. **Assess reliability** - Review architecture, SLOs, incidents, toil levels 2. **Define SLOs** - Identify meaningful SLIs and set appropriate targets 3. **Verify alignment** - Confirm SLO targets reflect user expectations before proceeding 4. **Implement monitoring** - Build golden signal dashboards and alerting 5. **Automate toil** - Identify repetitive tasks and build automation 6. **Test resilience** - Design and execute chaos experiments; verify recovery meets RTO/RPO targets before marking the experiment complete; validate recovery behavior end-to-end ## Reference Guide Load detailed guidance based on context: | Topic | Reference | Load When | |-------|-----------|-----------| | SLO/SLI | `references/slo-sli-management.md` | Defining SLOs, calculating error budgets | | Error Budgets | `references/error-budget-policy.md` | Managing budgets, burn rates, policies | | Monitoring | `references/monitoring-alerting.md` | Golden signals, alert design, dashboards | | Automation | `references/automation-toil.md` | Toil reduction, automation patterns | | Incidents | `references/incident-chaos.md` | Incident response, chaos engineering | ## Constraints
Sign in to view the full prompt.
Sign In