Site Reliability Engineering Manager
About This Role
About the Role We're looking for a technical leader who can architect systems and grow people in equal measure. You'll build our SRE practice from the ground up - defining SLOs that matter, establishing incident response rituals, and mentoring engineers to think like operators. This is hands-on leadership: you'll split time between designing observability infrastructure and developing your team's capabilities. The challenge is real: take production systems to the next order of scale while building a culture where reliability isn't an afterthought, it's engineered in. Our Stack - Cloud & orchestration: Kubernetes · Terraform · AWS/GCP/Azure · Helm · service mesh patterns - Observability: Prometheus · Grafana · Datadog · PagerDuty · distributed tracing - CI/CD & automation: GitHub Actions · ArgoCD · GitOps workflows · infrastructure-as-code - SRE tooling: Vault for secrets · Consul for service discovery · chaos engineering frameworks What You'll Do - Define SRE strategy and reliability roadmap aligned to business growth, translating uptime goals into actionable SLO/SLA frameworks and platform investments - Build and lead a distributed SRE team of 4–6 engineers, owning hiring, mentoring, performance development, and career progression while maintaining technical credibility through hands-on contributions - Drive incident response processes end-to-end — establish on-call rotations, facilitate blameless post-mortems, analyze root causes with data, and implement systemic fixes that prevent recurrence - Architect observability infrastructure using Prometheus, Grafana, and Datadog to provide actionable insights into system health, performance bottlenecks, and capacity planning needs - Partner with engineering, product, and executive stakeholders to balance feature velocity with operational stability, translating technical risk into business language and securing resources for platform investments - Own infrastructure-as-code practices using Terraform and Kubernetes, setting standards for deployment automation, configuration management, and cloud resource governance - Establish SRE culture rituals — error budgets, chaos engineering experiments, production readiness reviews — that shift reliability left without slowing teams down What We're Looking For - 8+ years of professional experience in SRE, DevOps, or infrastructure engineering roles, with demonstrated impact on system reliability at scale - 2+ years managing and developing technical teams — you've hired engineers, run 1:1s, and grown people's careers while maintaining your own technical credibility - Deep hands-on expertise with Kubernetes in production environments — you've debugged pod scheduling issues at 3am, not just followed tutorials - Strong infrastructure-as-code skills with Terraform or similar tooling, with a focus on building reusable, well-documented modules that teams actually want to use - Production experience designing and operating monitoring and observability systems (Prometheus, Grafana, Datadog, or equivalent) — you know the difference between metrics, logs, and traces and when to use each - Proven ability to lead incident response and build blameless post-mortem culture — calm under pressure, analytical when debugging cascading failures - Track record of defining SLIs/SLOs/SLAs and using them to drive technical decisions and resource allocation, not just compliance checkboxes - Effective communication skills that translate complex technical tradeoffs into clear recommendations for engineering leadership and product stakeholders Nice to Have - Multi-cloud experience across AWS, GCP, or Azure with architectural understanding of when to use managed services vs. self-hosted infrastructure - Background in security or compliance domains (SOC 2, HIPAA, PCI-DSS) with practical experience integrating security controls into CI/CD pipelines - Certifications such as AWS Solutions Architect, Google Cloud Professional, CKA (Certified Kubernetes Administrator), or equivalent Bonus Points - Open-source contributions to infrastructure tooling, observability projects, or Kubernetes ecosystem components - Conference talks or blog posts about SRE practices, reliability engineering, or platform engineering - Experience with chaos engineering practices or building resilience testing frameworks - Prior work scaling infrastructure through 10x growth or major architectural migrations
Requirements
- 8+ years of professional experience in SRE, DevOps, or infrastructure engineering roles, with demonstrated impact on system reliability at scale
- 2+ years managing and developing technical teams — you've hired engineers, run 1:1s, and grown people's careers while maintaining your own technical credibility
- Deep hands-on expertise with Kubernetes in production environments — you've debugged pod scheduling issues at 3am, not just followed tutorials
- Strong infrastructure-as-code skills with Terraform or similar tooling, with a focus on building reusable, well-documented modules that teams actually want to use
- Production experience designing and operating monitoring and observability systems (Prometheus, Grafana, Datadog, or equivalent) — you know the difference between metrics, logs, and traces and when to use each
- Proven ability to lead incident response and build blameless post-mortem culture — calm under pressure, analytical when debugging cascading failures
- Track record of defining SLIs/SLOs/SLAs and using them to drive technical decisions and resource allocation, not just compliance checkboxes
- Effective communication skills that translate complex technical tradeoffs into clear recommendations for engineering leadership and product stakeholders
