Site Reliability Engineering Manager

Engineeringfull timeAtlanta, GAsenior level

About This Role

About the Role We're looking for a technical leader who can architect systems and grow people in equal measure. You'll build our SRE practice from the ground up - defining SLOs that matter, establishing incident response rituals, and mentoring engineers to think like operators. This is hands-on leadership: you'll split time between designing observability infrastructure and developing your team's capabilities. The challenge is real: take production systems to the next order of scale while building a culture where reliability isn't an afterthought, it's engineered in. Our Stack - Cloud & orchestration: Kubernetes · Terraform · AWS/GCP/Azure · Helm · service mesh patterns - Observability: Prometheus · Grafana · Datadog · PagerDuty · distributed tracing - CI/CD & automation: GitHub Actions · ArgoCD · GitOps workflows · infrastructure-as-code - SRE tooling: Vault for secrets · Consul for service discovery · chaos engineering frameworks What You'll Do - Define SRE strategy and reliability roadmap aligned to business growth, translating uptime goals into actionable SLO/SLA frameworks and platform investments - Build and lead a distributed SRE team of 4–6 engineers, owning hiring, mentoring, performance development, and career progression while maintaining technical credibility through hands-on contributions - Drive incident response processes end-to-end — establish on-call rotations, facilitate blameless post-mortems, analyze root causes with data, and implement systemic fixes that prevent recurrence - Architect observability infrastructure using Prometheus, Grafana, and Datadog to provide actionable insights into system health, performance bottlenecks, and capacity planning needs - Partner with engineering, product, and executive stakeholders to balance feature velocity with operational stability, translating technical risk into business language and securing resources for platform investments - Own infrastructure-as-code practices using Terraform and Kubernetes, setting standards for deployment automation, configuration management, and cloud resource governance - Establish SRE culture rituals — error budgets, chaos engineering experiments, production readiness reviews — that shift reliability left without slowing teams down What We're Looking For - 8+ years of professional experience in SRE, DevOps, or infrastructure engineering roles, with demonstrated impact on system reliability at scale - 2+ years managing and developing technical teams — you've hired engineers, run 1:1s, and grown people's careers while maintaining your own technical credibility - Deep hands-on expertise with Kubernetes in production environments — you've debugged pod scheduling issues at 3am, not just followed tutorials - Strong infrastructure-as-code skills with Terraform or similar tooling, with a focus on building reusable, well-documented modules that teams actually want to use - Production experience designing and operating monitoring and observability systems (Prometheus, Grafana, Datadog, or equivalent) — you know the difference between metrics, logs, and traces and when to use each - Proven ability to lead incident response and build blameless post-mortem culture — calm under pressure, analytical when debugging cascading failures - Track record of defining SLIs/SLOs/SLAs and using them to drive technical decisions and resource allocation, not just compliance checkboxes - Effective communication skills that translate complex technical tradeoffs into clear recommendations for engineering leadership and product stakeholders Nice to Have - Multi-cloud experience across AWS, GCP, or Azure with architectural understanding of when to use managed services vs. self-hosted infrastructure - Background in security or compliance domains (SOC 2, HIPAA, PCI-DSS) with practical experience integrating security controls into CI/CD pipelines - Certifications such as AWS Solutions Architect, Google Cloud Professional, CKA (Certified Kubernetes Administrator), or equivalent Bonus Points - Open-source contributions to infrastructure tooling, observability projects, or Kubernetes ecosystem components - Conference talks or blog posts about SRE practices, reliability engineering, or platform engineering - Experience with chaos engineering practices or building resilience testing frameworks - Prior work scaling infrastructure through 10x growth or major architectural migrations

Requirements

8+ years of professional experience in SRE, DevOps, or infrastructure engineering roles, with demonstrated impact on system reliability at scale
2+ years managing and developing technical teams — you've hired engineers, run 1:1s, and grown people's careers while maintaining your own technical credibility
Deep hands-on expertise with Kubernetes in production environments — you've debugged pod scheduling issues at 3am, not just followed tutorials
Strong infrastructure-as-code skills with Terraform or similar tooling, with a focus on building reusable, well-documented modules that teams actually want to use
Production experience designing and operating monitoring and observability systems (Prometheus, Grafana, Datadog, or equivalent) — you know the difference between metrics, logs, and traces and when to use each
Proven ability to lead incident response and build blameless post-mortem culture — calm under pressure, analytical when debugging cascading failures
Track record of defining SLIs/SLOs/SLAs and using them to drive technical decisions and resource allocation, not just compliance checkboxes
Effective communication skills that translate complex technical tradeoffs into clear recommendations for engineering leadership and product stakeholders

Skills

KubernetesTerraformAWSPrometheusIncident ManagementTeam LeadershipSLO/SLA Design

Site Reliability Engineering Manager

About This Role

Requirements

Skills

Check your profile with AI