Senior Site Reliability Engineer
Remote
United States, Canada
$215,000 - $250,000/yearly
Senior Level
Top Benefits
Equity for all employees
Medical, dental, and vision coverage
401(k) retirement plan
About the role
Who you are
- This individual will bring deep experience in reliability engineering, distributed systems, and production operations—along with a forward-thinking mindset around AI-assisted development and infrastructure-as-code
- If you are passionate about building resilient systems, defining SLOs that actually matter, and leveraging AI tooling to accelerate operational excellence, this role is for you
- 10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Production Engineering
- Proven experience defining and implementing SLOs, SLAs, SLIs, and error budget frameworks at scale
- Deep experience building and managing on-call rotations and incident management processes
- Strong background in distributed systems and cloud-native architectures
- Hands-on experience with:
- Honeycomb
- Grafana
- AWS
- Vercel
- Supabase
- Strong experience with observability instrumentation and telemetry design
- Infrastructure-as-Code experience (e.g., Terraform, Pulumi, or similar)
- Experience designing resilient CI/CD pipelines
- Deep understanding of high-availability, scalability, and performance engineering principles
- Demonstrated experience leveraging AI tools (Cursor, Claude, Codex, etc.) in development or infrastructure workflows
- Experience using AI-assisted tooling to generate, validate, or optimize infrastructure configurations
- Strong interest in building AI-native operational practices
- Leadership & Communication
- Ability to operate as both strategic architect and hands-on implementer
- Strong written and verbal communication skills
- Experience influencing cross-functional teams
- Comfort working in fast-paced, high-growth environments
- Experience supporting AI/ML workloads in production
- Experience building internal developer platforms (IDP)
- Experience with cost observability and FinOps practices
- Experience scaling observability in high-growth SaaS environments
What the job involves
- We are seeking a Senior Site Reliability Engineer (SRE) / Infrastructure Engineering leader to define, architect, and drive a modern observability and reliability strategy for an AI-first development organization
- This is a highly impactful technical leadership role responsible for establishing best-in-class operational practices, reliability standards, and AI-enabled infrastructure automation across our product ecosystem
- Leverage AI tools and Agentic processes to drive observability, quality, responsiveness, and operational clarity
- Design modern telemetry pipelines (metrics, logs, traces, events) for distributed systems and AI-driven workloads
- Define and implement a comprehensive observability framework across applications and
- Infrastructure
- Establish and operationalize meaningful SLIs, SLOs, and SLAs aligned with business objectives
- Lead the adoption and optimization of observability tooling including Honeycomb, Grafana, and related telemetry platforms
- Drive best practices in error budgeting, alert design, and production health monitoring
- Operational Excellence
- Define and evolve incident management processes, including:
- On-call structures and escalation models
- Postmortems and blameless retrospectives
- Runbooks and operational playbooks
- Improve system reliability, performance, scalability, and cost efficiency
- Establish operational KPIs and reliability dashboards for engineering and leadership visibility
- Lead reliability reviews for new architecture and product initiatives
- Infrastructure Engineering
- Architect and implement scalable cloud infrastructure primarily within AWS
- Work closely with modern application platforms such as Vercel and Supabase
- Implement and improve Infrastructure-as-Code practices
- Leverage AI-assisted tooling to accelerate infrastructure design, validation, and automation
- Ensure production-grade security, compliance, and resilience standards
- AI-First Enablement
- Champion the use of AI tools to:
- Accelerate infrastructure provisioning
- Improve operational workflows
- Enhance observability signal quality
- Automate incident response and remediation
- Partner with AI-focused product teams to ensure observability supports model performance, experimentation, and reliability
- Technical Leadership
- Serve as a senior technical authority for reliability and infrastructure decisions
- Mentor engineers on production best practices
- Influence architectural decisions to improve system resilience and maintainability
- Drive a culture of reliability, accountability, and continuous improvement
- What Success Looks Like in the First 6 Months
- Clear SLO framework implemented across core services
- Observability tooling standardized and adopted organization-wide
- On-call and incident management processes running smoothly with measurable improvements
- AI-driven infrastructure workflows reducing operational toil
Benefits
- The Usuals: We offer equity for everyone. Medical, dental, vision covered, 401k, short and long-term disability, and life insurance
- Work-life synergy: We take time off and flexibility seriously. Our team is fully remote all across the US, enjoys unlimited PTO, and is entitled to 12 weeks of parental leave
- Mission driven team: We are a team of experts and dedicated professionals—but more than that, we are a diverse group of individuals who support one another in our work and in our mission
- Transparent culture: With bi-weekly business updates, an open-door policy with the executive team, a fireside chat series with our board and other stakeholders, we make sure that everyone on the team has access to what’s going on throughout the company
- Social & supportive environment: We work hard to ensure that Finite State is a fun, comfortable place to work no matter your background. Teams work together to ensure that everyone feels supported and has their needs met, and our vibrant social calendar gives us a myriad of opportunities to kick back and have a good time
About Finite State
Computer and Network Security
51-200
We enable product security teams – the guardians of the connected world – to protect the devices we rely on every day through market-leading software threat, vulnerability, and risk management.
Finite State is the leading provider of product cyber security solutions for connected devices and embedded systems, including IoT, medical devices, and OT/ICS.
Similar jobs you might like
Senior Site Reliability Engineer
Remote
United States, Canada
$215,000 - $250,000/yearly
Senior Level
Top Benefits
Equity for all employees
Medical, dental, and vision coverage
401(k) retirement plan
About the role
Who you are
- This individual will bring deep experience in reliability engineering, distributed systems, and production operations—along with a forward-thinking mindset around AI-assisted development and infrastructure-as-code
- If you are passionate about building resilient systems, defining SLOs that actually matter, and leveraging AI tooling to accelerate operational excellence, this role is for you
- 10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Production Engineering
- Proven experience defining and implementing SLOs, SLAs, SLIs, and error budget frameworks at scale
- Deep experience building and managing on-call rotations and incident management processes
- Strong background in distributed systems and cloud-native architectures
- Hands-on experience with:
- Honeycomb
- Grafana
- AWS
- Vercel
- Supabase
- Strong experience with observability instrumentation and telemetry design
- Infrastructure-as-Code experience (e.g., Terraform, Pulumi, or similar)
- Experience designing resilient CI/CD pipelines
- Deep understanding of high-availability, scalability, and performance engineering principles
- Demonstrated experience leveraging AI tools (Cursor, Claude, Codex, etc.) in development or infrastructure workflows
- Experience using AI-assisted tooling to generate, validate, or optimize infrastructure configurations
- Strong interest in building AI-native operational practices
- Leadership & Communication
- Ability to operate as both strategic architect and hands-on implementer
- Strong written and verbal communication skills
- Experience influencing cross-functional teams
- Comfort working in fast-paced, high-growth environments
- Experience supporting AI/ML workloads in production
- Experience building internal developer platforms (IDP)
- Experience with cost observability and FinOps practices
- Experience scaling observability in high-growth SaaS environments
What the job involves
- We are seeking a Senior Site Reliability Engineer (SRE) / Infrastructure Engineering leader to define, architect, and drive a modern observability and reliability strategy for an AI-first development organization
- This is a highly impactful technical leadership role responsible for establishing best-in-class operational practices, reliability standards, and AI-enabled infrastructure automation across our product ecosystem
- Leverage AI tools and Agentic processes to drive observability, quality, responsiveness, and operational clarity
- Design modern telemetry pipelines (metrics, logs, traces, events) for distributed systems and AI-driven workloads
- Define and implement a comprehensive observability framework across applications and
- Infrastructure
- Establish and operationalize meaningful SLIs, SLOs, and SLAs aligned with business objectives
- Lead the adoption and optimization of observability tooling including Honeycomb, Grafana, and related telemetry platforms
- Drive best practices in error budgeting, alert design, and production health monitoring
- Operational Excellence
- Define and evolve incident management processes, including:
- On-call structures and escalation models
- Postmortems and blameless retrospectives
- Runbooks and operational playbooks
- Improve system reliability, performance, scalability, and cost efficiency
- Establish operational KPIs and reliability dashboards for engineering and leadership visibility
- Lead reliability reviews for new architecture and product initiatives
- Infrastructure Engineering
- Architect and implement scalable cloud infrastructure primarily within AWS
- Work closely with modern application platforms such as Vercel and Supabase
- Implement and improve Infrastructure-as-Code practices
- Leverage AI-assisted tooling to accelerate infrastructure design, validation, and automation
- Ensure production-grade security, compliance, and resilience standards
- AI-First Enablement
- Champion the use of AI tools to:
- Accelerate infrastructure provisioning
- Improve operational workflows
- Enhance observability signal quality
- Automate incident response and remediation
- Partner with AI-focused product teams to ensure observability supports model performance, experimentation, and reliability
- Technical Leadership
- Serve as a senior technical authority for reliability and infrastructure decisions
- Mentor engineers on production best practices
- Influence architectural decisions to improve system resilience and maintainability
- Drive a culture of reliability, accountability, and continuous improvement
- What Success Looks Like in the First 6 Months
- Clear SLO framework implemented across core services
- Observability tooling standardized and adopted organization-wide
- On-call and incident management processes running smoothly with measurable improvements
- AI-driven infrastructure workflows reducing operational toil
Benefits
- The Usuals: We offer equity for everyone. Medical, dental, vision covered, 401k, short and long-term disability, and life insurance
- Work-life synergy: We take time off and flexibility seriously. Our team is fully remote all across the US, enjoys unlimited PTO, and is entitled to 12 weeks of parental leave
- Mission driven team: We are a team of experts and dedicated professionals—but more than that, we are a diverse group of individuals who support one another in our work and in our mission
- Transparent culture: With bi-weekly business updates, an open-door policy with the executive team, a fireside chat series with our board and other stakeholders, we make sure that everyone on the team has access to what’s going on throughout the company
- Social & supportive environment: We work hard to ensure that Finite State is a fun, comfortable place to work no matter your background. Teams work together to ensure that everyone feels supported and has their needs met, and our vibrant social calendar gives us a myriad of opportunities to kick back and have a good time
About Finite State
Computer and Network Security
51-200
We enable product security teams – the guardians of the connected world – to protect the devices we rely on every day through market-leading software threat, vulnerability, and risk management.
Finite State is the leading provider of product cyber security solutions for connected devices and embedded systems, including IoT, medical devices, and OT/ICS.