Senior Site Reliability Engineer

Finite State 19 days ago

Remote

United States, Canada

$215,000 - $250,000/yearly

Senior Level

Top Benefits

Equity for all employees

Medical, dental, and vision coverage

401(k) retirement plan

About the role

Who you are

This individual will bring deep experience in reliability engineering, distributed systems, and production operations—along with a forward-thinking mindset around AI-assisted development and infrastructure-as-code
If you are passionate about building resilient systems, defining SLOs that actually matter, and leveraging AI tooling to accelerate operational excellence, this role is for you
10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Production Engineering
Proven experience defining and implementing SLOs, SLAs, SLIs, and error budget frameworks at scale
Deep experience building and managing on-call rotations and incident management processes
Strong background in distributed systems and cloud-native architectures
Hands-on experience with:
Honeycomb
Grafana
AWS
Vercel
Supabase
Strong experience with observability instrumentation and telemetry design
Infrastructure-as-Code experience (e.g., Terraform, Pulumi, or similar)
Experience designing resilient CI/CD pipelines
Deep understanding of high-availability, scalability, and performance engineering principles
Demonstrated experience leveraging AI tools (Cursor, Claude, Codex, etc.) in development or infrastructure workflows
Experience using AI-assisted tooling to generate, validate, or optimize infrastructure configurations
Strong interest in building AI-native operational practices
Leadership & Communication
Ability to operate as both strategic architect and hands-on implementer
Strong written and verbal communication skills
Experience influencing cross-functional teams
Comfort working in fast-paced, high-growth environments
Experience supporting AI/ML workloads in production
Experience building internal developer platforms (IDP)
Experience with cost observability and FinOps practices
Experience scaling observability in high-growth SaaS environments

What the job involves

We are seeking a Senior Site Reliability Engineer (SRE) / Infrastructure Engineering leader to define, architect, and drive a modern observability and reliability strategy for an AI-first development organization
This is a highly impactful technical leadership role responsible for establishing best-in-class operational practices, reliability standards, and AI-enabled infrastructure automation across our product ecosystem
Leverage AI tools and Agentic processes to drive observability, quality, responsiveness, and operational clarity
Design modern telemetry pipelines (metrics, logs, traces, events) for distributed systems and AI-driven workloads
Define and implement a comprehensive observability framework across applications and
Infrastructure
Establish and operationalize meaningful SLIs, SLOs, and SLAs aligned with business objectives
Lead the adoption and optimization of observability tooling including Honeycomb, Grafana, and related telemetry platforms
Drive best practices in error budgeting, alert design, and production health monitoring
Operational Excellence
Define and evolve incident management processes, including:
On-call structures and escalation models
Postmortems and blameless retrospectives
Runbooks and operational playbooks
Improve system reliability, performance, scalability, and cost efficiency
Establish operational KPIs and reliability dashboards for engineering and leadership visibility
Lead reliability reviews for new architecture and product initiatives
Infrastructure Engineering
Architect and implement scalable cloud infrastructure primarily within AWS
Work closely with modern application platforms such as Vercel and Supabase
Implement and improve Infrastructure-as-Code practices
Leverage AI-assisted tooling to accelerate infrastructure design, validation, and automation
Ensure production-grade security, compliance, and resilience standards
AI-First Enablement
Champion the use of AI tools to:
Accelerate infrastructure provisioning
Improve operational workflows
Enhance observability signal quality
Automate incident response and remediation
Partner with AI-focused product teams to ensure observability supports model performance, experimentation, and reliability
Technical Leadership
Serve as a senior technical authority for reliability and infrastructure decisions
Mentor engineers on production best practices
Influence architectural decisions to improve system resilience and maintainability
Drive a culture of reliability, accountability, and continuous improvement
What Success Looks Like in the First 6 Months
Clear SLO framework implemented across core services
Observability tooling standardized and adopted organization-wide
On-call and incident management processes running smoothly with measurable improvements
AI-driven infrastructure workflows reducing operational toil

Benefits

The Usuals: We offer equity for everyone. Medical, dental, vision covered, 401k, short and long-term disability, and life insurance
Work-life synergy: We take time off and flexibility seriously. Our team is fully remote all across the US, enjoys unlimited PTO, and is entitled to 12 weeks of parental leave
Mission driven team: We are a team of experts and dedicated professionals—but more than that, we are a diverse group of individuals who support one another in our work and in our mission
Transparent culture: With bi-weekly business updates, an open-door policy with the executive team, a fireside chat series with our board and other stakeholders, we make sure that everyone on the team has access to what’s going on throughout the company
Social & supportive environment: We work hard to ensure that Finite State is a fun, comfortable place to work no matter your background. Teams work together to ensure that everyone feels supported and has their needs met, and our vibrant social calendar gives us a myriad of opportunities to kick back and have a good time

About Finite State

Computer and Network Security

51-200

We enable product security teams – the guardians of the connected world – to protect the devices we rely on every day through market-leading software threat, vulnerability, and risk management.

Finite State is the leading provider of product cyber security solutions for connected devices and embedded systems, including IoT, medical devices, and OT/ICS.

Website LinkedIn

Similar jobs you might like