Jobs.ca
Jobs.ca
Language
Finite State logo

Senior Site Reliability Engineer

Finite State19 days ago
Remote
United States, Canada
$215,000 - $250,000/yearly
Senior Level

Top Benefits

Equity for all employees
Medical, dental, and vision coverage
401(k) retirement plan

About the role

Who you are

  • This individual will bring deep experience in reliability engineering, distributed systems, and production operations—along with a forward-thinking mindset around AI-assisted development and infrastructure-as-code
  • If you are passionate about building resilient systems, defining SLOs that actually matter, and leveraging AI tooling to accelerate operational excellence, this role is for you
  • 10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Production Engineering
  • Proven experience defining and implementing SLOs, SLAs, SLIs, and error budget frameworks at scale
  • Deep experience building and managing on-call rotations and incident management processes
  • Strong background in distributed systems and cloud-native architectures
  • Hands-on experience with:
  • Honeycomb
  • Grafana
  • AWS
  • Vercel
  • Supabase
  • Strong experience with observability instrumentation and telemetry design
  • Infrastructure-as-Code experience (e.g., Terraform, Pulumi, or similar)
  • Experience designing resilient CI/CD pipelines
  • Deep understanding of high-availability, scalability, and performance engineering principles
  • Demonstrated experience leveraging AI tools (Cursor, Claude, Codex, etc.) in development or infrastructure workflows
  • Experience using AI-assisted tooling to generate, validate, or optimize infrastructure configurations
  • Strong interest in building AI-native operational practices
  • Leadership & Communication
  • Ability to operate as both strategic architect and hands-on implementer
  • Strong written and verbal communication skills
  • Experience influencing cross-functional teams
  • Comfort working in fast-paced, high-growth environments
  • Experience supporting AI/ML workloads in production
  • Experience building internal developer platforms (IDP)
  • Experience with cost observability and FinOps practices
  • Experience scaling observability in high-growth SaaS environments

What the job involves

  • We are seeking a Senior Site Reliability Engineer (SRE) / Infrastructure Engineering leader to define, architect, and drive a modern observability and reliability strategy for an AI-first development organization
  • This is a highly impactful technical leadership role responsible for establishing best-in-class operational practices, reliability standards, and AI-enabled infrastructure automation across our product ecosystem
  • Leverage AI tools and Agentic processes to drive observability, quality, responsiveness, and operational clarity
  • Design modern telemetry pipelines (metrics, logs, traces, events) for distributed systems and AI-driven workloads
  • Define and implement a comprehensive observability framework across applications and
  • Infrastructure
  • Establish and operationalize meaningful SLIs, SLOs, and SLAs aligned with business objectives
  • Lead the adoption and optimization of observability tooling including Honeycomb, Grafana, and related telemetry platforms
  • Drive best practices in error budgeting, alert design, and production health monitoring
  • Operational Excellence
  • Define and evolve incident management processes, including:
  • On-call structures and escalation models
  • Postmortems and blameless retrospectives
  • Runbooks and operational playbooks
  • Improve system reliability, performance, scalability, and cost efficiency
  • Establish operational KPIs and reliability dashboards for engineering and leadership visibility
  • Lead reliability reviews for new architecture and product initiatives
  • Infrastructure Engineering
  • Architect and implement scalable cloud infrastructure primarily within AWS
  • Work closely with modern application platforms such as Vercel and Supabase
  • Implement and improve Infrastructure-as-Code practices
  • Leverage AI-assisted tooling to accelerate infrastructure design, validation, and automation
  • Ensure production-grade security, compliance, and resilience standards
  • AI-First Enablement
  • Champion the use of AI tools to:
  • Accelerate infrastructure provisioning
  • Improve operational workflows
  • Enhance observability signal quality
  • Automate incident response and remediation
  • Partner with AI-focused product teams to ensure observability supports model performance, experimentation, and reliability
  • Technical Leadership
  • Serve as a senior technical authority for reliability and infrastructure decisions
  • Mentor engineers on production best practices
  • Influence architectural decisions to improve system resilience and maintainability
  • Drive a culture of reliability, accountability, and continuous improvement
  • What Success Looks Like in the First 6 Months
  • Clear SLO framework implemented across core services
  • Observability tooling standardized and adopted organization-wide
  • On-call and incident management processes running smoothly with measurable improvements
  • AI-driven infrastructure workflows reducing operational toil

Benefits

  • The Usuals: We offer equity for everyone. Medical, dental, vision covered, 401k, short and long-term disability, and life insurance
  • Work-life synergy: We take time off and flexibility seriously. Our team is fully remote all across the US, enjoys unlimited PTO, and is entitled to 12 weeks of parental leave
  • Mission driven team: We are a team of experts and dedicated professionals—but more than that, we are a diverse group of individuals who support one another in our work and in our mission
  • Transparent culture: With bi-weekly business updates, an open-door policy with the executive team, a fireside chat series with our board and other stakeholders, we make sure that everyone on the team has access to what’s going on throughout the company
  • Social & supportive environment: We work hard to ensure that Finite State is a fun, comfortable place to work no matter your background. Teams work together to ensure that everyone feels supported and has their needs met, and our vibrant social calendar gives us a myriad of opportunities to kick back and have a good time

About Finite State

Computer and Network Security
51-200

We enable product security teams – the guardians of the connected world – to protect the devices we rely on every day through market-leading software threat, vulnerability, and risk management.

Finite State is the leading provider of product cyber security solutions for connected devices and embedded systems, including IoT, medical devices, and OT/ICS.

Similar jobs you might like