Senior Site Reliability Engineer

Tubi 5 days ago

Toronto

Senior Level

Top Benefits

Medical, dental, and vision coverage from day one

Generous parental leave, childcare, and eldercare support

Monthly wellness reimbursement, generous time off, extra holidays

About the role

Who you are

Bachelor's degree in Computer Science, a related technical field, or equivalent practical experience
5+ years of professional experience in a Site Reliability Engineering, DevOps, or Software Engineering role with a focus on infrastructure and operations
Strong programming proficiency in one or more high-level languages such as Rust, Go, Python, or Typescript. You should be comfortable writing, testing, and deploying production-grade code
Deep knowledge of AWS services (especially networking, IAM, EKS, ALBs/NLBs, Route 53, CloudWatch)
Proven experience with Kubernetes in production (EKS preferred), including service exposure, networking, and availability engineering
A solid understanding of Linux/Unix operating systems, networking fundamentals (TCP/IP, DNS, HTTP), and the architecture of modern distributed systems
Experience building and managing large-scale monitoring and observability systems using tools like Datadog, Prometheus, Grafana, etc
Expertise in designing and implementing CI/CD pipelines using tools such as Github action, ArgoCD, etc
Experience with distributed storage technologies (e.g., Amazon S3) and databases (e.g., PostgreSQL, ScyllaDb, Clickhouse, etc.)
Contributions to open-source projects in the SRE, DevOps, or cloud-native ecosystem

What the job involves

Site Reliability Engineering (SRE) at Tubi is not a traditional operations team
We are a software engineering organization that applies a developer's mindset and toolkit to the challenges of building and running large-scale, distributed systems
Our mission is to engineer resilience from the ground up, enabling our product teams to innovate rapidly while ensuring our users have a stellar experience
We own the availability, latency, performance, and capacity of our platform, and we achieve our goals through a culture of data-driven decision-making, blameless learning, and relentless automation
As a Senior Site Reliability Engineer, you are a hands-on engineer who blends deep software development expertise with a passion for operational excellence
You will be responsible for designing, building, and running the resilient, scalable, and increasingly self-healing systems that power our products
You will apply sound engineering principles to solve our most complex reliability challenges, with a mandate to automate everything, eliminate toil, and write robust, maintainable code
You will be a force multiplier, mentoring other engineers and elevating the site reliability bar for the entire organization
System Architecture & Design: Design, build, and maintain scalable, highly available, and fault-tolerant distributed systems
Partner with development teams as a reliability consultant, reviewing designs and influencing architectural decisions to ensure new services are built with reliability, observability, and performance as core principles, not afterthoughts
Automation & Software Development: Write robust, performant, and maintainable code to automate operational tasks, and CI/CD pipelines
Build the internal tools, libraries, and frameworks that enable engineering teams to self-service their observability needs, reducing cognitive load and increasing their velocity
Incident Response & Post-Mortem Analysis: Participate in a 24/7 on-call rotation, acting as a key technical leader and incident commander during critical service disruptions
Conduct deep, blameless root cause analyses (RCAs) that go beyond immediate fixes to identify and address systemic issues
Drive the implementation of corrective actions to prevent the recurrence of incidents
Performance & Capacity Planning: Proactively monitor, measure, and optimize system performance to ensure low latency and high efficiency
Gather and analyze metrics from operating systems and applications to assist in performance tuning and fault finding
Analyze usage patterns and historical data to forecast capacity needs, ensuring our platform stays ahead of customer demand
As a Senior SRE, you will be at the forefront of applying AI to solve our most critical reliability challenges. This is a hands-on software development role where the "product" you build is an intelligent, automated reliability platform. Your responsibilities will include:
Building AI-Driven Automation: Building and integrating solutions that leverage our AIOps platform. This involves writing the code that consumes signals from the AI system, correlates disparate data sources, automates responses to AI-detected anomalies, and builds self-healing systems triggered by predictive alerts. You will transform AI insights into concrete reliability improvements
Leveraging AI for Code Development: Utilizing AI-assisted coding tools (e.g., Claude Code, Cursor) as a force multiplier in your daily workflow. You will leverage these assistants to write high-quality automation scripts, Terraform modules, Kubernetes manifests, and observability dashboards faster and more efficiently, while applying your expertise to validate and refine their output
Enriching our AI Knowledge Base: Developing and enriching our observability platform's internal knowledge base. You will be responsible for creating and documenting high-quality runbooks and procedural guides that can be ingested and used by AI assistants to provide context-aware troubleshooting guidance to the on-call engineer during an incident
Applying Data Science to Reliability: Treating reliability as a data science problem. You will analyze vast sets of telemetry data to identify trends, build predictive models for system capacity, and proactively identify performance bottlenecks and potential failure modes before they can impact our users

Benefits

Healthcare Coverage: We offer medical, dental, and vision coverage, effective from day one
Family Support: We’re proud to support families of all kinds, and offer generous parental leave, childcare support, and eldercare assistance whenever you need it
Wellness Programs: Monthly wellness reimbursement, generous time off, and additional Tubi Holidays help us support mental and physical wellbeing for you and your family
Continuing Education: From education reimbursement to leadership development to certification support, we’re invested in developing our talent so you can take your career to the next level
Financial Support: We offer resources to help keep you financially fit and invested in your future, from our highly-rated retirement savings matches to financial advisors and planning services

About Tubi

Entertainment Providers

501-1000

Tubi is the most watched free TV and movie streaming service in the U.S., dedicated to providing all people access to all the world's stories. As a leading ad-supported video-on-demand service, the company engages diverse audiences through a personalized experience and the world’s largest content library of over 275,000 movies and TV episodes, a growing collection of Tubi Originals, and nearly 250 FAST channels. Tubi is part of the Tubi Media Group, a division of Fox Corporation that oversees the company’s digital businesses

Website LinkedIn