Site Reliability Engineer
Top Benefits
About the role
Senior Site Reliability Engineer (SRE) with Team Lead Experience
This role may also be a strong fit if you've held titles such as Lead Production Engineer, Reliability Engineering Lead / Manager, Platform Reliability Engineer, Infrastructure Reliability Engineer, Systems Reliability Engineer, or Production Operations Lead. The discipline matters more than the title.
About XP Venture Labs
At XP Venture Labs, we partner with ambitious companies to solve complex technology challenges and accelerate growth. Our teams are composed of highly skilled engineers, architects, and technology leaders who bring deep technical expertise and real-world delivery experience. We don't operate as traditional consultants, we embed as strategic partners to design scalable systems, modernize platforms, improve reliability, and help our clients navigate high-impact technical decisions with confidence.
From cloud architecture and distributed systems to platform engineering and large-scale modernization, we specialize in the kinds of problems that demand precision, experience, and a relentless focus on outcomes.
About the Role
As our Senior SRE Team Lead, you own the reliability, availability, and performance of our production systems, and you lead the team responsible for keeping them healthy around the clock. This is a hands-on leadership role for someone who has already run an SRE function: defined SLOs, is comfortable being on call, has led incident bridges at 3 a.m., and built the tooling, standards, and on-call culture that prevent the next outage rather than just reacting to it.
A few things that make this role what it is:
We move at startup speed. You'll be building core reliability infrastructure from the ground up rather than inheriting a mature platform, juggling several high-impact initiatives at once, and making consequential calls with incomplete information. If you're energized by fast-moving, high-ownership, sometimes-ambiguous environments, you'll thrive here. If you need a slow, tightly-scoped, ticket-by-ticket setup, this likely isn't the right fit, and that's okay. Our production systems span both Windows and Linux. We need someone genuinely fluent in both, not strong in one and passable in the other. You should be equally at home debugging an IIS / .NET issue and a Linux, systemd, or kernel-level one. This is a reliability role, not a pipeline role. We're looking for an SRE, someone who applies software engineering to operations, owns production outcomes, and obsesses over SLOs, observability, and resilience. (More on what we mean by that below.)
You'll have the autonomy to evaluate and introduce new tools, establish best practices, and define the standards that guide the SRE function as we scale.
What You'll Own
The reliability, availability, and performance of production systems running on both Linux and Windows. Defining and managing SLAs, SLOs, SLIs, and error budgets to drive measurable, accountable reliability improvements. 24/7 monitoring, observability, and on-call including building the dashboards, structured logging pipelines, and actionable alerting that keep us ahead of problems. Leading incident response, postmortems, and root-cause analysis to cut recurrence and mean time to recovery (MTTR). Architecting and maintaining scalable, highly available AWS infrastructure. Mentoring and developing the SRE team including setting on-call rotations, operational-excellence standards, and a culture of reliability. Proactive capacity planning and performance optimization before scale becomes a problem. Partnering with Engineering, Security, and Product to embed reliability across the SDLC. Evaluating and adopting tools, technologies, and operational frameworks that improve resilience and efficiency.
What We're Looking For
Dual Windows + Linux production expertise. Deep, hands-on experience operating, tuning, and troubleshooting production systems on both operating systems, not just one. Deep AWS experience. You've architected and run secure, scalable, highly available environments on AWS, not just deployed into them. Direct experience leading a team of SREs. You've owned the function including on-call rotations, standards, hiring, and growth, not just mentored a junior or two as a senior IC. A track record of production SLAs and 24/7 operations. You've maintained production-grade SLAs and built/run the monitoring and observability that backs them. Comfort in fast-paced, high-intensity environments. You've worked somewhere startup-like, building and maintaining systems from scratch while juggling many concurrent priorities, and you do your best work there.
Technical Requirements
Operating systems (both required): Linux - administration, performance tuning, networking, troubleshooting, and Bash (e.g., Ubuntu / RHEL, systemd). Windows Server — administration, IIS, and PowerShell for Windows/legacy automation. AWS (deep): EC2, ECS/EKS, Lambda, RDS, DynamoDB, S3, IAM, VPC, and networking with a proven ability to architect secure, scalable, highly available environments. Containers & orchestration: strong hands-on experience with Docker and Kubernetes. Observability & monitoring (core to this role): designing performance dashboards, structured logging pipelines, and actionable alerting. Bonus points for Grafana, BetterStack, and Kibana. Reliability engineering practices: SLOs/SLIs/error budgets, incident response, postmortems, RCA, and MTTR reduction. Infrastructure-as-Code: advanced Terraform experience required. (AWS CloudFormation / SAM a plus.) Databases: MS SQL Server performance tuning and optimization. Application context: experience operating and monitoring .NET (C#) backend services with Angular front-ends. Distributed systems: strong understanding of microservice/API architecture, networking, and high-availability design principles. Message brokers: experience operating and tuning brokers such as RabbitMQ and KAFKA. Automation & tooling: scripting (Python, Bash, Shell) to build reliability tooling and reduce operational toil. Excellent written and verbal communication in English. Must be physically located in Canada and legally authorized to work in Canada.
Nice to Have
Networking depth: DNS management, VPN configuration, and packet analysis. Chaos/resilience testing and game-day experience. Experience standing up an SRE practice or reliability tooling from zero.
What We Offer
Competitive compensation: $150,000 – $180,000 CAD annually, based on experience and expertise. Meaningful growth: diverse, high-impact projects that expand your technical depth and leadership. A high-performance culture: a team that values innovation, technical excellence, and shared success. Fully remote: 100% work-from-home across Canada. Modern stack: cutting-edge tools across observability, data visualization, and cloud infrastructure. Human-centered hiring: no AI tools are used at any stage of the application or evaluation process.
How to Apply
Interested candidates are encouraged to submit their resume at their earliest convenience.
Not the right fit? Search for Site Reliability Engineer jobs in Canada
About XP Venture Labs
At XP Venture Labs, we believe the right understanding and technological edge can lead companies toward a successful future. We always seek valuable feedback from our clients to learn and evolve.
Similar Jobs
Site Reliability Engineer
Top Benefits
About the role
Senior Site Reliability Engineer (SRE) with Team Lead Experience
This role may also be a strong fit if you've held titles such as Lead Production Engineer, Reliability Engineering Lead / Manager, Platform Reliability Engineer, Infrastructure Reliability Engineer, Systems Reliability Engineer, or Production Operations Lead. The discipline matters more than the title.
About XP Venture Labs
At XP Venture Labs, we partner with ambitious companies to solve complex technology challenges and accelerate growth. Our teams are composed of highly skilled engineers, architects, and technology leaders who bring deep technical expertise and real-world delivery experience. We don't operate as traditional consultants, we embed as strategic partners to design scalable systems, modernize platforms, improve reliability, and help our clients navigate high-impact technical decisions with confidence.
From cloud architecture and distributed systems to platform engineering and large-scale modernization, we specialize in the kinds of problems that demand precision, experience, and a relentless focus on outcomes.
About the Role
As our Senior SRE Team Lead, you own the reliability, availability, and performance of our production systems, and you lead the team responsible for keeping them healthy around the clock. This is a hands-on leadership role for someone who has already run an SRE function: defined SLOs, is comfortable being on call, has led incident bridges at 3 a.m., and built the tooling, standards, and on-call culture that prevent the next outage rather than just reacting to it.
A few things that make this role what it is:
We move at startup speed. You'll be building core reliability infrastructure from the ground up rather than inheriting a mature platform, juggling several high-impact initiatives at once, and making consequential calls with incomplete information. If you're energized by fast-moving, high-ownership, sometimes-ambiguous environments, you'll thrive here. If you need a slow, tightly-scoped, ticket-by-ticket setup, this likely isn't the right fit, and that's okay. Our production systems span both Windows and Linux. We need someone genuinely fluent in both, not strong in one and passable in the other. You should be equally at home debugging an IIS / .NET issue and a Linux, systemd, or kernel-level one. This is a reliability role, not a pipeline role. We're looking for an SRE, someone who applies software engineering to operations, owns production outcomes, and obsesses over SLOs, observability, and resilience. (More on what we mean by that below.)
You'll have the autonomy to evaluate and introduce new tools, establish best practices, and define the standards that guide the SRE function as we scale.
What You'll Own
The reliability, availability, and performance of production systems running on both Linux and Windows. Defining and managing SLAs, SLOs, SLIs, and error budgets to drive measurable, accountable reliability improvements. 24/7 monitoring, observability, and on-call including building the dashboards, structured logging pipelines, and actionable alerting that keep us ahead of problems. Leading incident response, postmortems, and root-cause analysis to cut recurrence and mean time to recovery (MTTR). Architecting and maintaining scalable, highly available AWS infrastructure. Mentoring and developing the SRE team including setting on-call rotations, operational-excellence standards, and a culture of reliability. Proactive capacity planning and performance optimization before scale becomes a problem. Partnering with Engineering, Security, and Product to embed reliability across the SDLC. Evaluating and adopting tools, technologies, and operational frameworks that improve resilience and efficiency.
What We're Looking For
Dual Windows + Linux production expertise. Deep, hands-on experience operating, tuning, and troubleshooting production systems on both operating systems, not just one. Deep AWS experience. You've architected and run secure, scalable, highly available environments on AWS, not just deployed into them. Direct experience leading a team of SREs. You've owned the function including on-call rotations, standards, hiring, and growth, not just mentored a junior or two as a senior IC. A track record of production SLAs and 24/7 operations. You've maintained production-grade SLAs and built/run the monitoring and observability that backs them. Comfort in fast-paced, high-intensity environments. You've worked somewhere startup-like, building and maintaining systems from scratch while juggling many concurrent priorities, and you do your best work there.
Technical Requirements
Operating systems (both required): Linux - administration, performance tuning, networking, troubleshooting, and Bash (e.g., Ubuntu / RHEL, systemd). Windows Server — administration, IIS, and PowerShell for Windows/legacy automation. AWS (deep): EC2, ECS/EKS, Lambda, RDS, DynamoDB, S3, IAM, VPC, and networking with a proven ability to architect secure, scalable, highly available environments. Containers & orchestration: strong hands-on experience with Docker and Kubernetes. Observability & monitoring (core to this role): designing performance dashboards, structured logging pipelines, and actionable alerting. Bonus points for Grafana, BetterStack, and Kibana. Reliability engineering practices: SLOs/SLIs/error budgets, incident response, postmortems, RCA, and MTTR reduction. Infrastructure-as-Code: advanced Terraform experience required. (AWS CloudFormation / SAM a plus.) Databases: MS SQL Server performance tuning and optimization. Application context: experience operating and monitoring .NET (C#) backend services with Angular front-ends. Distributed systems: strong understanding of microservice/API architecture, networking, and high-availability design principles. Message brokers: experience operating and tuning brokers such as RabbitMQ and KAFKA. Automation & tooling: scripting (Python, Bash, Shell) to build reliability tooling and reduce operational toil. Excellent written and verbal communication in English. Must be physically located in Canada and legally authorized to work in Canada.
Nice to Have
Networking depth: DNS management, VPN configuration, and packet analysis. Chaos/resilience testing and game-day experience. Experience standing up an SRE practice or reliability tooling from zero.
What We Offer
Competitive compensation: $150,000 – $180,000 CAD annually, based on experience and expertise. Meaningful growth: diverse, high-impact projects that expand your technical depth and leadership. A high-performance culture: a team that values innovation, technical excellence, and shared success. Fully remote: 100% work-from-home across Canada. Modern stack: cutting-edge tools across observability, data visualization, and cloud infrastructure. Human-centered hiring: no AI tools are used at any stage of the application or evaluation process.
How to Apply
Interested candidates are encouraged to submit their resume at their earliest convenience.
Not the right fit? Search for Site Reliability Engineer jobs in Canada
About XP Venture Labs
At XP Venture Labs, we believe the right understanding and technological edge can lead companies toward a successful future. We always seek valuable feedback from our clients to learn and evolve.