Site Reliability Engineer
About the role
Pay Range: CAD 60-65/hr design and implement observability-as-code solutions using Terraform to deploy monitoring pipelines, dashboards, and alerting strategies across distributed systems.Drive observability improvements leveraging industry-leading tools (Dynatrace, ELK, Splunk, PagerDuty) to achieve real-time performance insights and comprehensive system visibility.Instrument applications for end-to-end observabilityimplementing distributed tracing, metrics collection, and log aggregation across Node.js and .NET microservices and event-driven architectures.Troubleshoot complex incidents in production environments, diagnosing root causes across multiple service layers, databases, caches, and APIs under load using SLISLO frameworks.Investigate and resolve Azure Kubernetes Service (AKS) infrastructure, ensuring reliability and scalability of containerized workloads with deep proficiency in Terraform and Azure managed services (SQL MI, Redis, Functions, Event Grid).Translate business requirements into observable, resilient systems that meet defined SLIsSLOs and drive reliability improvements.Automate operational tasks to reduce toil and improve system resilience through infrastructure-as-code and CICD best practices.Lead incident response and remediation for mission-critical systems, conducting blameless postmortems and building resilience through chaos engineering and tabletop exercises.Collaborate cross-functionally with development, platform, and business teams to improve service availability, scalability, and operational excellence.What do you need to succeedMust-have8 years hands-on experience in observability, SRE, or DevOps roles with proven expertise across infrastructure and application-level reliability.Deep expertise in observability tooling Dynatrace, ELK, Splunk, and PagerDuty demonstrated understanding of observability principles (instrumentation, correlation IDs, SLISLO frameworks).Advanced proficiency with Azure Kubernetes Service (AKS), Terraform, and Azure managed services (SQL MI, Redis, Functions, Event Grid) proven ability to design and implement infrastructure-as-code solutions.Strong hands-on experience instrumenting applications for comprehensive observability distributed tracing, metrics collection, and log aggregation across Node.js and .NET applications in microservices and event-driven architectures.Proven troubleshooting expertise in distributed systemsdiagnosing root causes across multiple service layers, databases, caches, and APIs in production environments.Excellent incident management skills hands-on experience with PagerDuty and ServiceNow ability to resolve high-severity incidents rapidly and conduct effective root cause analysis.Knowledge of incident, problem, and change management processes, including SRE principles, blameless postmortems, and chaos engineering practices.Exceptional communication and leadership skills to coordinate across business and IT teams ability to lead remo
Not the right fit? Search for Site Reliability Engineer jobs in Montreal, Quebec, Canada
About LanceSoft, Inc.
Established in 2000, LanceSoft is a pioneer in delivering top-notch Global Workforce Solutions and IT Services to a diverse clientele. As a Certified MBE and Woman-Owned organization, we pride ourselves on fostering global cross-cultural connections that advance both the careers of our employees and the success of our clients' businesses.
At LanceSoft, our mission is clear: to leverage our global network to seamlessly connect businesses with the right talent and individuals with the right opportunities, all without bias. We believe in providing Global Workforce Solutions with a personalized, human touch.
Our comprehensive range of services spans various domains, encompassing temporary and permanent staffing, Statement of Work (SOW) arrangements, payrolling, Recruitment Process Outsourcing (RPO), application design and development, program/project management, and engineering solutions.
Currently, our team of over 5,000 professionals caters to 110+ enterprise clients worldwide, including Fortune companies. Our client base represents a diverse spectrum of industries, including Banking & Financial Services, Semiconductor/VLSI, Technology, Healthcare & Life Sciences, Government, Telecom & Media, Retail & Distribution, Oil & Gas, and Energy & Utilities.
Headquartered in Herndon, VA, LanceSoft operates 32+ regional offices across the North America, Europe, Asia, and Australia. We also have nine delivery centers strategically located in India in Bangalore, Indore, Noida, Baroda, Hyderabad, Bhubaneshwar, Dehradun, Goa, and Aligarh to further enhance our client service capabilities.
Similar Jobs
Site Reliability Engineer
About the role
Pay Range: CAD 60-65/hr design and implement observability-as-code solutions using Terraform to deploy monitoring pipelines, dashboards, and alerting strategies across distributed systems.Drive observability improvements leveraging industry-leading tools (Dynatrace, ELK, Splunk, PagerDuty) to achieve real-time performance insights and comprehensive system visibility.Instrument applications for end-to-end observabilityimplementing distributed tracing, metrics collection, and log aggregation across Node.js and .NET microservices and event-driven architectures.Troubleshoot complex incidents in production environments, diagnosing root causes across multiple service layers, databases, caches, and APIs under load using SLISLO frameworks.Investigate and resolve Azure Kubernetes Service (AKS) infrastructure, ensuring reliability and scalability of containerized workloads with deep proficiency in Terraform and Azure managed services (SQL MI, Redis, Functions, Event Grid).Translate business requirements into observable, resilient systems that meet defined SLIsSLOs and drive reliability improvements.Automate operational tasks to reduce toil and improve system resilience through infrastructure-as-code and CICD best practices.Lead incident response and remediation for mission-critical systems, conducting blameless postmortems and building resilience through chaos engineering and tabletop exercises.Collaborate cross-functionally with development, platform, and business teams to improve service availability, scalability, and operational excellence.What do you need to succeedMust-have8 years hands-on experience in observability, SRE, or DevOps roles with proven expertise across infrastructure and application-level reliability.Deep expertise in observability tooling Dynatrace, ELK, Splunk, and PagerDuty demonstrated understanding of observability principles (instrumentation, correlation IDs, SLISLO frameworks).Advanced proficiency with Azure Kubernetes Service (AKS), Terraform, and Azure managed services (SQL MI, Redis, Functions, Event Grid) proven ability to design and implement infrastructure-as-code solutions.Strong hands-on experience instrumenting applications for comprehensive observability distributed tracing, metrics collection, and log aggregation across Node.js and .NET applications in microservices and event-driven architectures.Proven troubleshooting expertise in distributed systemsdiagnosing root causes across multiple service layers, databases, caches, and APIs in production environments.Excellent incident management skills hands-on experience with PagerDuty and ServiceNow ability to resolve high-severity incidents rapidly and conduct effective root cause analysis.Knowledge of incident, problem, and change management processes, including SRE principles, blameless postmortems, and chaos engineering practices.Exceptional communication and leadership skills to coordinate across business and IT teams ability to lead remo
Not the right fit? Search for Site Reliability Engineer jobs in Montreal, Quebec, Canada
About LanceSoft, Inc.
Established in 2000, LanceSoft is a pioneer in delivering top-notch Global Workforce Solutions and IT Services to a diverse clientele. As a Certified MBE and Woman-Owned organization, we pride ourselves on fostering global cross-cultural connections that advance both the careers of our employees and the success of our clients' businesses.
At LanceSoft, our mission is clear: to leverage our global network to seamlessly connect businesses with the right talent and individuals with the right opportunities, all without bias. We believe in providing Global Workforce Solutions with a personalized, human touch.
Our comprehensive range of services spans various domains, encompassing temporary and permanent staffing, Statement of Work (SOW) arrangements, payrolling, Recruitment Process Outsourcing (RPO), application design and development, program/project management, and engineering solutions.
Currently, our team of over 5,000 professionals caters to 110+ enterprise clients worldwide, including Fortune companies. Our client base represents a diverse spectrum of industries, including Banking & Financial Services, Semiconductor/VLSI, Technology, Healthcare & Life Sciences, Government, Telecom & Media, Retail & Distribution, Oil & Gas, and Energy & Utilities.
Headquartered in Herndon, VA, LanceSoft operates 32+ regional offices across the North America, Europe, Asia, and Australia. We also have nine delivery centers strategically located in India in Bangalore, Indore, Noida, Baroda, Hyderabad, Bhubaneshwar, Dehradun, Goa, and Aligarh to further enhance our client service capabilities.