Sr. Site Reliability Engineer, Chaos Engineering
Top Benefits
About the role
Job Description
What is the Opportunity?
We are seeking an experienced and innovative Lead Site Reliability Engineer (SRE) to spearhead the implementation of Chaos Engineering practices across all Digital Channels. This senior-level role is critical to ensuring the resilience, scalability, and availability of our systems in high-stress environments while driving operational excellence within the organization.
What will you do?
As the Lead SRE specialized in Chaos Engineering, your responsibilities will include:
Chaos Engineering Implementation:
- Design and execute chaos experiments using tools such as Gremlin to proactively test systems under stress.
- Simulate failure scenarios to identify potential risks and validate system behavior during degradation or outages.
- Ensure experiments yield actionable insights for improving resilience.
Assess and Validate Autoscaling:
-
Analyze and validate autoscaling policies across individual systems to ensure optimal performance under variable loads.
-
Collaborate with engineering teams to refine and implement dynamic scaling strategies based on experimental outcomes.
Resiliency Reporting:
-
Develop comprehensive reports that outline system resiliency metrics, findings from chaos experiments, and recommendations for improvement.
-
Provide insights to leadership and technical teams to guide decision-making on infrastructure updates and architectural enhancements.
High Availability Architectures:
- Embed redundancy patterns such as failover mechanisms and active-active configurations to achieve high availability across key systems.
- Lead efforts to integrate seamless failover processes during both planned and unplanned downtimes.
Collaboration and Communication:
- Act as a subject matter expert and promote Chaos Engineering practices across teams.
- Partner with DevOps, infrastructure, and application teams to ensure resilience objectives align with broader organizational goals.
- Facilitate training sessions and knowledge-sharing initiatives on Chaos Engineering concepts for technical staff.
What do you need to succeed?
Must Have:
- 5+ years of experience in Site Reliability Engineering or a related role, with a minimum of 2 years focused on Chaos Engineering.
- Hands-on experience in designing and implementing reliable, scalable, and fault-tolerant systems.
- Strong proficiency in Chaos Engineering tools like Gremlin, Chaos Monkey, or similar platforms.
- Deep understanding of cloud infrastructure (AWS, Azure, GCP) and concepts like load balancing, autoscaling, failover, and high availability.
- Proven expertise in monitoring and observability tools like Prometheus, Grafana, or Datadog.
Nice to Have:
- Excellent analytical, problem-solving, and decision-making abilities.
- Strong collaboration and communication skills to influence cross-functional teams.
- A proactive and innovative mindset to drive continuous improvements.
What’s in it for you?
We thrive on the challenge to be our best, progressive thinking to keep growing, and working together to deliver trusted advice to help our clients thrive and communities prosper. We care about each other, reaching our potential, making a difference to our communities, and achieving success that is mutual.
- A comprehensive Total Rewards Program including bonuses and flexible benefits, competitive compensation, commissions, and stock where applicable
- Leaders who support your development through coaching and managing opportunities
- Ability to make a difference and lasting impact
- Work in a dynamic, collaborative, progressive, and high-performing team
- A world-class training program in financial services
- Flexible work/life balance options
- Opportunities to do challenging work
- Opportunities to take on progressively greater accountabilities
- Opportunities to building close relationships with clients
#LI-POST
#TECHPJ
About RBC
Royal Bank of Canada is a global financial institution with a purpose-driven, principles-led approach to delivering leading performance. Our success comes from the 94,000+ employees who leverage their imaginations and insights to bring our vision, values and strategy to life so we can help our clients thrive and communities prosper. As Canada's biggest bank and one of the largest in the world, based on market capitalization, we have a diversified business model with a focus on innovation and providing exceptional experiences to our more than 17 million clients in Canada, the U.S. and 27 other countries. Learn more at rbc.com. We are proud to support a broad range of community initiatives through donations, community investments and employee volunteer activities. See how at www.rbc.com/community-social-impact.
La Banque Royale du Canada est une institution financière mondiale définie par sa raison d'être, guidée par des principes et orientée vers l'excellence en matière de rendement. Notre succès est attribuable aux quelque 94 000+ employés qui mettent à profit leur créativité et leur savoir faire pour concrétiser notre vision, nos valeurs et notre stratégie afin que nous puissions contribuer à la prospérité de nos clients et au dynamisme des collectivités. Selon la capitalisation boursière, nous sommes la plus importante banque du Canada et l'une des plus grandes banques du monde. Nous avons adopté un modèle d'affaires diversifié axé sur l'innovation et l'offre d'expériences exceptionnelles à nos plus de 17 millions de clients au Canada, aux États Unis et dans 27 autres pays. Pour en savoir plus, visitez le site rbc.com/francais
Nous sommes fiers d'appuyer une grande diversité d'initiatives communautaires par des dons, des investissements dans la collectivité et le travail bénévole de nos employés. Pour de plus amples renseignements, visitez le site www.rbc.com/collectivite-impact-social.
Sr. Site Reliability Engineer, Chaos Engineering
Top Benefits
About the role
Job Description
What is the Opportunity?
We are seeking an experienced and innovative Lead Site Reliability Engineer (SRE) to spearhead the implementation of Chaos Engineering practices across all Digital Channels. This senior-level role is critical to ensuring the resilience, scalability, and availability of our systems in high-stress environments while driving operational excellence within the organization.
What will you do?
As the Lead SRE specialized in Chaos Engineering, your responsibilities will include:
Chaos Engineering Implementation:
- Design and execute chaos experiments using tools such as Gremlin to proactively test systems under stress.
- Simulate failure scenarios to identify potential risks and validate system behavior during degradation or outages.
- Ensure experiments yield actionable insights for improving resilience.
Assess and Validate Autoscaling:
-
Analyze and validate autoscaling policies across individual systems to ensure optimal performance under variable loads.
-
Collaborate with engineering teams to refine and implement dynamic scaling strategies based on experimental outcomes.
Resiliency Reporting:
-
Develop comprehensive reports that outline system resiliency metrics, findings from chaos experiments, and recommendations for improvement.
-
Provide insights to leadership and technical teams to guide decision-making on infrastructure updates and architectural enhancements.
High Availability Architectures:
- Embed redundancy patterns such as failover mechanisms and active-active configurations to achieve high availability across key systems.
- Lead efforts to integrate seamless failover processes during both planned and unplanned downtimes.
Collaboration and Communication:
- Act as a subject matter expert and promote Chaos Engineering practices across teams.
- Partner with DevOps, infrastructure, and application teams to ensure resilience objectives align with broader organizational goals.
- Facilitate training sessions and knowledge-sharing initiatives on Chaos Engineering concepts for technical staff.
What do you need to succeed?
Must Have:
- 5+ years of experience in Site Reliability Engineering or a related role, with a minimum of 2 years focused on Chaos Engineering.
- Hands-on experience in designing and implementing reliable, scalable, and fault-tolerant systems.
- Strong proficiency in Chaos Engineering tools like Gremlin, Chaos Monkey, or similar platforms.
- Deep understanding of cloud infrastructure (AWS, Azure, GCP) and concepts like load balancing, autoscaling, failover, and high availability.
- Proven expertise in monitoring and observability tools like Prometheus, Grafana, or Datadog.
Nice to Have:
- Excellent analytical, problem-solving, and decision-making abilities.
- Strong collaboration and communication skills to influence cross-functional teams.
- A proactive and innovative mindset to drive continuous improvements.
What’s in it for you?
We thrive on the challenge to be our best, progressive thinking to keep growing, and working together to deliver trusted advice to help our clients thrive and communities prosper. We care about each other, reaching our potential, making a difference to our communities, and achieving success that is mutual.
- A comprehensive Total Rewards Program including bonuses and flexible benefits, competitive compensation, commissions, and stock where applicable
- Leaders who support your development through coaching and managing opportunities
- Ability to make a difference and lasting impact
- Work in a dynamic, collaborative, progressive, and high-performing team
- A world-class training program in financial services
- Flexible work/life balance options
- Opportunities to do challenging work
- Opportunities to take on progressively greater accountabilities
- Opportunities to building close relationships with clients
#LI-POST
#TECHPJ
About RBC
Royal Bank of Canada is a global financial institution with a purpose-driven, principles-led approach to delivering leading performance. Our success comes from the 94,000+ employees who leverage their imaginations and insights to bring our vision, values and strategy to life so we can help our clients thrive and communities prosper. As Canada's biggest bank and one of the largest in the world, based on market capitalization, we have a diversified business model with a focus on innovation and providing exceptional experiences to our more than 17 million clients in Canada, the U.S. and 27 other countries. Learn more at rbc.com. We are proud to support a broad range of community initiatives through donations, community investments and employee volunteer activities. See how at www.rbc.com/community-social-impact.
La Banque Royale du Canada est une institution financière mondiale définie par sa raison d'être, guidée par des principes et orientée vers l'excellence en matière de rendement. Notre succès est attribuable aux quelque 94 000+ employés qui mettent à profit leur créativité et leur savoir faire pour concrétiser notre vision, nos valeurs et notre stratégie afin que nous puissions contribuer à la prospérité de nos clients et au dynamisme des collectivités. Selon la capitalisation boursière, nous sommes la plus importante banque du Canada et l'une des plus grandes banques du monde. Nous avons adopté un modèle d'affaires diversifié axé sur l'innovation et l'offre d'expériences exceptionnelles à nos plus de 17 millions de clients au Canada, aux États Unis et dans 27 autres pays. Pour en savoir plus, visitez le site rbc.com/francais
Nous sommes fiers d'appuyer une grande diversité d'initiatives communautaires par des dons, des investissements dans la collectivité et le travail bénévole de nos employés. Pour de plus amples renseignements, visitez le site www.rbc.com/collectivite-impact-social.