Associate Director, Automation/SRE, Production Engineering
Top Benefits
About the role
Job Description
What is the opportunity?
This role reports to the Associate Director, Production Engineering Lead within the Production and Risk Services (P&RS). As part of the team, you will be help operationalize her strategy to embed SRE (Site Reliability Engineering) principles and best practices to drive down TOIL and maintenance cost. This Associate Director, Automation/SRE-Production Engineering role will be vital in shaping the SRE function within P&RS (and wider Quantitative and Technology Services that P&RS is part of) to improve efficiency, recovery and resiliency.
What will you do?
- Report directly to the Production Engineering Lead, partenering closely to prioritize and address reliability and supportability gaps.
- Opportunity to work with RBC’s Predictive Engineering tool.
- Focus on solving our operational challenges tasks (improving supportability / enhancements on operating a 24/7 environment); you’ll support application written in Java, .NET, C++, and web stacks across cloud, on-prem Linux/Windows, and containerized environments.
- Engineer and maintain API integrations (e.g. REST/SOAP, OAuth, OpenAPI/Swagger) to enable robust inter-service communication
- Enhance and maintain database reliability and performance across relational(Postgres SQL, Oracle) and NoSQL (MongoDb, Cassandra, Redis) platform- qyuery optimization, replication relsilience, automation.
- Develop and manage automation tools and frameworks to reduce operational toil and build self-healing workflows; implement monitoring, alerting, SLI/SLO’s, dashboards usiong tools like ITRS geneos, Prometheus, Grafana , ELK, Dynatrace.
- Apply long-term corrective actions and automations for recurring incidents; embed resilience patterns, - retries, circuit breaker, graceful degradation into services and systems deployed on cloud(Azure, AWS), Linux/Windows on-prem , and containers(Kubernetes/Docker).
- Contribute to documentation and adoption of reliability best practices across development abds support teams; identify reliability gaps, help build standardized frameworks and clos those gaps through automation, monitoring , API/ database integrations, and self-healing systems.
- Utilize effective and efficient sustainment in supportability engineering to help automate manual tasks (alleviate TOIL), invent tools to help improve issue investigations and recovery such as intelligent operations/AI Ops.
- Lead in Disciplined Operations and governance - best practices and production standards on our toolkit, build and set templates for monitoring, batch controls, runbooks etc.
- Integrate Chaos Engineering best practices and outline front to back integration testing standards, accepting failure as norm.
What do you need to succeed?
Must have:
- Bachelor’s degree in Computer Science, Engineering, or equivalent experience
- 5+ years experience of software development, including 5+ years in Core Java in production applications
- 2+ Experience with SRE (Site Reliability Engineering) principles and practices
- Proven experience with API integrations and inter-service communication frameworks (e.g., Spring REST, Swagger/OpenAPI)
- Real world experience with database systems: SQL - designing for performance, automating operations
- Strong coding and scripting abilities (Java, Python, Bash) to build automation tooling
- Familiarity with Observability tooling, any one of…like ITRS Geneos, Prometheus, Grafana, ELK stack, Dynatrace
- Solid understanding of working with Linux and Windows OS, Networking, Container orchestration (Kubernetes), and cloud architecture
- Must be extremely hands on, detail oriented, assertive and proactive with both day-to-day tasks and short and long-term deliveries
- Proven ability to collaborate well with others, be strategically focused and realize continuous improvements
- Good organization skills , ability to effectively context switch and thrive in a fast pace environment
- Strong interpersonal skills and self-starter attitude
- Excellent verbal and written communication skills
Nice to have:
- Cloud certification (Azure, AWS) or Kubernetes certification(CKA)
- Exposure to chaos engineering, GitOps workflows(ArgoCD, Flux, Helios), and polatform level error budgeting
- Prior in-depth experience in supporting large-scale, enterprise-wide, global trading and risk platforms
What is in it for you?
We thrive on the challenge to be our best - progressive thinking to keep growing and working together to deliver trusted advice to help our clients thrive and communities prosper. We care about each other, reaching our potential, making a difference to our communities, and achieving success that is mutual.
- A comprehensive Total Rewards Program including bonuses, flexible benefits and competitive compensation
- Leaders who support your development through coaching and managing opportunities
- Opportunities to work with the best in the field
- Ability to make a difference and lasting impact
- Work in a dynamic, collaborative, progressive, and high-performing team
- A world-class training program in financial services
- Flexible working options fully supported.
#Li-Post
About RBC
Royal Bank of Canada is a global financial institution with a purpose-driven, principles-led approach to delivering leading performance. Our success comes from the 94,000+ employees who leverage their imaginations and insights to bring our vision, values and strategy to life so we can help our clients thrive and communities prosper. As Canada's biggest bank and one of the largest in the world, based on market capitalization, we have a diversified business model with a focus on innovation and providing exceptional experiences to our more than 17 million clients in Canada, the U.S. and 27 other countries. Learn more at rbc.com. We are proud to support a broad range of community initiatives through donations, community investments and employee volunteer activities. See how at www.rbc.com/community-social-impact.
La Banque Royale du Canada est une institution financière mondiale définie par sa raison d'être, guidée par des principes et orientée vers l'excellence en matière de rendement. Notre succès est attribuable aux quelque 94 000+ employés qui mettent à profit leur créativité et leur savoir faire pour concrétiser notre vision, nos valeurs et notre stratégie afin que nous puissions contribuer à la prospérité de nos clients et au dynamisme des collectivités. Selon la capitalisation boursière, nous sommes la plus importante banque du Canada et l'une des plus grandes banques du monde. Nous avons adopté un modèle d'affaires diversifié axé sur l'innovation et l'offre d'expériences exceptionnelles à nos plus de 17 millions de clients au Canada, aux États Unis et dans 27 autres pays. Pour en savoir plus, visitez le site rbc.com/francais
Nous sommes fiers d'appuyer une grande diversité d'initiatives communautaires par des dons, des investissements dans la collectivité et le travail bénévole de nos employés. Pour de plus amples renseignements, visitez le site www.rbc.com/collectivite-impact-social.
Associate Director, Automation/SRE, Production Engineering
Top Benefits
About the role
Job Description
What is the opportunity?
This role reports to the Associate Director, Production Engineering Lead within the Production and Risk Services (P&RS). As part of the team, you will be help operationalize her strategy to embed SRE (Site Reliability Engineering) principles and best practices to drive down TOIL and maintenance cost. This Associate Director, Automation/SRE-Production Engineering role will be vital in shaping the SRE function within P&RS (and wider Quantitative and Technology Services that P&RS is part of) to improve efficiency, recovery and resiliency.
What will you do?
- Report directly to the Production Engineering Lead, partenering closely to prioritize and address reliability and supportability gaps.
- Opportunity to work with RBC’s Predictive Engineering tool.
- Focus on solving our operational challenges tasks (improving supportability / enhancements on operating a 24/7 environment); you’ll support application written in Java, .NET, C++, and web stacks across cloud, on-prem Linux/Windows, and containerized environments.
- Engineer and maintain API integrations (e.g. REST/SOAP, OAuth, OpenAPI/Swagger) to enable robust inter-service communication
- Enhance and maintain database reliability and performance across relational(Postgres SQL, Oracle) and NoSQL (MongoDb, Cassandra, Redis) platform- qyuery optimization, replication relsilience, automation.
- Develop and manage automation tools and frameworks to reduce operational toil and build self-healing workflows; implement monitoring, alerting, SLI/SLO’s, dashboards usiong tools like ITRS geneos, Prometheus, Grafana , ELK, Dynatrace.
- Apply long-term corrective actions and automations for recurring incidents; embed resilience patterns, - retries, circuit breaker, graceful degradation into services and systems deployed on cloud(Azure, AWS), Linux/Windows on-prem , and containers(Kubernetes/Docker).
- Contribute to documentation and adoption of reliability best practices across development abds support teams; identify reliability gaps, help build standardized frameworks and clos those gaps through automation, monitoring , API/ database integrations, and self-healing systems.
- Utilize effective and efficient sustainment in supportability engineering to help automate manual tasks (alleviate TOIL), invent tools to help improve issue investigations and recovery such as intelligent operations/AI Ops.
- Lead in Disciplined Operations and governance - best practices and production standards on our toolkit, build and set templates for monitoring, batch controls, runbooks etc.
- Integrate Chaos Engineering best practices and outline front to back integration testing standards, accepting failure as norm.
What do you need to succeed?
Must have:
- Bachelor’s degree in Computer Science, Engineering, or equivalent experience
- 5+ years experience of software development, including 5+ years in Core Java in production applications
- 2+ Experience with SRE (Site Reliability Engineering) principles and practices
- Proven experience with API integrations and inter-service communication frameworks (e.g., Spring REST, Swagger/OpenAPI)
- Real world experience with database systems: SQL - designing for performance, automating operations
- Strong coding and scripting abilities (Java, Python, Bash) to build automation tooling
- Familiarity with Observability tooling, any one of…like ITRS Geneos, Prometheus, Grafana, ELK stack, Dynatrace
- Solid understanding of working with Linux and Windows OS, Networking, Container orchestration (Kubernetes), and cloud architecture
- Must be extremely hands on, detail oriented, assertive and proactive with both day-to-day tasks and short and long-term deliveries
- Proven ability to collaborate well with others, be strategically focused and realize continuous improvements
- Good organization skills , ability to effectively context switch and thrive in a fast pace environment
- Strong interpersonal skills and self-starter attitude
- Excellent verbal and written communication skills
Nice to have:
- Cloud certification (Azure, AWS) or Kubernetes certification(CKA)
- Exposure to chaos engineering, GitOps workflows(ArgoCD, Flux, Helios), and polatform level error budgeting
- Prior in-depth experience in supporting large-scale, enterprise-wide, global trading and risk platforms
What is in it for you?
We thrive on the challenge to be our best - progressive thinking to keep growing and working together to deliver trusted advice to help our clients thrive and communities prosper. We care about each other, reaching our potential, making a difference to our communities, and achieving success that is mutual.
- A comprehensive Total Rewards Program including bonuses, flexible benefits and competitive compensation
- Leaders who support your development through coaching and managing opportunities
- Opportunities to work with the best in the field
- Ability to make a difference and lasting impact
- Work in a dynamic, collaborative, progressive, and high-performing team
- A world-class training program in financial services
- Flexible working options fully supported.
#Li-Post
About RBC
Royal Bank of Canada is a global financial institution with a purpose-driven, principles-led approach to delivering leading performance. Our success comes from the 94,000+ employees who leverage their imaginations and insights to bring our vision, values and strategy to life so we can help our clients thrive and communities prosper. As Canada's biggest bank and one of the largest in the world, based on market capitalization, we have a diversified business model with a focus on innovation and providing exceptional experiences to our more than 17 million clients in Canada, the U.S. and 27 other countries. Learn more at rbc.com. We are proud to support a broad range of community initiatives through donations, community investments and employee volunteer activities. See how at www.rbc.com/community-social-impact.
La Banque Royale du Canada est une institution financière mondiale définie par sa raison d'être, guidée par des principes et orientée vers l'excellence en matière de rendement. Notre succès est attribuable aux quelque 94 000+ employés qui mettent à profit leur créativité et leur savoir faire pour concrétiser notre vision, nos valeurs et notre stratégie afin que nous puissions contribuer à la prospérité de nos clients et au dynamisme des collectivités. Selon la capitalisation boursière, nous sommes la plus importante banque du Canada et l'une des plus grandes banques du monde. Nous avons adopté un modèle d'affaires diversifié axé sur l'innovation et l'offre d'expériences exceptionnelles à nos plus de 17 millions de clients au Canada, aux États Unis et dans 27 autres pays. Pour en savoir plus, visitez le site rbc.com/francais
Nous sommes fiers d'appuyer une grande diversité d'initiatives communautaires par des dons, des investissements dans la collectivité et le travail bénévole de nos employés. Pour de plus amples renseignements, visitez le site www.rbc.com/collectivite-impact-social.