Principal AI Cloud Engineer
Top Benefits
About the role
Application Deadline:
11/29/2025
Address:
100 King Street West
Job Family Group:
Data Analytics & Reporting
The Team
We accelerate BMO’s AI journey by building enterprise-grade, cloud-native AI solutions. Our team combines engineering excellence with cutting-edge AI to deliver scalable, secure, and responsible solutions that power business innovation across the bank. We enable and accelerate our partners on their AI journeys across the enterprise, helping teams across BMO unlock value at scale. We support one another in times of need and take pride in our work. We are engineers, AI practitioners, platform builders, thought leaders, multipliers, and coders. Above all, we are a global team of diverse individuals who enjoy working together to create smart, secure, and scalable solutions that make an impact across the enterprise. Our ambition is bold: deploy our capital and resources to their highest and most profitable use through a digital-first operating model, powered by data and AI-driven decisions.
The Impact
As a Principal Cloud AI Engineer, you are a hands-on technical developer who designs, builds, and scales cloud-native AI solutions and products. You help set engineering standards, establish patterns, mentor senior engineers, and partner with multiple teams to deliver resilient, governed, and cost-efficient AI at enterprise scale. You’ll help shape and evolve our AI cloud strategy from model serving and LLMOps to security, observability, and compliance so teams across the bank can innovate safely and rapidly.
You will advance BMO’s Digital First strategy by:
- Defining reference and production-grade solutions for AI/GenAI on cloud (Azure preferred; multi-cloud aware).
- Building reusable, secure, and observable components (APIs, SDKs, microservices, pipelines).
- Operationalizing LLMs and RAG with strong controls and Responsible AI guardrails.
- Driving platform roadmaps that enable faster delivery, lower risk, and measurable business outcomes.
What’s In It for You
- Influence the technical direction of enterprise AI and the platform primitives others build on.
- Ship high-impact systems used across many business lines and products.
- Work across the full stack: cloud infra, data/feature pipelines, model serving, LLMOps, and DevSecOps.
- Partner with a leadership team invested in your growth and thought leadership.
Responsibilities
Infrastructure & Platform Builder
- Design, build, and operate cloud-native AI infrastructure for ML/GenAI workloads:
- Compute: GPU/CPU clusters, autoscaling, spot instance strategies
- Networking: Azure VNet, Private Link, peering, multi-region HA/DR
- Storage & Databases: high-performance data lakes (e.g., Azure Data Lake Storage) , relational DBs, vector DBs (FAISS, Milvus, Pinecone, pgvector)
- Security: IAM, Key Vault-backed secrets management, encryption, policy-as-code
 
- Implement observability and reliability for AI infra:
- Metrics (latency, throughput, GPU utilization, cost)
- Logging/tracing (OpenTelemetry), SLOs/SLIs for infra services
 
- Build CI/CD and GitOps pipelines for infrastructure-as-code (Terraform/Bicep) and AI platform components
- Drive FinOps for AI infra: GPU rightsizing, caching, inference optimization, cost governance
Application & Service Enablement
- Enable frontend and backend services for AI platforms:
- Secure APIs, microservices, and event-driven architectures
- Integration with custom model runtimes (TensorRT-LLM, vLLM, Triton/KServe)
 
- Provide infrastructure support for RAG systems: embeddings, chunking, retrieval pipelines
- Ensure scalable serving infrastructure for LLMs and ML models with caching and token optimization
Strategy & Architecture
- Define and evolve AI infrastructure reference architecture for cloud (Azure preferred):
- Container orchestration (Kubernetes), service mesh, ingress
- Serverless/event-driven patterns for AI pipelines
- Multi-region, HA/DR, compliance-ready designs
 
- Establish standards and best practices for containerization, IaC, and secure networking for AI systems
Security, Risk & Governance
- Implement defense-in-depth for AI infra:
- IAM least privilege, private networking, KMS/Key Vault, SBOM, image signing
 
- Ensure compliance and Responsible AI controls at infra level:
- Data residency, encryption, lineage, audit readiness
 
Delivery & Operations
- Lead infrastructure discovery and solution design with stakeholders
- Operate platforms with SRE principles: error budgets, incident response, chaos testing
- Mentor engineers; create reusable IaC modules, templates, and golden paths
Must-Have Qualifications
- Bachelor’s/Master’s/PhD in CS, Engineering, or related field
- 7+ years building large-scale distributed cloud infrastructure
- 5+ years hands-on with Azure (preferred); AWS/GCP nice to have
- Proven experience with AI/ML infra: GPU clusters, Kubernetes, CI/CD, observability
- Strong in IaC (Terraform/Bicep), Kubernetes, networking, security
- Expertise in cloud-native patterns: containers, service mesh, serverless
- Familiarity with MLOps/LLMOps infra: model serving, feature stores, vector DBs
- Programming in Python (infra automation) and one of Go/TypeScript for tooling
- Understanding of frontend/backend integration for AI services
- Familiarity with MLOps/LLMOps infra: model serving, feature stores, vector DBs
- Programming in Python (infra automation) and one of Go/TypeScript for tooling
- Understanding of frontend/backend integration for AI services
Nice-to-Have
- GPU optimization (CUDA/NCCL, TensorRT-LLM)
- Observability tools (Prometheus, Grafana, OpenTelemetry)
- Event streaming (Kafka/Azure Event Hubs), real-time systems
- Experience with AI platform products (Azure ML, MLflow, KServe, Hugging Face)
Tech Stack
- Cloud & Infra: Azure (AKS, Functions, Event Hubs, Key Vault), Terraform/Bicep, GitHub Actions/Azure DevOps
- AI Infra: Kubernetes, KServe/Triton, vLLM, TensorRT-LLM, Ray, Spark
- Ops: Prometheus, Grafana, OpenTelemetry, ArgoCD, OPA
- Data: Feature stores (Feast), vector DBs (FAISS, Milvus, Pinecone), relational DBs
- App Layer: APIs, microservices, frontend/backend integration for AI systems
Success Metrics
- Reliability & Performance: SLOs met for infra services, GPU utilization optimized
- Security & Compliance: Zero critical findings, auditable infra
- Cost Efficiency: Reduced GPU/infra spend via FinOps strategies
- Developer Velocity: Faster provisioning and deployment of AI infra
- Technical Leadership: Influence on infra standards, mentorship, reusable patterns
About Desjardins
Desjardins Group is the largest cooperative financial group in North America and the fifth largest cooperative financial group in the world, with assets of $435.8 billion as at March 31, 2024. It was named one of Canada's Best Employers by Forbes magazine and by Mediacorp. To meet the diverse needs of its members and clients, Desjardins offers a full range of products and services to individuals and businesses through its extensive distribution network, online platforms and subsidiaries across Canada. Ranked among the world's strongest banks according to The Banker magazine, Desjardins has some of the highest capital ratios and credit ratings in the industry and the first according to Bloomberg News.
Principal AI Cloud Engineer
Top Benefits
About the role
Application Deadline:
11/29/2025
Address:
100 King Street West
Job Family Group:
Data Analytics & Reporting
The Team
We accelerate BMO’s AI journey by building enterprise-grade, cloud-native AI solutions. Our team combines engineering excellence with cutting-edge AI to deliver scalable, secure, and responsible solutions that power business innovation across the bank. We enable and accelerate our partners on their AI journeys across the enterprise, helping teams across BMO unlock value at scale. We support one another in times of need and take pride in our work. We are engineers, AI practitioners, platform builders, thought leaders, multipliers, and coders. Above all, we are a global team of diverse individuals who enjoy working together to create smart, secure, and scalable solutions that make an impact across the enterprise. Our ambition is bold: deploy our capital and resources to their highest and most profitable use through a digital-first operating model, powered by data and AI-driven decisions.
The Impact
As a Principal Cloud AI Engineer, you are a hands-on technical developer who designs, builds, and scales cloud-native AI solutions and products. You help set engineering standards, establish patterns, mentor senior engineers, and partner with multiple teams to deliver resilient, governed, and cost-efficient AI at enterprise scale. You’ll help shape and evolve our AI cloud strategy from model serving and LLMOps to security, observability, and compliance so teams across the bank can innovate safely and rapidly.
You will advance BMO’s Digital First strategy by:
- Defining reference and production-grade solutions for AI/GenAI on cloud (Azure preferred; multi-cloud aware).
- Building reusable, secure, and observable components (APIs, SDKs, microservices, pipelines).
- Operationalizing LLMs and RAG with strong controls and Responsible AI guardrails.
- Driving platform roadmaps that enable faster delivery, lower risk, and measurable business outcomes.
What’s In It for You
- Influence the technical direction of enterprise AI and the platform primitives others build on.
- Ship high-impact systems used across many business lines and products.
- Work across the full stack: cloud infra, data/feature pipelines, model serving, LLMOps, and DevSecOps.
- Partner with a leadership team invested in your growth and thought leadership.
Responsibilities
Infrastructure & Platform Builder
- Design, build, and operate cloud-native AI infrastructure for ML/GenAI workloads:
- Compute: GPU/CPU clusters, autoscaling, spot instance strategies
- Networking: Azure VNet, Private Link, peering, multi-region HA/DR
- Storage & Databases: high-performance data lakes (e.g., Azure Data Lake Storage) , relational DBs, vector DBs (FAISS, Milvus, Pinecone, pgvector)
- Security: IAM, Key Vault-backed secrets management, encryption, policy-as-code
 
- Implement observability and reliability for AI infra:
- Metrics (latency, throughput, GPU utilization, cost)
- Logging/tracing (OpenTelemetry), SLOs/SLIs for infra services
 
- Build CI/CD and GitOps pipelines for infrastructure-as-code (Terraform/Bicep) and AI platform components
- Drive FinOps for AI infra: GPU rightsizing, caching, inference optimization, cost governance
Application & Service Enablement
- Enable frontend and backend services for AI platforms:
- Secure APIs, microservices, and event-driven architectures
- Integration with custom model runtimes (TensorRT-LLM, vLLM, Triton/KServe)
 
- Provide infrastructure support for RAG systems: embeddings, chunking, retrieval pipelines
- Ensure scalable serving infrastructure for LLMs and ML models with caching and token optimization
Strategy & Architecture
- Define and evolve AI infrastructure reference architecture for cloud (Azure preferred):
- Container orchestration (Kubernetes), service mesh, ingress
- Serverless/event-driven patterns for AI pipelines
- Multi-region, HA/DR, compliance-ready designs
 
- Establish standards and best practices for containerization, IaC, and secure networking for AI systems
Security, Risk & Governance
- Implement defense-in-depth for AI infra:
- IAM least privilege, private networking, KMS/Key Vault, SBOM, image signing
 
- Ensure compliance and Responsible AI controls at infra level:
- Data residency, encryption, lineage, audit readiness
 
Delivery & Operations
- Lead infrastructure discovery and solution design with stakeholders
- Operate platforms with SRE principles: error budgets, incident response, chaos testing
- Mentor engineers; create reusable IaC modules, templates, and golden paths
Must-Have Qualifications
- Bachelor’s/Master’s/PhD in CS, Engineering, or related field
- 7+ years building large-scale distributed cloud infrastructure
- 5+ years hands-on with Azure (preferred); AWS/GCP nice to have
- Proven experience with AI/ML infra: GPU clusters, Kubernetes, CI/CD, observability
- Strong in IaC (Terraform/Bicep), Kubernetes, networking, security
- Expertise in cloud-native patterns: containers, service mesh, serverless
- Familiarity with MLOps/LLMOps infra: model serving, feature stores, vector DBs
- Programming in Python (infra automation) and one of Go/TypeScript for tooling
- Understanding of frontend/backend integration for AI services
- Familiarity with MLOps/LLMOps infra: model serving, feature stores, vector DBs
- Programming in Python (infra automation) and one of Go/TypeScript for tooling
- Understanding of frontend/backend integration for AI services
Nice-to-Have
- GPU optimization (CUDA/NCCL, TensorRT-LLM)
- Observability tools (Prometheus, Grafana, OpenTelemetry)
- Event streaming (Kafka/Azure Event Hubs), real-time systems
- Experience with AI platform products (Azure ML, MLflow, KServe, Hugging Face)
Tech Stack
- Cloud & Infra: Azure (AKS, Functions, Event Hubs, Key Vault), Terraform/Bicep, GitHub Actions/Azure DevOps
- AI Infra: Kubernetes, KServe/Triton, vLLM, TensorRT-LLM, Ray, Spark
- Ops: Prometheus, Grafana, OpenTelemetry, ArgoCD, OPA
- Data: Feature stores (Feast), vector DBs (FAISS, Milvus, Pinecone), relational DBs
- App Layer: APIs, microservices, frontend/backend integration for AI systems
Success Metrics
- Reliability & Performance: SLOs met for infra services, GPU utilization optimized
- Security & Compliance: Zero critical findings, auditable infra
- Cost Efficiency: Reduced GPU/infra spend via FinOps strategies
- Developer Velocity: Faster provisioning and deployment of AI infra
- Technical Leadership: Influence on infra standards, mentorship, reusable patterns
About Desjardins
Desjardins Group is the largest cooperative financial group in North America and the fifth largest cooperative financial group in the world, with assets of $435.8 billion as at March 31, 2024. It was named one of Canada's Best Employers by Forbes magazine and by Mediacorp. To meet the diverse needs of its members and clients, Desjardins offers a full range of products and services to individuals and businesses through its extensive distribution network, online platforms and subsidiaries across Canada. Ranked among the world's strongest banks according to The Banker magazine, Desjardins has some of the highest capital ratios and credit ratings in the industry and the first according to Bloomberg News.

