LLMOps Lead

Meraki7 8 days ago

Hybrid

Toronto, ON

Senior Level

full_time

About the role

LLMOps Lead (Canada)

Location: Remote / Toronto (Canada)
Level: Senior Engineer Staff Engineer (depending on experience)

Role Overview

We are seeking an experienced LLMOps Lead to architect, manage, and scale the operational infrastructure behind our large language model workflows. You will be responsible for building robust pipelines, monitoring systems, and tooling that empower our AI teams to deploy, test, and iterate models rapidly and reliably.

Key Responsibilities

Design, build, and maintain LLM infrastructure (model serving, versioning, rollout, inference pipelines)
Automate model training, fine-tuning, evaluation, and retraining workflows
Implement observability, monitoring, logging, and alerting around model performance, drift, latency, and reliability
Work cross-functionally with ML research, product, and software engineering to integrate LLMs into production services
Optimize cost, scaling, caching, batching, and hardware utilization for inference
Lead best practices in MLOps: reproducibility, infrastructure as code, model lineage, experiment tracking
Mentor junior engineers; set engineering standards and guidelines for LLM operations

Required Skills & Experience

Master's or PhD in Computer Science, Machine Learning, or related (Waterloo / U of Toronto / Queen's preferred)
5+ years of backend / MLOps / infrastructure experience, with 2+ years working specifically with large language models or deep learning production systems
Deep familiarity with frameworks such as PyTorch, TensorFlow, Transformers, Hugging Face, etc.
Experience with model serving platforms (e.g. Triton, TorchServe, KFServing, TensorFlow Serving)
Strong knowledge of distributed systems, containerization (Docker), orchestration (Kubernetes), and cloud infrastructure (AWS, GCP, Azure)
Hands-on experience in orchestration tools (Airflow, Kubeflow, MLFlow, etc.)
Proven track record of building scalable, reliable, low-latency inference systems
Familiar with cost optimization strategies for model inference (quantization, pruning, batching, caching)
Excellent communication, problem-solving, debugging, and leadership skills
Energetic, proactive, with ability to adapt in fast-paced, startup-style environment

Nice-to-Have

Experience with agentic systems, autonomic AI, or multi-agent coordination
Background in prompt engineering, reinforcement learning from human feedback (RLHF)
Exposure to edge / on-device inference
#IT2025

About Meraki7

IT Services and IT Consulting

11-50

Meraki7 Inc is a national staffing and recruiting firm specializing in IT. We partner with leading companies to match in-demand top talent with challenging information technology positions.

Website LinkedIn