Jobs.ca
Jobs.ca
Language
Meraki7 logo

LLMOps Lead

Meraki78 days ago
Hybrid
Toronto, ON
Senior Level
full_time

About the role

LLMOps Lead (Canada)

Location: Remote / Toronto (Canada)
Level: Senior Engineer Staff Engineer (depending on experience)

Role Overview

We are seeking an experienced LLMOps Lead to architect, manage, and scale the operational infrastructure behind our large language model workflows. You will be responsible for building robust pipelines, monitoring systems, and tooling that empower our AI teams to deploy, test, and iterate models rapidly and reliably.

Key Responsibilities

  • Design, build, and maintain LLM infrastructure (model serving, versioning, rollout, inference pipelines)
  • Automate model training, fine-tuning, evaluation, and retraining workflows
  • Implement observability, monitoring, logging, and alerting around model performance, drift, latency, and reliability
  • Work cross-functionally with ML research, product, and software engineering to integrate LLMs into production services
  • Optimize cost, scaling, caching, batching, and hardware utilization for inference
  • Lead best practices in MLOps: reproducibility, infrastructure as code, model lineage, experiment tracking
  • Mentor junior engineers; set engineering standards and guidelines for LLM operations

Required Skills & Experience

  • Master's or PhD in Computer Science, Machine Learning, or related (Waterloo / U of Toronto / Queen's preferred)
  • 5+ years of backend / MLOps / infrastructure experience, with 2+ years working specifically with large language models or deep learning production systems
  • Deep familiarity with frameworks such as PyTorch, TensorFlow, Transformers, Hugging Face, etc.
  • Experience with model serving platforms (e.g. Triton, TorchServe, KFServing, TensorFlow Serving)
  • Strong knowledge of distributed systems, containerization (Docker), orchestration (Kubernetes), and cloud infrastructure (AWS, GCP, Azure)
  • Hands-on experience in orchestration tools (Airflow, Kubeflow, MLFlow, etc.)
  • Proven track record of building scalable, reliable, low-latency inference systems
  • Familiar with cost optimization strategies for model inference (quantization, pruning, batching, caching)
  • Excellent communication, problem-solving, debugging, and leadership skills
  • Energetic, proactive, with ability to adapt in fast-paced, startup-style environment

Nice-to-Have

  • Experience with agentic systems, autonomic AI, or multi-agent coordination
  • Background in prompt engineering, reinforcement learning from human feedback (RLHF)
  • Exposure to edge / on-device inference
  • #IT2025

About Meraki7

IT Services and IT Consulting
11-50

Meraki7 Inc is a national staffing and recruiting firm specializing in IT. We partner with leading companies to match in-demand top talent with challenging information technology positions.