Senior Devops Engineer

Textlayer about 1 month ago

Remote

CA$200,000 - CA$220,000/year

Senior Level

contract

About the role

Ready to solve the AI implementation gap that 85% of enterprises face? Join us!

About Textlayer

TextLayer helps enterprises and funded startups deploy advanced AI systems without rewriting their infrastructure. We work with organizations across fintech, healthtech, and other sectors to bridge the gap between AI potential and practical implementation.

Our approach combines deep technical expertise with proven frameworks like TextLayer Core to accelerate development and ensure production-ready results. From bespoke AI workflows to agentic systems, we help clients adopt AI that actually works in their existing tech stacks.

We're on a mission to help address the implementation gap that over 85% of enterprise clients experience in adding AI to their operations and products. We're looking for sharp, curious people who want to meaningfully shape how we build, operate, and deliver.

If you're excited to work on foundational AI infrastructure, solve complex problems for diverse clients, and help define what agentic software looks like in practice, we'd love to meet you.

The Role

The Senior DevOps Engineer will architect production-grade monitoring, logging, and tracing systems specifically designed for AI workloads, implement OpenTelemetry-based data collection pipelines, build robust deployment workflows using IaC, and create resilient observability solutions that provide deep insights into LLM applications and conversational AI systems. Observability into AI systems plays a critical role in helping us build and maintain the infrastructure that powers our client engagements and plays a key role in an upcoming product launch.

Key Responsibilities

Design and maintain OpenTelemetry-based observability infrastructure for distributed AI systems and LLM applications
Build and scale ELK stack deployments (Elasticsearch, Logstash, Kibana) for log aggregation, search, and visualization of AI application data
Implement comprehensive tracing and monitoring solutions for LLM inference, RAG pipelines, and AI Agent workflows
Develop and maintain data ingestion pipelines for processing high-volume telemetry data from AI applications
Configure and optimize OpenSearch clusters for real-time analytics and trace reconstruction of conversational flows
Deploy and manage LLM observability platforms like Langfuse, OpenLLMetry, and custom monitoring solutions
Implement Infrastructure as Code (Terraform, CloudFormation) for reproducible observability and application stack deployments
Build automated alerting and incident response systems for AI application performance and reliability
Collaborate with engineering teams to instrument AI applications with proper telemetry and observability hooks
Optimize data retention policies, indexing strategies, and query performance for large-scale observability data

What You Will Bring

To succeed in this role, you'll need deep expertise in observability infrastructure, hands-on experience with OpenTelemetry and ELK stack, and a strong understanding of AI/ML system monitoring challenges. You should be passionate about building scalable, reliable infrastructure that provides actionable insights into complex AI workloads.

Required Qualifications

4+ years of DevOps/Infrastructure engineering experience with focus on observability and monitoring
Expert-level experience with OpenTelemetry implementation, configuration, and custom instrumentation
Production experience with ELK stack (Elasticsearch, Logstash, Kibana) including cluster management and optimization
Strong knowledge of distributed tracing, metrics collection, and log aggregation architectures
Experience with container orchestration (Kubernetes, Docker) and cloud infrastructure (AWS/GCP/Azure)
Proficiency with Infrastructure as Code tools (Terraform, Ansible, CloudFormation)
Experience building high-throughput data ingestion pipelines and real-time analytics systems
Strong scripting skills (Python, Bash/Sh) for automation and tooling
Knowledge of observability best practices, SLI/SLO definitions, and incident response
Experience with monitoring tools like Prometheus, Grafana, or DataDog

Bonus Points

Experience with LLMOps observability tools (Langfuse, LiteLLM, Weights & Biases, Phoenix, Braintrust)
Experience with Golang (Go), Rust, or C/C++
Knowledge of AI/ML system monitoring patterns and LLM application telemetry
Experience with OpenSearch and ClickHouse for analytics workloads
Familiarity with conversational AI analytics and trace reconstruction techniques
Experience instrumenting LLM applications, RAG systems, or AI Agent workflows
Background in time-series databases and vector search optimization
Contributions to open-source observability or LLMOps projects
Knowledge of eval-driven development and automated AI system testing frameworks

Employment Type: Full Time

Location: Remote - Canada

Compensation: $200,000 - $220,000 CAD base salary

Start Date: Flexible, but prefer immediate

How to Apply: Apply directly via our portal: https://jobs.ashbyhq.com/textlayer/e5dc51ed-3987-45e0-ae4f-3a27edb88838

About Textlayer

1-10

TextLayer helps enterprises and ambitious teams build, deploy, and scale advanced AI systems—without rewriting their infrastructure.

We provide engineering teams with a modular, stable foundation so they can adopt AI without betting on the wrong tech. Our flagship stack, TextLayer Core, is maintainable, tailored to the environment, and deployed with Terraform and standardized APIs.

We work closely with platform teams and technical leaders to integrate LLMs, retrieval-augmented generation (RAG) pipelines, and agentic workflows directly into production environments.

From internal copilots to customer-facing features, TextLayer delivers fast, reliable implementation without compromising long-term maintainability.

We’re a small, fast-moving team on a mission to power enterprise clients with serious AI infrastructure. Modular. Scalable. Battle-tested.

Website LinkedIn