Site Reliability Engineer

Denvr Dataworks 21 days ago

Hybrid

Toronto, Ontario

Mid Level

full_time

About the role

Who We Are

Denvr is a vertically integrated AI Platform Services company with headquarters in Calgary, Canada. We provide foundational compute infrastructure and services to support the broader AI ecosystem and its end users. The platform includes cloud-native solutions for training, inference, high-performance computing, data processing, scalable storage, and a suite of software toolsets that accelerate the development, deployment, and integration of AI applications.

These capabilities are accessible via the public Denvr AI Cloud or through Private AI Platform Services, which offer fully dedicated, sovereign environments with enhanced security. Private deployments incorporate advanced data centers, optimized compute architectures, high-throughput storage fabrics, and tightly integrated platform operations software—engineered to meet the demands of large-scale, mission-critical AI workloads.

We design proprietary AI Data Centers powered by our ultra-efficient, modular data centers built for hyper-scale AI deployments.

At Denvr, we’re driven to create exceptional customer experiences that empower AI innovators, entrepreneurs, and enterprises to achieve real results. Our mission is to help unlock breakthroughs that transform industries, enhance creativity, and shape a better future.

Why Join Us

Joining Denvr means being part of a world-class team in the fast-moving field of AI and high-performance computing. We value curiosity, collaboration, and continuous learning. Our people are proactive problem solvers who take pride in delivering great results, thrive in open and transparent environments, and enjoy learning by doing.

If you’re forward-thinking, motivated by innovation, and excited to help drive a rapidly growing business, Denvr is the place to make an impact—and grow your career alongside exceptional teammates.

About the Role

We are seeking a Site Reliability Engineer (SRE) with experience spanning cloud and data center environments to drive infrastructure reliability, observability, and scalability. In this role, you will design and operate resilient, high-performance systems that enable cutting-edge data solutions.

What You’ll Do

Observability & Monitoring: Design, implement, and maintain observability systems with Grafana, Prometheus, Victoria metrics and PromQL to monitor system health and performance.
Industry Best Practices : Explore opportunities of improving overall observability of HPC environment using industry best practices.
Incident Management & Troubleshooting: Participate in on-call rotations, rapidly diagnose and resolve incidents, and perform postmortem reviews to drive continuous improvements.
DevOps & CI/CD : Hands on experience in automating DevOps pipeline using GitHub Action (or similar tools).

Who You Are

Experience: 3-5 years in a Site Reliability Engineering (SRE) or DevOps role.
Software Development background : Strong software development background, Computer science fundamentals.
Infrastructure as Code (IaC): Familiarity with tools like Terraform or Helm, Ansible, Python for automated infrastructure provisioning.
Security Best Practices: Knowledge of security practices and compliance standards for enterprise environments.
HPC Knowledge: Familiarity with high-performance computing, specifically in administering GPU-related workloads.
Kubernetes Proficiency: Strong experience in managing Kubernetes clusters in production environments.
Observability Tools: Expertise observability platforms (Grafana, Prometheus, PromQL) for tracking and analyzing system metrics.
Networking: Solid understanding of networking fundamentals (TCP/IP, DNS, load balancing, VPNs).
AWS Cloud/Hybrid Cloud : Hands on experience on developing and deploying production grade applications in AWS Cloud under hybrid cloud architecture.
Linux Systems: Proficiency in Linux administration, shell scripting, and performance tuning.
Programming Experience: Strong software development skills (e.g., Bash, Python, Golang) to automate infrastructure and operational tasks.

If you are passionate about technology and want to be part of a remote-first forward-thinking company, Denvr would love to hear from you and learn more about your skills and capabilities. Click on the link to apply!

About Denvr Dataworks

IT Services and IT Consulting

11-50

Denvr Dataworks provides cloud services for the development and operations of AI technologies, including Inference-as-a-Service offerings, with options for accessing on-demand or dedicated compute instances, or hardware, that are designed for AI workloads.

Website LinkedIn