Jobs.ca
Jobs.ca
Language
Jobgether logo

Staff Software Engineer, GPU Infrastructure (HPC)

Jobgetherabout 20 hours ago
Remote
Staff
full_time

Top Benefits

Comprehensive health, dental, and mental health benefits.
Six weeks vacation (30 days).
100% parental leave top‑up up to six months.

About the role

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Staff Software Engineer, GPU Infrastructure (HPC) in United States, Canada.

As a Staff Software Engineer in GPU infrastructure, you will design, build, and operate high-performance computing clusters to accelerate AI and machine learning workloads. You will collaborate closely with researchers and engineers to ensure AI workloads run reliably, efficiently, and at scale across cloud environments. The role includes optimizing infrastructure for cost, performance, and stability, while providing self-service tools for ML teams. You will troubleshoot complex issues, implement automation and observability best practices, and drive innovations in distributed GPU/TPU systems. This position offers opportunities to mentor engineers, influence infrastructure strategy, and directly impact the development of cutting-edge AI models. You will work in a fast-paced, collaborative environment where technical excellence and scalability are key priorities.

Accountabilities:
• Design, deploy, and manage Kubernetes-based GPU/TPU superclusters across multiple clouds for AI/ML workloads.
• Optimize HPC infrastructure for distributed training frameworks such as JAX, PyTorch, and TensorFlow.
• Identify and resolve performance bottlenecks, system failures, and infrastructure issues.
• Build self-service tools to enable researchers to monitor, debug, and optimize AI/ML training jobs independently.
• Implement best practices for automation, observability, and infrastructure-as-code (IaC).
• Collaborate closely with AI researchers and ML engineers to translate emerging needs into robust infrastructure solutions.
• Mentor team members, conduct code reviews, document processes, and foster a culture of knowledge sharing.

Requirements

 • Deep expertise in ML/HPC infrastructure, including GPU/TPU clusters and distributed training frameworks.
• Proven experience with cloud-native Kubernetes deployments at scale.
• Strong programming skills in Python and Go, with preference for open-source contributions.
• Knowledge of Linux internals, RDMA networking, and performance optimization for ML workloads.
• Demonstrated ability to collaborate with research teams and solve complex infrastructure challenges.
• Self-directed problem-solving mindset with ability to drive impact in fast-paced environments.
• Experience in building scalable, resilient, and maintainable infrastructure systems.

Benefits

 • Inclusive and collaborative work culture.
• Opportunities to work on cutting-edge AI research and infrastructure projects.
• Weekly lunch stipend, in-office meals, and snacks.
• Comprehensive health and dental benefits, including mental health budget.
• 100% parental leave top-up for up to six months.
• Personal enrichment benefits for arts, fitness, and workspace improvement.
• Remote-flexible work options, co-working stipend, and offices in major global cities.
• Six weeks of vacation (30 working days).

Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.
When you apply, your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly.
🔍 Our AI evaluates your CV and LinkedIn profile thoroughly, analyzing your skills, experience and achievements.
📊 It compares your profile to the job’s core requirements and past success factors to determine your match score.
🎯 Based on this analysis, we automatically shortlist the 3 candidates with the highest match to the role.
🧠 When necessary, our human team may perform an additional manual review to ensure no strong profile is missed.
The process is transparent, skills-based, and free of bias — focusing solely on your fit for the role.
Once the shortlist is completed, we share it directly with the company that owns the job opening. The final decision and next steps (such as interviews or additional assessments) are then made by their internal hiring team.
Thank you for your interest!

#LI-CL1

About Jobgether

Internet Marketplace Platforms
11-50

Your future of work, like you've always dreamt it, is now possible with Jobgether !

The Covid crisis has accelerated its revolution but work, as we knew it, doesn't exist anymore. Tomorrow, jobs will be hybrid, remote and asynchronous. Flexibility will be the norm.

Jobgether helps you find your next remote job, wherever you are.