Jobs.ca
Jobs.ca
Language
Alberta Machine Intelligence Institute (Amii) logo

System Administrator - HPC (4 year term)

Hybrid
Edmonton, Alberta
Mid Level
full_time

Top Benefits

Competitive compensation
Paid time off
Flexible health benefits

About the role

"Be part of something bigger. At Amii, our work pushes the boundaries of what’s possible in artificial intelligence. As our HPC System Administrator, you’ll design, build, and manage the high-performance computing systems that power groundbreaking research and innovation. This is your chance to shape the future of AI infrastructure and make a lasting impact. Apply today and help us drive discovery forward!".

  • Robert Craig | Director, IT

About Amii Alberta Machine Intelligence Institute (Amii) is one of Canada’s three main institutes for artificial intelligence (AI) and machine learning, our world-renowned researchers drive fundamental and applied research at the University of Alberta (and other academic institutions), training some of the world’s top scientific talent. Our cross-functional teams work collaboratively with Alberta-based businesses and organizations to build AI capacity and translate scientific advancement into industry adoption and economic impact.

About The Role We are seeking an HPC System Administrator for a full-time, 4-year term position. The System Administrator, High Performance Cluster (HPC) is critical to maintaining the stability, security, and performance of our mission-critical infrastructure, enabling our AI researchers and engineers to focus on pioneering innovations that advance the mission of 'AI for good and for all'.

Reporting to the Director, IT, the System Administrator, HPC is responsible for the day-to-day operation and maintenance of the data center's systems infrastructure. This includes servers, storage, network devices, and related software. This role requires a proactive approach to problem-solving, a commitment to best practices, and the ability to work effectively in a fast-paced environment.

The position focuses on achieving excellence in three main accountabilities:

  • System Maintenance & Optimization
  • Security Management
  • Technical Support
  • HPC Administration & Support

Key Responsibilities Required Skills / Expertise

  • Assists in expanding HPC resources based on user needs and growth projections, and maintain capacity planning models for scalability and performance
  • Build, configure, and maintain high-performance computing clusters, including compute, storage, and networking components.
  • Oversees daily operations and maintenance of the High-Performance Computing (HPC) Cluster running on Linux and SLURM, including monitoring system health and performance, and managing job queues and SLURM configurations for optimized scheduling and resource allocation
  • Design, configure, and troubleshoot high-speed networking (InfiniBand, Ethernet, VLANs, etc.) to optimize cluster performance.
  • Manages and maintains Linux-based and Windows servers, ensuring high availability, performance, and security by performing regular updates, patches, and backups, while also configuring and managing essential network services such as DNS, DHCP, NFS, and SNMP
  • Assists in the development and maintenance of comprehensive documentation for systems, configurations, procedures, and policies
  • Collaborates with other departments to align IT initiatives with organizational goals
  • Plans, tests, and deploys system upgrades and patches, keeping systems updated with security and performance enhancements, and coordinating maintenance to minimize user impact
  • Monitors system logs and performance metrics to proactively resolve issues, troubleshoot problems with vendors and support teams, and manage monitoring tools for real-time system health
  • Implements and maintains virtualization and containerization solutions (e.g., VMware) to optimize resource use and ensure secure, efficient operation
  • Recommends & updates standard tech packages for staff considering job requirements, latest technology and budget; deployment of tech packages to new staff
  • Monitors system performance, usage, and resource availability, proactively identifying and resolving issues that could impact performance or user experience
  • Collaborates with Researchers and Machine Learning Scientists to understand computational needs and implement solutions that enhance usability, throughput, and system efficiency
  • Administers workload management and scheduling systems (Slurm) to enable efficient resource allocation and job execution across the cluster
  • Drives continuous improvement of HPC services and support models, identifying opportunities to enhance efficiency, usability, and researcher experience
  • Prepares regular reports on system performance, security incidents, and project status for management review
  • Provides technical support to users by assisting with job submissions, troubleshooting issues, and resolving problems
  • Collaborates with researchers, developers and users to optimize HPC applications through performance tuning strategies
  • Evaluates and recommends new hardware and software solutions to enhance HPC capabilities

Qualifications

  • Post Secondary Degree in Computer Science, Information Technology, or a related field (Nice to have), equivalent experience will be considered
  • 3+ years of experience in system administration, preferably in a HPC environment.
  • Strong understanding of Linux (e.g., CentOS, RHEL, Ubuntu) and Windows Server operating systems
  • Experience with virtualization technologies (e.g., VMware, Hyper-V) (Nice to have)
  • Knowledge of scripting languages (e.g., Bash, Python, PowerShell) (Nice to have)
  • Insight into HPC hardware components (CPUs, GPUs, memory, interconnects) and how to optimize their use.
  • Familiarity with storage systems (SAN, NAS) and backup/recovery solutions
  • Strong understanding of networking concepts and protocols
  • Ability to assist users in optimizing workflows and provide training on HPC resources and tools.

What You'll Love About Us

  • A professional yet casual work environment that encourages the growth and development of your skills.
  • Participate in professional development activities
  • Gain access to the Amii community and events
  • A chance to learn from amazing teammates who support one another to succeed.
  • Competitive compensation, including paid time off and flexible health benefits.
  • A modern office located in downtown Edmonton, Alberta.

How to Apply

If this sounds like the opportunity you've been waiting for, please don’t wait for the closing September 25, 2025 to apply - we’re excited to add a new member to the Amii team for this role, and the posting may come down sooner than the closing date if we find the right candidate before the posting closes! When sending your application, please send your resume and cover letter indicating why you think you'd be a fit for Amii. In your cover letter, please include one professional accomplishment you are most proud of and why.

Applicants must be legally eligible to work in Canada at the time of application.

Amii is an equal opportunity employer and values a diverse workforce. We encourage applications from all qualified individuals without regard to ethnicity, religion, gender identity, sexual orientation, age or disability. Accommodations for disability-related needs throughout the recruitment and selection process are available upon request. Any information provided by you for accommodations will be kept confidential and won’t be used in the selection process.

About Alberta Machine Intelligence Institute (Amii)

Research Services
51-200

Amii grows machine intelligence capacity and capabilities in Alberta.

Originally formed in 2002, the Alberta Machine Intelligence Institute (Amii) explores the frontiers of scientific knowledge and drives business adoption in the fields of machine learning & artificial intelligence, together called machine intelligence.

Through world-class research, development and education, our team of experts, including over 120 staff and students, work to advance academic understanding at the University of Alberta and other affiliated institutions and to leverage world-leading academic talent and expertise to grow Alberta's machine intelligence literacy and build AI and machine learning capabilities in business – from early adoption to advanced research and development.

Amii is home to some of the world’s brightest minds in machine intelligence.