Senior DevOps & Site Reliability Engineer (Americas)

Appspace about 2 months ago

Toronto, Ontario, Canada

Senior Level

Full-Time

Top Benefits

Generous PTO

Flexible work schedules

Casual dress work environment

About the role

Our Cloud Operations team is seeking a Senior DevOps & Site Reliability Engineer who will play a critical role in ensuring the reliability, performance, and scalability of our diverse SaaS applications
This role is a specialized hybrid, bridging the gap between legacy VM-based architectures and modern cloud-native standards through aggressive automation and development-focused operations
Unlike a traditional SRE, this role is deeply integrated with the software development lifecycle, focusing on the consolidation and optimization of platform operations
You will be responsible for building the CI/CD frameworks, self-service tools, and AI-driven automation that allow our engineering teams to move faster while maintaining rock-solid stability
Your mission is to maximize the ROI of our existing infrastructure by “automating away” manual toil
On-call coverage will be required on a weekly rotation basis
In this role, you will be the technical anchor for a global platform footprint that includes a mix of Azure IaaS/PaaS, Google Cloud Platform (GCP), Kubernetes, and various data platforms. Your day will consist of:
Intelligent Automation & DevOps: Identifying manual “toil” and replacing it with automated workflows for monitoring, change management, and routine administration of large-scale VM environments to ensure a positive ROI
AI-Enhanced Operations: Leading the integration of AI tools for automated code reviews, development frameworks, and predictive log analysis to drive departmental velocity and efficiency
Scalable CI/CD & Provisioning: Designing and maintaining “self-service” deployment frameworks and CI/CD pipelines (GitHub Actions, Bamboo) using Infrastructure as Code (Bicep, Terraform)
Strategic ROI Projects: Evaluating platform components to determine the most cost-effective path: automating the current state or migrating features to modern, shared architectures
Unified Observability: Designing and maintaining a comprehensive observability stack across Azure and GCP (metrics, logs, traces) to identify performance bottlenecks and proactively address system defects
Cross-Functional Collaboration: Partner with engineering, security and operations teams to ensure new features are “born” with reliability, security and automated delivery in mind; Ensure adherence to security best practices and compliance standards (SOC2, HIPAA, ISO 27001) and operational excellence with cost efficiency
Root Cause Analysis & Forensics: Investigating complex performance defects by following log trails across web, application, and database tiers (SQL Server, MongoDB, MySQL)
Governance & Security: Ensuring all platforms meet security standards (SOC2, HIPAA, ISO 27001) through automated policy enforcement across Azure and GCP

Benefits

Generous PTO
Flexible work schedules
A casual dress work environment
Paid company holidays
Remote work opportunities
Appspace Quiet Fridays (No non-essential internal meetings scheduled)- Experience with AI-driven log analysis or automated incident remediation
Knowledge of database tuning (SQL Server, MySQL, MongoDB)
6+ years in DevOps or SRE roles, with a proven track record of bridging development and operations in complex cloud environments
Expert-level PowerShell and Python skills. Hands-on experience with Bicep or Terraform is required
You are a problem-solver and an automator at heart
Experience with Atlassian suite (Jira, Confluence, Bitbucket)
Familiarity with various middleware and PaaS technologies (e.g. Event Hub, Service Bus, CosmosDB, RabbitMQ, MongoDB, etc.)
Must have a passion for life-long learning
Familiarity with compliance standards (SOC2, HIPAA, GDPR)
Extensive experience with Microsoft Azure (IaaS, PaaS, App Services, Networking) and/or Google Cloud Platform (GCP)
Strong background in Windows/Linux Server OS, Kubernetes (AKS/GKE), Helm, and container orchestration
Expert-level troubleshooting and the ability to reason through complex process workflows to identify faults in large-scale platform environments

Not the right fit? Search for DevOps & Site Reliability Engineer jobs in Toronto, Ontario, Canada

About Appspace

Software Development

201-500

Founded in 2002

Connect your people, places, and spaces.

Appspace is the workplace experience platform for your whole team that lets you manage it all – from employee communications to your physical office spaces. So, work-from-anywhere becomes an experience everyone loves. With offices in the US, UK, UAE, and Malaysia, plus additional experts in a dozen other countries, we provide global support to thousands of customers and help companies modernize their workplace experience.

Website LinkedIn

Similar Jobs