Top Benefits
About the role
Element451 is building the AI-powered platform reshaping how colleges and universities recruit, enroll, and support their students - and the reliability of that platform is what earns the trust of the institutions that run on it. This is a rare chance to own reliability, operations, and security at the level where it actually happens: keeping a real production system healthy, fast, and safe, and building the automation that keeps it that way. If you're a broad operator who's happiest when a single week spans operations, delivery, security, infrastructure, and data - and you've done that work where things move fast and the edges go unowned - you'll feel at home here.
The Role You'll keep Element451's platform reliable, secure, and operable - and build the delivery systems that let it scale without scaling the firefighting. This is a hands-on senior IC role, and a deliberately broad one: reliability and operations are the core, with CI/CD and delivery, security, infrastructure, and data reliability all real, recurring parts of the work. You'll partner closely with our Director of Platform Engineering, who owns the platform strategy; you own a large share of the operational and delivery execution that brings it to life. We hold a high bar - and we give you the ownership, context, and support to meet it. The work has range and a fair amount of unpredictability; the people who thrive in it tend to want exactly that. What You'll Own You own the operational health of the platform - that it's available, fast, observable, and safe in production - and you build the automation that makes those properties durable rather than heroic. In practice, you: Own the reliability discipline in practice - define and track SLIs and SLOs, keep the observability stack sharp (we use CloudWatch, Sentry, Papertrail, and Langsmith), and make system health and customer impact legible in real time. Carry production operations day to day - participate in on-call, lead incident response, run blameless post-incident reviews, and drive issues to root cause rather than patching symptoms. Treat operational toil as engineering work to eliminate - relentlessly automate remediation, sharpen alert quality, and drive down MTTD and MTTR rather than absorbing manual load. Own and evolve the CI/CD and delivery platform alongside the Director of Platform Engineering - build pipelines, deployment automation, environment management, and release tooling - so shipping is routine, safe, and low-drama, with progressive delivery, automated rollback, and production validation gates as standard. Build the developer-facing automation and paved roads that cut friction for the product engineering team - treating them as the platform's customer and making the reliable, secure path the easy path. Be the platform's hands-on security operator - IAM and least-privilege hygiene, secrets management, threat detection and response (WAF, GuardDuty), and vulnerability triage and remediation against SLA. This is the security function in practice today; you partner with the Director of Platform Engineering on strategy and standards, and you're trusted to set the operational bar where none exists yet. Keep the platform audit-ready by default - produce the infrastructure and reliability evidence that SOC 2 Type II and FERPA obligations depend on (we use Vanta), so audits are a byproduct of good operations, not a scramble. Build and operate cloud infrastructure as code - AWS (ECS/Fargate, Lambda, SQS/SNS, EventBridge, S3/CloudFront, VPC) managed with Terraform, with no manual or snowflake infrastructure in any environment. Plan and execute scaling ahead of product growth, and keep operational documentation current - failure modes and recovery procedures included. Own the operational health of our data stores - MongoDB Atlas backups and recovery, performance tuning, and monitoring at scale - so data stays durable, performant, and recoverable. How You'll Show Up We hire for behavior as much as for experience. Our values describe how we work, and we look for people who already operate this way: Understand the "why" before the "what." You dig for the reason behind the work and tie operational, delivery, and security decisions to real customer and business impact - rather than executing tickets on faith. Own the outcome. You're on the hook for the result, not the task - reliability, security, and operability end to end - and you're genuinely unsatisfied with "it mostly works." Take initiative to move work forward. You see the unowned edge and close it, surface risks early, and instinctively ask "how do we make this permanent and self-healing?" rather than waiting to be told. Engage collaboratively to solve problems together. You treat the engineering team as the platform's customer, work in the open without heroics or silos, and are energized rather than rattled when a day spans an incident, a deploy, a security finding, and a database. What You Bring 7+ years across site reliability, operations, delivery, infrastructure, or platform engineering, with a track record of hands-on delivery - including real startup experience where you owned broad, shifting scope without a large team behind you. A strong SRE foundation - SLI/SLO design, observability stack ownership, and incident response in a production environment with real customer impact. Strong CI/CD and delivery-engineering chops - building and operating pipelines (GitHub Actions), Docker/ECR workflows, and ECS deployment automation, with progressive delivery and automated rollback - enough to co-own the delivery platform, not just consume it. Security-operations depth you can own without a security team behind you - IAM governance and least-privilege, secrets management, network security and threat detection (WAF, GuardDuty), and vulnerability triage and remediation. You'll be the org's hands-on security practitioner, so the judgment to set an operational bar - not just follow one - matters here. Deep, current AWS expertise - ECS/Fargate, Lambda, SQS/SNS, EventBridge, S3/CloudFront, VPC networking, IAM, and Secrets Manager - plus strong Terraform and infrastructure-as-code discipline across multi-environment systems. Operational experience with MongoDB Atlas or a comparable managed database platform - backup and recovery, performance tuning, and monitoring at scale. Working knowledge of compliance operations - SOC 2 Type II and FERPA - and producing audit evidence (we use Vanta) as a byproduct of good engineering rather than a separate effort. Comfort operating as a high-output IC with broad domain ownership and a long execution horizon. Current familiarity with AI-assisted operations - intelligent alerting, anomaly detection, or AI-augmented incident response - is a plus. Our Values Impactful not Immediate - We prioritize and invest in initiatives that will be most impactful. Progress before Perfection - We are action-oriented people. We are empowered to make decisions and achieve our goals. Learners before Masters - We are curious and humble people who strive to constantly improve. Together not Alone - We rally behind each other and pitch in to support the greater whole. Customer Success not Support - We solve partner goals and prioritize their success. Perks & Benefits We invest in our team the same way we invest in our product - thoughtfully, and for the long term. Competitive pay and full benefits - a salary calibrated to the seniority of the role, plus comprehensive medical & dental coverage for you and your family. Truly remote - We're built remote-first, not remote-tolerant. Time to recharge - flexible PTO, paid company holidays, tenure milestones that reward your commitment, and your birthday off. Work that matters - every release helps students find their path to college and succeed once they get there. Your craft has a real human impact. Our Interview Process Our process is rigorous and designed to be real signal - for you and for us. Expect live, interactive technical assessment with our engineers - real systems and real operational scenarios, not take-home busywork or trivia. You'll show how you reason about reliability, security, and trade-offs under pressure, and how you apply AI tooling in the work. We move quickly for the right people and give you a clear view of the bar before you accept.
Not the right fit? Search for Platform Engineer jobs in Canada
About Element451
Element451 is spearheading a shift in student engagement, marketing, and success through AI. With our three AI-driven packages, we're ushering institutions into a new era of interaction. Our AI chatbots and copilot assistants pave the way for hyper-personalized communications and a transformative engagement journey. It's not just a platform, but a doorway to a future where every interaction is amplified by AI, making the academic journey resonate at every touchpoint.
Similar Jobs
Top Benefits
About the role
Element451 is building the AI-powered platform reshaping how colleges and universities recruit, enroll, and support their students - and the reliability of that platform is what earns the trust of the institutions that run on it. This is a rare chance to own reliability, operations, and security at the level where it actually happens: keeping a real production system healthy, fast, and safe, and building the automation that keeps it that way. If you're a broad operator who's happiest when a single week spans operations, delivery, security, infrastructure, and data - and you've done that work where things move fast and the edges go unowned - you'll feel at home here.
The Role You'll keep Element451's platform reliable, secure, and operable - and build the delivery systems that let it scale without scaling the firefighting. This is a hands-on senior IC role, and a deliberately broad one: reliability and operations are the core, with CI/CD and delivery, security, infrastructure, and data reliability all real, recurring parts of the work. You'll partner closely with our Director of Platform Engineering, who owns the platform strategy; you own a large share of the operational and delivery execution that brings it to life. We hold a high bar - and we give you the ownership, context, and support to meet it. The work has range and a fair amount of unpredictability; the people who thrive in it tend to want exactly that. What You'll Own You own the operational health of the platform - that it's available, fast, observable, and safe in production - and you build the automation that makes those properties durable rather than heroic. In practice, you: Own the reliability discipline in practice - define and track SLIs and SLOs, keep the observability stack sharp (we use CloudWatch, Sentry, Papertrail, and Langsmith), and make system health and customer impact legible in real time. Carry production operations day to day - participate in on-call, lead incident response, run blameless post-incident reviews, and drive issues to root cause rather than patching symptoms. Treat operational toil as engineering work to eliminate - relentlessly automate remediation, sharpen alert quality, and drive down MTTD and MTTR rather than absorbing manual load. Own and evolve the CI/CD and delivery platform alongside the Director of Platform Engineering - build pipelines, deployment automation, environment management, and release tooling - so shipping is routine, safe, and low-drama, with progressive delivery, automated rollback, and production validation gates as standard. Build the developer-facing automation and paved roads that cut friction for the product engineering team - treating them as the platform's customer and making the reliable, secure path the easy path. Be the platform's hands-on security operator - IAM and least-privilege hygiene, secrets management, threat detection and response (WAF, GuardDuty), and vulnerability triage and remediation against SLA. This is the security function in practice today; you partner with the Director of Platform Engineering on strategy and standards, and you're trusted to set the operational bar where none exists yet. Keep the platform audit-ready by default - produce the infrastructure and reliability evidence that SOC 2 Type II and FERPA obligations depend on (we use Vanta), so audits are a byproduct of good operations, not a scramble. Build and operate cloud infrastructure as code - AWS (ECS/Fargate, Lambda, SQS/SNS, EventBridge, S3/CloudFront, VPC) managed with Terraform, with no manual or snowflake infrastructure in any environment. Plan and execute scaling ahead of product growth, and keep operational documentation current - failure modes and recovery procedures included. Own the operational health of our data stores - MongoDB Atlas backups and recovery, performance tuning, and monitoring at scale - so data stays durable, performant, and recoverable. How You'll Show Up We hire for behavior as much as for experience. Our values describe how we work, and we look for people who already operate this way: Understand the "why" before the "what." You dig for the reason behind the work and tie operational, delivery, and security decisions to real customer and business impact - rather than executing tickets on faith. Own the outcome. You're on the hook for the result, not the task - reliability, security, and operability end to end - and you're genuinely unsatisfied with "it mostly works." Take initiative to move work forward. You see the unowned edge and close it, surface risks early, and instinctively ask "how do we make this permanent and self-healing?" rather than waiting to be told. Engage collaboratively to solve problems together. You treat the engineering team as the platform's customer, work in the open without heroics or silos, and are energized rather than rattled when a day spans an incident, a deploy, a security finding, and a database. What You Bring 7+ years across site reliability, operations, delivery, infrastructure, or platform engineering, with a track record of hands-on delivery - including real startup experience where you owned broad, shifting scope without a large team behind you. A strong SRE foundation - SLI/SLO design, observability stack ownership, and incident response in a production environment with real customer impact. Strong CI/CD and delivery-engineering chops - building and operating pipelines (GitHub Actions), Docker/ECR workflows, and ECS deployment automation, with progressive delivery and automated rollback - enough to co-own the delivery platform, not just consume it. Security-operations depth you can own without a security team behind you - IAM governance and least-privilege, secrets management, network security and threat detection (WAF, GuardDuty), and vulnerability triage and remediation. You'll be the org's hands-on security practitioner, so the judgment to set an operational bar - not just follow one - matters here. Deep, current AWS expertise - ECS/Fargate, Lambda, SQS/SNS, EventBridge, S3/CloudFront, VPC networking, IAM, and Secrets Manager - plus strong Terraform and infrastructure-as-code discipline across multi-environment systems. Operational experience with MongoDB Atlas or a comparable managed database platform - backup and recovery, performance tuning, and monitoring at scale. Working knowledge of compliance operations - SOC 2 Type II and FERPA - and producing audit evidence (we use Vanta) as a byproduct of good engineering rather than a separate effort. Comfort operating as a high-output IC with broad domain ownership and a long execution horizon. Current familiarity with AI-assisted operations - intelligent alerting, anomaly detection, or AI-augmented incident response - is a plus. Our Values Impactful not Immediate - We prioritize and invest in initiatives that will be most impactful. Progress before Perfection - We are action-oriented people. We are empowered to make decisions and achieve our goals. Learners before Masters - We are curious and humble people who strive to constantly improve. Together not Alone - We rally behind each other and pitch in to support the greater whole. Customer Success not Support - We solve partner goals and prioritize their success. Perks & Benefits We invest in our team the same way we invest in our product - thoughtfully, and for the long term. Competitive pay and full benefits - a salary calibrated to the seniority of the role, plus comprehensive medical & dental coverage for you and your family. Truly remote - We're built remote-first, not remote-tolerant. Time to recharge - flexible PTO, paid company holidays, tenure milestones that reward your commitment, and your birthday off. Work that matters - every release helps students find their path to college and succeed once they get there. Your craft has a real human impact. Our Interview Process Our process is rigorous and designed to be real signal - for you and for us. Expect live, interactive technical assessment with our engineers - real systems and real operational scenarios, not take-home busywork or trivia. You'll show how you reason about reliability, security, and trade-offs under pressure, and how you apply AI tooling in the work. We move quickly for the right people and give you a clear view of the bar before you accept.
Not the right fit? Search for Platform Engineer jobs in Canada
About Element451
Element451 is spearheading a shift in student engagement, marketing, and success through AI. With our three AI-driven packages, we're ushering institutions into a new era of interaction. Our AI chatbots and copilot assistants pave the way for hyper-personalized communications and a transformative engagement journey. It's not just a platform, but a doorway to a future where every interaction is amplified by AI, making the academic journey resonate at every touchpoint.